# 🌐 Ryanair Airline Reviews Web Scraping
This notebook showcases web scrapping of the Ryanair airline from the site *AirlineQuality.com* using library `BeautifulSoup`.

###  Motivation
The purpose of scraping data from the Ryanair airline reviews webpage is to create a dataset containing customer reviews, ratings, and other relevant information. This dataset can be utilized for various purposes such as:

#### 😊 Sentiment Analysis
Analyzing customer sentiments towards Ryanair by studying their reviews and ratings.

#### ⏱️ Performance Evaluation
Assessing the performance of Ryanair based on customer feedback regarding aspects like service quality, punctuality, and customer service.

#### 📊 Comparative Analysis
Comparing Ryanair's performance with other airlines by analyzing their respective datasets.

#### 🤖 Predictive Modeling
Building machine learning models to predict customer satisfaction or flight experiences based on review data.

#### 💡 Business Insights
Extracting valuable insights for Ryanair to improve its services, identify areas of improvement, and enhance customer satisfaction.



# Step 1: Importing Necessary Libraries

In this step, we import the libraries required for web scraping and parsing HTML content. We'll be using `requests` for fetching webpages and `BeautifulSoup` from bs4 for parsing HTML.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
pip install -r/content/requirements.txt


Collecting retrying (from -r /content/requirements.txt (line 1))
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Installing collected packages: retrying
Successfully installed retrying-1.3.4


In [None]:
from retrying import retry
import time
import traceback

>

# Step 2: Defining Function to Fetch Webpage Content and Parse HTML

Here, we define a function `fetch_and_parse_webpage(url)` that takes a URL as input, fetches the HTML content of the webpage using the `requests.get()` method, and then parses it using Beautiful Soup with the `'html.parser'` parser.

In [None]:
def fetch_webpage(url):
    """
    Fetches the content of a webpage using requests.

    Parameters:
        url (str): The URL of the webpage.

    Returns:
        str: The HTML content of the webpage.
    """
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to fetch webpage. Status code: {response.status_code}")
        return None

def parse_html(html_content):
    """
    Parses HTML content using Beautiful Soup.

    Parameters:
        html_content (str): The HTML content to parse.

    Returns:
        BeautifulSoup: A BeautifulSoup object representing the parsed HTML.
    """
    return BeautifulSoup(html_content, 'html.parser')

Now, we check if it works correctly.

In [None]:
# Example usage:
url = 'https://google.com'
html_content = fetch_webpage(url)
if html_content:
    soup = parse_html(html_content)
    # Now you can work with the parsed HTML using Beautiful Soup
    # For example, extracting specific elements or information from the webpage

# Step3: Webscraping

The following code generates a list of URLs for scraping Ryanair airline reviews from the website airlinequality.com. Each URL corresponds to a specific page of reviews, with a maximum of **23 pages**. The `MAX_PAGES` variable determines the maximum number of pages to scrape.

In [None]:
MAX_PAGES=23
list_url = [f'https://www.airlinequality.com/airline-reviews/ryanair/page/{page}/?sortby=post_date%3ADesc&pagesize=100' for page in range(1, MAX_PAGES+1)]

### Initializing DataFrame and Required Variables

Here, we initialize an empty DataFrame to store the comments data retrieved from the website. We also define a dictionary `class_to_label `to map class names of HTML elements to corresponding labels for data extraction.

In [None]:
# Initialize an empty DataFrame to store the comments data
comments_data = pd.DataFrame(columns=['Date Published', 'Overall Rating', 'Passenger Country', 'Trip_verified', 'Comment title','Comment',
                                       'Aircraft', 'Type Of Traveller', 'Seat Type', 'Origin', 'Destination' 'Date Flown',
                                       'Seat Comfort', 'Cabin Staff Service', 'Food & Beverages', 'Ground Service',
                                       'Value For Money', 'Recommended'])
comments_data_list = []

class_to_label = {
    'aircraft': 'Aircraft',
    'type_of_traveller': 'Type Of Traveller',
    'cabin_flown': 'Seat Type',
    'route': 'Route',
    'date_flown': 'Date Flown',
    'seat_comfort': 'Seat Comfort',
    'cabin_staff_service': 'Cabin Staff Service',
    'food_and_beverages': 'Food & Beverages',
    'inflight_entertainment':'Inflight Entertainment',
    'ground_service': 'Ground Service',
    'wifi_and_connectivity':'Wifi & Connectivity',
    'value_for_money': 'Value For Money',
    'recommended': 'Recommended'
}


### Scraping Reviews from Multiple Pages

This section iterates over each URL in the list of URLs (`list_url`) and scrapes the reviews data from each page. It extracts various information such as date published, overall rating, passenger country, comment title, comment text, and specific ratings related to the flight experience.

In [None]:
for url in list_url:
    html_content = fetch_webpage(url)

    if html_content:
        soup = parse_html(html_content)

        # Find all comment elements
        comments = soup.find_all('article', itemprop='review')  # Only the first 5 comments

        for comment in comments:
            try:
                date_published = comment.find('meta', itemprop='datePublished')['content']
                rating = comment.find('span', itemprop='ratingValue')
                if rating:
                    rating=rating.text
                else:
                    rating=''

                text_header = comment.find('h2', class_='text_header').text

                text_sub_header_text = comment.find('h3', class_='text_sub_header userStatusWrapper').get_text(strip=True)
                country = text_sub_header_text.split('(')[-1].split(')')[0]

                text_content = comment.find('div', class_='text_content', itemprop='reviewBody')

                # Find the element containing 'Not Verified' or 'Trip Verified'
                verification = text_content.find('strong')
                if verification:
                    verification =verification.text.strip()
                else:
                    verification= ''
                text_content = text_content.text.strip()

                if '|' in text_content:
                    text_content= text_content.split('|')[1].strip()


                review_ratings = comment.find('table', class_='review-ratings')
                review_ratings = comment.find_all('tr')

                table_data = {}
                for row in review_ratings:
                    # Find the header and value cells
                    header_cell = row.find('td', class_='review-rating-header')
                    value_cell = row.find('td', class_='review-value')
                    value2_cell = row.find('td', class_='review-rating-stars')

                    # Check if both header and value cells exist
                    if header_cell and (value_cell or value2_cell):
                        # Get the class name of the header cell
                        class_name = header_cell['class'][1]

                        # Get the corresponding data label from the class_to_label dictionary
                        data_label = class_to_label.get(class_name, '')

                        # Store the data label and value in the table_data dictionary
                        if value_cell:
                            value = value_cell.text.strip()
                            # If the feature is 'Route', split the value into origin and destination
                            if data_label == 'Route':
                                if 'to' in value:
                                    origin, destination = value.split(' to ')
                                elif '-' in value:
                                    origin, destination, _ = value. split('-')
                                table_data['Origin'] = origin.strip()
                                table_data['Destination'] = destination.strip()
                            else:
                                table_data[data_label] = value
                        else:
                            filled_star_spans = value2_cell.find_all('span', class_='star fill')
                            table_data[data_label] = int(len(filled_star_spans))

                # Append the data from the current comment to the list
                comments_data_list.append({'Date Published': date_published, 'Overall Rating': rating,
                                           'Passenger Country': country, 'Trip_verified': verification,
                                           'Comment title': text_header, 'Comment': text_content, **table_data})

            except Exception as e:
                print(f'Error en el comentario: {url[60:62]} -> {comments.index(comment)}')
                traceback.print_exc()


### Converting Data into DataFrame and Displaying Results

Finally, the scraped data is converted into a pandas DataFrame (`comments_data`). This DataFrame contains all the extracted information from the reviews. We then display the DataFrame to inspect the scraped data.

In [None]:
# Convert the list of dictionaries into a DataFrame
comments_data = pd.DataFrame(comments_data_list)

# Step 4: Saving the data

In [None]:
comments_data.to_csv('ryanair_reviews.csv', encoding='utf-8')