# Web Scraping of Airline Data for Comprehensive Flight Information Extraction
The collected airline data provides detailed information on various flight options available from India to international and domestic destinations. Each entry in the dataset contains the following key components:

**Airline Name:** The name of the airline operating the flight (e.g., Etihad Airways, Virgin Atlantic, Air India).

**Departure Airport and City:** The airport code (e.g., BOM for Mumbai, DEL for New Delhi) along with the city name and country (e.g., "Mumbai, India").

**Departure Time:** The scheduled time for departure in a 24-hour format (e.g., "13:55").

**Duration:** The total flight duration expressed in hours and minutes (e.g., "12h 00m").

**Layover Information:** If applicable, this includes the layover airport code (e.g., AUH for Abu Dhabi) and city, along with the layover duration. For some flights, it indicates a direct flight (e.g., "4h 30m" without a layover).

**Arrival Time:** The scheduled time for arrival at the destination airport (e.g., "00:25+1" indicates arrival the next day).

**Flight Code:** The unique identifier for the flight (not explicitly shown in the provided data but often included).

**Fare Information:** The fare for the flight, typically represented in the local currency (e.g., "9,728" indicating the price in Indian Rupees).

**Actionable Information:** The phrase "VIEW FARES" indicates that users can click to see more detailed fare information.

This dataset can be utilized for various analyses, including understanding pricing trends, popular routes, flight durations, and layover patterns across different airlines. It can serve as a valuable resource for travelers, travel agencies, and researchers looking to gain insights into airline services.

In [19]:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

# Use a set to store unique records
Records = set()

for domestic in range(2, 16):
    driver.get('https://www.goibibo.com/airlines/indigo/')
    driver.find_element(By.XPATH, f"//*[@id=\"root\"]/div[2]/div[1]/div[1]/div[2]/div[2]/div[{domestic}]/div/div/div[6]/button").click()
    
    scroll_increment = 100
    delay = 0.001
    current_height = driver.execute_script("return document.body.scrollHeight")
    
    count = 0
    limit = 50
    
    for i in range(0, current_height, scroll_increment):
        driver.execute_script(f"window.scrollTo(0, {i});")
        time.sleep(delay)
        
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        elements = soup.find_all('div', class_='srp-card-uistyles__CardWrap-sc-3flq99-7 SBjIn width100 dF')
        
        for element in elements:
            cleaned_text = element.text.replace('\xa0', ' ').strip()
            
            if cleaned_text not in Records:  # Check for duplicates
                Records.add(cleaned_text)
                count += 1
    
            if count > limit:
                break
    
        if count > limit:
            break

for international in range(2, 16):
    driver.get('https://www.goibibo.com/airlines/indigo/')
    driver.find_element(By.XPATH, f"//*[@id=\"root\"]/div[2]/div[1]/div[1]/div[2]/div[3]/div[{international}]/div/div/div[6]/button").click()
    
    scroll_increment = 100
    delay = 0.05
    current_height = driver.execute_script("return document.body.scrollHeight")
    
    count = 0
    limit = 50
    
    for i in range(0, current_height, scroll_increment):
        driver.execute_script(f"window.scrollTo(0, {i});")
        time.sleep(delay)
        
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        elements = soup.find_all('div', class_='srp-card-uistyles__CardWrap-sc-3flq99-7 SBjIn width100 dF')
        
        for element in elements:
            cleaned_text = element.text.replace('\xa0', ' ').strip()
            
            if cleaned_text not in Records:  # Check for duplicates
                Records.add(cleaned_text)
                count += 1
    
            if count > limit:
                break
    
        if count > limit:
            break

driver.quit()

# Convert the set of records back to a list (if needed)
unique_records = list(Records)
print(f"Total unique records collected: {len(unique_records)}")

Total unique records collected: 760


In [20]:
unique_records

['Etihad AirwaysBOM Mumbai, India13:55-AUH-12h 00mXNB Dubai Bus Station, United Arab Emirates00:25+1D 9,728 VIEW FARES',
 'Virgin AtlanticDEL New Delhi, India21:30-BLR-17h 45mLHR London - Heathrow Apt, United Kingdom10:45+1D 23,211 VIEW FARES',
 'Air IndiaBOM Mumbai, India04:10 4h 30mBKK Bangkok, Thailand10:10 16,842 VIEW FARES',
 'VistaraBOM Mumbai, India07:30 2h 05mDEL New Delhi, India09:35 5,100 VIEW FARES',
 'IndiGoDEL New Delhi, India10:55-GWL-5h 45mBOM Mumbai, India16:40 4,697 VIEW FARES',
 'IndiGoDEL New Delhi, India18:15-BOM-7h 00mAUH Abu Dhabi, United Arab Emirates23:45 10,877 VIEW FARES',
 'Air IndiaDEL New Delhi, India11:00 4h 25mBKK Bangkok, Thailand16:55 13,861 VIEW FARES',
 'IndiGoBOM Mumbai, India07:00 2h 05mDEL New Delhi, India09:05 4,999 VIEW FARES',
 'IndiGoDEL New Delhi, India09:00-HYD-23h 00mSIN Singapore, Singapore10:30+1D 16,039 VIEW FARES',
 'Srilankan AirlinesBOM Mumbai, India20:40-CMB-8h 05mBKK Bangkok, Thailand06:15+1D 15,319 VIEW FARES',
 'IndiGoCCU Kolkata, 

### Creating DataFrame
- Creating an dataFrame using regex from the extracted data using BeautifulSoup and Selenium.
- The below code is used for the categorization and `np.nan` is used to fill the not found records.

In [25]:
import re
import numpy as np
import pandas as pd
Airline_Name = []
Departure_Airport_Code = []
Departure_Airport = []
Departure_Country = []
Arrival_Airport_Code = []
Arrival_Airport = []
Arrival_Country = []
Layover = []
Departure_Time = []
Arrival_Time = []
Duration = []
Fare = []
for i in unique_records:
    # text = i.text.strip()
    if re.findall("([A-Za-z]+\s?[A-Za-z]+?)[A-Z]{3}\s\w+\s?\w+?,", i):
        Airline_Name.append(''.join(re.findall("([A-Za-z]+\s?[A-Za-z]+?)[A-Z]{3}\s\w+\s?\w+?,", i)))
    else:
        Airline_Name.append(np.nan)
    if re.findall("[^m]([A-Z]{3})\s\w+\s?\w+?,", i):
        Departure_Airport_Code.append(''.join(re.findall("[^m]([A-Z]{3})\s\w+\s?\w+?,", i)))
    else:
        Departure_Airport_Code.append(np.nan)
    if re.findall("[^m][A-Z]{3}\s(\w+\s?\w+?),", i):
        Departure_Airport.append(''.join(re.findall("[^m][A-Z]{3}\s(\w+\s?\w+?),", i)))
    else:
        Departure_Airport.append(np.nan)
    if re.findall("[A-Za-z]+\s?[A-Za-z]+?[A-Z]{3}\s\w+\s?\w+?,\s(\D+)", i):
        Departure_Country.append(''.join(re.findall("[A-Za-z]+\s?[A-Za-z]+?[A-Z]{3}\s\w+\s?\w+?,\s(\D+)", i)))
    else:
        Departure_Country.append(np.nan)
    if re.findall("m([A-Z]+)", i):
        Arrival_Airport_Code.append(''.join(re.findall("m([A-Z]+)", i)))
    else:
        Arrival_Airport_Code.append(np.nan)
    if re.findall("m[A-Z]+\s(\w+\s?\w+)", i):
        Arrival_Airport.append(''.join(re.findall("m[A-Z]+\s(\w+\s?\w+)", i)))
    else:
        Arrival_Airport.append(np.nan)
    if re.findall("m[A-Z]+\s\w+\s?-?\s?(?:\w+\s\w+)?,\s(\D+\s?\D+)", i):
        Arrival_Country.append(''.join(re.findall("m[A-Z]+\s\w+\s?-?\s?(?:\w+\s\w+)?,\s(\D+\s?\D+)", i)))
    else:
        Arrival_Country.append(np.nan)
    if re.findall("-(\w+(?: - \w{3})?)-", i):
        Layover.append(''.join(re.findall("-(\w+(?: - \w{3})?)-", i)))
    else:
        Layover.append(np.nan)
    if re.findall("(\d+:\d+)[-\s](?:\d+h|\w{3})", i):
        Departure_Time.append(''.join(re.findall("(\d+:\d+)[-\s](?:\d+h|\w{3})", i)))
    else:
        Departure_Time.append(np.nan)
    if re.findall("(\d+:\d+(?:\+1D)?)[\s|₹]\d+,", i):
        Arrival_Time.append(''.join(re.findall("(\d+:\d+(?:\+1D)?)[\s|₹]\d+,", i)))
    else:
        Arrival_Time.append(np.nan)
    if re.findall("\d+h\s\d+m", i):
        Duration.append(''.join(re.findall("\d+h\s\d+m", i)))
    else:
        Duration.append(np.nan)
    if re.findall("(\d{1,2},\d{3})\s", i):
        Fare.append(''.join(re.findall("(\d{1,2},\d{3})\s", i)))
    else:
        Fare.append(np.nan)

Airline = pd.DataFrame({'Airline_Name' : Airline_Name,
              'Departure_Airport_Code' : Departure_Airport_Code,
              'Departure_Airport' : Departure_Airport,
              'Departure_Country' : Departure_Country,
              'Arrival_Airport_Code' : Arrival_Airport_Code,
              'Arrival_Airport' : Arrival_Airport,
              'Arrival_Country' : Arrival_Country,
              'Layover' : Layover,
              'Departure_Time' : Departure_Time,
              'Arrival_Time' : Arrival_Time,
              'Duration' : Duration,
              'Fare' : Fare})
Airline.to_csv('C:/Users/wazid/OneDrive/Desktop/Airline.csv', index=False)

In [1]:
import pandas as pd
Airline = pd.read_csv('C:/Users/wazid/OneDrive/Desktop/Airline.csv')
Airline

Unnamed: 0,Airline_Name,Departure_Airport_Code,Departure_Airport,Departure_Country,Arrival_Airport_Code,Arrival_Airport,Arrival_Country,Layover,Departure_Time,Arrival_Time,Duration,Fare
0,Etihad Airways,BOM,Mumbai,India,XNB,Dubai Bus,United Arab Emirates,AUH,13:55,00:25+1D,12h 00m,9728
1,Virgin Atlantic,DEL,New Delhi,India,LHR,London,United Kingdom,BLR,21:30,10:45+1D,17h 45m,23211
2,Air India,BOM,Mumbai,India,BKK,Bangkok,Thailand,,04:10,10:10,4h 30m,16842
3,Vistara,BOM,Mumbai,India,DEL,New Delhi,India,,07:30,09:35,2h 05m,5100
4,IndiGo,DEL,New Delhi,India,BOM,Mumbai,India,GWL,10:55,16:40,5h 45m,4697
...,...,...,...,...,...,...,...,...,...,...,...,...
755,Airline,MAA,Chennai,India,CMB,Colombo,Sri Lanka,DEL,16:55,08:50+1D,15h 55m,60596
756,IndiGo,CCU,Kolkata,India,BKK,Bangkok,Thailand,BLR,13:15,12:05+1D,21h 20m,14731
757,IndiGo,BLR,Bengaluru,India,SIN,Singapore,Singapore,MAA,16:30,05:10+1D,10h 10m,10799
758,IndiGo,DEL,New Delhi,India,KTM,Kathmandu,Nepal,BOM,10:00,14:05+1D,27h 50m,11913


In [2]:
Airline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 760 entries, 0 to 759
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Airline_Name            760 non-null    object
 1   Departure_Airport_Code  760 non-null    object
 2   Departure_Airport       760 non-null    object
 3   Departure_Country       760 non-null    object
 4   Arrival_Airport_Code    760 non-null    object
 5   Arrival_Airport         760 non-null    object
 6   Arrival_Country         760 non-null    object
 7   Layover                 540 non-null    object
 8   Departure_Time          760 non-null    object
 9   Arrival_Time            756 non-null    object
 10  Duration                760 non-null    object
 11  Fare                    760 non-null    object
dtypes: object(12)
memory usage: 71.4+ KB


In [3]:
Airline.describe()

Unnamed: 0,Airline_Name,Departure_Airport_Code,Departure_Airport,Departure_Country,Arrival_Airport_Code,Arrival_Airport,Arrival_Country,Layover,Departure_Time,Arrival_Time,Duration,Fare
count,760,760,760,760,760,760,760,540,760,756,760,760
unique,21,7,7,1,17,15,8,54,227,251,250,304
top,IndiGo,DEL,New Delhi,India,BLR,Bengaluru,India,BOM,05:00,23:45,2h 15m,10877
freq,454,270,270,760,120,120,380,132,13,23,30,22
