# Flight_Price_webscraping (kayak.com)

### Anyone who has booked a flight ticket knows how unexpectedly the prices vary. The cheapest available ticket on a given flight gets more and less expensive over time. This usually happens as an attempt to maximize revenue based on -
### 1. Time of purchase patterns (making sure last-minute purchases are expensive)
### 2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in order to reduce sales and hold back inventory for those expensive last-minute expensive purchases)

### So, you have to work on a project where you collect data of flight fares with other features and work to make a model to predict fares of flights.

### You have to scrape at least 1500 rows of data. You can scrape more data as well, it’s up to you, More the data better the model.

### In this section you have to scrape the data of flights from different websites (yatra.com, skyscanner.com, official websites of airlines, etc). The number of columns for data doesn’t have limit, it’s up to you and your creativity. Generally, these columns areairline name, date of journey, source, destination, route, departure time, arrival time, duration, total stops and the target variable price. You can make changes to it, you can add or you can remove some columns, it completely depends on the website from which you are fetching the data.

In [245]:
import selenium
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [495]:
driver = webdriver.Chrome(r"C:\Users\HP\Downloads\chromedriver_win32\chromedriver.exe")

url = "https://www.kayak.co.in/flights/BLR-MAA/2021-10-25/2021-11-01?sort=bestflight_a&fs=baditin=baditin"
driver.get(url)
driver.maximize_window()

#### Scrolls the page down and clicks the load more button

In [496]:
begin = time.time()

count = 0
while count <50:
    driver.execute_script("window.scrollBy(0,3800)")
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@class='moreButton']"))))
    count = count + 1
    
end = time.time()
print(f"Total runtime of the program is {end - begin}")

Total runtime of the program is 166.26537823677063


#### Extracts and store all the data in the given variables

In [497]:
begin = time.time()

airline = []
airport_from = []
airport_time = []
airport_dep = []
airline_stops =[]
price = []

    
air_from = driver.find_elements_by_xpath("(//span[@class='airport-name'])")
for airfrom in air_from:
    airport_from.append(airfrom.text)
    air_from = airport_from[::-4]
    air_to = airport_from[1:len(airport_from):4]
    
air_time_arr = driver.find_elements_by_xpath("(//span[@class='arrival-time base-time'])")
for airtime in air_time_arr:
    airport_time.append(airtime.text)
    air_arrival = airport_time[0:len(air_time_arr):2]
    
air_time_dep = driver.find_elements_by_xpath("(//span[@class='depart-time base-time'])")
for airdep in air_time_dep:
    airport_dep.append(airdep.text)
    air_dep = airport_dep[0:len(air_time_dep):2]
    
air_stop = driver.find_elements_by_xpath("(//div[@class='top'])")
for airstop in air_stop:
    airline_stops.append(airstop.text)
    air_stop_new = airline_stops[1:len(air_stop):3] 
    

end = time.time()
print(f"Total runtime of the program is {end - begin}")

Total runtime of the program is 454.8920774459839


In [498]:
airline_name = driver.find_elements_by_xpath("(//span[@class='codeshares-airline-names'])")
for air in airline_name:
    airline.append(air.text)
    
airline_name_new =pd.DataFrame(airline, columns =["Split col airline"])
airline_name_new = airline_name_new["Split col airline"].str.split(",", expand=True)
airline_name_new.drop(columns=[1], inplace=True)
airline_name_new.reset_index(drop=True)
airline_name_new.columns =["Airline Name"]

In [501]:
air_price = driver.find_elements_by_xpath("(//div[@class='col-price result-column js-no-dtog'])")
for airprice in air_price:
    price.append(airprice.text.strip().replace("₹ ","").replace(",",""))
    
price_new =pd.DataFrame(price, columns =["Split price col"])
price_new = price_new["Split price col"].str.split("\n", expand=True)
price_new.drop(columns=[3,2,1], inplace=True)
price_new.reset_index(drop=True)
price_new.columns =["Price"]

In [503]:
total_list = {
             "Source":air_from,
             "Destination":air_to,
             "Arrival Time":air_arrival,
             "Departure Time":air_dep}

#### As extraction at times doen't give same number of output, separate dataframes have been joined to form a single dataframe

In [504]:
df1 = pd.DataFrame(airline_name_new)
df2 = pd.DataFrame(total_list)
df3 = pd.DataFrame(air_stop_new, columns=["Number of stops"])
df4 = pd.DataFrame(price_new)

In [506]:
pd.set_option("display.max_rows",None)

new_df_final = df1.join(df2)
new_df_final = new_df_final.join(df3)
new_df_final = new_df_final.join(df4)
new_df_final

Unnamed: 0,Airline Name,Source,Destination,Arrival Time,Departure Time,Number of stops,Price
0,SpiceJet,BLR Bengaluru Intl,MAA Chennai,07:20,06:20,direct,6858
1,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6698
2,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
3,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
4,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
5,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
6,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
7,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
8,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6699
9,Air India,BLR Bengaluru Intl,MAA Chennai,08:05,07:05,direct,6700


# Extract more details from same website


In [514]:
driver = webdriver.Chrome(r"C:\Users\HP\Downloads\chromedriver_win32\chromedriver.exe")

url = "https://www.kayak.co.in/flights/BLR-DEL/2021-10-25/2021-11-01?sort=price_a&fs=baditin=baditin"
driver.get(url)
driver.maximize_window()

In [515]:
begin = time.time()

count = 0
while count <80:
    driver.execute_script("window.scrollBy(0,3800)")
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[@class='moreButton']"))))
    count = count + 1
    
end = time.time()
print(f"Total runtime of the program is {end - begin}")

Total runtime of the program is 579.6853535175323


In [516]:
begin = time.time()

airline1 = []
airport_from1 = []
airport_time1 = []
airport_dep1 = []
airline_stops1 =[]
price1 = []

    
air_from1 = driver.find_elements_by_xpath("(//span[@class='airport-name'])")
for airfrom1 in air_from1:
    airport_from1.append(airfrom1.text)
    air_from1 = airport_from1[::-4]
    air_to1 = airport_from1[1:len(airport_from1):4]
    
air_time_arr1 = driver.find_elements_by_xpath("(//span[@class='arrival-time base-time'])")
for airtime1 in air_time_arr1:
    airport_time1.append(airtime1.text)
    air_arrival1 = airport_time1[0:len(air_time_arr1):2]
    
air_time_dep1 = driver.find_elements_by_xpath("(//span[@class='depart-time base-time'])")
for airdep1 in air_time_dep1:
    airport_dep1.append(airdep1.text)
    air_dep1 = airport_dep1[0:len(air_time_dep1):2]
    
air_stop1 = driver.find_elements_by_xpath("(//div[@class='top'])")
for airstop1 in air_stop1:
    airline_stops1.append(airstop1.text)
    air_stop_new1 = airline_stops1[1:len(air_stop1):3] 
    
airline_name1 = driver.find_elements_by_xpath("(//span[@class='codeshares-airline-names'])")
for air1 in airline_name1:
    airline1.append(air1.text)
    
airline_name_new1 =pd.DataFrame(airline1, columns =["Split col airline1"])
airline_name_new1 = airline_name_new1["Split col airline1"].str.split(",", expand=True)
airline_name_new1.drop(columns=[1], inplace=True)
airline_name_new1.reset_index(drop=True)
airline_name_new1.columns =["Airline Name"]


air_price1 = driver.find_elements_by_xpath("(//div[@class='col-price result-column js-no-dtog'])")
for airprice1 in air_price1:
    price1.append(airprice1.text.strip().replace("₹ ","").replace(",",""))
    
price_new1 =pd.DataFrame(price1, columns =["Split price col1"])
price_new1 = price_new1["Split price col1"].str.split("\n", expand=True)
price_new1.drop(columns=[2,1], inplace=True)
price_new1.reset_index(drop=True)
price_new1.columns =["Price"]
    

end = time.time()
print(f"Total runtime of the program is {end - begin}")

Total runtime of the program is 1089.6907267570496


In [517]:
total_list1 = {
             "Source":air_from1,
             "Destination":air_to1,
             "Arrival Time":air_arrival1,
             "Departure Time":air_dep1}

In [518]:
df5 = pd.DataFrame(airline_name_new1)
df6 = pd.DataFrame(total_list1)
df7 = pd.DataFrame(air_stop_new1, columns=["Number of stops"])
df8 = pd.DataFrame(price_new1)

In [519]:
pd.set_option("display.max_rows",None)

new_df_final_new = df5.join(df6)
new_df_final_new = new_df_final_new.join(df7)
new_df_final_new = new_df_final_new.join(df8)
new_df_final_new

Unnamed: 0,Airline Name,Source,Destination,Arrival Time,Departure Time,Number of stops,Price
0,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,22:45,20:00,direct,13679
1,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,22:45,20:00,direct,13679
2,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,22:45,20:00,direct,13679
3,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,22:45,20:00,direct,13679
4,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,20:35,17:45,direct,13679
5,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,22:45,20:00,direct,13679
6,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,20:35,17:45,direct,13679
7,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,20:35,17:45,direct,13679
8,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,20:35,17:45,direct,13679
9,Air India,BLR Bengaluru Intl,DEL Indira Gandhi Intl,20:35,17:45,direct,13679


# Combine both the dataframes to stack one below another

In [521]:
vertical_stack = pd.concat([new_df_final, new_df_final_new], axis=0)

In [523]:
len(vertical_stack)

1981

# Convert the dataframe to CSV format and use it for further analysis

In [525]:
vertical_stack.to_csv("FlightPrices_WebScraping.csv", index=False)