# Introduction

Create a notebook on how to search through a specific airline website like Expedia or Southwest, find the best flight rates, and send an alert to your email on a timed basis. This is a nice project done by Aref. On top of emailing, we can store the data into a format that is usable for analysis purposes or historical view.

# Instructions

1. Connect Python to a web browser and access the airline website
2. Choose flights (roundtrip, one way, multi-city)
3. Select origin (city or airport of departure)
4. Select destination (city or airport of arrival)
5. Select departing and returning dates
6. Structure the data in a structured format for exploratory data analysis (optional)
7. Connect to your email and push an alert every hour

# What do I hope to achieve from this?

Develop a stronger understanding in working with other types of projects besides machine learning and data analysis. Webscraping is a process of extracting information and data from a website, transforming the information on a webpage into structured data for further analysis. Automation is a powerful skill to have and learning to do so with new data instead of extracting it from CSV files or SQL can be very handy in the future.

1. Automate my flight searches from OC to SJ
2. Initializing a new project and working from concept to product
3. Deploy to Production (optional) 

## Import libraries

In [None]:
# Data
import pandas as pd

# Web
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests

# Misc
import time
import datetime
import os
from PIL import Image

# Email
import smtplib
from email.mime.multipart import MIMEMultipart

### Connect to the web

In [None]:
# Opens empty web browser of your choice
# browser = webdriver.Firefox()
browser = webdriver.Chrome()

Inspect the element we want to start working on and check the tags related to it.

It has a 'label' tag with 'id = flight-type-roundtrip-label-hp-flight'. This is the tag we will use to device on the ticket type.

Here are the tags below for each different ticket type id. Store the tags and ids.

_Note: You can right click the element and copy the XPATH. The roundtrip result would be copied and it will be something like this: "//*[@id="flight-type-roundtrip-label-hp-flight"]"_

In [None]:
# Set ticket types
return_ticket = "//label[@id='flight-type-roundtrip-label-hp-flight']"
one_way_ticket = "//label[@id='flight-type-one-way-label-hp-flight']"

### Define a function to save HTML locally

In [None]:
sample_link = 'https://www.expedia.com/Flights-Search?flight-type=on&starDate=07%2F11%2F2019&endDate=07%2F11%2F2019&mode=search&trip=roundtrip&leg1=from%3AOrange+County%2C+CA+%28SNA-John+Wayne%29%2Cto%3ASan+Jose%2C+CA+%28SJC-Norman+Y.+Mineta+San+Jose+Intl.%29%2Cdeparture%3A07%2F11%2F2019TANYT&leg2=from%3ASan+Jose%2C+CA+%28SJC-Norman+Y.+Mineta+San+Jose+Intl.%29%2Cto%3AOrange+County%2C+CA+%28SNA-John+Wayne%29%2Cdeparture%3A07%2F11%2F2019TANYT&passengers=children%3A0%2Cadults%3A1%2Cseniors%3A0%2Cinfantinlap%3AY'

def save_html(html, path):
    with open(path, 'wb') as f:
        f.write(html)

In [None]:
r = requests.get(sample_link)
save_html(r.content, 'expedia_link')

### Open content

In [None]:
def open_html(path):
    with open(path, 'rb') as f:
        return f.read()
    
html = open_html('expedia_link')

### Define a function to choose a roundtrip or one way ticket.

In [None]:
# Define a new function
def choose_ticket(ticket):
    '''
    Try to click the button to choose the ticket type
    '''
    try:
        ticket_type = browser.find_element_by_xpath(ticket)
        time.sleep(1.0)
        ticket_type.click()
    except Exception as e:
        pass

### Define a function to choose departure the airport/city

The function will perform the following steps:
1. Type in the departure airport/city
2. Choose the first value from the drop-down menu

In [None]:
# Define a new function
def choose_departure_city(airport):
    '''
    Click flying from box, input airport, and select the first choice from the drop-down menu
    '''
    # Find departure box using the element's tags and attributes
    fly_from = browser.find_element_by_xpath("//input[@id='flight-origin-hp-flight']")
    time.sleep(1.5)
    
    # Clear any value written in the field
    fly_from.clear()
    time.sleep(2.0)
    
    # Type in the airport that will be passed into the function using .sendkeys()
    fly_from.send_keys('  ' + airport)
    time.sleep(1.0)
    
    # Get XPATH from the dropdown after passing the airport value
    first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']")
    time.sleep(1.5)
    
    # Click the first value
    first_item.click()

### Define a function to choose arrival the airport/city

The function will perform the following steps:
1. Type in the arrival airport/city
2. Choose the first value from the drop-down menu

In [None]:
# Define a new function
def choose_arrival_city(airport):
    '''
    Click flying from box, input airport, and select the first choice from the drop-down menu
    '''
    # Find departure box using the element's tags and attributes
    fly_to = browser.find_element_by_xpath("//input[@id='flight-destination-hp-flight']")
    time.sleep(1.5)
    
    # Clear any value written in the field
    fly_to.clear()
    time.sleep(2.0)
    
    # Type in the airport that will be passed into the function using .sendkeys()
    fly_to.send_keys('  ' + airport)
    time.sleep(1.0)
    
    # Get XPATH from the dropdown after passing the airport value
    first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']")
    time.sleep(1.5)
    
    # Click the first value
    first_item.click()

### Define a function to choose the departure date

The function will perform the following steps:
1. Find the date element on the web page
2. Clear every value
3. Type in the date in this specific format: mm/dd/yyyy

In [None]:
# Define a new function
def choose_departure_date(month, day, year):
    '''
    Function will select each individual part of the departure date
    '''
    # Find date using the element's tags and attributes
    dep_date = browser.find_element_by_xpath("//input[@id='flight-departing-hp-flight']")
    
    # Clear any value in the field
    dep_date.clear()
    
    # Type in the date that will be passed into the function using .sendkeys() 
    dep_date.send_keys(month + '/' + day + '/' + year)

### Define a function to choose the return date

In [None]:
# Define a new function
def choose_return_date(month, day, year):
    '''
    Function will select each individual part of the return date
    '''
    # Find date using the element's tag and attributes
    ret_date = browser.find_element_by_xpath("//input[@id='flight-returning-hp-flight']")
    
#     # Clear any value in the field
#     ret_date.clear()
    
#     # Type in the date that will be passed into the function
#     ret_date.send_keys(month + '/' + day + '/' + year)

    for i in range(11):
        ret_date.send_keys(Keys.BACKSPACE)
    ret_date.send_keys(month + '/' + day + '/' + year)

### Search

In [None]:
# Define a function that will click the search button
def search():
    '''
    Function will now press the select button after inputting all of the values and load the next page
    '''
    # Find search element's tag and attributes
    search = browser.find_element_by_xpath("//button[@class='btn-primary btn-action gcw-submit']")
    
    # Click the button and wait for the page to load
    search.click()
    time.sleep(10)
    
    # Result
    print('Search results are loaded.')

### Get the data

Now that we have loaded the search results, we can proceed to the next steps in gathering the data so we could use it. Here are the steps in the following cell:
1. Create an empty df
2. Create variables for all of the flight attributes and store it in a list
3. Find all the elements for an attribute
4. Store the elements in the related variable created in the list before
5. Create lists side by side as columns into the dataframe
6. Save the dataframe to an Excel sheet

In [None]:
# Create an empty df
df = pd.DataFrame()

# Define  function to gather the data calling our functions from before
def get_data():
    global df
    global dep_times_list
    global arr_times_list
    global airlines_list
    global price_list
    global durations_list
    global stops_list
    global layovers_list
    
    ## Get elements
    
    # Departure times
    dep_times = browser.find_elements_by_xpath("//span[@data-test-id='departure-time']")
    dep_times_list = [x.text for x in dep_times]
    
    # Arrival Times
    arr_times = browser.find_elements_by_xpath("//span[@data-test-id='arrival-time']")
    arr_times_list = [x.text for x in arr_times]
    
    # Airline Names
    airlines = browser.find_elements_by_xpath("//span[@data-test-id='airline-name']")
    airlines_list = [x.text for x in airlines]
    
    # Prices
    prices = browser.find_elements_by_xpath("//span[@data-test-id='listing-price-dollars']")
    prices_list = [x.text.split('')[1] for x in prices]
                                   
    # Durations
    durations = browser.find_elements_by_xpath("//span[@data-test-id='duration']")
    durations_list = [x.text for x in durations]
    
    # Stops
    stops = browser.find_elements_by_xpath("//span[@class='number-stops']")
    stops_list = [x.text for x in stops]
    
    # Layovers
    layovers = browser.find_elements_by_xpath("//span[@data-test-id='layover-airport-stops']")
    layovers_list = [x.text for x in layovers]  
    
    ## Create the data in a structured format
    
    # Get current datetime
    now = datetime.datetime.now()
    
    # Partition dates as yyyy-mm-dd
    current_date = (str(now.year) + '-' + str(now.month) + '-' + str(now.day))
    current_time = (str(now.hour) + ':' + str(now.minute))
    
    # Update current_price name with every call
    current_price = 'price' + '(' + current_date + '---' + current_time + ')'
    
    # Loop through every list
    for i in range(len(dep_times_list)):
        try:
            df.loc[i, 'departure_time'] = dep_times_list[i]
        except Exception as e:
            pass
        try:
            df.loc[i, 'arrival_time'] = arr_times_lies[i]
        except Exception as e:
            pass
        try:
            df.loc[i, 'airline'] = airlines_list[i]
        except Exception as e:
            pass
        try:
            df.loc[i, 'duration'] = durations_list[i]
        except Exception as e:
            pass
        try:
            df.loc[i, 'stops'] = stops_list[i]
        except Exception as e:
            pass
        try:
            df.loc[i, 'layovers'] = layovers_list[i]
        except Exception as e:
            pass
        try:
            df.loc[i, str(current_price)] = price_list[i]
        except Exception as e:
            pass
        
    print('Finished appending data.')


In [None]:
# Test out a chunk to see if it's working as intended

testurl = 'https://www.expedia.com/Flights-Search?flight-type=on&starDate=07%2F11%2F2019&endDate=07%2F11%2F2019&mode=search&trip=roundtrip&leg1=from%3AOrange+County%2C+CA+%28SNA-John+Wayne%29%2Cto%3ASan+Jose%2C+CA+%28SJC-Norman+Y.+Mineta+San+Jose+Intl.%29%2Cdeparture%3A07%2F11%2F2019TANYT&leg2=from%3ASan+Jose%2C+CA+%28SJC-Norman+Y.+Mineta+San+Jose+Intl.%29%2Cto%3AOrange+County%2C+CA+%28SNA-John+Wayne%29%2Cdeparture%3A07%2F11%2F2019TANYT&passengers=children%3A0%2Cadults%3A1%2Cseniors%3A0%2Cinfantinlap%3AY'
browser = webdriver.Chrome()

browser.get(testurl)
time.sleep(5)

ppp = browser.find_elements_by_xpath("//span[@data-test-id='listing-price-dollars']")
aaa = browser.find_elements_by_xpath("//*[@id='flight-module-2019-07-11t15:05:00-07:00-coach-sna-sjc-as-3411_2019-07-11t18:05:00-07:00-coach-sjc-sna-as-3412_']/div[1]/div[1]/div[2]/div/div[1]/div[1]/span")

In [None]:
ppp

# browser = webdriver.Firefox()
# time.sleep(3)

# r = requests.get(testurl)
# print(r.content[:100])
# soup = BeautifulSoup(r.content, 'html.parser')

# soup.find_all('div', attrs={'class': 'full-bold no-wrap'})

### Setting up email

Now that we have selected our ticket and flight information and added the data to a dataframe, we can start working on our emailing functions.

1. Connect to Gmail
    - Use environmental variables to hide your login information
2. Create the email message
3. Send the email

In [None]:
# Store login information
gmail_user = os.environ.get('gmail_user')
gmail_pass = os.environ.get('gmail_pass')

In [None]:
# Define a function to connect to your email
def connect_email(user, pw):
    global server
    
    '''
    Python conveniently comes with the smtplib, which handles all of the different parts of the protocol, like 
    connecting, authenticating, validation, and of course, sending emails.
    
    The protocol for mail submission actually uses 587, which is what we will use.
    '''
    try:
        server = smtplib.SMTP('smtp.gmail.com', 587)
        
        # Use ehlo which identifies you to the SMTP server
        server.ehlo()
        
        # Secure the SMTP connection; create an insecure connection and upgrade to TLS
        server.starttls()
        
        # Login with credentials
        server.login(user, pw)
    except:
        print('Something went wrong...')
    

In [None]:
# Define a function to create the message
def create_message():
    global msg
    
    # Create a variable that will append every data field we need
    msg = '\nCurrent Cheapest flight:\n\nDeparture time: {}\nArrival time: {}\nAirline: {}\nFlight duration: {}\nNo. of stops: {}\nPrice: {}\n'.format(cheapest_dep_time,
                       cheapest_arrival_time,
                       cheapest_airline,
                       cheapest_duration,
                       cheapest_stops,
                       cheapest_price)

In [None]:
# Define a function to send the email
def send_email(msg):
    global message

    message = MIMEMultipart()
    message['Subject'] = 'Current Best Flight from OC to SJ'
    message['From'] = gmail_user
    message['to'] = gmail_user
    server.sendmail(gmail_user, gmail_user, msg)

### Run the script now

In [None]:
# Loop through a time interval
for i in range(3):
    
    # Open up the website
    airline_link = 'https://expedia.com/'
    print('Opening browser... [1/10]')
    browser.get(airline_link)
    time.sleep(5)
    
    # Choose flights
    flights = browser.find_element_by_xpath("//button[@id='tab-flight-tab-hp']")
    print('Finding flight element... [2/10]')
    flights.click()
    
    # Choose ticket type
    print('Choosing ticket type... [3/10]')
    choose_ticket(return_ticket)
    
    # Departure
    print('Submitting departing city... [4/10]')
    choose_departure_city('Santa Ana')
    
    # Arrival
    print('Submitting arrival city... [5/10]')
    choose_arrival_city('San Jose')
    
    # Departure date
    print('Submitting departure date... [6/10]')
    choose_departure_date('07', '26', '2019')

    
    # Return date
    print('Submitting return date... [7/10]')
    choose_return_date('07', '28', '2019')   
    
    # Search and compile
    search()
    get_data()
    
    # Save data to a dataframe
    print('Appending data... [8/10]')
    current_values = df.iloc[0]

    cheapest_dep_time = current_values[0]
    cheapest_arrival_time = current_values[1]
    cheapest_airline = current_values[2]
    cheapest_duration = current_values[3]
    cheapest_stops = current_values[4]
    cheapest_price = current_values[-1]
    
    print('Round {} completed.'.format(i))
    
    # Emails
    print('Creating email... [9/10]')
    create_msg()
    connect_mail(gmail_user, gmail_pass)
    send_email(msg)
    print('Email sent!')
    print('Creating xlsx file... [10/10]')
    df.to_excel('flights.xlsx')
    time.sleep(240)