# Overview

This notebook extracts unstructured data from Microsoft Outlook emails regarding Travel Waivers
that United Airlines sends out related to events/weather/etc. that may impact travel. 

The email data is cleaned and manipulated into a DataFrame that includes the Travel Waiver's
date sent, event name, severity level, cities impacted, and dates affected. 

The code can be broken down into the following 3 sections:

    1. Scrape Outlook to collect all Travel Waiver emails
    2. Create classes/functions to parse data from email 
    3. Iterate over the data set to clean and structure it

Over time, the collection of this data may yield some interesting insights into events that impact travel.
Feel free to ask questions or leave suggestions/critique, I love to learn other's approaches to problems!

### The image below is an example of the email being scraped and my mailbox had over a year's worth of Travel Waiver emails 

![TravelWaiverImage](https://raw.githubusercontent.com/eli64s/Python-Email-Scraper/master/travel_waiver_image.PNG)

As most of you do not have access to these Travel Waiver emails, I have provided the email subject and body strings in the following cell from the example Travel Waiver image above. 

#### Use these strings to test the notebook on your own!

In [1]:
# The following strings come from the example Travel Waiver email above  
email_subject_example = 'EXTENSION-Travel Waiver (SEV3) China Novel Coronavirus - 01/30/20'

email_body_example = """Travel Waiver: China Novel Coronavirus
SEV 3
We have extended the Travel Waiver for Shanghai, Chengdu, and Beijing, China 
due to the ongoing situation with Novel Coronavirus.
*     Cities: CTU / PEK / PVG 
*     Travel date(s): January 24- March 31, 2020
*     Must be ticketed by: January 23, 2020
Avoid affected city on connections, if possible.
Permitted changes
*     Same origin and destination
*     Any booking code in originally ticketed cabin
*     Different connection
Standby
Same day confirmed 
Alternate day confirmed
*     Rescheduled travel up to/including April 30, 2020
Reuse ticket or coupon(s)
*     Use full value of unused ticket or coupon(s) for another itinerary within ticket validity
*     New itinerary can be any cities and cabin
*     Subject to add-collect of new fare
Full Refund of unused ticket or coupons
What’s waived?
Standby fee
Change fee: Same or alternate dates listed above
Add-collect: Within rescheduled travel dates listed above 
Non-refundability of fare rule
Travel agency waiver code: 7JCNZ
ATRE and DRS: TRAVEL WAIVER have been updated. United.com will be updated shortly.
Reminder: Flights that are already canceled or delayed fall under the Irregular Operations policy.
Goal: Same day travel flexibility | Change by specific date at no cost |
Use full value of unused ticket or coupon(s) for another itinerary by January 23, 2021
Full refund of unused ticket or coupon(s)"""

# Outlook Setup

In [2]:
import pandas as pd
import win32com.client
 
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6) # '6' refers to the index of a folder 
messages = inbox.Items              # Get all items in the Inbox folder

### Retrieve Mailbox Messages

#### Iterate over the messages in the mailbox, storing the extracted data in a Dictionary 

In [3]:
travel_waivers = {} # Dictionary{} to store email data - send date, title, body text
key_val = 0         # key value for Dictionary{}

for message in messages:
    '''
    Iterate over messages to collect data from each email belonging to
    SenderName - 'Travel Waivers'
    Try/Except statements are used as some emails have an unknown SenderName
    '''
    
    try:
        if message.SenderName == 'Travel Waivers':
            
            # Condition is True when the email is a reply to another email
            if 'From:' in message.Body:
                pass
            
            # Insert Travel Waiver data into Dictionary{}
            else:
                travel_waivers[key_val] = {
                    'email_title': message.Subject,
                    'email_body': message.Body,
                    'Date_Sent': message.SentOn.strftime('%Y-%m-%d'),
                    'Event': None,
                    'Indicator': None,
                    'City': None,
                    'Start_Date': None,
                    'End_Date': None,
                    'All_Dates': None
                }
                key_val += 1
            
    # Exception occurs when SenderName is unknown
    except Exception as error_message:
        print(error_message)

<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName
<unknown>.SenderName


# Create Classes to Parse Text of Email Body

In [4]:
from datetime import datetime, timedelta
import calendar
import re 

# Generate a list of calendar months name & number to match against dates in waiver
# List[Tuple()] then converted to Dictionary{}
month_names = [(calendar.month_name[month], n + 1) for n, month in enumerate(range(1,13))]
month_names = dict(month_names)

### Parent Class 

In [5]:
class GetData(object):
    '''
    This class extracts data from the email's subject and body
    '''
    def __init__(self, subject, body): 
        self.subject = subject 
        self.body = body
        
    def subject_data(self): 
        '''
        Gets event name and severity indicator from the email's subject
        '''
        # Travel Waiver Email Subject
        # Some Travel Waivers do not contain a Severity Indicator
        if ')' in self.subject: 
            subject_name = self.subject.split(')')[1]    
            self.subject_name = subject_name.split('-')[0].strip()
            
        else:
            self.subject_name = self.subject.split('-')[0].strip()
            
        # Severity Indicator
        # Some Travel Waivers do not contain a Severity Indicator
        if '(S' in self.subject: 
            ind = self.subject.split()
            ind = [i for i in ind if i.startswith('(S')][0]
            self.ind = re.sub('[()]', '', ind)
            
        else:
            self.ind = 'NA'
        
    def body_data(self):
        '''
        Gets two substrings from the email's body containing the city and date data
        '''
        text = self.body
        # Extract substring with Travel Waiver's cities
        start_word = re.findall(r"City:|Cities:", text)[0]              
        end_word = re.findall(r"Travel date:|Travel date\(s\):|Travel dates:", text)[0]
        self.city_text = text[text.find(start_word) + len(start_word): text.rfind(end_word)]
    
        # Extract substring with Travel Waiver's date range
        start_word = re.findall(r"Travel date:|Travel date\(s\):|Travel dates:", text)[0]
        end_word = re.findall(', \d{4}', text)[0]   # Regex to find the ', YYYY'
        end_word = text.split(end_word, 1)[1]       # Gets entire string after ', YYYY'
        self.dates_text = text[text.find(start_word) + len(start_word): text.rfind(end_word)]
        

### Child Class

In [6]:
class CleanData(GetData):
    '''
    Extracts the data for the Travel Waiver's affected cities and dates
    '''
    
    def __init__(self, city_text, dates_text): 
        self.city_text = city_text
        self.dates_text = dates_text
        
        
    def get_cities(self):
        '''
        This function gets the substring of city codes and returns list of city codes
        '''
        text = self.city_text
        cities = text.strip('/')                              # Strip '/' from substring
        cities = ''.join(ch for ch in cities if ch.isalnum()) # Delete non-alphanumeric characters
        
        return [cities[i:i + 3] for i in range(0, len(cities), 3)] # Create list of cities
    
    
    def get_dates(self):
        '''
        This function gets substring of the Travel Waiver's issued date range 
        Returns the start date, end date, and list of all dates every in the range
        '''
        text = self.dates_text
        months = [m for m in month_names if m in text]           # Travel Waiver's month(s) 
        days_and_year = list(map(int, re.findall(r'\d+', text))) # Regex to get numeric values
        year = days_and_year[-1]                                 # Travel Waiver's year 
        days = days_and_year[:-1]                                # Travel Waiver's day range
        start_day = days[0]                                      # Start day
        end_day = days[-1]                                       # End day

        # Condition for Travel Waivers spanning multiple months: ex) 'September 30 – October 2, 2019'
        if len(days) == 2 and len(months) == 2:
            start_month = int(month_names[months[0]])   # Starting month number 
            end_month = int(month_names[months[1]])     # Ending month number
            
        # Condition for Travel Waivers spanning multiple days: ex) 'August 25 - 27'
        # or single day events
        else:
            start_month = int(month_names[months[0]])
            end_month = start_month

        day_range = (datetime(year, end_month, end_day) - datetime(year, start_month, start_day)).days      
        date_range_list = [(datetime(year, start_month, start_day) + \
                            timedelta(days = d)).strftime('%Y-%m-%d') for d in range(day_range + 1)]
        start_date = date_range_list[0]                                # Start date in trange
        end_date = date_range_list[-1]                                 # Last date in range

        return start_date, end_date, date_range_list

### Example Objects Using the Classes Above

##### Using the example data provided at the top of the notebook!

In [7]:
# GetData() Parent Class - gets Travel Waiver email subject data
event_info = GetData(email_subject_example, email_body_example)
event_info.subject_data()
event_name = event_info.subject_name
indicator = event_info.ind

# CleanData() Child Class - gets Travel Waiver email body data
event_info.body_data()
cities_dates = CleanData(
event_info.city_text,
event_info.dates_text
)

print(event_name, '\n')
print(indicator, '\n')
print(cities_dates.get_cities(), '\n')
print(cities_dates.get_dates(), '\n')

China Novel Coronavirus 

SEV3 

['CTU', 'PEK', 'PVG'] 

('2020-01-24', '2020-03-31', ['2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14', '2020-02-15', '2020-02-16', '2020-02-17', '2020-02-18', '2020-02-19', '2020-02-20', '2020-02-21', '2020-02-22', '2020-02-23', '2020-02-24', '2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28', '2020-02-29', '2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04', '2020-03-05', '2020-03-06', '2020-03-07', '2020-03-08', '2020-03-09', '2020-03-10', '2020-03-11', '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15', '2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19', '2020-03-20', '2020-03-21', '2020-03-22', '2020-03-23', '2020-03-24', '2020-03-25', '2020-03-26', '2020-03-27', '2020-03-28', '20

# Create Data Set

In [8]:
for value in travel_waivers:
    '''
    Create Travel Waiver objects using the GetData() & CleanData() classes
    '''
    email_subject = travel_waivers[value]['email_title']
    email_body = travel_waivers[value]['email_body']
    
    # GetData() Parent Class - gets Travel Waiver email subject data
    event_info = GetData(email_subject, email_body)
    event_info.subject_data()
    event_name = event_info.subject_name
    indicator = event_info.ind
    
    # CleanData() Child Class - gets Travel Waiver email body data
    event_info.body_data()
    cities_dates = CleanData(
    event_info.city_text,
    event_info.dates_text
)
    cities = cities_dates.get_cities()  # Cities
    dates = cities_dates.get_dates()    # Dates
    start = dates[0]                    # Start Date
    end = dates[1]                      # End Date
    all_dates = dates[2]                # List of all dates

    # Update dictionary values
    travel_waivers[value]['Event'] = event_name
    travel_waivers[value]['Indicator'] = indicator
    travel_waivers[value]['City'] = cities
    travel_waivers[value]['Start_Date'] = start
    travel_waivers[value]['End_Date'] = end
    travel_waivers[value]['All_Dates'] = all_dates

# Prepare Final Data for Output

In [9]:
# Convert the Dictionary with the data to DataFrame
travel_waiver_data = pd.DataFrame.from_dict(travel_waivers, orient = 'index')

# Drop the columns that contained the original scraped date ['email_title', 'email_body']
travel_waiver_data = travel_waiver_data[[
    'Date_Sent', 
    'Event',
    'Indicator', 
    'City',
    'Start_Date',
    'End_Date',
    'All_Dates'
    ]]

In [10]:
# explode() function takes a DataFrame column and creates a new row for each item  
travel_waiver_data_expanded = travel_waiver_data.explode('City')
travel_waiver_data_expanded = travel_waiver_data_expanded.explode('All_Dates')
travel_waiver_data_expanded = travel_waiver_data_expanded.drop_duplicates().reset_index(drop = True)

#### Save DataFrame as .csv file

In [11]:
today = datetime.now().strftime('%Y%m%d')
save_path = 'travel_waivers_' + today + '.csv'
travel_waiver_data_expanded.to_csv(save_path, index = False)

In [12]:
travel_waiver_data_expanded.tail(25)

Unnamed: 0,Date_Sent,Event,Indicator,City,Start_Date,End_Date,All_Dates
2584,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-15
2585,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-16
2586,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-17
2587,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-18
2588,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-19
2589,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-20
2590,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-21
2591,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-22
2592,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-23
2593,2020-02-03,Hong Kong Novel Coronavirus,SEV3,HKG,2020-01-28,2020-03-31,2020-03-24
