# ETL and EDA Notebook

# Part 1 - Amtrak Northeast Regional Train Data
* This project would not be possible without the diligent joint effort by [Chris Juckins](https://juckins.net/index.php) and [John Bobinyec](http://dixielandsoftware.net/Amtrak/status/StatusMaps/) to collect and preserve Amtrak's on-time performance records. Chris Juckins' archive of timetables was another invaluable resource which enabled me to sort through the trains and stations I chose to use in this project.
* The train data is sourced from [Amtrak Status Maps Archive Database (ASMAD)](https://juckins.net/amtrak_status/archive/html/home.php), and has been retrieved with Chris' permission.

### Overview of the Process
* Functions were written to scrape the HTML table returned from the search query and to process each column to the desired format
* Additional columns were also added during processing to aid in joining the train data with weather data

### Setup

In [None]:
import time
import requests
import re
import lxml.html as lh
import pandas as pd
import numpy as np
from datetime import date, timedelta
from trains_retrieve_and_process_data import * 

### Retrieve HTML table data and recreate as a Pandas DataFrame
* Default is to collect data from the previous day (run after 5am or else no data will be retrieved, ASMAD updates around 4am)
* Collects both arrival and departure data and stores in a dictionary further indexed by station

In [None]:
start = date(2021,4,21)
end = date(2021,4,25)

In [None]:
raw_data = retrieve_data(start=start, end=end)

In [None]:
depart =  raw_data_to_raw_df(raw_data, 'Depart')
print(depart.shape[0])
depart.tail()

In [None]:
arrive = raw_data_to_raw_df(raw_data, 'Arrive')
print(arrive.shape[0])
arrive.tail()

### Save the raw DF to disk

In [None]:
arrive_filestring = './data/trains_raw/arrive_raw_{}_{}.csv'.format(str(start), str(end))
depart_filestring = './data/trains_raw/depart_raw_{}_{}.csv'.format(str(start), str(end))

arrive.to_csv(arrive_filestring, line_terminator='\n', index=False)
depart.to_csv(depart_filestring, line_terminator='\n', index=False)

### Process the raw DF with modifications/additions 
* Modifications to the data:
    * Separate the Origin Date and Origin Week Day  into two columns
    * Add separate columns for Origin Year and Origin Month
    * Separate the Scheduled Arrival/Departure Date, Scheduled Arrival/Departure Week Day, and Scheduled Arrival/Departure Time into three seperate columns
    * Calculate the value of the time difference between Scheduled and Actual Arrival/Departure
    * Convert Service Disruption and Cancellation column text flags to binary indicator columns
    
    

In [None]:
full_arrive = process_columns(arrive, 'Arrive')
full_arrive.head()

In [None]:
full_depart = process_columns(depart, "Depart")
full_depart.head()

### For new 2021 data, concatenate with previously retrieved and processed data from this year

In [None]:
arrive_filestring2021 = './data/trains/arrive_2021_processed.csv'
depart_filestring2021 = './data/trains/depart_2021_processed.csv'
        
prev_arrive2021 = pd.read_csv(arrive_filestring2021)
prev_depart2021 = pd.read_csv(depart_filestring2021)

In [None]:
new_arrive2021 = pd.concat([prev_arrive2021, full_arrive], ignore_index=True, axis=0)
new_depart2021 = pd.concat([prev_depart2021, full_depart], ignore_index=True, axis=0)

In [None]:
new_arrive2021.shape[0]

In [None]:
new_depart2021.shape[0]

### Drop duplicate rows

In [None]:
new_arrive2021.drop_duplicates(inplace = True, ignore_index = True)
new_arrive2021.shape[0]

In [None]:
new_depart2021.drop_duplicates(inplace = True, ignore_index = True)
new_depart2021.shape[0]

In [None]:
new_arrive2021.head()

In [None]:
new_arrive2021.tail()

In [None]:
new_depart2021.head()

In [None]:
new_depart2021.tail()

In [None]:
new_arrive2021.to_csv(arrive_filestring2021, line_terminator='\n', index=False)
new_depart2021.to_csv(depart_filestring2021, line_terminator='\n', index=False)

# Part 2 - Visual Crossing Weather Data

### Setup

In [None]:
import requests
import os
import pandas as pd
import numpy as np
from datetime import date, timedelta

In [None]:
from weather_retrieve_and_process_data import *
assert os.environ.get('VC_TOKEN') is not None , 'empty token!'

### Retrieve unprocessed data

In [None]:
start = str(date.today()-timedelta(days=1))
end = str(date.today()-timedelta(days=1))

In [None]:
successful_retrievals = retrieve_weather_data(start, end)

### Data Cleaning/Taking Subset of Columns

* Processing recent data by year - add new columns, make minor fixes to string format, take subset of full columns list.
* Function processes the files that were successfully created in the previous step.
* This part is assuming 2021 data is being read and concatenates the previously retrieved data with the new data to create a single combined file.
* Output shows the fraction of the data kept, data is valid and complete almost always ($> 99\%$ of original data has been retained)

In [None]:
process_weather_data(files_to_process=successful_retrievals)

### Data sample for viewing

In [None]:
sample = pd.read_csv('./data/weather/Providence_RI_weather_2021_subset.csv')
sample.head()

In [None]:
sample.tail()

# Part 3a: Loading Data into Postgres Database (Composite Primary Key)

### Setup

In [None]:
import psycopg2
import csv
import os
import sys 
import time
assert os.environ.get('DB_PASS') != None , 'empty password!'

#### Functions to create and update tables in the database

In [None]:
def create_table(conn, command):
    """
    Create a table in the PostgreSQL database from the specified command.
    """
    try:
        cur = conn.cursor()
        cur.execute(command)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        conn.rollback()


def update_table(conn, command, csv_file):
    """
    Insert rows from a CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple(row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 
        
def update_trains(conn, command, arr_or_dep, csv_file):
    """
    Insert rows from trains CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple([arr_or_dep] + row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 

In [None]:
create_station_info_table_command = """ 
                                    DROP TABLE IF EXISTS station_info CASCADE;

                                    CREATE TABLE station_info (
                                        station_code text PRIMARY KEY,
                                        station_name text,
                                        state text,
                                        amtrak_city text,
                                        weather_loc text,
                                        longitude real,
                                        latitude real,
                                        nb_mile numeric,
                                        sb_mile numeric
                                    );
                                    """

insert_into_station_info_table_command = """
                                         INSERT INTO station_info (
                                             station_code,
                                             station_name,
                                             state,
                                             amtrak_city,
                                             weather_loc,
                                             longitude,
                                             latitude,
                                             nb_mile,
                                             sb_mile
                                        )
                                        VALUES
                                            (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                                        ON CONFLICT DO NOTHING;
                                        """    

In [None]:
create_trains_table_command = """ 
                              DROP TABLE IF EXISTS all_trains CASCADE;
                              CREATE TABLE all_trains (
                                  arr_or_dep text,
                                  train_num text,
                                  station_code text REFERENCES station_info (station_code), 
                                  direction text,
                                  origin_date date,
                                  origin_year int,
                                  origin_month int,
                                  origin_week_day text,
                                  full_sched_arr_dep_datetime timestamp,
                                  sched_arr_dep_date date,
                                  sched_arr_dep_week_day text,
                                  sched_arr_dep_time time,
                                  act_arr_dep_time time,
                                  full_act_arr_dep_datetime timestamp,
                                  timedelta_from_sched numeric,
                                  service_disruption boolean,
                                  cancellations boolean,
                                  PRIMARY KEY (train_num, station_code, origin_date)
                              );
                              """

insert_into_trains_table_command = """
                                     INSERT INTO all_trains (
                                          arr_or_dep,
                                          train_num,
                                          station_code,
                                          direction,
                                          origin_date,
                                          origin_year,
                                          origin_month,
                                          origin_week_day,
                                          full_sched_arr_dep_datetime,
                                          sched_arr_dep_date,
                                          sched_arr_dep_week_day,
                                          sched_arr_dep_time,
                                          act_arr_dep_time,
                                          full_act_arr_dep_datetime,
                                          timedelta_from_sched,
                                          service_disruption,
                                          cancellations
                                     )
                                     VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                                     ON CONFLICT DO NOTHING; 
                                     """  

In [None]:
create_weather_table_command = """
                               DROP TABLE IF EXISTS weather_hourly CASCADE;
                               CREATE TABLE weather_hourly (
                                   location text,
                                   date_time timestamp,
                                   temperature real,
                                   precipitation real,
                                   cloud_cover real,
                                   conditions text, 
                                   weather_type text,
                                   latitude real,
                                   longitude real,
                                   PRIMARY KEY (date_time, location)
                               );
                               """

insert_into_weather_table_command = """
                                    INSERT INTO weather_hourly (
                                        location,
                                        date_time,
                                        temperature,
                                        precipitation,
                                        cloud_cover,
                                        conditions,
                                        weather_type,
                                        latitude,
                                        longitude
                                    )
                                    VALUES
                                        (%s, %s, %s, %s, %s, %s, %s, %s, %s) 
                                    ON CONFLICT DO NOTHING;
                                    """ 

In [None]:
create_route_table_command = """
                             DROP TABLE IF EXISTS regional_route CASCADE;
                            
                             CREATE TABLE regional_route (
                                 coord_id SERIAL PRIMARY KEY,
                                 longitude real,
                                 latitude real,
                                 path_group numeric,
                                 connecting_path text, 
                                 nb_station_group text,
                                 sb_station_group text
                             );
                             """

insert_into_route_table_command = """
                                  INSERT INTO regional_route (
                                      longitude,
                                      latitude, 
                                      path_group,
                                      connecting_path,
                                      nb_station_group,
                                      sb_station_group
                                  )
                                  VALUES 
                                      (%s, %s, %s, %s, %s, %s) 
                                  ON CONFLICT DO NOTHING;
                                  """

In [None]:
create_table_commands = [create_station_info_table_command,
                         create_trains_table_command,
                         create_weather_table_command,
                         create_route_table_command]

In [None]:
conn = psycopg2.connect("dbname='amtrakproject' user='{}' password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [None]:
for command in create_table_commands:
    create_table(conn, command)

In [None]:
# Insert all station facts into station info table
update_table(conn, insert_into_station_info_table_command, './data/facts/geo_stations_info.csv')

# Insert route with the coordiniates into route table
update_table(conn, insert_into_route_table_command, './data/facts/NE_regional_lonlat.csv')

In [None]:
create_table(conn, create_trains_table_command)
years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

begin_everything = time.time()

# Insert all train data into arrival and departure data tables
for year in years:
    start = time.time()
    arrive_csv = './data/trains/arrive_{}_processed.csv'.format(year)
    depart_csv = './data/trains/depart_{}_processed.csv'.format(year)
    update_trains(conn, insert_into_trains_table_command, 'Arrival', arrive_csv)
    update_trains(conn, insert_into_trains_table_command, 'Departure', depart_csv)
    print('DONE WITH', year, 'in', time.time() - start)
print('COMPLETE in', time.time() - begin_everything)

In [None]:
create_table(conn, create_weather_table_command)
location_names_for_files = ['Boston_MA', 'Providence_RI', 'Kingston_RI', 'Westerly_RI', 'Mystic_CT',
                            'New_London_CT', 'Old_Saybrook_CT', 'New_Haven_CT', 'Bridgeport_CT', 
                            'Stamford_CT', 'New_Rochelle_NY', 'Manhattan_NY', 'Newark_NJ', 'Iselin_NJ', 
                            'Trenton_NJ', 'Philadelphia_PA', 'Wilmington_DE','Aberdeen_MD', 'Baltimore_MD',
                            'Baltimore_BWI_Airport_MD', 'New_Carrollton_MD', 'Washington_DC']

years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

# Insert all weather data into the weather data table
begin_everything = time.time()
for location in location_names_for_files:
    start = time.time()
    for year in years:
        weather_csv = './data/weather/{}_weather_subset_{}.csv'.format(location, year)
        update_table(conn, insert_into_weather_table_command, weather_csv)
    print('Finished adding location', location, 'to the database in', time.time() - start, 'seconds')
print("COMPLETE in", time.time() - begin_everything)

# Part 3b: Loading Data into Postgres Database (Serial Primary Key)

### Setup

In [1]:
import psycopg2
import csv
import os
import sys 
import time
assert os.environ.get('DB_PASS') != None , 'empty password!'

In [2]:
def create_table(conn, command):
    """
    Create a table in the PostgreSQL database from the specified command.
    """
    try:
        cur = conn.cursor()
        cur.execute(command)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        conn.rollback()


def update_table(conn, command, csv_file):
    """
    Insert rows from a CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple(row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 
        
def update_trains(conn, command, arr_or_dep, csv_file):
    """
    Insert rows from trains CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple([arr_or_dep] + row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 

In [3]:
create_station_info_table = """ 
                            DROP TABLE IF EXISTS station_info CASCADE;

                            CREATE TABLE station_info (
                                station_id SERIAL PRIMARY KEY,
                                station_code text,
                                amtrak_station text,
                                crew_change boolean,
                                weather_station text,
                                longitude real,
                                latitude real,
                                nb_mile numeric,
                                sb_mile numeric
                            );
                            """

insert_into_station_info_table = """
                                 INSERT INTO station_info (
                                     station_code,
                                     amtrak_station,
                                     crew_change,
                                     weather_station,
                                     longitude,
                                     latitude,
                                     nb_mile,
                                     sb_mile
                                 )
                                 VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
                                 ON CONFLICT DO NOTHING;
                                 """    

In [4]:
create_trains_table = """ 
                      DROP TABLE IF EXISTS all_trains CASCADE;
                      
                      CREATE TABLE all_trains (
                          dataset_id SERIAL PRIMARY KEY,
                          arr_or_dep text,
                          train_num text,
                          station_code text, 
                          direction text,
                          origin_date date,
                          origin_year int,
                          origin_month int,
                          origin_week_day text,
                          full_sched_arr_dep_datetime timestamp,
                          sched_arr_dep_date date,
                          sched_arr_dep_week_day text,
                          sched_arr_dep_time time,
                          act_arr_dep_time time,
                          full_act_arr_dep_datetime timestamp,
                          timedelta_from_sched numeric,
                          service_disruption boolean,
                          cancellations boolean
                      );
                      """

insert_into_trains_table = """
                           INSERT INTO all_trains (
                               arr_or_dep,
                               train_num,
                               station_code,
                               direction,
                               origin_date,
                               origin_year,
                               origin_month,
                               origin_week_day,
                               full_sched_arr_dep_datetime,
                               sched_arr_dep_date,
                               sched_arr_dep_week_day,
                               sched_arr_dep_time,
                               act_arr_dep_time,
                               full_act_arr_dep_datetime,
                               timedelta_from_sched,
                               service_disruption,
                               cancellations
                          )
                          VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                          ON CONFLICT DO NOTHING; 
                          """  

In [5]:
create_weather_table = """
                       DROP TABLE IF EXISTS weather_hourly CASCADE;
                       CREATE TABLE weather_hourly (
                           weather_id SERIAL PRIMARY KEY,
                           location text,
                           date_time timestamp,
                           temperature real,
                           precipitation real,
                           cloud_cover real,
                           conditions text, 
                           weather_type text,
                           latitude real,
                           longitude real
                       );
                       """

insert_into_weather_table = """
                            INSERT INTO weather_hourly (
                                location,
                                date_time,
                                temperature,
                                precipitation,
                                cloud_cover,
                                conditions,
                                weather_type,
                                latitude,
                                longitude
                            )
                            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s) 
                            ON CONFLICT DO NOTHING;
                            """ 

In [6]:
create_route_table = """
                     DROP TABLE IF EXISTS regional_route CASCADE;

                     CREATE TABLE regional_route (
                         coord_id SERIAL PRIMARY KEY,
                         longitude real,
                         latitude real,
                         path_group int,
                         station_pairing text, 
                         nb_station_group text,
                         sb_station_group text
                     );
                     """

insert_into_route_table = """
                          INSERT INTO
                              regional_route (
                                  longitude,
                                  latitude, 
                                  path_group,
                                  station_pairing,
                                  nb_station_group,
                                  sb_station_group
                              )
                          VALUES 
                              (%s, %s, %s, %s, %s, %s) 
                          ON CONFLICT DO NOTHING;
                          """

In [7]:
conn = psycopg2.connect("dbname='amtrakproject' user='{}' password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [8]:
# Create station link table
create_table(conn, create_station_info_table)

# Create route coordinates table
create_table(conn, create_route_table)

In [10]:
# Insert all station facts into station info table
update_table(conn, insert_into_station_info_table, './data/facts/geo_stations_info.csv')

# Insert route with the coordinates into route table
update_table(conn, insert_into_route_table, './data/facts/NE_regional_lonlat.csv')

In [11]:
create_table(conn, create_trains_table)
years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

# Insert all train data into arrival and departure data tables
begin_everything = time.time()
for year in years:
    start = time.time()
    arrive_csv = './data/trains/arrive_{}_processed.csv'.format(year)
    depart_csv = './data/trains/depart_{}_processed.csv'.format(year)
    update_trains(conn, insert_into_trains_table, 'Arrival', arrive_csv)
    update_trains(conn, insert_into_trains_table, 'Departure', depart_csv)
    print('Finished adding year', year, 'to database in', time.time() - start, 'seconds')
print('COMPLETE in', time.time() - begin_everything)

Finished adding year 2011 to database in 4.946296215057373 seconds
Finished adding year 2012 to database in 4.8167150020599365 seconds
Finished adding year 2013 to database in 5.1157310009002686 seconds
Finished adding year 2014 to database in 5.718309164047241 seconds
Finished adding year 2015 to database in 5.917017221450806 seconds
Finished adding year 2016 to database in 6.266315698623657 seconds
Finished adding year 2017 to database in 6.2739222049713135 seconds
Finished adding year 2018 to database in 6.24074912071228 seconds
Finished adding year 2019 to database in 6.481274127960205 seconds
Finished adding year 2020 to database in 4.550313949584961 seconds
Finished adding year 2021 to database in 1.6248202323913574 seconds
COMPLETE in 57.953227043151855


In [12]:
create_table(conn, create_weather_table)
location_names_for_files = ['Boston_MA', 'Providence_RI', 'Kingston_RI', 'Westerly_RI', 'Mystic_CT',
                            'New_London_CT', 'Old_Saybrook_CT', 'New_Haven_CT', 'Bridgeport_CT', 
                            'Stamford_CT', 'New_Rochelle_NY', 'Manhattan_NY', 'Newark_NJ', 'Iselin_NJ', 
                            'Trenton_NJ', 'Philadelphia_PA', 'Wilmington_DE','Aberdeen_MD', 'Baltimore_MD',
                            'Baltimore_BWI_Airport_MD', 'New_Carrollton_MD', 'Washington_DC']

years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

# Insert all weather data into the weather data table
begin_everything = time.time()
for location in location_names_for_files:
    start = time.time()
    for year in years:
        weather_csv = './data/weather/{}_weather_subset_{}.csv'.format(location, year)
        update_table(conn, insert_into_weather_table, weather_csv)
    print('Finished adding location', location, 'to database in', time.time() - start, 'seconds')
print("COMPLETE in", time.time() - begin_everything)

Finished adding location Boston_MA to database in 2.7271220684051514 seconds
Finished adding location Providence_RI to database in 2.7188689708709717 seconds
Finished adding location Kingston_RI to database in 2.6787750720977783 seconds
Finished adding location Westerly_RI to database in 2.68355393409729 seconds
Finished adding location Mystic_CT to database in 2.68345308303833 seconds
Finished adding location New_London_CT to database in 2.692837953567505 seconds
Finished adding location Old_Saybrook_CT to database in 2.6952991485595703 seconds
Finished adding location New_Haven_CT to database in 2.7057251930236816 seconds
Finished adding location Bridgeport_CT to database in 2.6888539791107178 seconds
Finished adding location Stamford_CT to database in 2.6934330463409424 seconds
Finished adding location New_Rochelle_NY to database in 2.7178261280059814 seconds
Finished adding location Manhattan_NY to database in 2.7187106609344482 seconds
Finished adding location Newark_NJ to databas