# ETL and EDA Notebook

# Part 1 - Amtrak Northeast Regional Train Data
* This project would not be possible without the diligent joint effort by [Chris Juckins](https://juckins.net/index.php) and [John Bobinyec](http://dixielandsoftware.net/Amtrak/status/StatusMaps/) to collect and preserve Amtrak's on-time performance records. Chris Juckins' archive of timetables was another invaluable resource which enabled me to sort through the trains and stations I chose to use in this project.
* The train data is sourced from [Amtrak Status Maps Archive Database (ASMAD)](https://juckins.net/amtrak_status/archive/html/home.php), and has been retrieved with Chris' permission.

### Overview of the Process
* Functions were written to scrape the HTML table returned from the search query and to process each column to the desired format
* Additional columns were also added during processing to aid in joining the train data with weather data

### Setup

In [1]:
import time
import requests
import re
import lxml.html as lh
import pandas as pd
import numpy as np
from datetime import date, timedelta
from trains_retrieve_and_process_data import * 

### Retrieve HTML table data and recreate as a Pandas DataFrame
* Default is to collect data from the previous day (run after 5am or else no data will be retrieved, ASMAD updates around 4am)
* Collects both arrival and departure data and stores in a dictionary further indexed by station

In [2]:
start = date(2021,6,11)
end = date(2021,6,11)

In [3]:
raw_data = retrieve_data(start=start, end=end)

Complete in 19.277547121047974 seconds


In [4]:
depart =  raw_data_to_raw_df(raw_data, 'Depart')
print(depart.shape[0])
depart.tail()

STATION:   EWR  (Depart) - Train # Group: [66, 82, 86, 88, 94, 132, 96, 176, 178, 190, 194] | No data for time period, or an error occurred during data retrieval.
STATION:   EWR  (Depart) - Train # Group: [67, 83, 93, 95, 99, 135, 65, 149, 169, 177] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [66, 82, 86, 88, 94, 132, 96, 176, 178, 190, 194] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [150, 160, 162, 164, 166, 168, 170, 172, 174] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [67, 83, 93, 95, 99, 135, 65, 149, 169, 177] | No data for time period, or an error occurred during data retrieval.
242


Unnamed: 0,Direction,Station,Train #,Origin Date,Sch Dp,Act Dp,Comments,Service Disruption,Cancellations
237,Northbound,WAS,176,06/11/2021 (Fr),06/11/2021 11:55 AM (Fr),12:27PM,Ar: 44 min late. | Dp: 32 min late.,,
238,Northbound,WAS,66,06/11/2021 (Fr),06/11/2021 10:00 PM (Fr),10:00PM,Ar: 26 min late. | Dp: On time.,,
239,Northbound,WAS,170,06/11/2021 (Fr),06/11/2021 4:45 AM (Fr),4:45AM,Dp: On time.,,
240,Northbound,WAS,172,06/11/2021 (Fr),06/11/2021 7:05 AM (Fr),7:05AM,Dp: On time.,,
241,Northbound,WAS,174,06/11/2021 (Fr),06/11/2021 10:01 AM (Fr),10:01AM,Ar: 1 min early. | Dp: On time.,,


In [5]:
arrive = raw_data_to_raw_df(raw_data, 'Arrive')
print(arrive.shape[0])
arrive.tail()

47


Unnamed: 0,Direction,Station,Train #,Origin Date,Sch Ar,Act Ar,Comments,Service Disruption,Cancellations
42,Southbound,WAS,95,06/11/2021 (Fr),06/11/2021 2:01 PM (Fr),2:01PM,Ar: On time. | Dp: On time.,,
43,Southbound,WAS,93,06/11/2021 (Fr),06/11/2021 5:21 PM (Fr),5:23PM,Ar: 2 min late.,,
44,Southbound,WAS,171,06/11/2021 (Fr),06/11/2021 4:20 PM (Fr),4:30PM,Ar: 10 min late. | Dp: 12 min late.,,
45,Southbound,WAS,173,06/11/2021 (Fr),06/11/2021 7:13 PM (Fr),7:11PM,Ar: 2 min early.,,
46,Southbound,WAS,137,06/11/2021 (Fr),06/11/2021 9:55 PM (Fr),10:05PM,Ar: 10 min late.,,


### Save the raw DF to disk

In [6]:
arrive_filestring = './data/trains_raw/arrive_raw_{}_{}.csv'.format(str(start), str(end))
depart_filestring = './data/trains_raw/depart_raw_{}_{}.csv'.format(str(start), str(end))

arrive.to_csv(arrive_filestring, line_terminator='\n', index=False)
depart.to_csv(depart_filestring, line_terminator='\n', index=False)

### Process the raw DF with modifications/additions 
* Modifications to the data:
    * Separate the Origin Date and Origin Week Day  into two columns
    * Add separate columns for Origin Year and Origin Month
    * Separate the Scheduled Arrival/Departure Date, Scheduled Arrival/Departure Week Day, and Scheduled Arrival/Departure Time into three seperate columns
    * Calculate the value of the time difference between Scheduled and Actual Arrival/Departure
    * Convert Service Disruption and Cancellation column text flags to binary indicator columns
    
    

In [7]:
full_arrive = process_columns(arrive, 'Arrive')
full_arrive.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
0,66,NYP,Northbound,2021-06-10,2021,6,Thursday,2021-06-11 01:55:00,2021-06-11,Friday,01:55:00,02:33:00,2021-06-11 02:33:00,38,0,0
1,176,NYP,Northbound,2021-06-11,2021,6,Friday,2021-06-11 15:20:00,2021-06-11,Friday,15:20:00,16:04:00,2021-06-11 16:04:00,44,0,0
2,170,NYP,Northbound,2021-06-11,2021,6,Friday,2021-06-11 08:15:00,2021-06-11,Friday,08:15:00,08:21:00,2021-06-11 08:21:00,6,0,0
3,172,NYP,Northbound,2021-06-11,2021,6,Friday,2021-06-11 10:44:00,2021-06-11,Friday,10:44:00,10:54:00,2021-06-11 10:54:00,10,0,0
4,174,NYP,Northbound,2021-06-11,2021,6,Friday,2021-06-11 13:35:00,2021-06-11,Friday,13:35:00,13:34:00,2021-06-11 13:34:00,-1,0,0


In [8]:
full_depart = process_columns(depart, "Depart")
full_depart.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
0,95,BOS,Southbound,2021-06-11,2021,6,Friday,2021-06-11 06:10:00,2021-06-11,Friday,06:10:00,06:10:00,2021-06-11 06:10:00,0,0,0
1,93,BOS,Southbound,2021-06-11,2021,6,Friday,2021-06-11 09:30:00,2021-06-11,Friday,09:30:00,09:30:00,2021-06-11 09:30:00,0,0,0
2,65,BOS,Southbound,2021-06-11,2021,6,Friday,2021-06-11 21:30:00,2021-06-11,Friday,21:30:00,21:30:00,2021-06-11 21:30:00,0,0,0
3,171,BOS,Southbound,2021-06-11,2021,6,Friday,2021-06-11 08:15:00,2021-06-11,Friday,08:15:00,08:15:00,2021-06-11 08:15:00,0,0,0
4,173,BOS,Southbound,2021-06-11,2021,6,Friday,2021-06-11 11:15:00,2021-06-11,Friday,11:15:00,11:15:00,2021-06-11 11:15:00,0,0,0


### For new 2021 data, concatenate with previously retrieved and processed data from this year

In [9]:
arrive_filestring2021 = './data/trains/arrive_2021_processed.csv'
depart_filestring2021 = './data/trains/depart_2021_processed.csv'
        
prev_arrive2021 = pd.read_csv(arrive_filestring2021)
prev_depart2021 = pd.read_csv(depart_filestring2021)

In [10]:
new_arrive2021 = pd.concat([prev_arrive2021, full_arrive], ignore_index=True, axis=0)
new_depart2021 = pd.concat([prev_depart2021, full_depart], ignore_index=True, axis=0)

In [11]:
new_arrive2021.shape[0]

8970

In [12]:
new_depart2021.shape[0]

47272

### Drop duplicate rows

In [13]:
new_arrive2021.drop_duplicates(inplace = True, ignore_index = True)
new_arrive2021.shape[0]

8970

In [14]:
new_depart2021.drop_duplicates(inplace = True, ignore_index = True)
new_depart2021.shape[0]

47272

In [15]:
new_arrive2021.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
0,82,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-01 13:46:00,2021-01-01,Friday,13:46:00,13:45:00,2021-01-01 13:45:00,-1,0,0
1,88,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-01 14:46:00,2021-01-01,Friday,14:46:00,14:46:00,2021-01-01 14:46:00,0,0,0
2,66,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-02 01:25:00,2021-01-02,Saturday,01:25:00,01:14:00,2021-01-02 01:14:00,-11,0,0
3,82,NYP,Northbound,2021-01-02,2021,1,Saturday,2021-01-02 13:46:00,2021-01-02,Saturday,13:46:00,13:44:00,2021-01-02 13:44:00,-2,0,0
4,88,NYP,Northbound,2021-01-02,2021,1,Saturday,2021-01-02 14:46:00,2021-01-02,Saturday,14:46:00,14:50:00,2021-01-02 14:50:00,4,0,0


In [16]:
new_arrive2021.tail()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
8965,95,WAS,Southbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 14:01:00,2021-06-11,Friday,14:01:00,14:01:00,2021-06-11 14:01:00,0,0,0
8966,93,WAS,Southbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 17:21:00,2021-06-11,Friday,17:21:00,17:23:00,2021-06-11 17:23:00,2,0,0
8967,171,WAS,Southbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 16:20:00,2021-06-11,Friday,16:20:00,16:30:00,2021-06-11 16:30:00,10,0,0
8968,173,WAS,Southbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 19:13:00,2021-06-11,Friday,19:13:00,19:11:00,2021-06-11 19:11:00,-2,0,0
8969,137,WAS,Southbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 21:55:00,2021-06-11,Friday,21:55:00,22:05:00,2021-06-11 22:05:00,10,0,0


In [17]:
new_depart2021.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
0,99,BOS,Southbound,2021-01-01,2021,1,Friday,2021-01-01 08:40:00,2021-01-01,Friday,08:40:00,08:40:00,2021-01-01 08:40:00,0,0,0
1,99,BOS,Southbound,2021-01-02,2021,1,Saturday,2021-01-02 08:40:00,2021-01-02,Saturday,08:40:00,08:40:00,2021-01-02 08:40:00,0,0,0
2,99,BOS,Southbound,2021-01-03,2021,1,Sunday,2021-01-03 08:40:00,2021-01-03,Sunday,08:40:00,08:40:00,2021-01-03 08:40:00,0,0,0
3,67,BOS,Southbound,2021-01-03,2021,1,Sunday,2021-01-03 21:30:00,2021-01-03,Sunday,21:30:00,21:30:00,2021-01-03 21:30:00,0,0,0
4,95,BOS,Southbound,2021-01-04,2021,1,Monday,2021-01-04 06:10:00,2021-01-04,Monday,06:10:00,06:11:00,2021-01-04 06:11:00,1,0,0


In [18]:
new_depart2021.tail()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
47267,176,WAS,Northbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 11:55:00,2021-06-11,Friday,11:55:00,12:27:00,2021-06-11 12:27:00,32,0,0
47268,66,WAS,Northbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 22:00:00,2021-06-11,Friday,22:00:00,22:00:00,2021-06-11 22:00:00,0,0,0
47269,170,WAS,Northbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 04:45:00,2021-06-11,Friday,04:45:00,04:45:00,2021-06-11 04:45:00,0,0,0
47270,172,WAS,Northbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 07:05:00,2021-06-11,Friday,07:05:00,07:05:00,2021-06-11 07:05:00,0,0,0
47271,174,WAS,Northbound,2021-06-11 00:00:00,2021,6,Friday,2021-06-11 10:01:00,2021-06-11,Friday,10:01:00,10:01:00,2021-06-11 10:01:00,0,0,0


In [19]:
new_arrive2021.to_csv(arrive_filestring2021, line_terminator='\n', index=False)
new_depart2021.to_csv(depart_filestring2021, line_terminator='\n', index=False)

# Part 2 - Visual Crossing Weather Data

### Setup

In [None]:
import requests
import os
import pandas as pd
import numpy as np
from datetime import date, timedelta

In [None]:
from weather_retrieve_and_process_data import *
assert os.environ.get('VC_TOKEN') is not None , 'empty token!'

### Retrieve unprocessed data

In [None]:
start = str(date(2021,6,10))
end = str(date.today()-timedelta(days=1))

In [None]:
successful_retrievals = retrieve_weather_data(start, end)

### Data Cleaning/Taking Subset of Columns

* Processing recent data by year - add new columns, make minor fixes to string format, take subset of full columns list.
* Function processes the files that were successfully created in the previous step.
* This part is assuming 2021 data is being read and concatenates the previously retrieved data with the new data to create a single combined file.
* Output shows the fraction of the data kept, data is valid and complete almost always ($> 99\%$ of original data has been retained)

In [None]:
process_weather_data(successful_retrievals)

### Data sample for viewing

In [None]:
sample = pd.read_csv('./data/weather/Providence_RI_weather_subset_2021.csv')
sample.head()

In [None]:
sample.tail()

# Part 3a: Loading Data into Postgres Database
Schema pictured below:
![Database Schema](data/schema/Final_DB_Schema.pdf)

### Setup

In [20]:
import psycopg2
import csv
import os
import sys 
import time
assert os.environ.get('DB_PASS') != None , 'empty password!'

#### Functions to create and update tables in the database

In [21]:
def execute_command(conn, command):
    """
    Execute specified command in PostgreSQL database.
    """
    try:
        cur = conn.cursor()
        cur.execute(command)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        conn.rollback()

def update_table(conn, command, csv_file):
    """
    Insert rows from a CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple(row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 
        
def update_trains(conn, command, arr_or_dep, csv_file):
    """
    Insert rows from trains CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple([arr_or_dep] + row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 

In [22]:
conn = psycopg2.connect("dbname='amtrakproject' user='{}' password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [23]:
create_station_info = """ 
                      DROP TABLE IF EXISTS station_info CASCADE;
                      CREATE TABLE station_info (
                          station_code text UNIQUE PRIMARY KEY,
                          amtrak_station_name text,
                          crew_change boolean,
                          weather_location_name text,
                          longitude real,
                          latitude real,
                          nb_next_station text,
                          sb_next_station text,
                          nb_mile numeric,
                          sb_mile numeric,
                          nb_stop_num numeric,
                          sb_stop_num numeric,
                          nb_miles_to_next numeric,
                          sb_miles_to_next numeric
                      );
                      """

insert_into_station_info = """
                           INSERT INTO
                               station_info (
                                   station_code,
                                   amtrak_station_name,
                                   crew_change,
                                   weather_location_name,
                                   longitude,
                                   latitude,
                                   nb_next_station,
                                   sb_next_station,
                                   nb_mile,
                                   sb_mile,
                                   nb_stop_num,
                                   sb_stop_num,
                                   nb_miles_to_next,
                                   sb_miles_to_next

                             )
                         VALUES
                             (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                         ON CONFLICT DO NOTHING;
                         """  

In [24]:
create_stops = """
                 DROP TABLE IF EXISTS stops CASCADE;
                 CREATE TABLE stops (
                     stop_id SERIAL PRIMARY KEY,
                     arrival_or_departure text, 
                     train_num text,
                     station_code text REFERENCES station_info,
                     direction text,
                     origin_date date,
                     origin_year int,
                     origin_month int,
                     origin_week_day text,
                     full_sched_arr_dep_datetime timestamp,
                     sched_arr_dep_date date,
                     sched_arr_dep_week_day text,
                     sched_arr_dep_time time,
                     act_arr_dep_time time,
                     full_act_arr_dep_datetime timestamp,
                     timedelta_from_sched numeric,
                     service_disruption boolean,
                     cancellations boolean
                 );
               """

insert_into_stops = """
                    INSERT INTO
                        stops (
                            arrival_or_departure,
                            train_num,
                            station_code,
                            direction,
                            origin_date,
                            origin_year,
                            origin_month,
                            origin_week_day,
                            full_sched_arr_dep_datetime,
                            sched_arr_dep_date,
                            sched_arr_dep_week_day,
                            sched_arr_dep_time,
                            act_arr_dep_time,
                            full_act_arr_dep_datetime,
                            timedelta_from_sched,
                            service_disruption,
                            cancellations
                          )
                      VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                      ON CONFLICT DO NOTHING;
                      """

In [25]:
create_weather = """
                 DROP TABLE IF EXISTS weather_hourly CASCADE;
                 CREATE TABLE weather_hourly (
                     weather_id SERIAL PRIMARY KEY,
                     location text,
                     obs_datetime timestamp,
                     temperature real,
                     precipitation real,
                     cloud_cover real,
                     weather_type text
                 );
                 """

insert_into_weather = """
                      INSERT INTO
                          weather_hourly (
                              location,
                              obs_datetime,
                              temperature,
                              precipitation,
                              cloud_cover,
                              weather_type
                      )
                      VALUES
                          (%s, %s, %s, %s, %s, %s) 
                      ON CONFLICT DO NOTHING;
                      """ 

In [26]:
create_route = """
               DROP TABLE IF EXISTS regional_route CASCADE;

               CREATE TABLE regional_route (
                 coord_id SERIAL PRIMARY KEY,
                 longitude real,
                 latitude real,
                 path_group numeric,
                 connecting_path text, 
                 nb_station_group text,
                 sb_station_group text
               );
               """

insert_into_route = """
                    INSERT INTO
                      regional_route (
                          longitude,
                          latitude, 
                          path_group,
                          connecting_path,
                          nb_station_group,
                          sb_station_group
                      )
                    VALUES 
                        (%s, %s, %s, %s, %s, %s) 
                    ON CONFLICT DO NOTHING;
                    """

In [27]:
conn = psycopg2.connect("dbname='amtrakproject' user={} password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [28]:
create_table_cmds = [create_station_info, create_stops, create_weather,  create_route]

for cmd in create_table_cmds:
    execute_command(conn, cmd)

In [29]:
# Insert all station facts into station info table
update_table(conn, insert_into_station_info, './data/facts/geo_stations_info.csv')

# Insert route with the coordiniates into route table
update_table(conn, insert_into_route, './data/facts/NE_regional_lonlat.csv')

In [30]:
years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

begin_everything = time.time()

# Insert all train data into arrival and departure data tables
for year in years:
    start = time.time()
    arrive_csv = './data/trains/arrive_{}_processed.csv'.format(year)
    depart_csv = './data/trains/depart_{}_processed.csv'.format(year)
    update_trains(conn, insert_into_stops, 'Arrival', arrive_csv)
    update_trains(conn, insert_into_stops, 'Departure', depart_csv)
    print('Finished adding year', year, 'to database in', time.time() - start, 'seconds')
print('COMPLETE in', time.time() - begin_everything)

Finished adding year 2011 to database in 5.664422988891602 seconds
Finished adding year 2012 to database in 5.518220901489258 seconds
Finished adding year 2013 to database in 5.930235147476196 seconds
Finished adding year 2014 to database in 6.607802867889404 seconds
Finished adding year 2015 to database in 6.864720821380615 seconds
Finished adding year 2016 to database in 7.184977054595947 seconds
Finished adding year 2017 to database in 7.186160326004028 seconds
Finished adding year 2018 to database in 7.105131149291992 seconds
Finished adding year 2019 to database in 7.373771905899048 seconds
Finished adding year 2020 to database in 5.144702911376953 seconds
Finished adding year 2021 to database in 2.664069890975952 seconds
COMPLETE in 67.24544787406921


In [31]:
location_names_for_files = ['Boston_MA', 'Providence_RI', 'Kingston_RI', 'Westerly_RI', 'Mystic_CT',
                            'New_London_CT', 'Old_Saybrook_CT', 'New_Haven_CT', 'Bridgeport_CT', 
                            'Stamford_CT', 'New_Rochelle_NY', 'Manhattan_NY', 'Newark_NJ', 'Iselin_NJ', 
                            'Trenton_NJ', 'Philadelphia_PA', 'Wilmington_DE','Aberdeen_MD', 'Baltimore_MD',
                            'Baltimore_BWI_Airport_MD', 'New_Carrollton_MD', 'Washington_DC']

years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

# Insert all weather data into the weather data table
begin_everything = time.time()
for location in location_names_for_files:
    start = time.time()
    for year in years:
        weather_csv = './data/weather/{}_weather_subset_{}.csv'.format(location, year)
        update_table(conn, insert_into_weather, weather_csv)
    print('Finished adding location', location, 'to database in', time.time() - start, 'seconds')
print("COMPLETE in", time.time() - begin_everything)

Finished adding location Boston_MA to database in 2.529453992843628 seconds
Finished adding location Providence_RI to database in 2.552884817123413 seconds
Finished adding location Kingston_RI to database in 2.551476001739502 seconds
Finished adding location Westerly_RI to database in 2.4911279678344727 seconds
Finished adding location Mystic_CT to database in 2.4239580631256104 seconds
Finished adding location New_London_CT to database in 2.4675650596618652 seconds
Finished adding location Old_Saybrook_CT to database in 2.4762649536132812 seconds
Finished adding location New_Haven_CT to database in 2.4946210384368896 seconds
Finished adding location Bridgeport_CT to database in 2.4947707653045654 seconds
Finished adding location Stamford_CT to database in 2.5036308765411377 seconds
Finished adding location New_Rochelle_NY to database in 2.498610734939575 seconds
Finished adding location Manhattan_NY to database in 2.4717252254486084 seconds
Finished adding location Newark_NJ to databa

In [32]:
create_dates_trains = """
                      DROP TABLE IF EXISTS dates_trains CASCADE;
                      CREATE TABLE dates_trains AS SELECT DISTINCT
                          origin_date,
                          train_num
                      FROM
                          stops
                      GROUP BY
                          origin_date,
                          train_num;

                      ALTER TABLE dates_trains
                          ADD COLUMN trip_id SERIAL PRIMARY KEY;
                      """

In [33]:
execute_command(conn, create_dates_trains)

In [34]:
join_data = """
            CREATE TABLE full_joined AS
            SELECT
                *
            FROM
                stops s
                INNER JOIN (
                    SELECT
                        station_code AS si_station_code,
                        amtrak_station_name,
                        crew_change,
                        weather_location_name,
                        longitude,
                        latitude,
                        nb_next_station,
                        sb_next_station,
                        nb_mile,
                        sb_mile,
                        nb_stop_num,
                        sb_stop_num,
                        nb_miles_to_next,
                        sb_miles_to_next
                    FROM
                        station_info) si ON s.station_code = si.si_station_code
                INNER JOIN weather_hourly wh ON wh.location = si.weather_location_name
                    AND DATE_TRUNC('hour', s.full_act_arr_dep_datetime) = wh.obs_datetime
            ORDER BY
                s.full_sched_arr_dep_datetime;
            """

alter_joined_table = """
                      ALTER TABLE full_joined
                          DROP COLUMN location,
                          DROP COLUMN obs_datetime,
                          DROP COLUMN weather_id,
                          DROP COLUMN weather_location_name,
                          DROP COLUMN longitude,
                          DROP COLUMN latitude;
                     """

In [35]:
execute_command(conn, join_data)

In [36]:
execute_command(conn, alter_joined_table)

### Remove duplicate rows
* There are 393 duplicate entries which somehow ended up in the dataset, as determined by the unique tuples (`origin_date`, `train_num`, `station_code`, `arrival_or_departure`)

In [37]:
remove_duplicates = """
                    DELETE 
                    FROM full_joined
                    WHERE full_joined.stop_id IN 
                    (
                        SELECT fj_stop_id
                        FROM(
                            SELECT 
                                *, 
                                fj.stop_id AS fj_stop_id,
                                row_number() OVER (PARTITION BY origin_date, train_num, station_code, arrival_or_departure ORDER BY stop_id) 
                            FROM full_joined fj
                        ) s
                        WHERE row_number >= 2
                    );
                    """

In [38]:
execute_command(conn, remove_duplicates)