# Data Retrieval and Database Loading Notebook

# Part 1 - Amtrak Northeast Regional Train Data
* This project would not be possible without the diligent joint effort by [Chris Juckins](https://juckins.net/index.php) and [John Bobinyec](http://dixielandsoftware.net/Amtrak/status/StatusMaps/) to collect and preserve Amtrak's on-time performance records. Chris Juckins' archive of timetables was another invaluable resource which enabled me to sort through the trains and stations I chose to use in this project.
* The train data is sourced from [Amtrak Status Maps Archive Database (ASMAD)](https://juckins.net/amtrak_status/archive/html/home.php), and has been retrieved with Chris' permission.

### Overview of the Process
* Functions were written to scrape the HTML table returned from the search query and to process each column to the desired format
* Additional columns were also added during processing to aid in joining the train data with weather data

### Setup

In [1]:
import time
import requests
import re
import lxml.html as lh
import pandas as pd
import numpy as np
from datetime import date, timedelta
from trains_retrieve_and_process_data import * 

### Retrieve HTML table data and recreate as a Pandas DataFrame
* Default is to collect data from the previous day (run after 5am or else no data will be retrieved, ASMAD updates around 4am)
* Collects both arrival and departure data and stores in a dictionary further indexed by station

In [2]:
start = date(2021,6,16)
end = date(2021,6,16)

In [3]:
raw_data = retrieve_data(start=start, end=end)

Complete in 19.47046995162964 seconds


In [4]:
depart =  raw_data_to_raw_df(raw_data, 'Depart')
print(depart.shape[0])
depart.tail()

STATION:   EWR  (Depart) - Train # Group: [67, 83, 93, 95, 99, 135, 65, 149, 169, 177] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [66, 82, 86, 88, 94, 132, 96, 176, 178, 190, 194] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [150, 160, 162, 164, 166, 168, 170, 172, 174] | No data for time period, or an error occurred during data retrieval.
STATION:   ABE  (Depart) - Train # Group: [67, 83, 93, 95, 99, 135, 65, 149, 169, 177] | No data for time period, or an error occurred during data retrieval.
295


Unnamed: 0,Direction,Station,Train #,Origin Date,Sch Dp,Act Dp,Comments,Service Disruption,Cancellations
290,Northbound,WAS,94,06/16/2021 (We),06/16/2021 1:55 PM (We),1:55PM,Ar: 17 min early. | Dp: On time.,,
291,Northbound,WAS,66,06/16/2021 (We),06/16/2021 10:00 PM (We),10:00PM,Ar: 15 min early. | Dp: On time.,,
292,Northbound,WAS,170,06/16/2021 (We),06/16/2021 4:45 AM (We),4:45AM,Dp: On time.,,
293,Northbound,WAS,172,06/16/2021 (We),06/16/2021 7:05 AM (We),7:05AM,Dp: On time.,,
294,Northbound,WAS,174,06/16/2021 (We),06/16/2021 10:01 AM (We),10:01AM,Ar: 11 min early. | Dp: On time.,,


In [5]:
arrive = raw_data_to_raw_df(raw_data, 'Arrive')
print(arrive.shape[0])
arrive.tail()

57


Unnamed: 0,Direction,Station,Train #,Origin Date,Sch Ar,Act Ar,Comments,Service Disruption,Cancellations
52,Southbound,WAS,93,06/16/2021 (We),06/16/2021 5:21 PM (We),5:28PM,Ar: 7 min late. | Dp: 6 min late.,,
53,Southbound,WAS,171,06/16/2021 (We),06/16/2021 4:20 PM (We),4:17PM,Ar: 3 min early. | Dp: 9 min late.,,
54,Southbound,WAS,173,06/16/2021 (We),06/16/2021 7:13 PM (We),7:38PM,Ar: 25 min late.,,
55,Southbound,WAS,137,06/16/2021 (We),06/16/2021 9:55 PM (We),9:59PM,Ar: 4 min late.,,
56,Southbound,WAS,175,06/16/2021 (We),06/16/2021 11:19 PM (We),11:18PM,Ar: 1 min early.,,


### Save the raw DF to disk

In [6]:
arrive_filestring = './data/trains_raw/arrive_raw_{}_{}.csv'.format(str(start), str(end))
depart_filestring = './data/trains_raw/depart_raw_{}_{}.csv'.format(str(start), str(end))

arrive.to_csv(arrive_filestring, line_terminator='\n', index=False)
depart.to_csv(depart_filestring, line_terminator='\n', index=False)

### Process the raw DF with modifications/additions 
* Modifications to the data:
    * Separate the Origin Date and Origin Week Day  into two columns
    * Add separate columns for Origin Year and Origin Month
    * Separate the Scheduled Arrival/Departure Date, Scheduled Arrival/Departure Week Day, and Scheduled Arrival/Departure Time into three seperate columns
    * Calculate the value of the time difference between Scheduled and Actual Arrival/Departure
    * Convert Service Disruption and Cancellation column text flags to binary indicator columns
    
    

In [7]:
full_arrive = process_columns(arrive, 'Arrive')
full_arrive.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
0,66,NYP,Northbound,2021-06-15,2021,6,Tuesday,2021-06-16 01:55:00,2021-06-16,Wednesday,01:55:00,01:52:00,2021-06-16 01:52:00,-3,0,0
1,176,NYP,Northbound,2021-06-16,2021,6,Wednesday,2021-06-16 15:20:00,2021-06-16,Wednesday,15:20:00,15:15:00,2021-06-16 15:15:00,-5,0,0
2,94,NYP,Northbound,2021-06-16,2021,6,Wednesday,2021-06-16 17:22:00,2021-06-16,Wednesday,17:22:00,17:22:00,2021-06-16 17:22:00,0,0,0
3,170,NYP,Northbound,2021-06-16,2021,6,Wednesday,2021-06-16 08:15:00,2021-06-16,Wednesday,08:15:00,08:24:00,2021-06-16 08:24:00,9,0,0
4,172,NYP,Northbound,2021-06-16,2021,6,Wednesday,2021-06-16 10:44:00,2021-06-16,Wednesday,10:44:00,10:50:00,2021-06-16 10:50:00,6,0,0


In [8]:
full_depart = process_columns(depart, "Depart")
full_depart.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
0,95,BOS,Southbound,2021-06-16,2021,6,Wednesday,2021-06-16 06:10:00,2021-06-16,Wednesday,06:10:00,06:10:00,2021-06-16 06:10:00,0,0,0
1,93,BOS,Southbound,2021-06-16,2021,6,Wednesday,2021-06-16 09:30:00,2021-06-16,Wednesday,09:30:00,09:30:00,2021-06-16 09:30:00,0,0,0
2,177,BOS,Southbound,2021-06-16,2021,6,Wednesday,2021-06-16 17:35:00,2021-06-16,Wednesday,17:35:00,17:36:00,2021-06-16 17:36:00,1,0,0
3,67,BOS,Southbound,2021-06-16,2021,6,Wednesday,2021-06-16 21:30:00,2021-06-16,Wednesday,21:30:00,21:30:00,2021-06-16 21:30:00,0,0,0
4,171,BOS,Southbound,2021-06-16,2021,6,Wednesday,2021-06-16 08:15:00,2021-06-16,Wednesday,08:15:00,08:15:00,2021-06-16 08:15:00,0,0,0


### For new 2021 data, concatenate with previously retrieved and processed data from this year

In [9]:
arrive_filestring2021 = './data/trains/arrive_2021_processed.csv'
depart_filestring2021 = './data/trains/depart_2021_processed.csv'
        
prev_arrive2021 = pd.read_csv(arrive_filestring2021)
prev_depart2021 = pd.read_csv(depart_filestring2021)

In [10]:
new_arrive2021 = pd.concat([prev_arrive2021, full_arrive], ignore_index=True, axis=0)
new_depart2021 = pd.concat([prev_depart2021, full_depart], ignore_index=True, axis=0)

In [11]:
new_arrive2021.shape[0]

9293

In [12]:
new_depart2021.shape[0]

48982

### Drop duplicate rows (if any)

In [13]:
new_arrive2021.drop_duplicates(inplace = True, ignore_index = True)
new_arrive2021.shape[0]

9248

In [14]:
new_depart2021.drop_duplicates(inplace = True, ignore_index = True)
new_depart2021.shape[0]

48739

In [15]:
new_arrive2021.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
0,82,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-01 13:46:00,2021-01-01,Friday,13:46:00,13:45:00,2021-01-01 13:45:00,-1,0,0
1,88,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-01 14:46:00,2021-01-01,Friday,14:46:00,14:46:00,2021-01-01 14:46:00,0,0,0
2,66,NYP,Northbound,2021-01-01,2021,1,Friday,2021-01-02 01:25:00,2021-01-02,Saturday,01:25:00,01:14:00,2021-01-02 01:14:00,-11,0,0
3,82,NYP,Northbound,2021-01-02,2021,1,Saturday,2021-01-02 13:46:00,2021-01-02,Saturday,13:46:00,13:44:00,2021-01-02 13:44:00,-2,0,0
4,88,NYP,Northbound,2021-01-02,2021,1,Saturday,2021-01-02 14:46:00,2021-01-02,Saturday,14:46:00,14:50:00,2021-01-02 14:50:00,4,0,0


In [16]:
new_arrive2021.tail()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Ar Date,Sch Ar Date,Sch Ar Day,Sch Ar Time,Act Ar Time,Full Act Ar Date,Arrive Diff,Service Disruption,Cancellations
9243,93,WAS,Southbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 17:21:00,2021-06-16,Wednesday,17:21:00,17:28:00,2021-06-16 17:28:00,7,0,0
9244,171,WAS,Southbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 16:20:00,2021-06-16,Wednesday,16:20:00,16:17:00,2021-06-16 16:17:00,-3,0,0
9245,173,WAS,Southbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 19:13:00,2021-06-16,Wednesday,19:13:00,19:38:00,2021-06-16 19:38:00,25,0,0
9246,137,WAS,Southbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 21:55:00,2021-06-16,Wednesday,21:55:00,21:59:00,2021-06-16 21:59:00,4,0,0
9247,175,WAS,Southbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 23:19:00,2021-06-16,Wednesday,23:19:00,23:18:00,2021-06-16 23:18:00,-1,0,0


In [17]:
new_depart2021.head()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
0,99,BOS,Southbound,2021-01-01,2021,1,Friday,2021-01-01 08:40:00,2021-01-01,Friday,08:40:00,08:40:00,2021-01-01 08:40:00,0,0,0
1,99,BOS,Southbound,2021-01-02,2021,1,Saturday,2021-01-02 08:40:00,2021-01-02,Saturday,08:40:00,08:40:00,2021-01-02 08:40:00,0,0,0
2,99,BOS,Southbound,2021-01-03,2021,1,Sunday,2021-01-03 08:40:00,2021-01-03,Sunday,08:40:00,08:40:00,2021-01-03 08:40:00,0,0,0
3,67,BOS,Southbound,2021-01-03,2021,1,Sunday,2021-01-03 21:30:00,2021-01-03,Sunday,21:30:00,21:30:00,2021-01-03 21:30:00,0,0,0
4,95,BOS,Southbound,2021-01-04,2021,1,Monday,2021-01-04 06:10:00,2021-01-04,Monday,06:10:00,06:11:00,2021-01-04 06:11:00,1,0,0


In [18]:
new_depart2021.tail()

Unnamed: 0,Train Num,Station,Direction,Origin Date,Origin Year,Origin Month,Origin Week Day,Full Sch Dp Date,Sch Dp Date,Sch Dp Day,Sch Dp Time,Act Dp Time,Full Act Dp Date,Depart Diff,Service Disruption,Cancellations
48734,94,WAS,Northbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 13:55:00,2021-06-16,Wednesday,13:55:00,13:55:00,2021-06-16 13:55:00,0,0,0
48735,66,WAS,Northbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 22:00:00,2021-06-16,Wednesday,22:00:00,22:00:00,2021-06-16 22:00:00,0,0,0
48736,170,WAS,Northbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 04:45:00,2021-06-16,Wednesday,04:45:00,04:45:00,2021-06-16 04:45:00,0,0,0
48737,172,WAS,Northbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 07:05:00,2021-06-16,Wednesday,07:05:00,07:05:00,2021-06-16 07:05:00,0,0,0
48738,174,WAS,Northbound,2021-06-16 00:00:00,2021,6,Wednesday,2021-06-16 10:01:00,2021-06-16,Wednesday,10:01:00,10:01:00,2021-06-16 10:01:00,0,0,0


In [19]:
new_arrive2021.to_csv(arrive_filestring2021, line_terminator='\n', index=False)
new_depart2021.to_csv(depart_filestring2021, line_terminator='\n', index=False)

# Part 2 - Visual Crossing Weather Data

### Setup

In [22]:
import requests
import os
import pandas as pd
import numpy as np
from datetime import date, timedelta

In [23]:
from weather_retrieve_and_process_data import *
assert os.environ.get('VC_TOKEN') is not None , 'empty token!'

### Retrieve unprocessed data

In [24]:
start = str(date(2021,6,16))
end = str(date.today()-timedelta(days=1))

In [25]:
successful_retrievals = retrieve_weather_data(start, end)

Retrieving data for LOCATION: Boston_MA
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Providence_RI
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Kingston_RI
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Westerly_RI
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Mystic_CT
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: New_London_CT
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Old_Saybrook_CT
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: New_Haven_CT
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Bridgeport_CT
    and DATE RANGE: 2021-06-16T00:00:00 to 2021-06-16T23:59:00
Retrieving data for LOCATION: Stamford_CT
    an

### Data Cleaning/Taking Subset of Columns

* Processing recent data by year - add new columns, make minor fixes to string format, take subset of full columns list.
* Function processes the files that were successfully created in the previous step.
* This part is assuming 2021 data is being read and concatenates the previously retrieved data with the new data to create a single combined file.
* Output shows the fraction of the data kept, data is valid and complete almost always ($> 99\%$ of original data has been retained)

In [26]:
process_weather_data(successful_retrievals)

Successfully processed and combined the following raw data files with previous data:
        FILE:          ./data/weather/Boston_MA_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Providence_RI_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Kingston_RI_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Westerly_RI_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Mystic_CT_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/New_London_CT_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Old_Saybrook_CT_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/New_Haven_CT_weather_subset_2021.csv
        FRACTION KEPT: 1.0
        FILE:          ./data/weather/Bridgeport_CT_weather_subset_2021.csv
        FRACTION KEPT: 1.0

### Data sample for viewing

In [27]:
sample = pd.read_csv('./data/weather/Providence_RI_weather_subset_2021.csv')
sample.head()

Unnamed: 0,Address,Date time,Temperature,Precipitation,Cloud Cover,Weather Type
0,"Providence, RI",01/01/2021 00:00:00,28.2,0.0,0.0,
1,"Providence, RI",01/01/2021 01:00:00,27.2,0.0,0.0,
2,"Providence, RI",01/01/2021 02:00:00,26.8,0.0,7.5,
3,"Providence, RI",01/01/2021 03:00:00,26.4,0.0,12.5,
4,"Providence, RI",01/01/2021 04:00:00,30.0,0.0,22.5,


In [29]:
sample.tail()

Unnamed: 0,Address,Date time,Temperature,Precipitation,Cloud Cover,Weather Type
4016,"Providence, RI",06/16/2021 19:00:00,71.7,0.0,15.6,
4017,"Providence, RI",06/16/2021 20:00:00,68.5,0.0,25.0,
4018,"Providence, RI",06/16/2021 21:00:00,64.7,0.0,17.1,
4019,"Providence, RI",06/16/2021 22:00:00,59.8,0.0,15.6,
4020,"Providence, RI",06/16/2021 23:00:00,58.4,0.0,19.3,


# Part 3a: Loading Data into Postgres Database
Schema pictured below:
![Database Schema](data/schema/Final_DB_Schema.pdf)

### Setup

In [30]:
import psycopg2
import csv
import os
import sys 
import time
assert os.environ.get('DB_PASS') != None , 'empty password!'

#### Functions to create and update tables in the database

In [31]:
def execute_command(conn, command):
    """
    Execute specified command in PostgreSQL database.
    """
    try:
        cur = conn.cursor()
        cur.execute(command)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        conn.rollback()

def update_table(conn, command, csv_file):
    """
    Insert rows from a CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple(row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 
        
def update_trains(conn, command, arr_or_dep, csv_file):
    """
    Insert rows from trains CSV file into table specified by the command.
    """
    cur = conn.cursor()
    with open(csv_file, newline='') as file:
        info_reader = csv.reader(file, delimiter=',')
        next(info_reader) # Skip header                                                                          
        for row in info_reader:                                           
            try:
                cur.execute(command, tuple([arr_or_dep] + row))
            except (Exception, psycopg2.DatabaseError) as error:
                print(error)
                conn.rollback()
        conn.commit() 

In [32]:
conn = psycopg2.connect("dbname='amtrakproject' user='{}' password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [33]:
create_station_info = """ 
                      DROP TABLE IF EXISTS station_info CASCADE;
                      CREATE TABLE station_info (
                          station_code text UNIQUE PRIMARY KEY,
                          amtrak_station_name text,
                          crew_change boolean,
                          weather_location_name text,
                          longitude real,
                          latitude real,
                          nb_next_station text,
                          sb_next_station text,
                          nb_mile numeric,
                          sb_mile numeric,
                          nb_stop_num numeric,
                          sb_stop_num numeric,
                          nb_miles_to_next numeric,
                          sb_miles_to_next numeric
                      );
                      """

insert_into_station_info = """
                           INSERT INTO
                               station_info (
                                   station_code,
                                   amtrak_station_name,
                                   crew_change,
                                   weather_location_name,
                                   longitude,
                                   latitude,
                                   nb_next_station,
                                   sb_next_station,
                                   nb_mile,
                                   sb_mile,
                                   nb_stop_num,
                                   sb_stop_num,
                                   nb_miles_to_next,
                                   sb_miles_to_next

                             )
                         VALUES
                             (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                         ON CONFLICT DO NOTHING;
                         """  

In [34]:
create_stops = """
                 DROP TABLE IF EXISTS stops CASCADE;
                 CREATE TABLE stops (
                     stop_id SERIAL PRIMARY KEY,
                     arrival_or_departure text, 
                     train_num text,
                     station_code text REFERENCES station_info,
                     direction text,
                     origin_date date,
                     origin_year int,
                     origin_month int,
                     origin_week_day text,
                     full_sched_arr_dep_datetime timestamp,
                     sched_arr_dep_date date,
                     sched_arr_dep_week_day text,
                     sched_arr_dep_time time,
                     act_arr_dep_time time,
                     full_act_arr_dep_datetime timestamp,
                     timedelta_from_sched numeric,
                     service_disruption boolean,
                     cancellations boolean
                 );
               """

insert_into_stops = """
                    INSERT INTO
                        stops (
                            arrival_or_departure,
                            train_num,
                            station_code,
                            direction,
                            origin_date,
                            origin_year,
                            origin_month,
                            origin_week_day,
                            full_sched_arr_dep_datetime,
                            sched_arr_dep_date,
                            sched_arr_dep_week_day,
                            sched_arr_dep_time,
                            act_arr_dep_time,
                            full_act_arr_dep_datetime,
                            timedelta_from_sched,
                            service_disruption,
                            cancellations
                          )
                      VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                      ON CONFLICT DO NOTHING;
                      """

In [35]:
create_weather = """
                 DROP TABLE IF EXISTS weather_hourly CASCADE;
                 CREATE TABLE weather_hourly (
                     weather_id SERIAL PRIMARY KEY,
                     location text,
                     obs_datetime timestamp,
                     temperature real,
                     precipitation real,
                     cloud_cover real,
                     weather_type text
                 );
                 """

insert_into_weather = """
                      INSERT INTO
                          weather_hourly (
                              location,
                              obs_datetime,
                              temperature,
                              precipitation,
                              cloud_cover,
                              weather_type
                      )
                      VALUES
                          (%s, %s, %s, %s, %s, %s) 
                      ON CONFLICT DO NOTHING;
                      """ 

In [36]:
create_route = """
               DROP TABLE IF EXISTS regional_route CASCADE;

               CREATE TABLE regional_route (
                 coord_id SERIAL PRIMARY KEY,
                 longitude real,
                 latitude real,
                 path_group numeric,
                 connecting_path text, 
                 nb_station_group text,
                 sb_station_group text
               );
               """

insert_into_route = """
                    INSERT INTO
                      regional_route (
                          longitude,
                          latitude, 
                          path_group,
                          connecting_path,
                          nb_station_group,
                          sb_station_group
                      )
                    VALUES 
                        (%s, %s, %s, %s, %s, %s) 
                    ON CONFLICT DO NOTHING;
                    """

In [37]:
conn = psycopg2.connect("dbname='amtrakproject' user={} password={}".format(os.environ.get('USER'), os.environ.get('DB_PASS')))
assert conn is not None, 'need to fix conn!!'

In [38]:
create_table_cmds = [create_station_info, create_stops, create_weather,  create_route]

for cmd in create_table_cmds:
    execute_command(conn, cmd)

In [39]:
# Insert all station facts into station info table
update_table(conn, insert_into_station_info, './data/facts/geo_stations_info.csv')

# Insert route with the coordiniates into route table
update_table(conn, insert_into_route, './data/facts/NE_regional_lonlat.csv')

In [40]:
years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

begin_everything = time.time()

# Insert all train data into arrival and departure data tables
for year in years:
    start = time.time()
    arrive_csv = './data/trains/arrive_{}_processed.csv'.format(year)
    depart_csv = './data/trains/depart_{}_processed.csv'.format(year)
    update_trains(conn, insert_into_stops, 'Arrival', arrive_csv)
    update_trains(conn, insert_into_stops, 'Departure', depart_csv)
    print('Finished adding year', year, 'to database in', time.time() - start, 'seconds')
print('COMPLETE in', time.time() - begin_everything)

Finished adding year 2011 to database in 5.6925718784332275 seconds
Finished adding year 2012 to database in 5.75301718711853 seconds
Finished adding year 2013 to database in 5.979727029800415 seconds
Finished adding year 2014 to database in 6.977294921875 seconds
Finished adding year 2015 to database in 7.017122268676758 seconds
Finished adding year 2016 to database in 7.247180700302124 seconds
Finished adding year 2017 to database in 7.503597974777222 seconds
Finished adding year 2018 to database in 7.3149778842926025 seconds
Finished adding year 2019 to database in 7.789529800415039 seconds
Finished adding year 2020 to database in 5.3095738887786865 seconds
Finished adding year 2021 to database in 2.754067897796631 seconds
COMPLETE in 69.3400650024414


In [41]:
location_names_for_files = ['Boston_MA', 'Providence_RI', 'Kingston_RI', 'Westerly_RI', 'Mystic_CT',
                            'New_London_CT', 'Old_Saybrook_CT', 'New_Haven_CT', 'Bridgeport_CT', 
                            'Stamford_CT', 'New_Rochelle_NY', 'Manhattan_NY', 'Newark_NJ', 'Iselin_NJ', 
                            'Trenton_NJ', 'Philadelphia_PA', 'Wilmington_DE','Aberdeen_MD', 'Baltimore_MD',
                            'Baltimore_BWI_Airport_MD', 'New_Carrollton_MD', 'Washington_DC']

years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

# Insert all weather data into the weather data table
begin_everything = time.time()
for location in location_names_for_files:
    start = time.time()
    for year in years:
        weather_csv = './data/weather/{}_weather_subset_{}.csv'.format(location, year)
        update_table(conn, insert_into_weather, weather_csv)
    print('Finished adding location', location, 'to database in', time.time() - start, 'seconds')
print("COMPLETE in", time.time() - begin_everything)

Finished adding location Boston_MA to database in 2.5462558269500732 seconds
Finished adding location Providence_RI to database in 2.5801069736480713 seconds
Finished adding location Kingston_RI to database in 2.52608585357666 seconds
Finished adding location Westerly_RI to database in 2.5474460124969482 seconds
Finished adding location Mystic_CT to database in 2.496647834777832 seconds
Finished adding location New_London_CT to database in 2.532504081726074 seconds
Finished adding location Old_Saybrook_CT to database in 2.553225040435791 seconds
Finished adding location New_Haven_CT to database in 2.462512969970703 seconds
Finished adding location Bridgeport_CT to database in 2.4934451580047607 seconds
Finished adding location Stamford_CT to database in 2.525826930999756 seconds
Finished adding location New_Rochelle_NY to database in 2.515416145324707 seconds
Finished adding location Manhattan_NY to database in 2.541264057159424 seconds
Finished adding location Newark_NJ to database in

In [47]:
join_data = """
            DROP TABLE IF EXISTS stops_joined;
            CREATE TABLE stops_joined AS
            SELECT
                *
            FROM
                stops s
                INNER JOIN (
                    SELECT
                        station_code AS si_station_code,
                        amtrak_station_name,
                        crew_change,
                        weather_location_name,
                        nb_stop_num,
                        sb_stop_num
                    FROM
                        station_info) si ON s.station_code = si.si_station_code
                INNER JOIN weather_hourly wh ON wh.location = si.weather_location_name
                    AND DATE_TRUNC('hour', s.full_act_arr_dep_datetime) = wh.obs_datetime
            ORDER BY
                s.full_sched_arr_dep_datetime;
            """

alter_joined_table = """
                      ALTER TABLE stops_joined
                          DROP COLUMN si_station_code,
                          DROP COLUMN amtrak_station_name,
                          DROP COLUMN location,
                          DROP COLUMN obs_datetime,
                          DROP COLUMN weather_id,
                          DROP COLUMN weather_location_name;
                     """

In [48]:
execute_command(conn, join_data)

In [49]:
execute_command(conn, alter_joined_table)

### Remove duplicate rows
* There are 393 duplicate entries which somehow ended up in the dataset, as determined by the unique tuples (`origin_date`, `train_num`, `station_code`, `arrival_or_departure`)

In [50]:
remove_duplicates = """
                    DELETE 
                    FROM stops_joined
                    WHERE stops_joined.stop_id IN 
                    (
                        SELECT sj_stop_id
                        FROM(
                            SELECT 
                                *, 
                                sj.stop_id AS sj_stop_id,
                                row_number() OVER (PARTITION BY origin_date, train_num, station_code, arrival_or_departure ORDER BY stop_id) 
                            FROM stops_joined sj
                        ) s
                        WHERE row_number >= 2
                    );
                    """
execute_command(conn, remove_duplicates)

### Add condition columns based on weather values
* `cloud_level` describes sky conditions during the course of the hour
* `precip_level` describes amount of precipitation during the course of the hour
* `precip_type` describes the worst type of precipitation that occurred (whether rain, snow, hail, miscellaneous mild weather conditions, etc.)
* `season` describes the *weather* season more than the actual official season name

In [51]:
add_precip_type = """ALTER TABLE stops_joined ADD COLUMN precip_type text;"""

set_precip_type  = """
                    UPDATE
                        stops_joined
                    SET
                        precip_type = (
                            CASE WHEN weather_type LIKE '%Snow%'
                                AND weather_type LIKE '%Rain%' THEN
                                'Snow'
                            WHEN weather_type LIKE '%Snow%'
                                AND weather_type NOT LIKE '%Rain%' THEN
                                'Snow'
                            WHEN weather_type LIKE '%Rain%'
                                AND weather_type NOT LIKE '%Snow%' THEN
                                'Rain'
                            WHEN weather_type = '' AND precipitation > 0 THEN
                                'Rain'
                            WHEN weather_type = '' AND precipitation = 0 THEN
                                'None'
                            ELSE
                                'Other'
                            END)
                    WHERE
                        weather_type IS NOT NULL;
                 """

execute_command(conn, add_precip_type)
execute_command(conn, set_precip_type)