### Testing data coverage issue
* There was about 4% of the data missing upon joining the table so I needed to understand what the issue was
* I found that the 2021 data was malformed due to a column revision issue with the 2021 weather data
* I used the following queries to output the returned tables to CSVs:
    * For the original stops table (full data for each stop record):
    ```
    SELECT DISTINCT
        station_code,
        origin_year,
        COUNT(*),
        sb_mile
    FROM
        stops s
        INNER JOIN (
            SELECT
                station_code AS si_station_code,
                sb_mile
            FROM
                station_info) si ON s.station_code = si.si_station_code
    GROUP BY
        station_code,
        sb_mile,
        origin_year
    ORDER BY
        sb_mile ASC,
        origin_year ASC;
    ```
    * For the merged table of records (stops and weather entries):
    ```
    SELECT DISTINCT
        station_code,
        origin_year,
        COUNT(*),
        si.sb_mile
    FROM
        full_joined s
        INNER JOIN (
            SELECT
                station_code AS si_station_code,
                sb_mile
            FROM
                station_info) si ON s.station_code = si.si_station_code
    GROUP BY
        station_code,
        si.sb_mile,
        origin_year
    ORDER BY
        si.sb_mile ASC,
        origin_year ASC;
    ````

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_stops = pd.read_csv('./csv/export_counts_from_stops.csv')
df_joined = pd.read_csv('./csv/export_counts_from_joined.csv')

In [12]:
df_stops['count'].loc[df_stops['origin_year'] == 2021]

10     2318
21     2321
32     2317
43     2318
54     2318
65     1428
76     1079
87     2313
98     1887
109    4663
120    2104
131    2314
142    1995
153    4440
164    2106
175    1279
186    1810
197    2005
208    4254
219    2112
230     261
241    2108
252    2108
263    1982
274    2113
Name: count, dtype: int64

In [13]:
df_joined['count'].loc[df_joined['origin_year'] == 2021]

10     16
21     16
32     15
43     15
54     15
65      9
76      7
87     15
98     12
109    30
120    13
131    15
142    12
153    28
164    13
175     7
186    11
197    12
208    26
219    12
230     1
241    13
252    13
263    12
274    13
Name: count, dtype: int64

### There are also some duplicate entries which I later removed
* They appear as negative values 

In [15]:
for i in range(df_stops.shape[0]):
    if df_stops.loc[i, 'count'] - df_joined.loc[i, 'count'] > 0:
        print(df_stops.loc[i, 'station_code'], df_stops.loc[i, 'origin_year'], df_stops.loc[i, 'count'] - df_joined.loc[i, 'count'])
    elif df_stops.loc[i, 'count'] - df_joined.loc[i, 'count'] < 0:
        print(df_stops.loc[i, 'station_code'], df_stops.loc[i, 'origin_year'], df_stops.loc[i, 'count'] - df_joined.loc[i, 'count'])

BOS 2020 2
BOS 2021 2302
BBY 2020 2
BBY 2021 2305
RTE 2020 2
RTE 2021 2302
PVD 2020 2
PVD 2021 2303
KIN 2020 2
KIN 2021 2303
WLY 2020 1
WLY 2021 1419
MYS 2020 2
MYS 2021 1072
NLC 2020 2
NLC 2021 2298
OSB 2020 1
OSB 2021 1875
NHV 2020 4
NHV 2021 4633
BRP 2021 2091
STM 2013 -1
STM 2014 -1
STM 2015 -1
STM 2017 -1
STM 2018 -1
STM 2019 -1
STM 2020 2
STM 2021 2299
NRO 2021 1983
NYP 2011 3
NYP 2013 -1
NYP 2014 -2
NYP 2015 -1
NYP 2016 -1
NYP 2018 -2
NYP 2019 -1
NYP 2020 3
NYP 2021 4412
NWK 2011 -1
NWK 2013 -1
NWK 2014 -1
NWK 2015 -1
NWK 2016 -1
NWK 2017 -1
NWK 2019 -1
NWK 2020 1
NWK 2021 2093
EWR 2021 1272
MET 2011 -1
MET 2013 -1
MET 2014 -1
MET 2015 -1
MET 2016 -1
MET 2017 -1
MET 2018 -2
MET 2020 2
MET 2021 1799
TRE 2012 -1
TRE 2017 -1
TRE 2019 -1
TRE 2020 1
TRE 2021 1993
PHL 2018 -2
PHL 2019 -2
PHL 2020 2
PHL 2021 4228
WIL 2012 -1
WIL 2014 -1
WIL 2017 -1
WIL 2020 1
WIL 2021 2100
ABE 2021 260
BAL 2013 -1
BAL 2015 -1
BAL 2016 -1
BAL 2017 -1
BAL 2019 -1
BAL 2020 1
BAL 2021 2095
BWI 2016 -1
BWI 

## The fixed results:

In [None]:
df_fixed = pd.read_csv('./csv/export_counts_from_joined_fixed.csv')

In [15]:
for i in range(df_stops.shape[0]):
    print(df_stops.loc[i, 'station_code'], df_stops.loc[i, 'origin_year'], np.abs(df_stops.loc[i, 'count'] - df_fixed.loc[i, 'count']))

BOS 2011 0
BOS 2012 0
BOS 2013 0
BOS 2014 0
BOS 2015 0
BOS 2016 0
BOS 2017 0
BOS 2018 0
BOS 2019 0
BOS 2020 0
BOS 2021 13
BBY 2011 0
BBY 2012 0
BBY 2013 0
BBY 2014 0
BBY 2015 0
BBY 2016 0
BBY 2017 0
BBY 2018 0
BBY 2019 0
BBY 2020 0
BBY 2021 13
RTE 2011 0
RTE 2012 0
RTE 2013 0
RTE 2014 0
RTE 2015 0
RTE 2016 0
RTE 2017 0
RTE 2018 0
RTE 2019 0
RTE 2020 0
RTE 2021 12
PVD 2011 0
PVD 2012 0
PVD 2013 0
PVD 2014 0
PVD 2015 0
PVD 2016 0
PVD 2017 0
PVD 2018 0
PVD 2019 0
PVD 2020 0
PVD 2021 12
KIN 2011 0
KIN 2012 0
KIN 2013 0
KIN 2014 0
KIN 2015 0
KIN 2016 0
KIN 2017 0
KIN 2018 0
KIN 2019 0
KIN 2020 0
KIN 2021 12
WLY 2011 0
WLY 2012 0
WLY 2013 0
WLY 2014 0
WLY 2015 0
WLY 2016 0
WLY 2017 0
WLY 2018 0
WLY 2019 0
WLY 2020 0
WLY 2021 8
MYS 2011 0
MYS 2012 0
MYS 2013 0
MYS 2014 0
MYS 2015 0
MYS 2016 0
MYS 2017 0
MYS 2018 0
MYS 2019 0
MYS 2020 0
MYS 2021 5
NLC 2011 0
NLC 2012 0
NLC 2013 0
NLC 2014 0
NLC 2015 0
NLC 2016 0
NLC 2017 0
NLC 2018 0
NLC 2019 0
NLC 2020 0
NLC 2021 12
OSB 2011 0
OSB 2012 0
OSB 

### Sum the number of duplicate rows
* There are 393 duplicates that end up removed so some may have multiple duplicates or else the number below is also influenced by weather data records not being available

In [16]:
dupes = 0
for i in range(df_stops.shape[0]):
    if np.abs(df_stops.loc[i, 'count']-df_fixed.loc[i, 'count']) > 0:
        dupes += np.abs(df_stops.loc[i, 'count'] - df_fixed.loc[i, 'count'])
print(dupes)

344
