# Preliminary Analysis

#### Data Cleaning Code
Code for cleaning and processing your data. Include a data dictionary for your transformed dataset.

- Data Dictionary for Air Quality
    - **indicator id:** id for each name
    - **name:** classify the sample in the air
    - **measure:** the type of measurement used
    - **measure info:** unit of measurement
    - **geo type name:** type of area in nyc
    - **geo place name:** name of the area in nyc
    - **time period:** time frame
    - **start_date:** date started
    <br><br>
- Data Dictionary for Traffic Volume
    - **requestId:** id for each request
    - **boro:** area in nyc
    - *vol:* tbd
    - *segmentId:* tbd
    - *wktgeom:* tbd
    - **street:** street name of where traffic happened
    - **fromst:** start street of traffic
    - **tost:** end street where traffic volume was located
    - **direction:** 
    - **date_time:** date at which it took place

In [55]:
import pandas as pd
import numpy as np

In [6]:
air_quality = pd.read_csv('datasets/Air_Quality.csv')
traffic_volume = pd.read_csv('datasets/Automated_Traffic_Volume_Counts.csv')

In [8]:
print(air_quality.isnull().sum() / len(air_quality))

Unique ID         0.0
Indicator ID      0.0
Name              0.0
Measure           0.0
Measure Info      0.0
Geo Type Name     0.0
Geo Join ID       0.0
Geo Place Name    0.0
Time Period       0.0
Start_Date        0.0
Data Value        0.0
Message           1.0
dtype: float64


In [9]:
air_quality = air_quality.drop(['Message'], axis=1)
print(air_quality.isnull().sum() / len(air_quality))

Unique ID         0.0
Indicator ID      0.0
Name              0.0
Measure           0.0
Measure Info      0.0
Geo Type Name     0.0
Geo Join ID       0.0
Geo Place Name    0.0
Time Period       0.0
Start_Date        0.0
Data Value        0.0
dtype: float64


In [10]:
print(air_quality.nunique() / len(air_quality))

Unique ID         1.000000
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Join ID       0.004466
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [11]:
air_quality = air_quality.drop(['Unique ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

(16122, 10)
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Join ID       0.004466
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [52]:
air_quality = air_quality.drop(['Geo Join ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

(16122, 9)
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [25]:
air_quality.dtypes

Indicator ID        int64
Name               object
Measure            object
Measure Info       object
Geo Type Name      object
Geo Join ID         int64
Geo Place Name     object
Time Period        object
Start_Date         object
Data Value        float64
dtype: object

In [26]:
air_quality.nunique()

Indicator ID        22
Name                19
Measure              8
Measure Info         8
Geo Type Name        5
Geo Join ID         72
Geo Place Name     114
Time Period         45
Start_Date          36
Data Value        4086
dtype: int64

In [27]:
air_quality['Time Period'].unique()

array(['Summer 2013', 'Summer 2014', 'Winter 2008-09', 'Summer 2009',
       'Summer 2010', 'Summer 2011', 'Summer 2012', 'Winter 2009-10',
       '2005-2007', '2013', '2005', '2009-2011', 'Winter 2010-11',
       'Winter 2011-12', 'Winter 2012-13', 'Annual Average 2009',
       'Annual Average 2010', 'Annual Average 2011',
       'Annual Average 2012', 'Annual Average 2013', '2015',
       'Winter 2013-14', 'Annual Average 2014', '2011', 'Winter 2014-15',
       '2016', 'Annual Average 2015', 'Summer 2015', 'Winter 2015-16',
       'Summer 2016', 'Annual Average 2016', 'Summer 2017', '2012-2014',
       'Summer 2018', 'Annual Average 2017', 'Summer 2019',
       'Winter 2016-17', 'Annual Average 2018', 'Winter 2017-18',
       '2015-2017', 'Summer 2020', 'Annual Average 2019',
       'Winter 2018-19', 'Annual Average 2020', 'Winter 2019-20'],
      dtype=object)

In [28]:
air_quality['Time Period'].value_counts() / len(air_quality)

2012-2014              0.029773
2005-2007              0.029773
2015-2017              0.029773
2009-2011              0.029773
Summer 2012            0.026237
Winter 2012-13         0.026237
Winter 2015-16         0.026237
Winter 2014-15         0.026237
Summer 2013            0.026237
Summer 2009            0.026237
Winter 2009-10         0.026237
Summer 2014            0.026237
Summer 2019            0.026237
Winter 2008-09         0.026237
Winter 2011-12         0.026237
Summer 2018            0.026237
Summer 2016            0.026237
Summer 2011            0.026237
Winter 2013-14         0.026237
Summer 2010            0.026237
Winter 2010-11         0.026237
Summer 2020            0.026237
Summer 2017            0.026237
Summer 2015            0.026237
2005                   0.025245
2016                   0.019911
Annual Average 2011    0.017492
Winter 2017-18         0.017492
Winter 2018-19         0.017492
Annual Average 2009    0.017492
Annual Average 2018    0.017492
Winter 2

In [29]:
air_quality['Start_Date'].unique()

array(['06/01/2013', '06/01/2014', '12/01/2008', '06/01/2009',
       '06/01/2010', '06/01/2011', '06/01/2012', '12/01/2009',
       '01/01/2005', '01/01/2013', '01/01/2009', '12/01/2010',
       '12/01/2011', '12/01/2012', '01/01/2015', '12/01/2013',
       '01/01/2011', '12/01/2014', '01/01/2016', '06/01/2015',
       '12/01/2015', '05/31/2016', '12/31/2015', '06/01/2017',
       '01/02/2012', '06/01/2018', '01/01/2017', '06/01/2019',
       '12/01/2016', '01/01/2018', '12/01/2017', '06/01/2020',
       '01/01/2019', '12/01/2018', '01/01/2020', '12/01/2019'],
      dtype=object)

In [30]:
air_quality['Start_Date'] = pd.to_datetime(air_quality['Start_Date'], infer_datetime_format=True)

In [31]:
air_quality['Start_Date'].min()

Timestamp('2005-01-01 00:00:00')

In [32]:
air_quality['Start_Date'].value_counts().sort_index() / len(air_quality)

2005-01-01    0.055018
2008-12-01    0.043729
2009-01-01    0.029773
2009-06-01    0.026237
2009-12-01    0.043729
2010-06-01    0.026237
2010-12-01    0.043729
2011-01-01    0.013274
2011-06-01    0.026237
2011-12-01    0.043729
2012-01-02    0.029773
2012-06-01    0.026237
2012-12-01    0.043729
2013-01-01    0.008932
2013-06-01    0.026237
2013-12-01    0.043729
2014-06-01    0.026237
2014-12-01    0.026237
2015-01-01    0.056197
2015-06-01    0.026237
2015-12-01    0.026237
2015-12-31    0.017492
2016-01-01    0.019911
2016-05-31    0.026237
2016-12-01    0.017492
2017-01-01    0.017492
2017-06-01    0.026237
2017-12-01    0.017492
2018-01-01    0.017492
2018-06-01    0.026237
2018-12-01    0.017492
2019-01-01    0.017492
2019-06-01    0.026237
2019-12-01    0.017492
2020-01-01    0.017492
2020-06-01    0.026237
Name: Start_Date, dtype: float64

In [33]:
traffic_volume.sample(10)

Unnamed: 0,RequestID,Boro,Yr,M,D,HH,MM,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction
6905083,22851,Brooklyn,2016,3,9,23,30,40,162697,POINT (987583.140853647 190844.75607060915),LIVINGSTON STREET,Boerum Place,Culver/ 6 Avenue Line,EB
18245292,8674,Brooklyn,2012,5,26,1,30,10,105904,POINT (996805.9 149496.8),BRIGHTON 15 ST,BRIGHTON BEACH AV,BEND,WB
1148967,11573,Brooklyn,2012,10,27,5,15,55,22924,POINT (989923.1 188795),ATLANTIC AV,3 AV,4 AV,WB
5333867,19635,Brooklyn,2015,2,10,20,30,0,44742,POINT (1004554.3605970219 195624.30125201194),KNICKERBOCKER AVENUE,Jefferson Street,Troutman Street,WB
22330895,7210,Manhattan,2010,11,6,2,30,215,145260,POINT (987371.9 202639.8),E/B E HOUSTON ST @ ALLEN ST,ALLEN ST,ALLEN ST,EB
21027644,23906,Bronx,2016,9,9,23,0,346,138734,POINT (1003176.1933528372 235761.60822257065),MADISON AVENUE BR APPROACH,Madison Avenue Bridge,Exterior Street,EB
5022927,8692,Queens,2012,6,20,13,15,577,190793,POINT (1000331.6 208690.3),MIDTOWN HWY BR,QN MIDTOWN EXWY,QN MIDTOWN EXWY,EB
2135230,23541,Brooklyn,2016,6,15,2,15,0,251715,POINT (992254.2938923276 176810.5545309051),EAST DRIVE,West Drive,West Drive,EB
11686554,23707,Brooklyn,2016,6,20,18,0,74,29355,POINT (994266.5646090935 188448.78370890685),ST JAMES PLACE,Gates Avenue,Fulton Street,SB
14602660,26698,Brooklyn,2017,9,5,10,45,206,28382,POINT (995362.0545424692 178276.608729737),FLATBUSH AVENUE,Winthrop Street,Parkside Avenue,SB


In [34]:
traffic_volume.shape

(27190511, 14)

In [35]:
print(traffic_volume.isnull().sum() / len(traffic_volume))

RequestID    0.000000
Boro         0.000000
Yr           0.000000
M            0.000000
D            0.000000
HH           0.000000
MM           0.000000
Vol          0.000000
SegmentID    0.000000
WktGeom      0.000000
street       0.000000
fromSt       0.000000
toSt         0.000074
Direction    0.000000
dtype: float64


In [36]:
print(traffic_volume.nunique() / len(traffic_volume))

RequestID    2.607527e-04
Boro         1.838877e-07
Yr           5.884406e-07
M            4.413304e-07
D            1.140104e-06
HH           8.826609e-07
MM           1.471101e-07
Vol          1.476986e-04
SegmentID    5.499345e-04
WktGeom      7.525787e-04
street       2.482484e-04
fromSt       2.361486e-04
toSt         2.175391e-04
Direction    2.206652e-07
dtype: float64


In [37]:
traffic_volume.nunique()

RequestID     7090
Boro             5
Yr              16
M               12
D               31
HH              24
MM               4
Vol           4016
SegmentID    14953
WktGeom      20463
street        6750
fromSt        6421
toSt          5915
Direction        6
dtype: int64

In [39]:
traffic_volume.dtypes

RequestID     int64
Boro         object
Yr            int64
M             int64
D             int64
HH            int64
MM            int64
Vol           int64
SegmentID     int64
WktGeom      object
street       object
fromSt       object
toSt         object
Direction    object
dtype: object

In [40]:
traffic_volume.Yr.min()

2000

In [41]:
traffic_volume = traffic_volume[traffic_volume['Yr'] >= 2005]

In [42]:
traffic_volume.shape

(27188607, 14)

In [43]:
traffic_volume['Yr'].value_counts().sort_index()

2006        664
2007      11780
2008      68591
2009    1012766
2010    1421397
2011    1238391
2012    2434583
2013    2829656
2014    3708367
2015    3232005
2016    3362243
2017    3013530
2018    2046443
2019    2365633
2020     442558
Name: Yr, dtype: int64

In [44]:
traffic_volume = traffic_volume[traffic_volume['Yr'] > 2008]

In [45]:
traffic_volume.shape

(27107572, 14)

In [46]:
traffic_volume['date_time'] = pd.to_datetime(dict(year=traffic_volume.Yr, \
                                                  month=traffic_volume.M, \
                                                  day=traffic_volume.D, \
                                                  hour=traffic_volume.HH, \
                                                  minute=traffic_volume.MM))

In [47]:
traffic_volume = traffic_volume.drop(['Yr', 'M', 'D', 'HH', 'MM'], axis=1)

In [48]:
traffic_volume.sample(10)

Unnamed: 0,RequestID,Boro,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction,date_time
3229533,32384,Bronx,299,144019,POINT (1003489.5459258083 233692.39948567233),3 AVENUE BRIDGE,Dead End,Dead end,WB,2020-10-19 14:30:00
12254319,30530,Bronx,41,88374,POINT (1026253.7113216726 256803.0970668355),EAST GUN HILL ROAD,Burke Avenue,Young Avenue,EB,2019-09-19 05:00:00
26527402,28834,Brooklyn,8,43312,POINT (1003103.589272826 188040.53529617586),STUYVESANT AVENUE,Macon Street,Mac Donough Street,SB,2018-11-15 04:00:00
10990059,24514,Brooklyn,7,24386,POINT (988631.9836337205 190309.84823588253),LIVINGSTON STREET,Elm Place,Bond Street,EB,2016-09-28 04:30:00
18243713,8407,Brooklyn,15,22638,POINT (987251.8 186361.9),CARROLL ST BR,CARROLL ST,CARROLL ST,EB,2012-05-01 07:45:00
10416911,14742,Queens,118,132557,POINT (1000813.5 211763.1),JACKSON AV,QNSBO BR UP RY APPR/DUTCH KILLS ST,QUEENS ST,EB,2013-06-20 13:45:00
24790017,10586,Manhattan,82,192179,POINT (1004947.3 253141.7),NAGLE AV,HILLSIDE AV,DYCKMAN ST,EB,2013-01-16 19:45:00
23278694,18209,Brooklyn,162,20487,POINT (989454.0378511314 162978.26420641012),BAY PARKWAY,64 Street,65 Street,SB,2014-09-28 19:45:00
5980166,32384,Manhattan,152,188852,POINT (994559.01025559 216508.43451445957),ED KOCH QUEENSBORO BRIDGE EXIT,Astoria Line,Dead end,NB,2020-10-22 05:30:00
13586253,26291,Manhattan,188,165481,POINT (992719.992942051 216871.28888757754),EAST 58 STREET,Park Avenue,Lexington Avenue,EB,2017-06-22 16:00:00


In [49]:
traffic_volume['date_time'].dt.year.value_counts().sort_index()

2009    1012766
2010    1421397
2011    1238391
2012    2434583
2013    2829656
2014    3708367
2015    3232005
2016    3362243
2017    3013530
2018    2046443
2019    2365633
2020     442558
Name: date_time, dtype: int64

## Transformed Datasets

In [50]:
traffic_volume.head(25)

Unnamed: 0,RequestID,Boro,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction,date_time
0,20856,Queens,9,171896,POINT (1052296.600156678 199785.26932711253),94 AVENUE,207 Street,Francis Lewis Boulevard,WB,2015-06-23 23:30:00
1,21231,Staten Island,6,9896,POINT (942668.0589509147 171441.21296926),RICHMOND TERRACE,Wright Avenue,Emeric Court,WB,2015-09-14 04:15:00
2,29279,Bronx,85,77817,POINT (1016508.0034050211 235221.59092266942),HUNTS POINT AVENUE,Whittier Street,Randall Avenue,NB,2017-10-19 04:30:00
3,27019,Brooklyn,168,188023,POINT (992925.4316054962 184116.82855457635),FLATBUSH AVENUE,Brighton Line,Brighton Line,NB,2017-11-07 18:30:00
4,26734,Manhattan,355,137516,POINT (1004175.9505178436 247779.63624949602),WASHINGTON BRIDGE,Harlem River Shoreline,Harlem River Shoreline,EB,2017-11-03 22:00:00
5,26015,Bronx,11,86053,POINT (1021709.470909429 248612.86356908735),WALLACE AVENUE,Rhinelander Avenue,Bronxdale Avenue,NB,2017-06-17 01:45:00
6,2033,Manhattan,99,70683,POINT (1000954.8 243914.9),S/B AMSTERDAM AVE @ W 162 ST,ST NICHOLAS AV/W 162 ST,W 163 ST,SB,2009-09-01 18:30:00
7,23133,Queens,232,101101,POINT (1050277.3347521287 216784.58047417598),NORTHERN BOULEVARD,220 Place,220 Street,WB,2016-03-21 09:45:00
8,32417,Queens,18,147877,POINT (1044172.6626552071 200130.04842303603),MIDLAND PARKWAY,Dalny Road,Connector,SB,2020-11-14 02:15:00
9,26198,Bronx,2,85935,POINT (1021747.2311522859 242463.04655740186),THIERIOT AVENUE,Gleason Avenue,Pelham Line,NB,2017-06-22 04:30:00


In [53]:
air_quality.sample(10)

Unnamed: 0,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Place Name,Time Period,Start_Date,Data Value
13694,639,PM2.5-Attributable Deaths,Estimated Annual Rate - Adults 30 Yrs and Older,"per 100,000 adults",UHF42,East Flatbush - Flatbush,2015-2017,2015-01-01,36.2
5236,375,Nitrogen Dioxide (NO2),Mean,ppb,UHF42,Greenwich Village - SoHo,Summer 2009,2009-06-01,31.61
2579,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Flushing and Whitestone (CD7),Annual Average 2011,2010-12-01,21.94
14069,661,O3-Attributable Asthma Hospitalizations,Estimated Annual Rate- 18 Yrs and Older,"per 100,000 adults",Borough,Queens,2015-2017,2015-01-01,2.6
8213,386,Ozone (O3),Mean,ppb,UHF42,Greenpoint,Summer 2009,2009-06-01,25.56
9798,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,South Beach and Willowbrook (CD2),Annual Average 2015,2015-01-01,13.62
2262,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Bedford Stuyvesant (CD3),Summer 2011,2011-06-01,21.38
15047,386,Ozone (O3),Mean,ppb,UHF34,Kingsbridge - Riverdale,Summer 2019,2019-06-01,28.43
2601,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Washington Heights and Inwood (CD12),Annual Average 2012,2011-12-01,22.64
11128,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Park Slope and Carroll Gardens (CD6),Annual Average 2016,2015-12-31,21.17
