# Preliminary Analysis

### Data Cleaning Code
Code for cleaning and processing your data. Include a data dictionary for your transformed dataset.

- Data Dictionary for Air Quality
    - **indicator id:** id for each name
    - **name:** classify the sample in the air
    - **measure:** how the indicator is measured
    - **measure info:** information about the measure
    - **geo type name:** geography type, UHF stands for United Hospital Fund neighborhoods
    - **geo place name:** neighborhood name
    - **time period:** time frame
    - **start_date:** date started
    <br><br>
- Data Dictionary for Traffic Volume
    - **requestId:** unique id generated for each counts request
    - **boro:** lists which of the five diviions of New York City the location is within
    - **vol:** total sum of count collected within 15 minute increments
    - **segmentId:** The ID that idenifies each segment of a street
    - **wktgeom:** Geometry point of the location
    - **street:** street name of where traffic happened
    - **fromst:** start street of traffic
    - **tost:** end street where traffic volume was located
    - **direction:** text-based direction of traffic where the count took place
    - **date_time:** date at which it took place
    <br><br>
- Data Dictionary for 2020 mobility Dataset
    - **sub_region_2** which county it is
    - **date** date during recording
    - **retail_and_recreation_percent_change_from_baseline** mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.
    - **grocery_and_pharmacy_percent_change_from_baseline** mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies
    - **parks_percent_change_from_baseline** mobility trends for places like national parks, public beaches, marinas, dog parks, plazas, and public gardens
    - **transit_stations_percent_change_from_baseline** mobility trends for places like public transport hubs such as subway, bus, and train stations
    - **workplaces_percent_change_from_baseline** mobility trend for places of work
    - **residential_percent_change_from_baseline** mobility trends for places of residence
    
### Exploratory Analysis
Describe what work you have done so far and include the code. This may include descriptive statistics, graphs and charts, and preliminary models.

- We removed some columns that were irrelevant to what we want to predict as well as combine some columns that would fit together, such as the date and time.


### Challenges
Describe any challenges you've encountered so far. Let me know if there's anything you need help with!

- There were some challenges in figuring out what sort of data was necessary to include for our problem as it was targeted in New York City. 
- Figuring out the transformations to use on each dataset was also a challenge since there were many columns for each dataset and we had to find the ones that weren't relevant to our problem.
- There are some issues for the columns right now where there are some, such as segmentId in the Traffic Volume dataset where we are currently unsure if it's useful to keep or remove.

### Future Work
Describe what work you are planning to complete for the final analysis.

- Future work includes using the cleaned data to use as inputs for models suited for classification such as Logisitc Regression and Linear Regression. 
- Make predictions using the models trained to obtain the accuracy scores to answer our questions
- Find the best model for accuracy as well as graph/chart the data to further understand it for future predictions.

### Contributions
Describe the contributions that each group member made.
- **Daniel Aguilar-Rodriguez**
    - Researched and acquired datasets
    - Helped present ideas during brainstorming session
    - Created jupyter notebook and helped clean datasets
    - Helped transform datasets and removed columns irrelevant to our work
    <br><br>
- **Jia Cong Lin**
    - Helped present ideas during brainstorming session
    - Helped define necessary columns for the mobility dataset
    - Assisted in determining columns to clean and define 
    <br><br>
- **Anvinh Truong**
    - Helped clean and define some columns for the datasets and dictionary
    - Helped present ideas during brainstorming session
    - Assisted in thinking of procedure to clean data columns

In [64]:
import pandas as pd
import numpy as np

In [65]:
air_quality = pd.read_csv('datasets/Air_Quality.csv')
traffic_volume = pd.read_csv('datasets/Automated_Traffic_Volume_Counts.csv')
mobility_2020 = pd.read_csv('datasets/2020_US_Region_Mobility_Report.csv')
mobility_2021 = pd.read_csv('datasets/2021_US_Region_Mobility_Report.csv')
mobility_2022 = pd.read_csv('datasets/2022_US_Region_Mobility_Report.csv')

## Air Quality Dataset Cleaning

In [66]:
print(air_quality.isnull().sum() / len(air_quality))

Unique ID         0.0
Indicator ID      0.0
Name              0.0
Measure           0.0
Measure Info      0.0
Geo Type Name     0.0
Geo Join ID       0.0
Geo Place Name    0.0
Time Period       0.0
Start_Date        0.0
Data Value        0.0
Message           1.0
dtype: float64


In [67]:
air_quality = air_quality.drop(['Message'], axis=1)
print(air_quality.isnull().sum() / len(air_quality))

Unique ID         0.0
Indicator ID      0.0
Name              0.0
Measure           0.0
Measure Info      0.0
Geo Type Name     0.0
Geo Join ID       0.0
Geo Place Name    0.0
Time Period       0.0
Start_Date        0.0
Data Value        0.0
dtype: float64


In [68]:
print(air_quality.nunique() / len(air_quality))

Unique ID         1.000000
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Join ID       0.004466
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [69]:
air_quality = air_quality.drop(['Unique ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

(16122, 10)
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Join ID       0.004466
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [70]:
air_quality = air_quality.drop(['Geo Join ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

(16122, 9)
Indicator ID      0.001365
Name              0.001179
Measure           0.000496
Measure Info      0.000496
Geo Type Name     0.000310
Geo Place Name    0.007071
Time Period       0.002791
Start_Date        0.002233
Data Value        0.253443
dtype: float64


In [71]:
air_quality.dtypes

Indicator ID        int64
Name               object
Measure            object
Measure Info       object
Geo Type Name      object
Geo Place Name     object
Time Period        object
Start_Date         object
Data Value        float64
dtype: object

In [72]:
air_quality.nunique()

Indicator ID        22
Name                19
Measure              8
Measure Info         8
Geo Type Name        5
Geo Place Name     114
Time Period         45
Start_Date          36
Data Value        4086
dtype: int64

In [73]:
air_quality['Time Period'].unique()

array(['Summer 2013', 'Summer 2014', 'Winter 2008-09', 'Summer 2009',
       'Summer 2010', 'Summer 2011', 'Summer 2012', 'Winter 2009-10',
       '2005-2007', '2013', '2005', '2009-2011', 'Winter 2010-11',
       'Winter 2011-12', 'Winter 2012-13', 'Annual Average 2009',
       'Annual Average 2010', 'Annual Average 2011',
       'Annual Average 2012', 'Annual Average 2013', '2015',
       'Winter 2013-14', 'Annual Average 2014', '2011', 'Winter 2014-15',
       '2016', 'Annual Average 2015', 'Summer 2015', 'Winter 2015-16',
       'Summer 2016', 'Annual Average 2016', 'Summer 2017', '2012-2014',
       'Summer 2018', 'Annual Average 2017', 'Summer 2019',
       'Winter 2016-17', 'Annual Average 2018', 'Winter 2017-18',
       '2015-2017', 'Summer 2020', 'Annual Average 2019',
       'Winter 2018-19', 'Annual Average 2020', 'Winter 2019-20'],
      dtype=object)

In [74]:
air_quality['Time Period'].value_counts() / len(air_quality)

2012-2014              0.029773
2005-2007              0.029773
2015-2017              0.029773
2009-2011              0.029773
Summer 2012            0.026237
Winter 2012-13         0.026237
Winter 2015-16         0.026237
Winter 2014-15         0.026237
Summer 2013            0.026237
Summer 2009            0.026237
Winter 2009-10         0.026237
Summer 2014            0.026237
Summer 2019            0.026237
Winter 2008-09         0.026237
Winter 2011-12         0.026237
Summer 2018            0.026237
Summer 2016            0.026237
Summer 2011            0.026237
Winter 2013-14         0.026237
Summer 2010            0.026237
Winter 2010-11         0.026237
Summer 2020            0.026237
Summer 2017            0.026237
Summer 2015            0.026237
2005                   0.025245
2016                   0.019911
Annual Average 2011    0.017492
Winter 2017-18         0.017492
Winter 2018-19         0.017492
Annual Average 2009    0.017492
Annual Average 2018    0.017492
Winter 2

In [75]:
air_quality['Start_Date'].unique()

array(['06/01/2013', '06/01/2014', '12/01/2008', '06/01/2009',
       '06/01/2010', '06/01/2011', '06/01/2012', '12/01/2009',
       '01/01/2005', '01/01/2013', '01/01/2009', '12/01/2010',
       '12/01/2011', '12/01/2012', '01/01/2015', '12/01/2013',
       '01/01/2011', '12/01/2014', '01/01/2016', '06/01/2015',
       '12/01/2015', '05/31/2016', '12/31/2015', '06/01/2017',
       '01/02/2012', '06/01/2018', '01/01/2017', '06/01/2019',
       '12/01/2016', '01/01/2018', '12/01/2017', '06/01/2020',
       '01/01/2019', '12/01/2018', '01/01/2020', '12/01/2019'],
      dtype=object)

In [76]:
air_quality['Start_Date'] = pd.to_datetime(air_quality['Start_Date'], infer_datetime_format=True)

In [77]:
air_quality['Start_Date'].min()

Timestamp('2005-01-01 00:00:00')

In [78]:
air_quality['Start_Date'].value_counts().sort_index() / len(air_quality)

2005-01-01    0.055018
2008-12-01    0.043729
2009-01-01    0.029773
2009-06-01    0.026237
2009-12-01    0.043729
2010-06-01    0.026237
2010-12-01    0.043729
2011-01-01    0.013274
2011-06-01    0.026237
2011-12-01    0.043729
2012-01-02    0.029773
2012-06-01    0.026237
2012-12-01    0.043729
2013-01-01    0.008932
2013-06-01    0.026237
2013-12-01    0.043729
2014-06-01    0.026237
2014-12-01    0.026237
2015-01-01    0.056197
2015-06-01    0.026237
2015-12-01    0.026237
2015-12-31    0.017492
2016-01-01    0.019911
2016-05-31    0.026237
2016-12-01    0.017492
2017-01-01    0.017492
2017-06-01    0.026237
2017-12-01    0.017492
2018-01-01    0.017492
2018-06-01    0.026237
2018-12-01    0.017492
2019-01-01    0.017492
2019-06-01    0.026237
2019-12-01    0.017492
2020-01-01    0.017492
2020-06-01    0.026237
Name: Start_Date, dtype: float64

## Traffic Volume Dataset Cleaning

In [79]:
traffic_volume.sample(10)

Unnamed: 0,RequestID,Boro,Yr,M,D,HH,MM,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction
7754832,17864,Manhattan,2014,10,20,9,0,353,36026,POINT (991925.689545062 218081.12454349105),5 AVENUE,Astoria Line,East 61 Street,SB
6152433,28809,Manhattan,2019,1,15,4,45,30,158962,POINT (989304.9965110081 223426.7521785344),BROADWAY,Dead End,West 74 Street,SB
16566933,11132,Brooklyn,2012,10,29,15,30,29,39744,POINT (1000204.6 159290.7),NOSTRAND AV,AV T,AV S,NB
27156325,23544,Queens,2016,6,7,23,15,18,82566,POINT (1027240.3192830402 201619.68932911183),71 AVENUE,Continental Avenue,Austin Street,NB
8012507,24785,Brooklyn,2016,11,18,1,30,9,168117,POINT (1016524.4109720308 175851.41240695576),CROTON LOOP,Dead End,Pennsylvania Avenue,EB
27011485,11561,Brooklyn,2012,10,1,18,45,217,19566,POINT (982299.4 168550.7),65 ST,10 AV,11 AV,EB
6016569,1251,Queens,2011,7,10,21,45,1,130062,POINT (1011921.2 205634.9),65 PL,53 DR,53 AV,NB
3495643,26966,Brooklyn,2017,10,22,17,15,13,44348,POINT (1001487.7493934579 198493.97044553026),WATERBURY STREET,Ten Eyck Street,Maujer Street,SB
7503346,1798,Queens,2010,10,2,16,0,117,90711,POINT (1037259.6 209694.9),E/B BOOTH MEMORIAL AVE @ 160 ST,159 ST,160 ST,EB
3601634,18042,Brooklyn,2014,10,15,12,45,24,128838,POINT (1020102.2594734749 183632.24750882766),EUCLID AVENUE,Alley,Driveway,NB


In [80]:
traffic_volume.shape

(27190511, 14)

In [81]:
print(traffic_volume.isnull().sum() / len(traffic_volume))

RequestID    0.000000
Boro         0.000000
Yr           0.000000
M            0.000000
D            0.000000
HH           0.000000
MM           0.000000
Vol          0.000000
SegmentID    0.000000
WktGeom      0.000000
street       0.000000
fromSt       0.000000
toSt         0.000074
Direction    0.000000
dtype: float64


In [82]:
print(traffic_volume.nunique() / len(traffic_volume))

RequestID    2.607527e-04
Boro         1.838877e-07
Yr           5.884406e-07
M            4.413304e-07
D            1.140104e-06
HH           8.826609e-07
MM           1.471101e-07
Vol          1.476986e-04
SegmentID    5.499345e-04
WktGeom      7.525787e-04
street       2.482484e-04
fromSt       2.361486e-04
toSt         2.175391e-04
Direction    2.206652e-07
dtype: float64


In [83]:
traffic_volume.nunique()

RequestID     7090
Boro             5
Yr              16
M               12
D               31
HH              24
MM               4
Vol           4016
SegmentID    14953
WktGeom      20463
street        6750
fromSt        6421
toSt          5915
Direction        6
dtype: int64

In [84]:
traffic_volume.dtypes

RequestID     int64
Boro         object
Yr            int64
M             int64
D             int64
HH            int64
MM            int64
Vol           int64
SegmentID     int64
WktGeom      object
street       object
fromSt       object
toSt         object
Direction    object
dtype: object

In [85]:
traffic_volume.Yr.min()

2000

In [86]:
traffic_volume = traffic_volume[traffic_volume['Yr'] >= 2005]

In [87]:
traffic_volume.shape

(27188607, 14)

In [88]:
traffic_volume['Yr'].value_counts().sort_index()

2006        664
2007      11780
2008      68591
2009    1012766
2010    1421397
2011    1238391
2012    2434583
2013    2829656
2014    3708367
2015    3232005
2016    3362243
2017    3013530
2018    2046443
2019    2365633
2020     442558
Name: Yr, dtype: int64

In [89]:
traffic_volume = traffic_volume[traffic_volume['Yr'] > 2008]

In [90]:
traffic_volume.shape

(27107572, 14)

In [91]:
traffic_volume['date_time'] = pd.to_datetime(dict(year=traffic_volume.Yr, \
                                                  month=traffic_volume.M, \
                                                  day=traffic_volume.D, \
                                                  hour=traffic_volume.HH, \
                                                  minute=traffic_volume.MM))

In [92]:
traffic_volume = traffic_volume.drop(['Yr', 'M', 'D', 'HH', 'MM'], axis=1)

In [93]:
traffic_volume.sample(10)

Unnamed: 0,RequestID,Boro,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction,date_time
4518237,26734,Manhattan,333,138892,POINT (995960.7465818096 215551.39421857634),ED KOCH QUEENSBORO BRIDGE,East River West Channel Shl,Dead end,WB,2017-10-12 02:00:00
12719173,21836,Manhattan,199,159237,POINT (992101.2904137683 228928.03971819984),BROADWAY,West 97 Street,West 96 Street,SB,2015-10-25 14:15:00
15702761,20284,Brooklyn,18,28880,POINT (990467.7070242653 181085.39723555313),PROSPECT PARK WEST,10 Street,11 Street,SB,2015-05-08 02:00:00
21143670,25473,Brooklyn,62,21336,POINT (987574.0760201351 174035.40703463872),36 STREET,Minna Street,Ft Hamilton Parkway,NB,2017-03-28 22:15:00
26073446,15389,Manhattan,234,34122,POINT (988658.3 219868.7),COLUMBUS AV,W 60 ST,W 61 ST,SB,2013-07-23 20:15:00
14409817,18989,Manhattan,31,33921,POINT (987657.8500733508 212924.6299008743),BROADWAY,Broadway Line,Broadway Line,SB,2014-12-02 11:45:00
22747300,24687,Manhattan,0,37175,POINT (990954.3032392034 222051.33444510837),CPW 72 APPROACH,8 Avenue Line,West Drive,EB,2016-11-03 15:00:00
26950462,23441,Bronx,101,87681,POINT (1022149.6740512461 254639.15134222773),ALLERTON AVENUE,Barnes Avenue,Matthews Avenue,EB,2016-05-17 13:45:00
21674274,16588,Brooklyn,23,22514,POINT (986964.3 185614.5),3 ST BR,3 ST,3 ST,EB,2014-04-23 21:30:00
4428157,7787,Manhattan,118,159038,POINT (989172.5 222264.2),BROADWAY,W 70 ST,W 69 ST,SB,2011-10-22 02:45:00


In [94]:
traffic_volume['date_time'].dt.year.value_counts().sort_index()

2009    1012766
2010    1421397
2011    1238391
2012    2434583
2013    2829656
2014    3708367
2015    3232005
2016    3362243
2017    3013530
2018    2046443
2019    2365633
2020     442558
Name: date_time, dtype: int64

## 2020 mobility Dataset cleaning

In [95]:
mobility_2020.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
country_region_code,US,US,US,US,US,US,US,US,US,US
country_region,United States,United States,United States,United States,United States,United States,United States,United States,United States,United States
sub_region_1,,,,,,,,,,
sub_region_2,,,,,,,,,,
metro_area,,,,,,,,,,
iso_3166_2_code,,,,,,,,,,
census_fips_code,,,,,,,,,,
place_id,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw
date,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,2020-02-22,2020-02-23,2020-02-24
retail_and_recreation_percent_change_from_baseline,6,7,6,0,2,1,2,7,7,2


In [96]:
mobility_2021.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
country_region_code,US,US,US,US,US,US,US,US,US,US
country_region,United States,United States,United States,United States,United States,United States,United States,United States,United States,United States
sub_region_1,,,,,,,,,,
sub_region_2,,,,,,,,,,
metro_area,,,,,,,,,,
iso_3166_2_code,,,,,,,,,,
census_fips_code,,,,,,,,,,
place_id,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw
date,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06,2021-01-07,2021-01-08,2021-01-09,2021-01-10
retail_and_recreation_percent_change_from_baseline,-47,-26,-27,-19,-20,-22,-24,-26,-23,-26


In [97]:
mobility_2022.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
country_region_code,US,US,US,US,US,US,US,US,US,US
country_region,United States,United States,United States,United States,United States,United States,United States,United States,United States,United States
sub_region_1,,,,,,,,,,
sub_region_2,,,,,,,,,,
metro_area,,,,,,,,,,
iso_3166_2_code,,,,,,,,,,
census_fips_code,,,,,,,,,,
place_id,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,ChIJCzYy5IS16lQRQrfeQ5K5Oxw
date,2022-01-01,2022-01-02,2022-01-03,2022-01-04,2022-01-05,2022-01-06,2022-01-07,2022-01-08,2022-01-09,2022-01-10
retail_and_recreation_percent_change_from_baseline,-43,-20,-14,-14,-16,-19,-24,-19,-21,-17


In [98]:
mobility_2020.shape

(812065, 15)

In [99]:
mobility = pd.concat([mobility_2020, mobility_2021, mobility_2022], ignore_index=True)

In [100]:
mobility.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-15,6.0,2.0,15.0,3.0,2.0,-1.0
1,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-16,7.0,1.0,16.0,2.0,0.0,-1.0
2,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-17,6.0,0.0,28.0,-9.0,-24.0,5.0
3,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-18,0.0,-1.0,6.0,1.0,0.0,1.0
4,US,United States,,,,,,ChIJCzYy5IS16lQRQrfeQ5K5Oxw,2020-02-19,2.0,0.0,8.0,1.0,1.0,0.0


In [101]:
mobility.tail()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
2511989,US,United States,Wyoming,Weston County,,,56045.0,ChIJd4Rqhed3YocR7ubT5-HgoJg,2022-10-10,,,,,-26.0,
2511990,US,United States,Wyoming,Weston County,,,56045.0,ChIJd4Rqhed3YocR7ubT5-HgoJg,2022-10-11,,,,,-20.0,
2511991,US,United States,Wyoming,Weston County,,,56045.0,ChIJd4Rqhed3YocR7ubT5-HgoJg,2022-10-12,,,,,-17.0,
2511992,US,United States,Wyoming,Weston County,,,56045.0,ChIJd4Rqhed3YocR7ubT5-HgoJg,2022-10-13,,,,,-15.0,
2511993,US,United States,Wyoming,Weston County,,,56045.0,ChIJd4Rqhed3YocR7ubT5-HgoJg,2022-10-14,,,,,-8.0,


In [102]:
mobility_nyc = mobility[(mobility['sub_region_1'] == 'New York') & 
                        (mobility['sub_region_2'].str.contains('Bronx|Kings|New York|Queens|Richmond'))]

In [103]:
mobility_nyc.sample(10).transpose()

Unnamed: 0,2202680,475463,475445,2194919,2206382,1370975,473363,1375239,2206196,2205714
country_region_code,US,US,US,US,US,US,US,US,US,US
country_region,United States,United States,United States,United States,United States,United States,United States,United States,United States,United States
sub_region_1,New York,New York,New York,New York,New York,New York,New York,New York,New York,New York
sub_region_2,New York County,New York County,New York County,Bronx County,Richmond County,New York County,Kings County,Richmond County,Richmond County,Queens County
metro_area,,,,,,,,,,
iso_3166_2_code,,,,,,,,,,
census_fips_code,36061,36061,36061,36005,36085,36061,36047,36085,36085,36081
place_id,ChIJOwE7_GTtwokRFq0uOwLSE9g,ChIJOwE7_GTtwokRFq0uOwLSE9g,ChIJOwE7_GTtwokRFq0uOwLSE9g,ChIJBUEf6ovgwokRwlazSIxIpsk,ChIJOwE7_GTtwokR1V_vES61lRI,ChIJOwE7_GTtwokRFq0uOwLSE9g,ChIJOwE7_GTtwokRs75rhW4_I6M,ChIJOwE7_GTtwokR1V_vES61lRI,ChIJOwE7_GTtwokR1V_vES61lRI,ChIJgav5pFbxwokRno6Tc5x2GL8
date,2022-01-23,2020-03-08,2020-02-19,2022-02-07,2022-09-26,2021-06-18,2020-06-20,2021-02-22,2022-03-24,2022-06-26
retail_and_recreation_percent_change_from_baseline,-47,-5,2,-30,-14,-41,-39,-38,-22,-13


In [104]:
mobility_nyc.shape

(4870, 15)

In [105]:
mobility_nyc.isnull().sum() / len(mobility_nyc)

country_region_code                                   0.000000
country_region                                        0.000000
sub_region_1                                          0.000000
sub_region_2                                          0.000000
metro_area                                            1.000000
iso_3166_2_code                                       1.000000
census_fips_code                                      0.000000
place_id                                              0.000000
date                                                  0.000000
retail_and_recreation_percent_change_from_baseline    0.000000
grocery_and_pharmacy_percent_change_from_baseline     0.000000
parks_percent_change_from_baseline                    0.001643
transit_stations_percent_change_from_baseline         0.000000
workplaces_percent_change_from_baseline               0.000000
residential_percent_change_from_baseline              0.000000
dtype: float64

In [106]:
mobility_nyc = mobility_nyc.drop(['metro_area', 'iso_3166_2_code'], axis=1)

In [107]:
mobility_nyc.isnull().sum()

country_region_code                                   0
country_region                                        0
sub_region_1                                          0
sub_region_2                                          0
census_fips_code                                      0
place_id                                              0
date                                                  0
retail_and_recreation_percent_change_from_baseline    0
grocery_and_pharmacy_percent_change_from_baseline     0
parks_percent_change_from_baseline                    8
transit_stations_percent_change_from_baseline         0
workplaces_percent_change_from_baseline               0
residential_percent_change_from_baseline              0
dtype: int64

In [108]:
mobility_nyc = mobility_nyc.drop(['country_region_code', 'country_region', 'sub_region_1'], axis=1)

In [109]:
mobility_nyc.sample(10)

Unnamed: 0,sub_region_2,census_fips_code,place_id,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
2205625,Queens County,36081.0,ChIJgav5pFbxwokRno6Tc5x2GL8,2022-03-29,-24.0,-6.0,-9.0,-28.0,-21.0,7.0
2194972,Bronx County,36005.0,ChIJBUEf6ovgwokRwlazSIxIpsk,2022-04-01,-30.0,-13.0,-36.0,-33.0,-26.0,5.0
466929,Bronx County,36005.0,ChIJBUEf6ovgwokRwlazSIxIpsk,2020-03-07,13.0,6.0,20.0,9.0,6.0,-1.0
2202921,New York County,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,2022-09-21,-32.0,-17.0,20.0,-25.0,-42.0,6.0
1361082,Bronx County,36005.0,ChIJBUEf6ovgwokRwlazSIxIpsk,2021-05-11,-15.0,-3.0,-17.0,-29.0,-36.0,8.0
479354,Richmond County,36085.0,ChIJOwE7_GTtwokR1V_vES61lRI,2020-05-01,-56.0,-15.0,-19.0,-62.0,-58.0,28.0
2200881,Kings County,36047.0,ChIJOwE7_GTtwokRs75rhW4_I6M,2022-08-28,-25.0,-22.0,67.0,-27.0,-25.0,0.0
2206342,Richmond County,36085.0,ChIJOwE7_GTtwokR1V_vES61lRI,2022-08-17,-11.0,-7.0,-13.0,-31.0,-35.0,9.0
475673,New York County,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,2020-10-04,-55.0,-25.0,-25.0,-50.0,-24.0,5.0
478950,Queens County,36081.0,ChIJgav5pFbxwokRno6Tc5x2GL8,2020-12-25,-73.0,-44.0,-53.0,-74.0,-80.0,31.0


In [110]:
mobility_nyc.groupby(['sub_region_2', 'census_fips_code'])['place_id'].nunique()

sub_region_2     census_fips_code
Bronx County     36005.0             1
Kings County     36047.0             1
New York County  36061.0             1
Queens County    36081.0             1
Richmond County  36085.0             1
Name: place_id, dtype: int64

In [111]:
mobility_nyc = mobility_nyc.drop(['census_fips_code', 'place_id'], axis=1)

In [112]:
mobility_nyc.sample(10).transpose()

Unnamed: 0,2200758,479488,475494,1360965,2202864,479416,1374597,1370808,1368525,1361294
sub_region_2,Kings County,Richmond County,New York County,Bronx County,New York County,Richmond County,Queens County,New York County,Kings County,Bronx County
date,2022-04-27,2020-09-12,2020-04-08,2021-01-14,2022-07-26,2020-07-02,2021-05-21,2021-01-02,2021-10-01,2021-12-09
retail_and_recreation_percent_change_from_baseline,-24,-24,-87,-28,-36,-19,-14,-61,-16,-19
grocery_and_pharmacy_percent_change_from_baseline,-12,-1,-49,-10,-19,9,3,-27,-1,-2
parks_percent_change_from_baseline,1,59,-65,-42,17,49,78,-16,41,-28
transit_stations_percent_change_from_baseline,-35,-23,-81,-44,-31,-40,-34,-53,-25,-27
workplaces_percent_change_from_baseline,-27,-8,-79,-40,-43,-45,-33,-27,-34,-26
residential_percent_change_from_baseline,6,3,32,12,8,12,9,9,6,5


In [113]:
mobility_nyc.shape

(4870, 8)

In [114]:
mobility_nyc.dtypes

sub_region_2                                           object
date                                                   object
retail_and_recreation_percent_change_from_baseline    float64
grocery_and_pharmacy_percent_change_from_baseline     float64
parks_percent_change_from_baseline                    float64
transit_stations_percent_change_from_baseline         float64
workplaces_percent_change_from_baseline               float64
residential_percent_change_from_baseline              float64
dtype: object

## Transformed Datasets

In [115]:
traffic_volume.head(15)

Unnamed: 0,RequestID,Boro,Vol,SegmentID,WktGeom,street,fromSt,toSt,Direction,date_time
0,20856,Queens,9,171896,POINT (1052296.600156678 199785.26932711253),94 AVENUE,207 Street,Francis Lewis Boulevard,WB,2015-06-23 23:30:00
1,21231,Staten Island,6,9896,POINT (942668.0589509147 171441.21296926),RICHMOND TERRACE,Wright Avenue,Emeric Court,WB,2015-09-14 04:15:00
2,29279,Bronx,85,77817,POINT (1016508.0034050211 235221.59092266942),HUNTS POINT AVENUE,Whittier Street,Randall Avenue,NB,2017-10-19 04:30:00
3,27019,Brooklyn,168,188023,POINT (992925.4316054962 184116.82855457635),FLATBUSH AVENUE,Brighton Line,Brighton Line,NB,2017-11-07 18:30:00
4,26734,Manhattan,355,137516,POINT (1004175.9505178436 247779.63624949602),WASHINGTON BRIDGE,Harlem River Shoreline,Harlem River Shoreline,EB,2017-11-03 22:00:00
5,26015,Bronx,11,86053,POINT (1021709.470909429 248612.86356908735),WALLACE AVENUE,Rhinelander Avenue,Bronxdale Avenue,NB,2017-06-17 01:45:00
6,2033,Manhattan,99,70683,POINT (1000954.8 243914.9),S/B AMSTERDAM AVE @ W 162 ST,ST NICHOLAS AV/W 162 ST,W 163 ST,SB,2009-09-01 18:30:00
7,23133,Queens,232,101101,POINT (1050277.3347521287 216784.58047417598),NORTHERN BOULEVARD,220 Place,220 Street,WB,2016-03-21 09:45:00
8,32417,Queens,18,147877,POINT (1044172.6626552071 200130.04842303603),MIDLAND PARKWAY,Dalny Road,Connector,SB,2020-11-14 02:15:00
9,26198,Bronx,2,85935,POINT (1021747.2311522859 242463.04655740186),THIERIOT AVENUE,Gleason Avenue,Pelham Line,NB,2017-06-22 04:30:00


In [116]:
air_quality.sample(10)

Unnamed: 0,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Place Name,Time Period,Start_Date,Data Value
1687,383,Sulfur Dioxide (SO2),Mean,ppb,CD,Long Island City and Astoria (CD1),Winter 2010-11,2010-12-01,4.46
9951,386,Ozone (O3),Mean,ppb,UHF42,Northeast Bronx,Summer 2015,2015-06-01,29.93
1621,383,Sulfur Dioxide (SO2),Mean,ppb,CD,Borough Park (CD12),Winter 2009-10,2009-12-01,2.82
2503,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Crown Heights and Prospect Heights (CD8),Annual Average 2010,2009-12-01,25.9
9482,375,Nitrogen Dioxide (NO2),Mean,ppb,UHF42,Long Island City - Astoria,Winter 2014-15,2014-12-01,26.22
15148,375,Nitrogen Dioxide (NO2),Mean,ppb,Borough,Queens,Summer 2020,2020-06-01,10.0
11431,652,O3-Attributable Cardiac and Respiratory Deaths,Estimated Annual Rate,"per 100,000 residents",UHF42,Canarsie - Flatlands,2012-2014,2012-01-02,5.9
9071,365,Fine Particulate Matter (PM2.5),Mean,mcg per cubic meter,UHF42,Ridgewood - Forest Hills,Annual Average 2015,2015-01-01,8.51
2245,375,Nitrogen Dioxide (NO2),Mean,ppb,CD,Central Harlem (CD10),Summer 2011,2011-06-01,24.0
1795,383,Sulfur Dioxide (SO2),Mean,ppb,CD,South Crown Heights and Lefferts Gardens (CD9),Winter 2012-13,2012-12-01,1.12


In [117]:
mobility_nyc.sample(10)

Unnamed: 0,sub_region_2,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
479423,Richmond County,2020-07-09,-23.0,4.0,44.0,-40.0,-44.0,13.0
467181,Bronx County,2020-11-14,-22.0,-3.0,-6.0,-16.0,-18.0,5.0
1368260,Kings County,2021-01-09,-38.0,-13.0,-10.0,-43.0,-23.0,8.0
1368562,Kings County,2021-11-07,-18.0,-12.0,33.0,-21.0,-22.0,3.0
1361118,Bronx County,2021-06-16,-8.0,-3.0,7.0,-26.0,-36.0,6.0
1368292,Kings County,2021-02-10,-35.0,-9.0,-26.0,-46.0,-45.0,16.0
2195040,Bronx County,2022-06-08,-24.0,-13.0,-10.0,-27.0,-27.0,3.0
467190,Bronx County,2020-11-23,-24.0,-8.0,-49.0,-36.0,-36.0,12.0
2200879,Kings County,2022-08-26,-28.0,-18.0,10.0,-38.0,-43.0,10.0
467013,Bronx County,2020-05-30,-38.0,-5.0,53.0,-24.0,-26.0,10.0
