# Preliminary Analysis

### Data Cleaning Code
Code for cleaning and processing your data. Include a data dictionary for your transformed dataset.

- Data Dictionary for Air Quality
    - **indicator id:** id for each name
    - **name:** classify the sample in the air
    - **measure:** how the indicator is measured
    - **measure info:** information about the measure
    - **geo type name:** geography type, UHF stands for United Hospital Fund neighborhoods
    - **geo place name:** neighborhood name
    - **time period:** time frame
    - **start_date:** date started
    <br><br>
- Data Dictionary for Traffic Volume
    - **requestId:** unique id generated for each counts request
    - **boro:** lists which of the five diviions of New York City the location is within
    - **vol:** total sum of count collected within 15 minute increments
    - **segmentId:** The ID that idenifies each segment of a street
    - **wktgeom:** Geometry point of the location
    - **street:** street name of where traffic happened
    - **fromst:** start street of traffic
    - **tost:** end street where traffic volume was located
    - **direction:** text-based direction of traffic where the count took place
    - **date_time:** date at which it took place
    <br><br>
- Data Dictionary for 2020 mobility Dataset
    - **sub_region_2** which county it is
    - **date** date during recording
    - **retail_and_recreation_percent_change_from_baseline** mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.
    - **grocery_and_pharmacy_percent_change_from_baseline** mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies
    - **parks_percent_change_from_baseline** mobility trends for places like national parks, public beaches, marinas, dog parks, plazas, and public gardens
    - **transit_stations_percent_change_from_baseline** mobility trends for places like public transport hubs such as subway, bus, and train stations
    - **workplaces_percent_change_from_baseline** mobility trend for places of work
    - **residential_percent_change_from_baseline** mobility trends for places of residence
    
### Exploratory Analysis
Describe what work you have done so far and include the code. This may include descriptive statistics, graphs and charts, and preliminary models.

- We removed some columns that were irrelevant to what we want to predict as well as combine some columns that would fit together, such as the date and time.


### Challenges
Describe any challenges you've encountered so far. Let me know if there's anything you need help with!

- There were some challenges in figuring out what sort of data was necessary to include for our problem as it was targeted in New York City. 
- Figuring out the transformations to use on each dataset was also a challenge since there were many columns for each dataset and we had to find the ones that weren't relevant to our problem.
- There are some issues for the columns right now where there are some, such as segmentId in the Traffic Volume dataset where we are currently unsure if it's useful to keep or remove.

### Future Work
Describe what work you are planning to complete for the final analysis.

- Future work includes using the cleaned data to use as inputs for models suited for classification such as Logisitc Regression and Linear Regression. 
- Make predictions using the models trained to obtain the accuracy scores to answer our questions
- Find the best model for accuracy as well as graph/chart the data to further understand it for future predictions.

### Contributions
Describe the contributions that each group member made.
- **Daniel Aguilar-Rodriguez**
    - Researched and acquired datasets
    - Helped present ideas during brainstorming session
    - Created jupyter notebook and helped clean datasets
    - Helped transform datasets and removed columns irrelevant to our work
    <br><br>
- **Jia Cong Lin**
    - Helped present ideas during brainstorming session
    - Helped define necessary columns for the mobility dataset
    - Assisted in determining columns to clean and define 
    <br><br>
- **Anvinh Truong**
    - Helped clean and define some columns for the datasets and dictionary
    - Helped present ideas during brainstorming session
    - Assisted in thinking of procedure to clean data columns

In [98]:
import pandas as pd
import numpy as np
import os
import requests

In [99]:
def download_data(csv_name):
    url_dict = {'air_quality': 'https://data.cityofnewyork.us/api/views/c3uy-2p5r/rows.csv', 
                'mobility_global': 'https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv', 
                'traffic_volume': 'https://data.cityofnewyork.us/api/views/7ym2-wayt/rows.csv'}
    
    response = requests.get(url_dict[csv_name])
    path = f'datasets/{csv_name}.csv'
    with open(path, 'wb') as f:
        f.write(response.content)

In [100]:
def csv_exists(csv_name):
    path = f'datasets/{csv_name}.csv'
    file_exists = os.path.exists(path)
    return file_exists

In [101]:
def create_df(csv_name):
    if not csv_exists(csv_name):
        download_data(csv_name)
    path = f'datasets/{csv_name}.csv'
    df = pd.read_csv(path)
    return df

In [102]:
def mkdir_if_not_exist():
    directory = 'datasets'
    if not os.path.exists(f'{directory}/'):
        os.mkdir(directory)

In [103]:
def create_all_df(csv_names):
    mkdir_if_not_exist()
    df_list = []
    
    for csv_name in csv_names:
        print(f'Creating {csv_name} df')
        df = create_df(csv_name)
        df_list.append(df)
        
    return df_list

In [None]:
csv_names = ['air_quality', 'mobility_global', 'traffic_volume']

air_quality, mobility_global, traffic_volume = create_all_df(csv_names)

Creating air_quality df
Creating mobility_global df


  df = pd.read_csv(path)


Creating traffic_volume df


## Air Quality Dataset Cleaning

In [None]:
print(air_quality.isnull().sum() / len(air_quality))

In [None]:
air_quality = air_quality.drop(['Message'], axis=1)
print(air_quality.isnull().sum() / len(air_quality))

In [None]:
print(air_quality.nunique() / len(air_quality))

In [None]:
air_quality = air_quality.drop(['Unique ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

In [None]:
air_quality = air_quality.drop(['Geo Join ID'], axis=1)
print(air_quality.shape)
print(air_quality.nunique() / len(air_quality))

In [None]:
air_quality.dtypes

In [None]:
air_quality.nunique()

In [None]:
air_quality['Time Period'].unique()

In [None]:
air_quality['Time Period'].value_counts() / len(air_quality)

In [None]:
air_quality['Start_Date'].unique()

In [None]:
air_quality['Start_Date'] = pd.to_datetime(air_quality['Start_Date'], infer_datetime_format=True)

In [None]:
air_quality['Start_Date'].min()

In [None]:
air_quality['Start_Date'].value_counts().sort_index() / len(air_quality)

## Traffic Volume Dataset Cleaning

In [None]:
traffic_volume.sample(10)

In [None]:
traffic_volume.shape

In [None]:
print(traffic_volume.isnull().sum() / len(traffic_volume))

In [None]:
print(traffic_volume.nunique() / len(traffic_volume))

In [None]:
traffic_volume.nunique()

In [None]:
traffic_volume.dtypes

In [None]:
traffic_volume.Yr.min()

In [None]:
traffic_volume = traffic_volume[traffic_volume['Yr'] >= 2005]

In [None]:
traffic_volume.shape

In [None]:
traffic_volume['Yr'].value_counts().sort_index()

In [None]:
traffic_volume = traffic_volume[traffic_volume['Yr'] > 2008]

In [None]:
traffic_volume.shape

In [None]:
traffic_volume['date_time'] = pd.to_datetime(dict(year=traffic_volume.Yr, \
                                                  month=traffic_volume.M, \
                                                  day=traffic_volume.D, \
                                                  hour=traffic_volume.HH, \
                                                  minute=traffic_volume.MM))

In [None]:
traffic_volume = traffic_volume.drop(['Yr', 'M', 'D', 'HH', 'MM'], axis=1)

In [None]:
traffic_volume.sample(10)

In [None]:
traffic_volume['date_time'].dt.year.value_counts().sort_index()

## 2020 mobility Dataset cleaning

In [None]:
mobility_global.head(10).transpose()

In [None]:
mobility_global.shape

In [None]:
mobility_nyc = mobility_global[mobility_global['country_region_code'].eq('US') & 
                               mobility_global['sub_region_1'].eq('New York') & 
                               mobility_global['sub_region_2'].str.contains('Bronx|Kings|New York|Queens|Richmond')]

In [None]:
mobility_nyc.sample(10).transpose()

In [None]:
mobility_nyc.shape

In [None]:
mobility_nyc.isnull().sum() / len(mobility_nyc)

In [None]:
mobility_nyc = mobility_nyc.drop(['metro_area', 'iso_3166_2_code'], axis=1)

In [None]:
mobility_nyc.isnull().sum()

In [None]:
mobility_nyc = mobility_nyc.drop(['country_region_code', 'country_region', 'sub_region_1'], axis=1)

In [None]:
mobility_nyc.sample(10)

In [None]:
mobility_nyc.groupby(['sub_region_2', 'census_fips_code'])['place_id'].nunique()

In [None]:
mobility_nyc = mobility_nyc.drop(['census_fips_code', 'place_id'], axis=1)

In [None]:
mobility_nyc.sample(10).transpose()

In [None]:
mobility_nyc.shape

In [None]:
mobility_nyc.dtypes

## Transformed Datasets

In [None]:
traffic_volume.head(15)

In [None]:
air_quality.sample(10)

In [None]:
mobility_nyc.sample(10)