### Obtaining the Data
The data was obtained via the [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website provided. Looking at the website, there is an API, however I chose to scrape the data available through the links on this page because of the time restraint and ease of access. 

In [1]:
#import packages
from bs4 import BeautifulSoup
import requests 
import pandas as pd

In [2]:
# Get list of links from website provided
url = 'https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page'
r = requests.get(url)
soup = BeautifulSoup(r.content) # create soup of content

Before choosing what files to download, I wanted to get an idea of the files available. To do so, I:
1. Created a list of all the links found on this page
2. Identifed the links that contained keywords indicating it was a data file
3. Create table of the URLs and corresponding data

In [3]:
# create list of links for data to scrape
sources = ['yellow', 'green', 'fhv', 'fhvhv']
keyword = 'trip+data'
url_list = []

for link in soup.find_all('a'):
    url = link.get('href')
    for source in sources:
        if (keyword in url) and (source+'_' in url):
            year = url[-11:][:4]
            month = url[-6:][:2]
            filename = source + '-' + year + '-' + month + '.csv'
            url_list.append([source, month, year, url, filename])

In [4]:
# Saving a list of files for use in my analysis notebook
file_info = pd.DataFrame(url_list, columns = ['source', 'month', 'year', 'url', 'filename'])
file_info.to_csv('file_info.csv', index=False)
file_info.head()

Unnamed: 0,source,month,year,url,filename
0,yellow,1,2021,https://s3.amazonaws.com/nyc-tlc/trip+data/yel...,yellow-2021-01.csv
1,green,1,2021,https://s3.amazonaws.com/nyc-tlc/trip+data/gre...,green-2021-01.csv
2,fhv,1,2021,https://nyc-tlc.s3.amazonaws.com/trip+data/fhv...,fhv-2021-01.csv
3,fhvhv,1,2021,https://nyc-tlc.s3.amazonaws.com/trip+data/fhv...,fhvhv-2021-01.csv
4,yellow,2,2021,https://s3.amazonaws.com/nyc-tlc/trip+data/yel...,yellow-2021-02.csv


The above table now helps me easily organize the files I want. Now I can easily choose which files I want to download. I chose to use data from 2017 to current. My reasoning is that according to the [TLC Trip Records User Guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf), this is the year where they started to receive drop-off location for the FHV data. We need the data with the zone information, and also this should also give enough data to establish a pre-covid snapshot for years 2017 - 2019.

In [None]:
# download files for years 2017 to now
target_year = 2017
for row in url_list:
    if int(row[2]) >= target_year:
        path = 'data/' + row[4]
        csv = requests.get(row[3])
        with open(path, 'wb') as file:
        file.write(csv.content)

Since the download code takes a while to run, once it downloaded, I used a new notebook to analyze the data. This notebook can be found [HERE](Analyze.ipynb).