# Data Wrangling

**Dataset I - CitiBike Trip Data**
(https://www.citibikenyc.com/system-data)

The goal of this notebook is to get all the required data needed to complete the project. The first dataset that will be compiled is the Trip data from CitiBike. The trip data holds key information about each trip that was taken by customers of the service. For example, columns such as the start time, end station, and gender are recorded for each trip.

**Dataset II - Neighborhood Profiles**
(https://furmancenter.org/neighborhoods)

The next dataset that is needed is the characteristics of each neighborhood in New York City (NYC). The data was gathered by the Furman Center for Real Estate and Urban Policy at New York University. Each dataset has different categories of information about each neighborhood in the city. For example two categories that exist in the dataset are demographics and housing. 

**Dataset III - Community District GeoJson**
(https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)

The third data is GeoJSON data that actually segments the community districts of NYC. That data is obtained directly from NYCOpenData. *Note: What Furman Center calls Neighborhoods, NYCOpenData calls Community Districts. NYCOpenData has a different dataset called Neighborhood Tabulation Areas which is a more granular division of the city.*

**Dataset IV - Subway Entrance GeoJson**
(https://data.cityofnewyork.us/Transportation/Subway-Entrances/drex-xx56)

The final dataset is another GeoJSON file that has the information on the entrances of all the subway stations in the city. Again, this data is obtained directly from NYCOpendata. 

## Scraping
The purpose of this section is to connect, extract, and store all of the tripdata files from the CitiBike S3 bucket into a temporary folder in the working directory. We will use the requests, zipfile, and io packages to retrieve the zipped data and extract it to the temporary folder. 

*The vision for this project is that all files needed for any analysis be stored in the cloud (AWS S3), separate from the directory of the code. In the "Upload..." sections we will upload the extracted data from the temporary folder to a personal S3 bucket and then delete the temporary folder. For the remainder of the project, all data will be pulled from that S3 bucket*

In [252]:
def unzip(zipper, datafile: str, folder: str):
    if os.path.exists(folder + datafile):
        print(f"Skipped: {datafile} already extracted from S3 Bucket \n")
        return None

    zipper.extract(datafile, path = folder)
    print(f"Extract Success: {datafile} unzipped and uploaded to {folder} \n")
    return None

## Scraping the TripData from the BayWheels S3 Bucket

In [253]:
BAYWHEELS_DATA_FOLDER = "https://s3.amazonaws.com/baywheels-data/" 
MY_BAYWHEELS = os.path.join(os.getcwd(),"BayWheelsData")

In [254]:
if not os.path.exists(MY_BAYWHEELS):
    os.makedirs(MY_BAYWHEELS)

In [255]:
def bay_request(bay_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(bay_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 201905 the buckets changed from fordgobike -> baywheels
        # We try to connect again with the new ending 
        try:
            r = requests.get(bay_bucket + filename.replace('fordgobike','baywheels'), stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename[:-4] + '.csv.' + filename[-3:]} requested from BayWheels S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from BayWheels S3 Bucket")
    
    return r

In [256]:
def bay_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """        
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 

        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0]
        unzip(zip, datafile, folder)
        
    return None
    

In [257]:
r = bay_request(BAYWHEELS_DATA_FOLDER, f"2017-fordgobike-tripdata.csv.zip")
bay_download(r, MY_BAYWHEELS)

Request Success: 2017-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 2017-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWheelsData 



In [258]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = bay_request(BAYWHEELS_DATA_FOLDER, f"{year}{month}-fordgobike-tripdata.csv.zip")
        bay_download(r, MY_BAYWHEELS)

Request Success: 201801-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 201801-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWheelsData 

Request Success: 201802-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 201802-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWheelsData 

Request Success: 201803-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 201803-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWheelsData 

Request Success: 201804-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 201804-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWheelsData 

Request Success: 201805-fordgobike-tripdata.csv.zip requested from BayWheels S3 Bucket
Extract Success: 201805-fordgobike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BayWhe

## Scraping the TripData from the BlueBike S3 Bucket

In [259]:
BLUEBIKE_DATA_FOLDER = "https://s3.amazonaws.com/hubway-data/" 
MY_BLUEBIKE = os.path.join(os.getcwd(),"BlueBikeData")

In [260]:
if not os.path.exists(MY_BLUEBIKE):
    os.makedirs(MY_BLUEBIKE)

In [261]:
def blue_request(blue_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(blue_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 201805 the buckets changed from hubway -> bluebikes
        # We try to connect again with the new ending 
        try:
            r = requests.get(blue_bucket + filename.replace('hubway','bluebikes'), stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename.replace('hubway','bluebikes')} requested from BlueBike S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from BlueBike S3 Bucket")
    
    return r

In [262]:
def blue_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0]
        unzip(zip,datafile,folder)

    return None
    

In [263]:
yearlist = ['2011','2012','2013','2014_1','2014_2']

for year in yearlist:
    r = blue_request(BLUEBIKE_DATA_FOLDER, f"hubway_Trips_{year}.csv")
    url_content = r.content
    csv_file = open(f'/root/Citi-Bike-Expansion/BlueBikeData/{year}-hubway-tripdata.csv', 'wb')

    csv_file.write(url_content)
    csv_file.close()

Request Success: hubway_Trips_2011.csv requested from BlueBike S3 Bucket
Request Success: hubway_Trips_2012.csv requested from BlueBike S3 Bucket
Request Success: hubway_Trips_2013.csv requested from BlueBike S3 Bucket
Request Success: hubway_Trips_2014_1.csv requested from BlueBike S3 Bucket
Request Success: hubway_Trips_2014_2.csv requested from BlueBike S3 Bucket


In [264]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2015","2016","2017","2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = blue_request(BLUEBIKE_DATA_FOLDER, f"{year}{month}-hubway-tripdata.zip")
        blue_download(r, MY_BLUEBIKE)

Request Success: 201501-hubway-tripdata.zip requested from BlueBike S3 Bucket
Extract Success: 201501-hubway-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BlueBikeData 

Request Success: 201502-hubway-tripdata.zip requested from BlueBike S3 Bucket
Extract Success: 201502-hubway-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BlueBikeData 

Request Success: 201503-hubway-tripdata.zip requested from BlueBike S3 Bucket
Extract Success: 201503-hubway-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BlueBikeData 

Request Success: 201504-hubway-tripdata.zip requested from BlueBike S3 Bucket
Extract Success: 201504-hubway-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BlueBikeData 

Request Success: 201505-hubway-tripdata.zip requested from BlueBike S3 Bucket
Extract Success: 201505-hubway-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/BlueBikeData 

Request Success: 201506-hubway-tripdata.zip requested from B

## Scraping the TripData from the Capital Bikeshare S3 Bucket

In [265]:
CAPITAL_DATA_FOLDER = "https://s3.amazonaws.com/capitalbikeshare-data/" 
MY_CAPITAL = os.path.join(os.getcwd(),"CapitalData")

In [266]:
if not os.path.exists(MY_CAPITAL):
    os.makedirs(MY_CAPITAL)

In [267]:
def capital_request(capital_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(capital_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None   
    else:
        print(f"Request Success: {filename} requested from Capital S3 Bucket")
    
    return r

In [268]:
def capital_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        if zip.namelist()[0][4] == 'Q':
            for i in range(len(zip.namelist())):
                datafile = zip.namelist()[i]
                unzip(zip,datafile,folder)
        else:
            datafile = zip.namelist()[0]
            unzip(zip,datafile,folder)
        return None
    

In [269]:
yearlist = ["2010","2011","2012","2013","2014","2015","2016","2017"]

for year in yearlist:
    r = capital_request(CAPITAL_DATA_FOLDER, f"{year}-capitalbikeshare-tripdata.zip")
    capital_download(r, MY_CAPITAL)

Request Success: 2010-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 2010-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 2011-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 2011-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 2012-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 2012Q1-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Extract Success: 2012Q2-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Extract Success: 2012Q3-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Extract Success: 2012Q4-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 2013-capitalbikesh

In [270]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

for year in yearlist:
    for month in monthlist:
        r = capital_request(CAPITAL_DATA_FOLDER, f"{year}{month}-capitalbikeshare-tripdata.zip")
        capital_download(r, MY_CAPITAL)

Request Success: 201801-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 201801_capitalbikeshare_tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 201802-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 201802-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 201803-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 201803-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 201804-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 201804-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CapitalData 

Request Success: 201805-capitalbikeshare-tripdata.zip requested from Capital S3 Bucket
Extract Success: 201805-capitalbikeshare-tripdata.csv unzipped and uploaded to /root/Citi

## Scraping the TripData from the CitiBike S3 Bucket

In [271]:
import requests, zipfile, io   # Needed to pull data from CitiBike S3 bucket
import os   # Needed to work with folders that will be created
import shutil

In [272]:
CITIBIKE_DATA_FOLDER = "https://s3.amazonaws.com/tripdata/" 
MY_CITIBIKE = os.path.join(os.getcwd(),"CitiBikeData")

In [273]:
if not os.path.exists(MY_CITIBIKE):
    os.makedirs(MY_CITIBIKE)

In [274]:
def citi_request(citi_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    try:
        r = requests.get(citi_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 2017 the bucket endings changed from .zip -> .csv.zip
        # We try to connect again with the new ending 
        try:
            r = requests.get(citi_bucket + filename[:-4] + '.csv.' + filename[-3:], stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename[:-4] + '.csv.' + filename[-3:]} requested from Citibike S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from Citibike S3 Bucket")
    
    return r

In [275]:
def citi_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0] 
        unzip(zip,datafile,folder)

    return None
    

In [276]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2013","2014", "2015", "2016", "2017", "2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        try:
            r = citi_request(CITIBIKE_DATA_FOLDER, f"{year}{month}-citibike-tripdata.zip")
            citi_download(r,MY_CITIBIKE)
        except:
            pass

404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201301-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201302-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201303-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201304-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201305-citibike-tripdata.csv.zip
Request Success: 201306-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract Success: 201306-citibike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CitiBikeData 

Request Success: 201307-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract Success: 2013-07 - Citi Bike trip data.csv unzipped and uploaded to /root/Citi-Bike-Expansion/CitiBikeData 

Request Success: 201308-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract Suc

## Scraping the TripData from the Divvy S3 Bucket

In [277]:
import re

In [278]:
DIVVY_DATA_FOLDER = "https://divvy-tripdata.s3.amazonaws.com/" 
MY_DIVVY = os.path.join(os.getcwd(),"DivvyData")

In [279]:
if not os.path.exists(MY_DIVVY):
    os.makedirs(MY_DIVVY)

In [280]:
def divvy_request(divvy_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    try:
        r = requests.get(divvy_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None    
    else:
        print(f"Request Success: {filename} requested from BayWheels S3 Bucket")
    
    return r

In [281]:
def divvy_download(r: requests.models.Response, folder: str, year: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        if year == '2013':
            datafile = zip.namelist()[2]
            unzip(zip,datafile,folder)

        elif int(year) < 2018:                    
            for file in zip.namelist():
                if re.match('(Divvy_Trips_\d{4}.{3,4}.csv)$', file):
                    datafile = file
                    unzip(zip,datafile,folder)
                elif re.match('(Divvy_Trips_\d{4}.{3,5}.csv)$', file):
                    datafile = file
                    unzip(zip,datafile,folder)

        else:
            datafile = zip.namelist()[0]
            unzip(zip,datafile,folder)
    
    return None

In [282]:
yearlist = ['2013','2014','2015','2016','2017','2018','2019','2020']

for year in yearlist:
    if year == '2013':
        r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Stations_Trips_{year}.zip")
        divvy_download(r, MY_DIVVY, year)
    
    elif year == '2014':
        for half in ['Q1Q2','Q3Q4']:
            r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Stations_Trips_{year}_{half}.zip")
            divvy_download(r, MY_DIVVY, year)       
    
    elif int(year) < 2018:
        for half in ['Q1Q2','Q3Q4']:
            try:
                # 404 Error for 2015_Q1Q2 (the only file that gets run in the except)
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}_{half}.zip")
                divvy_download(r, MY_DIVVY, year)
            except:
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}-{half}.zip")
                divvy_download(r, MY_DIVVY, year)
    else:
        for quarter in ['Q1','Q2','Q3','Q4']:
            try:
                # 404 Errors for 2020Q2 - 2020Q4
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}_{quarter}.zip")
                divvy_download(r, MY_DIVVY, year)
            except:
                pass

Request Success: Divvy_Stations_Trips_2013.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Stations_Trips_2013/Divvy_Trips_2013.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: Divvy_Stations_Trips_2014_Q1Q2.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Trips_2014_Q1Q2.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: Divvy_Stations_Trips_2014_Q3Q4.zip requested from BayWheels S3 Bucket
404 Client Error: Not Found for url: https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2015_Q1Q2.zip
Request Success: Divvy_Trips_2015-Q1Q2.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Trips_2015-Q1.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Trips_2015-Q2.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: Divvy_Trips_2015_Q3Q4.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Trips_2015_Q4.csv unz

In [283]:
# There are some files from 2014 and 2016 that need to be specifically downloaded
r = divvy_request(DIVVY_DATA_FOLDER, "Divvy_Stations_Trips_2014_Q3Q4.zip")
with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
    for file in zip.namelist()[2:5]:
        datafile = file
        unzip(zip, datafile, MY_DIVVY)

Request Success: Divvy_Stations_Trips_2014_Q3Q4.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Stations_Trips_2014_Q3Q4/Divvy_Trips_2014-Q3-07.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Stations_Trips_2014_Q3Q4/Divvy_Trips_2014-Q3-0809.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Stations_Trips_2014_Q3Q4/Divvy_Trips_2014-Q4.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 



In [284]:
r = divvy_request(DIVVY_DATA_FOLDER, "Divvy_Trips_2016_Q1Q2.zip")
with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip:
    for file in zip.namelist()[1:5]:
        datafile = file
        unzip(zip, datafile, MY_DIVVY)

Request Success: Divvy_Trips_2016_Q1Q2.zip requested from BayWheels S3 Bucket
Extract Success: Divvy_Trips_2016_Q1Q2/Divvy_Trips_2016_04.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Trips_2016_Q1Q2/Divvy_Trips_2016_05.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Trips_2016_Q1Q2/Divvy_Trips_2016_06.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Extract Success: Divvy_Trips_2016_Q1Q2/Divvy_Trips_2016_Q1.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 



In [285]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2020"]
monthlist = ["04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = divvy_request(DIVVY_DATA_FOLDER, f"{year}{month}-divvy-tripdata.zip")
        divvy_download(r, MY_DIVVY, year)

Request Success: 202004-divvy-tripdata.zip requested from BayWheels S3 Bucket
Extract Success: 202004-divvy-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: 202005-divvy-tripdata.zip requested from BayWheels S3 Bucket
Extract Success: 202005-divvy-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: 202006-divvy-tripdata.zip requested from BayWheels S3 Bucket
Extract Success: 202006-divvy-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: 202007-divvy-tripdata.zip requested from BayWheels S3 Bucket
Extract Success: 202007-divvy-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: 202008-divvy-tripdata.zip requested from BayWheels S3 Bucket
Extract Success: 202008-divvy-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/DivvyData 

Request Success: 202009-divvy-tripdata.zip requested from BayWheels S3 Bucket
Ex

In [286]:
# Move one file from a subdirectory into the main DivvyData folder
destination = os.path.join(os.getcwd(),'DivvyData')

In [287]:
subfolders = [os.path.join(os.getcwd(),'DivvyData','Divvy_Stations_Trips_2013'),
              os.path.join(os.getcwd(),'DivvyData','Divvy_Stations_Trips_2014_Q3Q4'),
              os.path.join(os.getcwd(),'DivvyData','Divvy_Trips_2016_Q1Q2')]

In [288]:
# Moves the files out of the subfolders
for folder in subfolders:
    for file in os.listdir(folder):
        shutil.move(os.path.join(folder,file), destination)

In [289]:
# Deletes the subfolders
for i in range(len(subfolders)):
    shutil.rmtree(subfolders[i])

## Scraping Neighborhood Data I - Getting the Neighborhood Codes
To download the xlsx files from Furman Center we need the 4 character code for each community district. To get those values we'll use beautifulsoup to scrap the dropdown menu and store the code:name pairs of each community in a dictionary. For example, BK04: Bushwick will be an entry in the dictionary (The BK portion represents the borough Brooklyn).  

In [237]:
from bs4 import BeautifulSoup

In [238]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [239]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_codes = {}
for code in soup.find_all('option')[1:]:
    hood_codes[code.text[:4]] = code.text[6:].replace("/","-").replace(" ","_")   # Borough names will be used as filename in the next secion

## Scraping Neighborhood Data II - Getting the Neighborhood Data Files
With the neighborhood codes available, we can send a request to the Furman Center, download their excel files, and store it in a temporary folder. This is going to be a simpler and similar process to the tripdata files because we don't have to deal with zipped folders. Later the data will be uploaded to the S3 Bucket and then deleted from the local repository.

In [240]:
TEMP_HOOD_FOLDER = "/root/Citi-Bike-Expansion/TempHoodData/"

if not os.path.exists(TEMP_HOOD_FOLDER):
    os.makedirs(TEMP_HOOD_FOLDER)

In [241]:
def pull_hood_data(code: str, name: str, folder: str) -> None:
    """Uses the scraped neighborhood code to download the xlsx data from Furman Center
    
    Parameters
    ----------
    code: str
        The 4 character neighborhood string
    name: str
        The actual name of the neighborhood
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be an XLSX file in the specified folder.
    """
    
    file = f"https://furmancenter.org/files/NDP/{code}_NeighborhoodDataProfile.xlsx"
    
    if os.path.exists(folder + f"{code}_{name}.xlsx"):
        print(f"Skipped: {code}_{name} already downloaded from Furman Center")
        return None
    
    try:
        r3 = requests.get(file)
        r3.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None
    else:
        print(f"Request Success: {file} from Furman Center")
    
    with open(folder + f"{code}_{name}.xlsx", 'wb') as output:
        output.write(r3.content)
    
    return None

In [242]:
for key, value in hood_codes.items():
    pull_hood_data(key, value, TEMP_HOOD_FOLDER)

Request Success: https://furmancenter.org/files/NDP/BK01_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK02_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK03_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK04_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK05_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK06_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK07_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK08_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK09_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK1

## Upload TripData to Personal S3 Bucket

In [243]:
import boto3

In [244]:
# Note: This code can has to be executed with your own S3 bucket by changing the following string values:
# ACCESS_KEY_ID, ACCESS_SECRET_KEY, bucket, prefix 

ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = 'OGeuPNVq+ptQo9UlDJZaB3EvrcysgLyyFIqthVdY'

s3 = boto3.resource(
     's3',
     aws_access_key_id = ACCESS_KEY_ID,
     aws_secret_access_key = ACCESS_SECRET_KEY
)

bucket = 'williams-citibike'   # Premade bucket in S3
trip_prefix = 'TripData'   # Premade folder inside the bucket

In [245]:
def s3_upload(directory: str, prefix: str):
    filenames = sorted([file for file in os.listdir(directory)])
    
    for key in filenames:
        s3.Bucket(bucket).Object(os.path.join(trip_prefix,prefix,key)).upload_file(os.path.join(directory,key))

In [246]:
local_data_folders = [(MY_BAYWHEELS,'BayWheels'),
                      (MY_BLUEBIKE,'BlueBike'),
                      (MY_CAPITAL, 'CapitalBike'),
                      (MY_CITIBIKE, 'CitiBike'),
                      (MY_DIVVY, 'DivvyBike')]

In [247]:
for directory, prefix in local_data_folders:
    s3_upload(directory, prefix)

## Upload Neighborhood Data to Personal S3 Bucket
The purpose of this section is to take the downloaded files and upload them to my own personal S3 bucket.

In [248]:
hood_prefix = "HoodData"
filenames = sorted([file for file in os.listdir(TEMP_HOOD_FOLDER)])

In [249]:
for key in filenames:
    s3.Bucket(bucket).Object(os.path.join(hood_prefix,key)).upload_file(TEMP_HOOD_FOLDER + key)

## Delete the Local Repositories

In [250]:
for directory, name in local_data_folders:
    shutil.rmtree(directory)

In [251]:
shutil.rmtree(TEMP_HOOD_FOLDER)