# **Data Wrangling**
The goal of this notebook is to get all the required data needed to complete the project. The vision for this project is that all files needed for any analysis be stored in the cloud (AWS S3), separate from the directory of the code. In the "Upload..." sections we will upload the extracted data from the temporary folder to a personal S3 bucket and then delete the temporary folder. For the remainder of the project, all data will be pulled from that S3 bucket. 

**Bike Share Trip Data Datasets**

This project will be looking at 5 different Bike Share companies: BayWheels, BlueBike, CapitalBike, CitiBike, and DivvyBike. The first five datasets need to be compiled from each companies individual S3 bucket. These datasets contain trip data and holds key information about each trip that was taken by customers of the service. The columns of the datasets include columns such as start time and end station.

- Dataset I - <a href="https://s3.amazonaws.com/baywheels-data/index.html"> BayWheels S3 Trip Data Bucket </a>

- Dataset II - <a href="https://s3.amazonaws.com/hubway-data/index.html"> BlueBike S3 Trip Data Bucket </a>

- Dataset III - <a href="https://s3.amazonaws.com/capitalbikeshare-data/index.html"> Capital S3 Trip Data Bucket </a>

- Dataset IV - <a href="https://s3.amazonaws.com/tripdata/index.html"> CitiBike S3 Trip Data Bucket </a>

- Dataset V - <a href="https://divvy-tripdata.s3.amazonaws.com/index.html"> DivvyBike S3 Trip Data Bucket </a>



**Neighborhood Profiles Datasets**

The next group of data that is used contains the the characteristics of each neighborhood in New York City (NYC) and San Francisco. For NYC The data was gathered by the Furman Center for Real Estate and Urban Policy at New York University. For San Francisco the dataset was published by the San Francisco Planning Department. Each dataset has different categories of information about the neighborhoods in both cities. 

- Dataset VI - <a href = "https://furmancenter.org/neighborhoods"> New York City Neighborhood Profiles </a>

- Dataset VII - <a href = "https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf"> San Francisco Neighborhood Profiles



**GeoSpatial Datasets**

Two EDA portions of the project required the geospatial boundaries of the neighborhoods in NYC and San Francisco. The GeoJSON data for NYC was obtained from NYCOpenData and it segments NYC into community districts. *Note: What Furman Center calls Neighborhoods, NYCOpenData calls Community Districts. NYCOpenData has a different dataset called Neighborhood Tabulation Areas which is a more granular division of the city.* The GeoJson data for San Francisco was obtained from DataSF and it segments San Francisco into Analysis Neighborhoods. *Note: DataSF has many neighborhood division the one that matches the neighborhood data is called Analysis Neighborhoods*. The final GeoJSON file has the locations of all the entraces of the subway stations in NYC

- Dataset VIII - <a href="https://data.cityofnewyork.us/api/geospatial/yfnk-k7r4?method=export&format=GeoJSON"> NYC Community District GeoJSON File </a>

- Dataset IX - <a href="https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON"> San Francisco Community District GeoJSON File </a>

- Dataset X - <a href="https://data.cityofnewyork.us/api/geospatial/drex-xx56?method=export&format=GeoJSON"> Subway Entrance GeoJSON File </a>


**Zip Code Datasets**

The final group of datasets contain the properties of every zipcode in the United States as well as the Core Based Statistical Areas (CBSAs) of the country.

- Dataset XI - Zipcode USA Data
- Dataset XII -<a href="https://www2.census.gov/programs-surveys/metro-micro/geographies/reference-files/2020/delineation-files/list1_2020.xls" > Delineation File </a>

- Dataset XIII -
<a href="https://www.huduser.gov/portal/datasets/usps_crosswalk.html"> USPS Zipcode Crosswalk Files</a>
    
<hr>

## **Gathering the Trip Data**
The purpose of this section is to connect to and extract all of the trip data files from the S3 bucket of the five different bike share companies. The data will be stored into a temporary folder in the working directory. In a later step the files will then be uploaded to our personal S3 bucket. 

#### **Subsection Structure** -  Each of the 5 subsections in this section have the same structure
<ol>
    <li> Define the path to the company's S3 bucket and create an empty directory in working directory to house the files.
    <li> Create a custom function to request the file.
    <li> Create a custom function that downloads the file contained in the request.
    <li> Use a series of for loops to get all the files available (up to 2020-12). 
</ol>

[<p style="text-align:center;font-style:italic">I Understand & Don't Need to See the Code </p>](#Skip_Gathering) 

In [None]:
import requests, zipfile, io   # Needed to pull data from CitiBike S3 bucket
import os   # Needed to work with folders that will be created
import shutil   # Needed to delete the temporary folder

In [None]:
def unzip(zipper: zipfile.ZipFile, datafile: str, folder: str) -> None:
    """A helper function that is used in the *_download functions to extract and save data. 
    
    Parameters
    ----------
    zipper: zipfile.ZipFile
        The zipfile that contains the datafile to be extracted.
    datafile: str
        The name of the file to be downloaded from the zipped folder.
    folder: str
        The folder where the unzipped datafile will be stored.
        
    Returns
    -------
    None
        If succesful there should be a datafile (most likely .csv) in the folder that was passed. 
    """
    
    if os.path.exists(folder + datafile):
        print(f"Skipped: {datafile} already extracted from S3 Bucket \n")
        return None

    zipper.extract(datafile, path = folder)
    print(f"Extract Success: {datafile} unzipped and uploaded to {folder} \n")
    return None

### **BayWheels S3 Bucket**

In [None]:
BAYWHEELS_DATA_FOLDER = "https://s3.amazonaws.com/baywheels-data/" 
MY_BAYWHEELS = os.path.join(os.getcwd(),"BayWheelsData")

In [None]:
if not os.path.exists(MY_BAYWHEELS):
    os.makedirs(MY_BAYWHEELS)

In [None]:
def bay_request(bay_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(bay_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 201905 the buckets changed from fordgobike -> baywheels
        # We try to connect again with the new ending 
        try:
            r = requests.get(bay_bucket + filename.replace('fordgobike','baywheels'), stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename[:-4] + '.csv.' + filename[-3:]} requested from BayWheels S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from BayWheels S3 Bucket")
    
    return r

In [None]:
def bay_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """        
    
    # The with block's purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 

        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0]
        unzip(zip, datafile, folder)
        
    return None
    

In [None]:
# This one file doesn't follow the standard convention (see below)
r = bay_request(BAYWHEELS_DATA_FOLDER, f"2017-fordgobike-tripdata.csv.zip")
bay_download(r, MY_BAYWHEELS)

In [None]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = bay_request(BAYWHEELS_DATA_FOLDER, f"{year}{month}-fordgobike-tripdata.csv.zip")
        bay_download(r, MY_BAYWHEELS)

[<p style="text-align:center;font-style:italic">After Seeing it Once I Understand the Structure </p>](#Skip_Gathering) 

### **BlueBike S3 Bucket**

In [None]:
BLUEBIKE_DATA_FOLDER = "https://s3.amazonaws.com/hubway-data/" 
MY_BLUEBIKE = os.path.join(os.getcwd(),"BlueBikeData")

In [None]:
if not os.path.exists(MY_BLUEBIKE):
    os.makedirs(MY_BLUEBIKE)

In [None]:
def blue_request(blue_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    blue_bucket: str
        The URL to the BlueBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    
    # The purpose of the following try block is to attempt to connect to the file in the BlueBike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(blue_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 201805 the buckets changed from hubway -> bluebikes
        # We try to connect again with the new ending 
        try:
            r = requests.get(blue_bucket + filename.replace('hubway','bluebikes'), stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename.replace('hubway','bluebikes')} requested from BlueBike S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from BlueBike S3 Bucket")
    
    return r

In [None]:
def blue_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0]
        unzip(zip,datafile,folder)

    return None
    

In [None]:
yearlist = ['2011','2012','2013','2014_1','2014_2']

# For these years the data isn't stored as a zip file, it's stored as a csv
for year in yearlist:
    r = blue_request(BLUEBIKE_DATA_FOLDER, f"hubway_Trips_{year}.csv")
    url_content = r.content
    csv_file = open(f'/root/Citi-Bike-Expansion/BlueBikeData/{year}-hubway-tripdata.csv', 'wb')

    csv_file.write(url_content)
    csv_file.close()

In [None]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-hubway-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2015","2016","2017","2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = blue_request(BLUEBIKE_DATA_FOLDER, f"{year}{month}-hubway-tripdata.zip")
        blue_download(r, MY_BLUEBIKE)

### **CapitalBike S3 Bucket**

In [None]:
CAPITAL_DATA_FOLDER = "https://s3.amazonaws.com/capitalbikeshare-data/" 
MY_CAPITAL = os.path.join(os.getcwd(),"CapitalData")

In [None]:
if not os.path.exists(MY_CAPITAL):
    os.makedirs(MY_CAPITAL)

In [None]:
def capital_request(capital_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    capital_bucket: str
        The URL to the Capital S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the CapitalBike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    print
    try:
        r = requests.get(capital_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None   
    else:
        print(f"Request Success: {filename} requested from Capital S3 Bucket")
    
    return r

In [None]:
def capital_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # This if-else block handles all the different cases that arise when downloading files from capital
        if zip.namelist()[0][4] == 'Q':
            for i in range(len(zip.namelist())):
                datafile = zip.namelist()[i]
                unzip(zip,datafile,folder)
        else:
            datafile = zip.namelist()[0]
            unzip(zip,datafile,folder)
        return None
    

In [None]:
yearlist = ["2010","2011","2012","2013","2014","2015","2016","2017"]

for year in yearlist:
    r = capital_request(CAPITAL_DATA_FOLDER, f"{year}-capitalbikeshare-tripdata.zip")
    capital_download(r, MY_CAPITAL)

In [None]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201805-capitalbikeshare-tripdata refers to
# the file that contains all the trips for May 2018

yearlist = ["2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

for year in yearlist:
    for month in monthlist:
        r = capital_request(CAPITAL_DATA_FOLDER, f"{year}{month}-capitalbikeshare-tripdata.zip")
        capital_download(r, MY_CAPITAL)

### **CitiBike S3 Bucket**

In [None]:
CITIBIKE_DATA_FOLDER = "https://s3.amazonaws.com/tripdata/" 
MY_CITIBIKE = os.path.join(os.getcwd(),"CitiBikeData")

In [None]:
if not os.path.exists(MY_CITIBIKE):
    os.makedirs(MY_CITIBIKE)

In [None]:
def citi_request(citi_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    citi_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Citibike S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    try:
        r = requests.get(citi_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        # The firt block might fail due to the inconsistency of the naming convention
        # Starting in 2017 the bucket endings changed from .zip -> .csv.zip
        # We try to connect again with the new ending 
        try:
            r = requests.get(citi_bucket + filename[:-4] + '.csv.' + filename[-3:], stream=True)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename[:-4] + '.csv.' + filename[-3:]} requested from Citibike S3 Bucket")       
    else:
        print(f"Request Success: {filename} requested from Citibike S3 Bucket")
    
    return r

In [None]:
def citi_download(r: requests.models.Response, folder: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0] 
        unzip(zip,datafile,folder)

    return None
    

In [None]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 201705-citibike-tripdata refers to
# the file that contains all the trips for May 2017

yearlist = ["2013","2014", "2015", "2016", "2017", "2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs. The reason for the try-except
for year in yearlist:
    for month in monthlist:
        try:
            r = citi_request(CITIBIKE_DATA_FOLDER, f"{year}{month}-citibike-tripdata.zip")
            citi_download(r,MY_CITIBIKE)
        except:
            pass

### **DivvyBike S3 Bucket**

In [None]:
import re  # Needed to do a regex matching

In [None]:
DIVVY_DATA_FOLDER = "https://divvy-tripdata.s3.amazonaws.com/" 
MY_DIVVY = os.path.join(os.getcwd(),"DivvyData")

In [None]:
if not os.path.exists(MY_DIVVY):
    os.makedirs(MY_DIVVY)

In [None]:
def divvy_request(divvy_bucket: str, filename: str) -> requests.models.Response:
    """Connects to CitiBike's S3 bucket and attempts to make a connection to the filename
    
    Parameters
    ----------
    divvy_bucket: str
        The URL to the CitiBike S3 bucket
    filename: str
        The name of the file to be downloaded from the bucket
    
    Returns
    -------
    r: requests.models.Response
        If the connection is succesful the response to the file will be returned. If not it prints an error and returns None
    """
    # The purpose of the following try block is to attempt to connect to the file in the Divvy S3 bucket 
    # and catch the different errors that may occur if the connection fails. A failed connection exits the function
    
    try:
        r = requests.get(divvy_bucket + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None    
    else:
        print(f"Request Success: {filename} requested from Divvy S3 Bucket")
    
    return r

In [None]:
def divvy_download(r: requests.models.Response, folder: str, year: str) -> None:
    """Uses the response from the file_request function to unzip and download the citibike data to the output location
    
    Parameters
    ----------
    r: requests.models.Response
        The response that was returned from the file_request function
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be a csv file in the specified folder.
    """
    
    # The with block belows purpose is to unzip the file and extract it to the Temporary Bike Folder defined above.
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # The if-else block handles all the cases that will be encountered when trying to download files. 
        if year == '2013':
            datafile = zip.namelist()[2]
            unzip(zip,datafile,folder)

        elif int(year) < 2018:                    
            for file in zip.namelist():
                if re.match('(Divvy_Trips_\d{4}.{3,4}.csv)$', file):
                    datafile = file
                    unzip(zip,datafile,folder)
                elif re.match('(Divvy_Trips_\d{4}.{3,5}.csv)$', file):
                    datafile = file
                    unzip(zip,datafile,folder)

        else:
            datafile = zip.namelist()[0]
            unzip(zip,datafile,folder)
    
    return None

In [None]:
yearlist = ['2013','2014','2015','2016','2017','2018','2019','2020']

for year in yearlist:
    # The naming conventions changes for different years and this if-else block handles all the cases
    if year == '2013':
        r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Stations_Trips_{year}.zip")
        divvy_download(r, MY_DIVVY, year)
    
    elif year == '2014':
        for half in ['Q1Q2','Q3Q4']:
            r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Stations_Trips_{year}_{half}.zip")
            divvy_download(r, MY_DIVVY, year)       
    
    elif int(year) < 2018:
        for half in ['Q1Q2','Q3Q4']:
            try:
                # 404 Error for 2015_Q1Q2 (the only file that gets run in the except)
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}_{half}.zip")
                divvy_download(r, MY_DIVVY, year)
            except:
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}-{half}.zip")
                divvy_download(r, MY_DIVVY, year)
    else:
        for quarter in ['Q1','Q2','Q3','Q4']:
            try:
                # 404 Errors for 2020Q2 - 2020Q4
                r = divvy_request(DIVVY_DATA_FOLDER, f"Divvy_Trips_{year}_{quarter}.zip")
                divvy_download(r, MY_DIVVY, year)
            except:
                pass

In [None]:
# There are some files from 2014 and 2016 that need to be specifically downloaded
r = divvy_request(DIVVY_DATA_FOLDER, "Divvy_Stations_Trips_2014_Q3Q4.zip")
with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
    for file in zip.namelist()[2:5]:
        datafile = file
        unzip(zip, datafile, MY_DIVVY)

In [None]:
# There are some files from 2014 and 2016 that need to be specifically downloaded
r = divvy_request(DIVVY_DATA_FOLDER, "Divvy_Trips_2016_Q1Q2.zip")
with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip:
    for file in zip.namelist()[1:5]:
        datafile = file
        unzip(zip, datafile, MY_DIVVY)

In [None]:
# All the datafiles have the same prefix before the .zip. For example the file with the prefix 202005-divvy-tripdata refers to
# the file that contains all the trips for May 2020

yearlist = ["2020"]
monthlist = ["04", "05", "06", "07", "08", "09", "10", "11", "12"]

# Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        r = divvy_request(DIVVY_DATA_FOLDER, f"{year}{month}-divvy-tripdata.zip")
        divvy_download(r, MY_DIVVY, year)

#### **Move Files into the Main DivvyBike Temporary Directory**

There are 3 files that when downloaded they went to a subfolder instead of the main temporary folder. In this sub-subsection we will move those file into the main temporary directory with the other files. 

In [None]:
# Move one file from a subdirectory into the main DivvyData folder
destination = os.path.join(os.getcwd(),'DivvyData')

In [None]:
subfolders = [os.path.join(os.getcwd(),'DivvyData','Divvy_Stations_Trips_2013'),
              os.path.join(os.getcwd(),'DivvyData','Divvy_Stations_Trips_2014_Q3Q4'),
              os.path.join(os.getcwd(),'DivvyData','Divvy_Trips_2016_Q1Q2')]

In [None]:
# Moves the files out of the subfolders
for folder in subfolders:
    for file in os.listdir(folder):
        shutil.move(os.path.join(folder,file), destination)

In [None]:
# Deletes the subfolders
for i in range(len(subfolders)):
    shutil.rmtree(subfolders[i])

<hr>
<a id="Skip_Gathering"> </a>

## **Scraping NYC Neighborhood Data**
The purpose of this section is to connect to and extract all 59 of the NYC neighborhood profile files from the Furman Center. Like the trip data the files will be stored into a temporary folder in the working directory and then in a later step uploaded to our personal S3 bucket. We will now need to add BeautifulSoup to our toolbox. 

#### **Phase I - Getting the Neighborhood Codes**
To download the xlsx files from Furman Center we need the 4 character code for each community district. To get those values we'll use the beautifulsoup package to scrap the dropdown menu and store the code:name key-value pairs of each community in a dictionary. For example, BK04:Bushwick will be an entry in the dictionary (The BK portion represents the borough Brooklyn).  

In [None]:
from bs4 import BeautifulSoup   # Needed to parse the Furman Center website

In [None]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [None]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_codes = {}
for code in soup.find_all('option')[1:]:
    hood_codes[code.text[:4]] = code.text[6:].replace("/","-").replace(" ","_")   # Borough names will be used as filename in the next section

#### **Phase II - Getting the Neighborhood Data Files**
With the neighborhood codes available, we can send a request to the Furman Center, download their excel files, and store it in a temporary folder. This is going to be a similar process to the trip data files, but simpler since we don't have to deal with zipped folders. Later the data will be uploaded to the S3 Bucket and then deleted from the local repository.

In [None]:
TEMP_HOOD_FOLDER = MY_CAPITAL = os.path.join(os.getcwd(),"TempHoodData/")

if not os.path.exists(TEMP_HOOD_FOLDER):
    os.makedirs(TEMP_HOOD_FOLDER)

In [None]:
def pull_hood_data(code: str, name: str, folder: str) -> None:
    """Uses the scraped neighborhood code to download the xlsx data from Furman Center
    
    Parameters
    ----------
    code: str
        The 4 character neighborhood string
    name: str
        The actual name of the neighborhood
    folder: str
        The output location of the file download
    
    Returns
    -------
    None:
        If executed properly there should be an XLSX file in the specified folder.
    """
    
    file = f"https://furmancenter.org/files/NDP/{code}_NeighborhoodDataProfile.xlsx"
    
    if os.path.exists(folder + f"{code}_{name}.xlsx"):
        print(f"Skipped: {code}_{name} already downloaded from Furman Center")
        return None
    
    try:
        r3 = requests.get(file)
        r3.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None
    else:
        print(f"Request Success: {file} from Furman Center")
    
    with open(folder + f"{code}_{name}.xlsx", 'wb') as output:
        output.write(r3.content)
    
    return None

In [None]:
for key, value in hood_codes.items():
    pull_hood_data(key, value, TEMP_HOOD_FOLDER)

<hr>

## **Upload All the Data to our Personal S3 Bucket**

<div aling="center" class="alert alert-block alert-danger">
    <b>Danger: The upload into S3 is free, but there is a cost attached to storing it in S3. </b>
</div>

#### **BikeShare Trip Data Uploads**

In [None]:
import boto3   # AWS SDK Needed to work with our Bucket

In [None]:
# Note: This code can has to be executed with your own S3 bucket by changing the following string values:
# ACCESS_KEY_ID, ACCESS_SECRET_KEY, bucket, trip_prefix 

ACCESS_KEY_ID = ''
ACCESS_SECRET_KEY = ''

s3 = boto3.resource(
     's3',
     aws_access_key_id = ACCESS_KEY_ID,
     aws_secret_access_key = ACCESS_SECRET_KEY
)

bucket = 'williams-citibike'   # Premade bucket in S3
trip_prefix = 'TripData'   # Premade folder inside the bucket

In [None]:
def s3_upload(directory: str, prefix: str):
    """Goes through the temporary folders and uploads the data to the S3 buckets
    
    Parameters
    ----------
    directory: str
        The path to the directory that has the downloaded files
    prefix: str
        The name of the 'folder' in S3 where the data will be transfered to
    
    Returns
    -------
    None:
        If executed properly all the files will be within the S3 bucket in the prefix 'folder'
    """
        
    filenames = sorted([file for file in os.listdir(directory)])
    
    for key in filenames:
        s3.Bucket(bucket).Object(os.path.join(trip_prefix,prefix,key)).upload_file(os.path.join(directory,key))

In [None]:
local_data_folders = [(MY_BAYWHEELS,'BayWheels'),
                      (MY_BLUEBIKE,'BlueBike'),
                      (MY_CAPITAL, 'CapitalBike'),
                      (MY_CITIBIKE, 'CitiBike'),
                      (MY_DIVVY, 'DivvyBike')]

In [None]:
for directory, prefix in local_data_folders:
    s3_upload(directory, prefix)

#### **NYC Neighborhood Data Uploads**

In [None]:
hood_prefix = "HoodData"
filenames = sorted([file for file in os.listdir(TEMP_HOOD_FOLDER)])

In [None]:
for key in filenames:
    s3.Bucket(bucket).Object(os.path.join(hood_prefix,key)).upload_file(TEMP_HOOD_FOLDER + key)

#### **Manual Uploads**
The remaining files were manually downloaded and uploaded to S3 using the management console. ***Note: It isn't recommended that you change the 4 character codes for files: SX01, ZX01, ZX02, ZX03. They are in the HoodData folder with the other 59 NYC files and changing the codes will require major changes to the code in future notebooks.***

Inside the HoodData "Folder" within the Bucket:
- San Francisco Profiles = "SX01_SanFran-Neighborhoods-Data.pdf"
- Zipcodes USA = "ZX01_Zipcodes-USA.csv"
- Delineations = "ZX02_Delineation.xls"
- USPS Crosswalk = "ZX03_USPS-Crosswalk.xlsx"

Inside the GeoSpatial "Folder" within the Bucket:
- NYC GeoJSON File = "NYC-Neighborhoods.geojson"
- San Francisco GeoJSON File = "San-Francisco-Neighborhoods.geojson"
- Subway Entrance GeoJSON File = "MTA-Subway-Entrances.geojson"



<hr>

## **Delete the Local Repositories**

The purpose of this section is to clear out our local repository now that we uploaded all the files to our S3 bucket. 

In [None]:
for directory, name in local_data_folders:
    shutil.rmtree(directory)

In [None]:
shutil.rmtree(TEMP_HOOD_FOLDER)

<hr>

### **Project Extension**

In the future I would like to convert the download and/or upload process to a python script that can be run outside of the Jupyter Notebook.

<div style="line-height:11px">
    <p style="text-align:right;font-style:italic;color:#c1121f"> <b> Data Science = Solving Problems = Happiness </b> </p>
    <p style="text-align:right;"> <b> Denzel S. Williams <b> </p>
</div>