# Data Wrangling

**Dataset I - CitiBike Trip Data**
(https://www.citibikenyc.com/system-data)

The goal of this notebook is to get all the required data needed to complete the project. The first dataset that will be compiled is the Trip data from CitiBike. The trip data holds key information about each trip that was taken by customers of the service. For example, columns such as the start time, end station, and gender are recorded for each trip.

**Dataset II - Neighborhood Profiles**
(https://furmancenter.org/neighborhoods)

The next dataset that is needed is the characteristics of each neighborhood in New York City (NYC). The data was gathered by the Furman Center for Real Estate and Urban Policy at New York University. Each dataset has different categories of information about each neighborhood in the city. For example two categories that exist in the dataset are demographics and housing. 

**Dataset III - Community District GeoJson**
(https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)

The final piece of information needed is the GeoJson data that actually segments the community districts of NYC. That data is obtained directly from NYCOpenData. *Note: What Furman Center calls Neighborhoods, NYCOpenData calls Community Districts. NYCOpenData has a different dataset called Neighborhood Tabulation Areas which is a more granular division of the city*

## Scraping the TripData from the CitiBike S3 Bucket
The purpose of this section is to connect, extract, and store all of the tripdata files from the CitiBike S3 bucket into a temporary folder in the working directory. We will use the requests, zipfile, and io packages to retrieve the zipped data and extract it to the temporary folder. 

*The vision for this project is that all files will be stored in the cloud (AWS S3), separate from the directory of the code. In the section named {} we will upload the extracted data from the temporary folder to a personal S3 bucket and then delete the temporary folder. For the remainder of the project, all data will be pulled from that S3 bucket*

In [1]:
import requests, zipfile, io   # Needed to pull data from CitiBike S3 bucket
import os   # Needed to work with folders that will be created

In [2]:
CITIBIKE_DATA_FOLDER = "https://s3.amazonaws.com/tripdata/"    
TEMP_BIKE_FOLDER = os.path.join(os.getcwd(),"TempBikeData")

In [3]:
if not os.path.exists(TEMP_BIKE_FOLDER):
    os.makedirs(TEMP_BIKE_FOLDER)

In [4]:
def pull_citi_data(filename: str) -> None:
    """Connects to Citibike's S3 bucket, extracts, and stores the trip data into the temp_data_folder

    Parameters
    ----------
    filename : str
        The name of a file in the Citibike S3 bucket (stem only)

    Returns
    -------
    None:
        If executed properly there should be a CSV file in the TEMP_BIKE_FOLDER.
    """
    
    # Attempts to connect to the file in the citibike S3 bucket and catches the different errors
    # Returns False if the connection fails
    try:
        r = requests.get(CITIBIKE_DATA_FOLDER + filename, stream=True)   
        r.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        try:
            # Starting in 2017 bucket endings changed from .zip -> .csv.zip
            r = requests.get(CITIBIKE_DATA_FOLDER + filename[:-4] + '.csv.' + filename[-3:])
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh: 
            print(errh)
            return None
        else:
            print(f"Request Success: {filename[:-4] + '.csv.' + filename[-3:]} requested from Citibike S3 Bucket")       
    except requests.exceptions.ConnectionError as errc:
        print(errc)
        return None
    except requests.exceptions.Timeout as errt:
        print(errt)
        return None
    except requests.exceptions.RequestException as err:
        print(err)
        return None
    else:
        print(f"Request Success: {filename} requested from Citibike S3 Bucket")
    
    # ==============================================================================================================
    
    # Unzips the file and extracts it to the Temporary Data Folder
    with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zip: 
        
        # Regardless of the change in naming conventions, the actual data appears first in every bucket
        datafile = zip.namelist()[0] 
               
        if os.path.exists(TEMP_BIKE_FOLDER + datafile):
            print(f"Skipped: {datafile} already extracted from Citbike S3 Bucket \n")
            return None
        
        zip.extract(datafile, path = TEMP_BIKE_FOLDER)
    
    print(f"Extract Success: {datafile} unzipped and uploaded to {TEMP_BIKE_FOLDER} \n")
    return None

In [4]:
yearlist = ["2013","2014", "2015", "2016", "2017", "2018", "2019", "2020"]
monthlist = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]

In [77]:
#Citibike starts in 201306 so there should be 404 errors for the first 5 runs
for year in yearlist:
    for month in monthlist:
        pull_citi_data(f"{year}{month}-citibike-tripdata.zip")

404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201301-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201302-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201303-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201304-citibike-tripdata.csv.zip
404 Client Error: Not Found for url: https://s3.amazonaws.com/tripdata/201305-citibike-tripdata.csv.zip
Request Success: 201306-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract Success: 201306-citibike-tripdata.csv unzipped and uploaded to /root/Citi-Bike-Expansion/TempTripData/ 

Request Success: 201307-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract Success: 2013-07 - Citi Bike trip data.csv unzipped and uploaded to /root/Citi-Bike-Expansion/TempTripData/ 

Request Success: 201308-citibike-tripdata.zip requested from Citibike S3 Bucket
Extract S

## Scraping Neighborhood Data I - Getting the Neighborhood Codes
To download the xlsx files from Furman Center we need the 4 character code for each community district. To get those values we'll use beautifulsoup to scrap the dropdown menu and store the code:name pairs of each community in a dictionary. For example, BK04: Bushwick will be an entry in the dictionary (The BK portion represents the borough Brooklyn).  

In [14]:
from bs4 import BeautifulSoup

In [15]:
# Attempt connection to the URL
HoodURL = "https://furmancenter.org/neighborhoods"
try:
    r2 = requests.get(HoodURL)
    r2.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(errh)

In [16]:
soup = BeautifulSoup(r2.content, "html.parser")

# The website has a dropdown with all the neighborhood codes and names
hood_codes = {}
for code in soup.find_all('option')[1:]:
    hood_codes[code.text[:4]] = code.text[6:].replace("/","-").replace(" ","_")   # Borough names will be used as filename in the next secion

In [17]:
hood_codes

{'BK01': 'Greenpoint-Williamsburg',
 'BK02': 'Fort_Greene-Brooklyn_Heights',
 'BK03': 'Bedford_Stuyvesant',
 'BK04': 'Bushwick',
 'BK05': 'East_New_York-Starrett_City',
 'BK06': 'Park_Slope-Carroll_Gardens',
 'BK07': 'Sunset_Park',
 'BK08': 'Crown_Heights-Prospect_Heights',
 'BK09': 'South_Crown_Heights-Lefferts_Gardens',
 'BK10': 'Bay_Ridge-Dyker_Heights',
 'BK11': 'Bensonhurst',
 'BK12': 'Borough_Park',
 'BK13': 'Coney_Island',
 'BK14': 'Flatbush-Midwood',
 'BK15': 'Sheepshead_Bay',
 'BK16': 'Brownsville',
 'BK17': 'East_Flatbush',
 'BK18': 'Flatlands-Canarsie',
 'BX01': 'Mott_Haven-Melrose',
 'BX02': 'Hunts_Point-Longwood',
 'BX03': 'Morrisania-Crotona',
 'BX04': 'Highbridge-Concourse',
 'BX05': 'Fordham-University_Heights',
 'BX06': 'Belmont-East_Tremont',
 'BX07': 'Kingsbridge_Heights-Bedford',
 'BX08': 'Riverdale-Fieldston',
 'BX09': 'Parkchester-Soundview',
 'BX10': 'Throgs_Neck-Co-op_City',
 'BX11': 'Morris_Park-Bronxdale',
 'BX12': 'Williamsbridge-Baychester',
 'MN01': 'Financ

## Scraping Neighborhood Data II - Getting the Neighborhood Data Files
Similar to the tripdata files from S3, but with exponetially less work, we will use requests to get the data from the Furman Center, store it in a temporary folder where we will upload the data to S3 from there and then delete it.

In [12]:
TEMP_HOOD_FOLDER = "/root/Citi-Bike-Expansion/TempHoodData/"

if not os.path.exists(TEMP_HOOD_FOLDER):
    os.makedirs(TEMP_HOOD_FOLDER)

In [13]:
def pull_hood_data(code: str, name: str) -> None:
    """Uses the scraped neighborhood code to download the xlsx data from Furman Center
    
    Parameters
    ----------
    code: str
        The 4 character neighborhood string
    name: str
        The actual name of the neighborhood
    Returns
    -------
    None:
        If executed properly there should be an XLSX file in the TEMP_HOOD_FOLDER.
    """
    
    file = f"https://furmancenter.org/files/NDP/{code}_NeighborhoodDataProfile.xlsx"
    
    if os.path.exists(TEMP_HOOD_FOLDER + f"{code}_{name}.xlsx"):
        print(f"Skipped: {code}_{name} already downloaded from Furman Center")
        return None
    
    try:
        r3 = requests.get(file)
        r3.raise_for_status()
    except requests.exceptions.HTTPError as errh:
        print(errh)
        return None
    else:
        print(f"Request Success: {file} from Furman Center")
    
    with open(TEMP_HOOD_FOLDER + f"{code}_{name}.xlsx", 'wb') as output:
        output.write(r3.content)
    
    return None

In [14]:
for key, value in hood_codes.items():
    pull_hood_data(key, value)

Request Success: https://furmancenter.org/files/NDP/BK01_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK02_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK03_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK04_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK05_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK06_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK07_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK08_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK09_NeighborhoodDataProfile.xlsx from Furman Center
Request Success: https://furmancenter.org/files/NDP/BK1

## Upload TripData to Personal S3 Bucket
The purpose of this section is to take the downloaded files and upload them to my own personal S3 bucket.

In [7]:
import boto3
import shutil

In [8]:
# Note: This code can be executed with your own S3 bucket by changing the following values:
# ACCESS_KEY_ID, ACCESS_SECRET_KEY, bucket, prefix (optional)

ACCESS_KEY_ID = 'AKIARJEUISD2VILSZ6HM'
ACCESS_SECRET_KEY = 'OGeuPNVq+ptQo9UlDJZaB3EvrcysgLyyFIqthVdY'

s3 = boto3.resource(
     's3',
     aws_access_key_id = ACCESS_KEY_ID,
     aws_secret_access_key = ACCESS_SECRET_KEY
)

bucket = 'williams-citibike'   # Premade bucket in S3
trip_prefix = 'TripData'   # Premade folder inside the bucket

In [11]:
filenames = sorted([file for file in os.listdir(TEMP_BIKE_FOLDER)])

In [12]:
# Bucket is where you want to store the file
# Object is what you want the name of the file to be
# Upload_file is the file that you want to upload

for key in filenames:
    s3.Bucket(bucket).Object(os.path.join(trip_prefix,key)).upload_file(TEMP_BIKE_FOLDER + key)

In [13]:
shutil.rmtree(TEMP_BIKE_FOLDER)

## Upload Neighborhood Data to Personal S3 Bucket
The purpose of this section is to take the downloaded files and upload them to my own personal S3 bucket.

In [24]:
hood_prefix = "HoodData"
filenames = sorted([file for file in os.listdir(TEMP_HOOD_FOLDER)])

In [25]:
for key in filenames:
    s3.Bucket(bucket).Object(os.path.join(hood_prefix,key)).upload_file(TEMP_HOOD_FOLDER + key)

In [26]:
shutil.rmtree(TEMP_HOOD_FOLDER)