In [None]:
jupyter nbconvert --to pdf --TemplateExporter.exclude_input=True my_notebook.ipynb

<p style="text-align:center;"> Bike Share USA </p>

## 0. Introduction

Every major human advancement improved how people, things, or ideas moved from one point to another. Ancient innovations such as agriculture, providing us with a surplus of food, enabling us to stop moving and build civilizations. Present day innovations such as the Internet has taken movement to all new levels. On top of the back of the internet you can order a package and have it delivered by day end. On top of the back of the internet you can send software money, permissionlessly, from one end of the world to another. Most importantly, the internet enables the global movement of information at near instant speeds. 

When it comes to the physical transportation of people, intra-planet space travel and self-driving cars are the talk of the town. However, the greatest macro transportation revolution is happening on a micro level.  Micromobility refers to a range of small, lightweight vehicles operating at speeds typically below 15 mph and driven by users personally. Micromobility devices include, bicycles, e-bikes, electric scooters and skateboards, shared bicycles, and electric pedal assisted bicycles [1]. 

<div style="clear: right;">
   <p style="float: right;"> 
       <figure style="float:right;text-align:center">
           <img src="./Data/Images/Report/0001.Shared-Micromobility-Graph.png" width="450" height="450" />
           <figcaption> Caption </figcaption>
       </figure>
   </p>
   <p>According to the 2019 Shared Micromobility Snapshot, published by the National Association of City Transportation Officials (NACTO), the number of trips taken on shared bikes, e-bikes, and scooters was 136 million up 60% from 2018’s 84 million and 288% from 2017’s 35 million [2].  Although there is not any official data about transportation mode shifting, their survey data suggests that Micromobility might be replacing car trips. The United States has about 19,495 incorporated cities, towns, and villages and of those, 310 are considered at least medium cities with populations of 100,000 or more [3]. Looking at the NACTO map there are only about 130 cities that have micromobility services. Imagine if there were micromobility services in all 310 of those cities. Better yet, imagine every part of the United States having micromobility services and instead of a sparse map the shared micromobility map resembled a 4G LTE Coverage map. 
    </p>
</div>

<figure style='text-align:center'>
    <img src="./Data/Images/Report/0002.Shared-Micromobility-Map.png" style="max-width:75%;" />
    <figcaption style="text-align:center"> Caption </figcaption>
<figure>
    
<p style="text-align:center;"> <b>The goal of this project is to expand the bike sharing sector of micromobility into every zip code across the country. The question that this project is looking to answer is: How many bike share stations should be in any given zip code in the United States?</b></p> 

## I. The Data Engineering

The data used to complete the project can be broken into four major groups. The first two groups were fundamental to completing the project, the last two were only required for the Exploratory Data Analytics (EDA):

<p style="text-align:right"> <b> A. Bike Share Trip Datasets </b> </p>
The subset of zip codes that have bike stations are derived from the five largest bike sharing services in the US: Bay Wheels, Blue Bike, Capital Bikeshare, Citi Bike, and Divvy Bike. Each company hosts their trip data on S3 buckets for public use. 
These datasets hold key information about each trip that was taken by their customers. The rows of the datasets represent a single trip, and the columns are the properties of the trip such as the starting station and the time when the trip ended. Since the trip data has the start and end station included, the trip data was used to derive the station data.

<p style="text-align:right"> <b> B. Zip Code Datasets </b> </p>
All the zip codes of the US along with the properties of the zip code. Properties such as the total population, core based statistical area type, and water area are included. 
    
<p style="text-align:right"> <b> C. Geospatial Datasets </b> </p>
New York City (NYC) and San Francisco has geospatial boundaries of the segmented neighborhoods in them. The datasets in this group contain those geospatial multi-polygons.
    
<p style="text-align:right"> <b> D. Neighborhood Profile Datasets </b> </p>
The datasets in this group have the demographics of the neighborhoods within New York City and San Francisco. These demographics, when combined with the geospatial data were used to do two custom analyses in the EDA portion of the project. The analysis used both the station location point geometries and the Voronoi polygons of the station locations. 

<p style="text-align:center;"> <b> 1.1 ETL & The Database</b> </p>

All together there were ten different datasets of data which summed to 68 GB of data across 350+ files. To work with this data, the best course of action was to build a database. Leveraging the Amazon Web Services (AWS) Cloud a RDS Database running PostgreSQL 12.5 was created on a db.t3.micro instance.  

With the blank database created, before doing anything, it was important to think about how the data was going to be used for analytics to determine how it should be feed into the database. With that idea in mind an Entity Relationship Diagram (ERD) was created to structure the database and guide the transformation portion of the upcoming Extract Transform Load (ETL) jobs. 

<figure style="text-align:center">
    <img src="./Data/Images/Report/0003.ERD-Final.png" style="max-width:65%" />
    <figcaption style="text-align:center"> Caption </figcaption>
</figure>

<p style="text-align:right"> <b> 1.1.A Extract</b> </p>
    
The first part of the project was a series of massive ETL jobs. The data had to be extracted from websites, S3 buckets, zip folders, csv files, pdf files, excel files, and geo spatial files. As with using RDS, one goal of the project was to leverage the basic services of AWS, so the data was extracted and indirectly uploaded  to a personal S3 bucket. 

Only the bike share trip data and the neighborhood profile data needed to be extracted via code. The other two groups were a simple download. For each individual bike service, a custom set of functions and loops were used to create the requests for all the relevant files from the company’s S3 bucket, unzipping the zip folder, and then extracting and saving the relevant files within the folder. 
    
The NYC neighborhood data is hosted at the <a href="https://furmancenter.org/neighborhoods"> Furman Center </a>. On the website there is a dropdown menu used to navigate to the different neighborhoods where you can download the excel file. Using the BeautifulSoup package the dropdown menu was scraped to get the codes which could be used to construct the request directly to the file.
    
When it comes to the San Francisco data it had to be pulled out of a PDF file. The file was a report that originated from the <a href="https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf"> SF Planning Department</a> and the data in the file was structured exactly the same for each neighborhood. Using the PyPDF2 package, I iterated through every relevant page of the report extracting out the lines of data that was required.

<p style="text-align:right"> <b> 1.1.B Transform</b> </p>

After the extraction three main technologies used in the transformation phase were: Python and the Pandas package, PostgreSQL, and the Google Cloud Platform (GCP) Geocoding API. 
    
The files were going to be sent to their appropriate staging table which meant their formats had to align with the staging table design. This required replacing missing values, reordering columns, inserting columns, changing data types, and doing string replacements for every trip file for every service. To handle this, like the extraction phase, custom functions were made to transform all the trip files within a bike service. Once transformed loading the data into the database took a handful of lines of code. 
    
    IMPORT SAMPLE
    

Embedded within every trip is the station information for both the starting station and the ending station for the trip. With all the trip data inside the staging tables a simple DISTINCT query returns a data frame that with all the stations and the information associated with them. When pulled down from the database, most stations had latitude and longitude data that was used to create append a point geometry to the returned data frame. However, the core piece of data needed to complete the project, the zip code, didn’t exist. To get the zip codes the geographic coordinates were sent to the GCP Geocoding API and the zip code was extracted. By 2020 most stations had coordinates attached to them. If the coordinates were not available, the name of the station was sent to the GCP Geocoding API combined with Region Filtering and the zip code was extracted. If both things failed the zip code was manually found through different search methods. 
    
The zip codes found were then appended to the data frame. With the zip codes in place the station data aligned with the database design and was loaded into the corresponding station table. Using the data in the staging and the stations schema the trip table was created within the database using a query to select the desired columns. The query also manufactured 4 artificial columns:

<ol>
    <li> Subtracting the starting time from the ending time of the trips returned the duration in minutes.
    <li> Using the station point geometries the distance between the starting station and the ending station was calculated in miles.
    <li> If both the duration and the distance column weren’t null, then the speed of the trip was calculated in mph.
    <li> The name of the service for each trip. This was a constant value for every trip within a trip table. 
</ol>

    IMPORT SAMPLE
    
The NYC Neighborhood data downloaded from the Furman Center was split across two tables. The profile table contained the actual data, and the other table was just a lookup table that had the name and the descriptions of the columns in the profile table.
The properties of the NYC Neighborhoods are grouped into different categories, such as Demographics and Housing Market and Conditions. In the lookup table each property is given an alias based on the group that it is in and its position in the group. For example, the percentage of people born in New York State is the first property in the demographics group, so it is aliased DEM1. Additionally, the full name and description will get their own columns.
    
Each NYC Neighborhood Profile excel file has, as rows, the properties of the neighborhood the neighborhood. The columns of the files are different years of recorded data. In the project we only used the most recent 2018 data. Each of the 59 files had to be individually pivoted into a single row. The single rows were glued together to create one data frame that was sent to the database.

<i> The zip code data did not require that many transformations before sending it to the database. The only things that were done were changing column names and filling missing values.</i>  
    

<p style="text-align:center;"> <b> 1.2 Cleaning & Updating</b> </p>

The stations were derived from the entire history of trips. The problem with this is that the ecosystem of stations in a bike sharing service is always changing. The services add new stations, remove stations, and occasionally move stations to nearby locations. It is important that the current state of the ecosystem be known when making predictions about zip codes. The model shouldn’t count a retired station in a zip codes station count. Using the trip table, two new columns were added to the tables in the station schema. The birth column recorded the first time a station appeared in a trip and the death column recorded the last time a station appeared in a column. From there, stations that had a death date within December 2020 were considered alive and their death dates were set to NULL. The birth and death columns unlocked the ability to look at the state of ecosystem of the services at any point in time. 

<p style="text-align:right"> <b> 1.2.A Trip Table Cleaning - Basic Cleaning</b> </p>

The first set of cleaning regarding the trip table was handling outliers in the speed column. Trips that had speed values too high had to be handled different from speed values that were zero. 
    
According to different facts from the services, reports, and articles the maximum that a pedal assisted e-bike can achieve is 18-mph. A conservative value of 20-mph was used to define a speed outlier. Trips that had speeds over 20 mph were adjusted. The speed was set to 20 mph and the duration was adjusted since the distance isn’t variable.

$$\frac{Distance}{\frac{Duration}{60}} = 20 \longrightarrow \text{Solving for Duration} \longrightarrow Duration = 3 \times Distance$$

Round trips have a speed of 0 mph because they start and end at the same station which means their distance is 0 resulting in a speed of 0. For round trips the speed was set to the average speed of 6 mph. Using the same as before, the speed was set to a constant value and since duration is there, the distance column was updated. 

$$\frac{Distance}{\frac{Duration}{60}} = 6 \longrightarrow \text{Solving for Distance} \longrightarrow Distance = \frac{Duration}{10}$$

With the speed outliers handled the next cleaning task was to remove trips where the start time was after the end time. These trips were removed because the quantity of trips that had this error did not justify the cost associated with fixing them.

<p style="text-align:right"> <b> 1.2.B Trip Table Cleaning - Duration Outliers</b> </p>

The first issue when trying to clean the duration outliers is handling the million of rows that are in each of the five trip tables. Trying to determine outliers by querying the entire table is extremely inefficient and thus required a sample. However, using the built in Bernoulli sampling of PostgreSQL was just as inefficient as query the entire table. The samples not only had to be random, be the number of trips selected in the sample had to be large enough, and the query had to be decently fast. To meet all three conditions a sampling procedure was created to get a million rows from each trip table. <b>ADD SAMPLING PROCEDURE TO APPENDIX</b>

INSERT DISTRIBUTION IMAGE

From the boxplot it looks like all of duration distributions are similar in the sense that they are drastically skewed to the right. Standard deviation ranges from 2.9 – 19.2 hours and the 75% quantile barely break 20m whereas the maximum values are in the thousands of minutes. Because of this skew, if we want to determine outliers the mean and standard deviation are not reliable.  With such a drastic skew, the quantiles are a significantly better measure to determine outliers as they are more representative of real life.  Each different sample has their own distributions so there isn’t a one size fits all quantile measure that can be used for all the samples. 66 mins might be the 99th percentile for Service A, but the 99th percentile for Service B may be at 160 mins. The question then becomes: Do riders that use bike sharing take the same length of trips, regardless of the service? If so, is there a universal duration time that could be used across all bike share services. 

Asking if riders that use bike sharing services take the same length of trips, what is really being asked is if the duration of the samples taken are all pulled from the same underlying distribution.  Comparing a single dependent variable across five different non-normal samples meet the conditions for the Kruskal-Wallis Test with the:

<ul>
    <li> Null Hypothesis $H_0$: The samples all originate from the same distribution and have the same median values.
    <li> Alternative Hypothesis $H_1$: At least one sample originates from a different distribution and has a different median. 
</ul>

The Scipy package was used to conduct this test and the p-value returned was 0, meaning there is enough evidence to reject H_0 at any significance level. The alternative hypothesis of the Kruskal-Wallis Test states that at least one sample originates from a different distribution, but it doesn’t say anything about how the samples are compare pairwise.  On a pairwise level the Mann-Whitney U Test was used to compare the samples with the sample null and alternate as the Kruskal-Wallis Test. All 10 tests resulted in a p-value of 0, meaning that riders of different bike share companies don’t ride for the same amounts of time.

Although the statistical tests failed, I still wanted to find a single outlier value that could work for all the services. So what we looked for is a “Quantile Threshold”. The aim was to keep no less than 95% of all the trip data for each service. Therefore, we were looking for two values, one for the lower end and the upper end, that would return at least 95% of the data for every service. To find those values we took found the 97.5th quantile for every service and took the highest value of the group as the upper cutoff. Similarly, the 2.5th quantile was found for every service and we took the lowest of the group.   Those two values were used as a “pseudo universal” cutoff value.  For every service, trips with a duration above 87.3 minutes or below 2.21 minutes was removed from the trip table. Using those values the percentage of data kept ranged from 96 to 97.5% 


## II. Exploratory Data Analytics

Since the focus of the project is not on trip data only basic analysis was done on the trip data directly, the trip data was mainly used as a secondary source to highlight analysis on the station data

<p style="text-align:center;"> <b> 2.1 Basic Exploratory Analysis for Trips</b> </p>

The first question asked was how may trips were taken per month for each service across the service’s entire lifetime. 
The number of trips taken over time for each service, except Bay Wheels, ebbs, and flows. This rise and fall may be due to the changing seasons from winter to summer and back in the four cities that the services are based in (Boston, Washington D.C. New York City, Chicago). Temperatures begin to rise into spring and reaches its heights in summer and the number of riders follow this pattern. When temperatures drop going into winter the number of trips drop as well. When it comes to Bay Wheels, this weather pattern doesn’t really apply because the winter climate there is mild and rainy, more comparable to the spring climate in the other cities. 

INSERT # OF TRIPS IMAGE

In line with the NACTO report, over time, both peaks and troughs for the four services that have the seasonality pattern were gradually getting higher. Higher highs and higher lows implies an uptrend and in this context an uptrend is equivalent to increased ridership. 

As with all things, COVID-19 did influence the number of trips in 2020. In the first months of 2020 the number of trips had a massive drop off going into April instead of following its usual behavior of a steady rise.  When there are lockdowns in place, people aren’t leaving their house, let alone getting on a public bicycle. 

INSERT DAILY PATTER IMAGE

Drilling down a little further I looked at the number of trips aggregated on the hour.  What stands out is that all of the graphs look the same for every service. Every hour across the country (ignoring time zones) the services receive the same levels of traffic. 
<ol>
    <li> During the 06:00 hour the trips begin to increase hitting its peak during the 08:00 hour.
    <li> Then from 09:00 – 15:00 the number of trips taken remains stable.
    <li> Followed by a secondary volume increase at 16:00 reaching its highs during the 17:00 and 18:00 hours.
    <li> Finally, the trips slowly taper off and repeats the cycle in the morning. 
</ol>   

It is likely that people are using the bike services to commute to work. This would explain the two demand spikes during the day as they coincide with rush hour times [4].

Naturally, the next question is to determine whether the cycle repeats every day of the week or do certain days differ from others. It turns out that on an hour basis, weekends have a similar structure different than that of the weekdays. Weekends do not have the sharp rises and declines that are associated with rush hour. The number of trips gradually rises and then gradually falls; slightly resembling the path of the sun in the sky. 

INSERT WEEKDAY/END PATTERN


<p style="text-align:center;"> <b> 2.2 Station Based Exploratory Analysis</b> </p>

Due to the dynamic nature of the station ecosystems in a bike sharing service, the birth and death columns were added to the station table in section 1.2. Using the manufactured columns, we can examine how many stations were added and removed in any given year for any service. Citi Bike is the clear winner when it comes to growth metrics, leading in most stations at launch, most stations overall, and most stations added in a single year.  

INSERT STATION GROWTH TABLE

<p style="text-align:right"> <b> 2.2.A Popularity Contest</b> </p>


<div style="clear: right;">
   <p style="float: right;"> 
       <figure style="float:right;text-align:center">
           <img src="./Data/Images/Report/0011.Top-Stations.jpg" width="300" height="300" />
           <figcaption> Caption </figcaption>
       </figure>
   </p>
   <p>Diving deeper into each service I looked at which stations are the most popular for each service by year, in terms of all the trips that they are involved in regardless of whether it was a start or end station. 

BayWheels, CapitalBike, and DivvyBike introduced the idea of dockless bikes in 2018-06, 2020-06, and 2020-07 respectively. A dockless bike doesn't need to be retrieved or returned from a designated station and can be unlocked/locked at any public bike rack. When riders are given that option, they take advantage of it. In 2020, dockless rides were the most popular 'station' for all three service. This doesn't prove that riders use the dockless feature more than stations, in fact that isn't true. It only proves that the dockless bike option is something that is desirable to riders and would be used if a bikeshare service started implementing it into their ecosystem.

<div style="center">
    <p style="text-align:center;font-style:italic"> There are infinitely many dockless stations and they are all used to come up with the total for the category, meaning it is an aggregate. Therefore comparing that aggregate column to a single station is unfair. If we were to sum up the totals for all stations, comparing an aggregate to an aggregate, then rides to/from a designated station would win by a landslide. </p>
</div>

As the number 1 ranked station in a service changes over time it is important to understand whether or not a newly created station is taking the number 1 spot or an old station is taking the number 1 spot. When a reigning champion is displaced let's look at the displacer and see if it was newly created or an old station. 

<b>Blue Bike</b> <br>
South Station was born 4 months after the MIT and Central Square station and still took the number 1 spot for 2015. However, MIT at Mass Ave. stayed the champion until 2020. So what happened? The MIT station is in the vicinity of the college and when the college slowed down due to COVID-19, I'm guessing so did traffic to that station. However, Central Square is only a couple of block away in the same zip code. I suspect that 02139 is a high traffic zip code overall. 

<b>Capital Bike </b> <br>
    Both reigning champions were born around the same time, but for some reason in 2014 Columbus Circle gained and maintained popularity. In fact, many stations over took Massachusetts Ave and over the years it slowly moved out of the top 5.
    
<b> Citi Bike </b> <br>
    All the top stations in Citi Bike were born at the same time and only 8 Ave was discontinued. When it comes to 2020, West St. might only be ranked number 1, once again, due to COVID-19. The reason for this is because Pershing Square is across the street from Grand Central Terminal and three blocks away from Times Square, which are both ultra high traffic areas, except when the city is shutdown. 

 </p>
</div>

<p style="text-align:right"> <b> 2.2.B Popularity Contest USA</b> </p>

We figured out the top stations within a service, but what about the top stations across all services. To compare stations across all services we have to look at the number of trips in a station relative to the number of trips in that year. Otherwise, Citi Bike stations would win all the time. 

INSERT TOP 10

The percentage of trips taken using a dockless bike Bay Wheels hit an astronomical value of 86% in 2020. I believe that COVID-19 played a role in the overal reduction of trips in gneral, but I am not confident that the same reasoning can be applied to the massive spike in dockless bike use.

One thing that I can think of is that because the dockless bikes can be parked anywhere, combined with the state locking down, the people that WERE using the bikeshare were going directly to where they needed to go, parked the bike, picked up the same bike, then parked it very close to their apartment, and then repeated the process with the same bike for another trip. <i>This explanation is really grasping at straws and there isn't any data to back it up.<i>

Other than dockless stations, "old-school" CapitalBike stations (2010-2012) appeared 5 times in the top 10 stations. Why? If we look back at our station growth table, in those years CapitalBike had 106, 144, and 194 stations in its ecosystem, respectively. Those numbers represent some of the lowest numbers of stations in an ecosystem at any given time. Which such a low number of stations, stations that are more popular will be able to capture more of the rider market, because the system isn't "decentralized". Which is why CitiBike, makes it appearances in the 80th - 110th (out of 111) places. From its inception CitiBike had the most stations out of every ecosystem and no other service could keep up with its growth. CitiBike has nearly double the number of stations as its closet competitor. You can say that the CitiBike ecosytem is "decentralized"
    
INSERT LINK TO STATION GROWTH TABLE

<p style="text-align:right"> <b> 2.2.C Inter Zip Code Travel</b> </p>

The significance of this project is to help a bike share company expand into a new area. For an expansion into a new area, how important is it to expand into multiple zip codes. How many zip codes should the expansion encompass. To determine this I looked at the ratio between the number of trips that started and ended in different zip codes compared to the total number of trips.

INSERT INTER BOXPLOT

The boxplot reveals that it is extremely important stations be set up in multiple zipcodes when planning an expansion effort. The majority of trips taken by riders, in all services, start in one zipcode and end in another. It's safe to say that not having a multiple zipcode expansion wouldn't be desirable to riders. To get an idea of how things should be done, let's look at how many zipcodes the stations were spread across when the five services made their inital launches and let's look at how many zipcodes they were in at the end of 2020. 

<p style="text-align:right"> <b> 2.2.D How Many People Does Each Station Serve?</b> </p>

When people use public transportation they go to the spot that is most convient for them. Typically conveient means the closet. I say typically because there are times when people have to go farther distances to catch a bus or train that has a different route than the one closet to them. However, in the case of bike share, there is no incentive to go to a bike station that is farther away from the one that is closet to you. 

With that being said, a station only serves the people that are closer to it than to any other station. We will define a station's service by the equation $$S(s) = \sum_{i=1}^{N}\frac{A(G_i \cap V_s)}{A(G_i)} \cdot P_i$$ where $S(s)$ is the number of people served by station $s$, $N$ is the number of neighborhoods in the region (NYC & SanFran), $A$ is the area function, $G_i$ is the geometry polygon for neighborhood $i$, $V_s$ is the voronoi polygon for station $s$, $P_i$ is the population for neighborhood $i$. 

<div align="center" class="alert alert-block alert-info">
    <p style="text-align:center;font-style:italic"> A simpler way to think about the formula is that we are multiplying the area of the voronoi that is in a neighborhood by the population density of that neighborhood.</p>
</div>

For every station we need to iterate through every neighborhood and for each neighborhood go through the list of steps listed below. To complete this analysis the population data from nyc_profile and sanfran_profile in the neighborhoods schema, a series of python functions, and built in PostGIS functions such as VoronoiPolygons were used.
<ol>
    <li> Intersect the neighborhood geometry with the voronoi geometry of the station
    <li> Find the area of that intersection
    <li> Divide that area by the area of the neighborhood geometry
    <li> Muliply that number by the population of the neighborhood
    <li> Save the number
</ol>

<figure style="text-align:center">
    <img src="./Data/Images/Report/0014.NYC-Serve-Box.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0014.NYC-Serve-Kernel.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0014.NYC-Serve-Strip.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <p style="clear:both;">
    <figcaption> caption </figcaption>
</figure>
Each of the three graphs above show the number of riders served by station for NYC. They reveal that majority of stations, about 75%, serve between 100K and 225K people. There is another smaller group that serve between 225K and 350K people. NYC can be broken up into 5 boroughs (only 4 have stations), and the graph below shows the number of riders served on a borough basis. 
    
<img src="./Data/Images/Report/0015.NYC-Serve-Borough.jpg" style="max-width:50%" />

For the San Francisco stations the same process can be repeated. In San Francisco the majority of stations serve between 10K and 60K people, again about 75% of the stations. The people served graphs are shown below. 

<figure style="text-align:center">
    <img src="./Data/Images/Report/0016.SF-Serve-Box.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0016.SF-Serve-Kernel.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0016.SF-Serve-Strip.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <p style="clear:both;">
    <figcaption> caption </figcaption>
</figure>

Although the people served statistic is interesting, it isn't very useful on in its own. It's impossible to tell if a station with a higher statistic has a bigger voronoi area or has a smaller voronoi area in a denser part of the city. A better statistic to look at would be the ratio between the riders served and the area of the voronoi. The graphs of the ratios for both cities are shown in the graphs below:

<figure style="text-align:center">
    <img src="./Data/Images/Report/0017.NYC-Serve-Ratio-Box.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0017.NYC-Serve-Ratio-Kernel.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0017.NYC-Serve-Ratio-Borough.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <p style="clear:both;">
    <figcaption> caption </figcaption>
</figure>

<figure style="text-align:center">
    <img src="./Data/Images/Report/0018.SF-Serve-Ratio-Box.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0018.SF-Serve-Ratio-Kernel.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <img src="./Data/Images/Report/0018.SF-Serve-Ratio-Strip.jpg" style="float: left; width: 30%; margin-right: 3%; margin-bottom: 0.5em;" />
    <p style="clear:both;">
    <figcaption> caption </figcaption>
</figure>

When we looked at just riders served the data was really spread out, the data is much tighter when looking at the ratio between the riders served and the area of the voronoi. Regardless of the borough, regardless of the location, the number of people that a station serves in NYC is rarely over 3.5 people per square meter of it's voronoi polygon. Which makes practical sense, because the denser the population of an area the more stations you need to accomodate the population. The more stations packed into one area, the smaller the voronoi area. Although, the area is small it is still serving tons of people. This leads me to believe that population density is an extremely important factor when a company chooses the number of and the locations of stations in a potential expansion area. Looking at the ratio between the riders and the area of the voronoi for San Francisco, the number of people that a station is rarely over 0.5 people per square meter of it's voronoi. New York City is bigger and more dense that San Francisco, so it makes sense that the ratios of people served per square meter are larger in NYC. 


<p style="text-align:right"> <b> 2.2.E Would You Have Bike Access?</b> </p>
