<a href="https://colab.research.google.com/github/georgejordan3/IBM_Capstone/blob/main/Bicycle_Cities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bicycle Cities - Evaluating the Best Cities in the US to Ride a Bicycle

George Jordan <br>
IBM Data Science Professional Certificate Capstone <br>
Last Updated: 2-26-21

<img src="https://www.confluence-denver.com/galleries/Features/2016/Issue_164/bike_lanes_04.jpg">

Credit: [Confluence Denver](https://www.confluence-denver.com/features/denver_bike_lanes_082416.aspx)


## Introduction 
I am a competitive cyclist and moving to Denver was partially motivated by my love for cycling. While I have a strong understanding of the riding experience here, I wonder how my experience compares to other cyclists in various parts of the US. I hope to tell a detailed story of these cities through the eyes of a cyclist and through the lens of data.

In this project, I will examine a list of the highest ranking bicycle cities in the country and apply my own analysis to gain further insight into the cities and their unique characteristics. I will use analytic tools as well as machine learning algorithms to see how these cities relate to each other and also gain insight into other supporting factors that have made these cities accessible via bicycle.

While this project is personally interesting to me, I believe that the insights here could be useful for a variety of business applications. Perhaps there is a company in one of these cities that is considering including provisions to support bike commuters. Maybe the existence of bicycle infrastructure is important to a company and an investigation into potential locations for an office would require such an analysis to foster that kind of culture in the workplace.

I believe that the bicycle is a very powerful tool to not only navigate a city but to also change it, through culture and infrastructure. I hope this project illuminates some of the impact that the bicycle has had on these cities.

## Data 
### PlacesForBikes City Ratings
- [PlacesForBikes City Ratings](https://cityratings.peopleforbikes.org/)

For this project, I will be first looking at the PlacesForBikes ratings to see their list of top cities for bicycles. They have made their data available for the public as well as an explanation to their methodology in ranking. It is from this data that I will select a number of cities to examine further, while also studying their decisions in ranking.

### Foursquare
- [Foursquare](https://foursquare.com/)

I will be using the Foursquare API to get an understanding of the selected cities from the rankings above. The most obvious query would be to find how many bike shops are within the city limits but also some other potential locations to be explored further. This data will be geospatially visualized using mapping libraries in Python.

### Strava
- [Strava](https://www.strava.com/about)

Strava is an app that tracks user's activity files from a variety of sports, including cycling. Strava currently has 55 million users, so there will be no shortage of data to examine in these popular cycling cities. By using the Strava API, I will be able to gain insight into the areas and density of the rides taken within the city.

### Zip Codes
- [Zip-Codes.com](https://www.zip-codes.com/)

In order to organize some this data, I will have to have quick access to a zipcode database for reference. I will use a webscraping tool in this project.

## Methodology 


### Data Importing

In [None]:
import pandas as pd

For the City Ratings, I saved the files locally and then uploaded the files into the notebook.

In [None]:
pfbr = pd.read_excel("pfbr")

With the data uploaded, getting an idea of the scope of the dataset is an important way to start structuring our analysis.

In [None]:
pfbr.describe()

Unnamed: 0,Places_ID_2020,ACS Bike-to-Work Mode Share,Land Area,Population,ACS Target,ACS Normalized Score,ACS Ridership Points,SMS Recreation Riding,SMS Points,Community Survey Ridership Score,Total Ridership Points,Average Fatalities All Mode,All Mode Fatality Rate,All Mode Fatality Points,Average Fatalities Bike,Bike Fatality Rate,Bike Fatality Points,All Mode Injuries,All Mode Injury Rate,All Mode Injury Points,Bike Injuries,Bike Injury Rate,Bike Injury Points,All mode safety points,Bike Safety Points,Community Survey Safety Score,Total Safety Points,City Snapshot Points,Community Survey Acceleration Score,Total Acceleration Points,BNA,BNA Points,Community Survey Network Score,Total Network Points,Percent Communities of Concern,Number Underserved Communities,Average BNA,BNA Underserved Communities,BNA Gap,BNA Tier,BNA Target,Distance,BNA Points.1,ACS Bike-to-Work Mode Share Men,ACS Bike-to-Work Mode Share Women,ACS Gap,ACS Tier,ACS_Target,Distance.1,ACS Points,Total Reach Points,Bonus,Points with bonus
count,567.0,567.0,567.0,567.0,567.0,567.0,567.0,566.0,566.0,415.0,567.0,562.0,562.0,562.0,562.0,528.0,528.0,49.0,49.0,49.0,50.0,50.0,50.0,562.0,528.0,415.0,563.0,84.0,415.0,419.0,567.0,567.0,415.0,567.0,567.0,567.0,567.0,567.0,567.0,400.0,400.0,400.0,400.0,567.0,567.0,567.0,532.0,532.0,532.0,532.0,567.0,567.0,567.0
mean,1571.84127,0.016409,62.511993,166768.2,0.207453,7.528395,0.375132,0.15579,2.597173,2.46506,1.547949,12.441637,0.688078,3.012456,0.435231,16.231061,4.013258,301.5,9.21655,1.979592,233.57,3773.152823,2.28,1.592527,2.114583,2.48506,1.79541,2.675595,2.436627,0.908499,23.253439,1.654321,2.585783,1.702131,12.279012,25.504409,20.887125,20.264727,0.621693,2.2625,-18.6875,19.647,1.864,0.012875,0.006349,0.003175,1.921053,-0.218421,0.220113,1.923684,1.811341,0.069665,1.642742
std,5566.382221,0.039262,157.373424,336914.3,0.110523,12.540077,0.628953,0.022846,0.381138,0.311314,0.476178,26.481965,0.641592,0.998138,1.110943,99.924805,1.249597,661.187221,21.226128,1.492897,1533.255776,25784.915562,1.654185,0.628623,0.759279,0.373543,0.61098,1.177564,0.520784,0.978849,14.202756,0.756885,0.495063,0.682201,8.157348,62.898731,12.797697,14.006255,6.263045,0.869185,4.345925,6.879379,0.72648,0.047481,0.039367,0.022014,0.840012,0.26268,0.269598,0.838169,0.769549,0.173298,0.560002
min,1.0,0.0,0.6,215.0,0.057,0.0,0.0,0.09,1.5,1.3,0.102228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.2,0.0,1.0,0.20892,2.0,1.0,1.1,0.8,0.0,0.0,2.7,1.1,-33.7,1.0,-25.0,-1.6,0.0,0.0,0.0,-0.1,1.0,-0.6,0.0,0.0,0.0,0.0,0.505123
25%,186.5,0.002,12.6,31528.0,0.126,1.2,0.1,0.144,2.4,2.3,1.190639,1.2,0.3,2.0,0.0,0.0,3.0,21.0,2.9,1.0,2.125,21.875,1.0,1.0,1.5,2.2,1.4,2.05,2.1,0.42008,12.4,1.0,2.3,1.25547,6.5,2.0,10.55,9.75,-1.35,1.0,-25.0,15.3,1.4,0.0,0.0,0.0,1.0,-0.6,0.0,1.3,1.358236,0.0,1.286828
50%,376.0,0.006,28.3,71313.0,0.208,3.2,0.2,0.154,2.6,2.5,1.540199,3.9,0.6,3.0,0.0,0.65,4.0,93.5,4.2,2.0,7.0,48.05,2.0,1.5,2.5,2.5,1.8,2.7,2.5,0.51128,20.4,2.0,2.6,1.6,11.1,6.0,18.9,17.6,0.4,3.0,-15.0,17.7,1.8,0.0,0.0,0.0,2.0,-0.1,0.1,1.6,1.798655,0.0,1.498295
75%,867.5,0.014,57.9,152795.5,0.263,7.3,0.4,0.164,2.7,2.6,1.716447,11.35,0.9,4.0,0.4,11.325,5.0,308.5,7.1,3.0,12.25,118.725,4.0,2.0,2.5,2.7,2.18178,3.5,2.8,0.61803,31.35,2.0,2.9,2.13531,17.1,21.0,27.75,28.3,2.95,3.0,-15.0,22.8,2.3,0.0,0.0,0.0,3.0,0.0,0.6,3.0,2.242401,0.0,1.83944
max,50079.0,0.6,2703.9,3959657.0,0.51,100.0,5.0,0.276,4.6,3.3,3.91117,261.0,8.4,5.0,15.6,2000.0,5.0,3856.0,147.5,5.0,10855.5,182445.4,5.0,5.0,5.0,3.5,4.51776,4.9,3.9,4.63192,88.1,5.0,4.1,4.82346,50.0,728.0,88.9,88.9,26.4,3.0,-15.0,47.5,5.2,0.6,0.6,0.3,3.0,0.0,0.9,3.6,4.185593,0.5,3.544765


In [None]:
pfbr.shape

(567, 55)

With 57 columns in the dataframe, there seem to be a pretty overwhelming number of variables to consider. Doing an analysis of correlation to the "Points with bonus", the determinant of the ranking, should show us which variables are the most pertinent.

In [None]:
pfbr.corrwith(pfbr["Points with bonus"]).sort_values(ascending=False)

Points with bonus                      1.000000
Bonus                                  0.795198
Total Acceleration Points              0.793944
Total Safety Points                    0.666423
Total Network Points                   0.643687
Total Ridership Points                 0.616059
Average BNA                            0.543552
ACS Normalized Score                   0.542358
ACS Ridership Points                   0.538885
BNA                                    0.537928
Community Survey Acceleration Score    0.533997
BNA Points                             0.527640
Community Survey Ridership Score       0.509141
Total Reach Points                     0.504944
Distance.1                             0.501925
BNA Underserved Communities            0.501465
ACS Tier                               0.488804
Community Survey Network Score         0.471626
Bike Safety Points                     0.463532
All mode safety points                 0.463473
ACS Points                             0

### Data Wrangling

### Clustering

## Results

## Discussion

## Conclusion