1. Link to the dataset and explanation of the data variables: https://data.bts.gov/Bicycles-and-Pedestrians/Locations-of-Docked-Bikeshare-Stations-by-System-a/7m5x-ubud/about_data

1a. Where I found the data: https://catalog.data.gov/dataset/locations-of-docked-bikeshare-stations-by-system-and-year

2. Link to the dataset: https://data.bts.gov/Bicycles-and-Pedestrians/Bikeshare-Docked-and-Dockless-and-E-scooter-System/cqdc-cm7d/about_data

# **Possible Questions to Answer**

### **1. Geographic Distribution**
- **Where are bikeshare stations most densely located (e.g., by city, state, or CBSA code)?**
- **How does the distribution of bikeshare stations vary across different regions or urban areas?**
- **Are there specific cities or states with a higher concentration of bikeshare stations?**
- **What is the geographic coverage of each bikeshare system (e.g., radius, area)?**

---

### **2. Growth and Expansion**
- **How has the number of bikeshare stations grown over time (by year)?**
- **Which bikeshare systems have expanded the most (e.g., by number of stations or geographic coverage)?**
- **Are there specific years when bikeshare systems experienced rapid growth?**
- **What is the average lifespan of a bikeshare station (based on `launchDate` and `endDate`)?**

---

### **3. System-Specific Analysis**
- **Which bikeshare systems have the most stations?**
- **How do bikeshare systems differ in terms of station density or geographic coverage?**
- **Are there differences in station types (`STATION_TYPE`) across systems?**
- **Which systems have the longest-running stations (based on `launchDate` and `endDate`)?**

---

### **4. Temporal Trends**
- **How has the number of bikeshare stations changed over time (by year)?**
- **Are there seasonal patterns in the launch or closure of bikeshare stations?**
- **Do certain years show a spike in station launches or closures?**
- **What is the average duration a bikeshare station remains active (based on `launchDate` and `endDate`)?**

---

### **5. Station Characteristics**
- **What are the most common station types (`STATION_TYPE`)?**
- **Are there differences in station types based on location (e.g., urban vs. suburban)?**
- **How do station names (`FAC_NAME`) or addresses reflect their location or purpose?**
- **Are there patterns in the naming conventions of bikeshare stations?**

---

### **6. Urban and Regional Analysis**
- **How do bikeshare stations align with population density or urban development?**
- **Are bikeshare stations more common in certain types of neighborhoods (e.g., downtown areas, residential areas)?**
- **How does the distribution of bikeshare stations correlate with public transit hubs or other transportation infrastructure?**
- **Are there differences in bikeshare station density between metropolitan areas (based on `CBSA_CODE`)?**

---

### **7. Comparative Analysis**
- **How do bikeshare systems in different cities or states compare in terms of station density or growth?**
- **Are there differences in station types or locations between systems?**
- **Which cities or systems have the most unique or innovative station designs?**
- **How do bikeshare systems in different regions adapt to local needs (e.g., weather, population density)?**

---

### **8. Predictive and Diagnostic Questions**
- **Can we predict where new bikeshare stations are likely to be launched based on existing patterns?**
- **What factors (e.g., population density, income levels, public transit access) influence the location of bikeshare stations?**
- **Are there patterns in station closures (e.g., due to low usage, maintenance costs)?**
- **What is the likelihood of a station remaining active based on its location or type?**

---

### **9. Visualization and Mapping**
- **Create a map showing the geographic distribution of bikeshare stations by system or year.**
- **Visualize the growth of bikeshare systems over time using an animated map or timeline.**
- **Plot the density of bikeshare stations in specific cities or regions.**
- **Create a heatmap of bikeshare station locations to identify high-density areas.**

---

### **10. Policy and Planning**
- **How can bikeshare systems optimize station placement to maximize usage?**
- **What lessons can be learned from successful bikeshare systems to improve underperforming ones?**
- **How do bikeshare stations contribute to sustainable urban transportation?**
- **What policies or incentives could encourage the expansion of bikeshare systems in underserved areas?**

---

### **Example Hypotheses to Test**
- **Hypothesis 1**: Bikeshare stations are more densely located in urban areas with higher population density.
- **Hypothesis 2**: Bikeshare systems in cities with robust public transit networks have more stations.
- **Hypothesis 3**: Stations located near public transit hubs have longer lifespans.
- **Hypothesis 4**: Bikeshare systems in warmer climates have more stations than those in colder climates.
- **Hypothesis 5**: The number of bikeshare stations has grown exponentially since their introduction.

---

### **Tools to Answer These Questions**
- **Data Cleaning**: Python (Pandas), R, or Excel.
- **Visualization**: Tableau, Power BI, Matplotlib, Seaborn, or Plotly.
- **Geospatial Analysis**: GeoPandas, QGIS, or ArcGIS for mapping.
- **Statistical Analysis**: Python (SciPy, Statsmodels) or R.
- **Machine Learning**: Scikit-learn for predictive modeling (e.g., station placement).

---



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df1 = pd.read_csv(r'C:\Users\andre\Desktop\Bikeshare Station Project\Locations_of_Docked_Bikeshare_Stations_by_System_and_Year.csv')
df1.head()

  df1 = pd.read_csv(r'C:\Users\andre\Desktop\Bikeshare Station Project\Locations_of_Docked_Bikeshare_Stations_by_System_and_Year.csv')


Unnamed: 0,the_geom,ID,FAC_ID,BIKE_ID,SYSTEM_ID,SYSTEM_NAME,YEAR,ASOFDATE,FAC_NAME,ADDRESS,CITY,STATE,ZIPCODE,CBSA_CODE,LONGITUDE,LATITUDE,STATION_TYPE,launchDate,endDate
0,POINT (-85.302739 35.047574),2278.0,TN3740303,-,54,Bike Chattanooga,2015,201512,-,Oak St & Houston St,Chattanooga,TN,37403,16860,-85.302739,35.047574,-,Jul-2012,
1,POINT (-77.091991 38.982456),2579.0,MD2077301,-,56,Capital Bikeshare,2015,201512,47th & Elm St,47th & Elm St,Chevy Chase,MD,20815,47900,-77.091991,38.982456,-,Sep-2010,
2,POINT (-87.624084 41.881031),2967.0,IL6060282,-,60,Divvy,2015,201512,Millennium Park,Millennium Park,Chicago,IL,60603,16980,-87.624084,41.881031,-,Jun-2013,
3,POINT (-71.107341 42.310579),3505.0,MA0213011,E32005,1,Hubway (03/2018 re-launched as Blue Bikes),2016,201612,-,Green St T,Jamaica Plain,MA,2130,14460,-71.107341,42.310579,-,Jul-2011,
4,POINT (-77.01597 38.917622),6350.0,DC2074001,31118,56,Capital Bikeshare,2016,201612,3rd & Elm St NW,3rd & Elm St NW,Washington,DC,20001,47900,-77.01597,38.917622,-,Sep-2010,


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67574 entries, 0 to 67573
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   the_geom      67574 non-null  object 
 1   ID            67573 non-null  float64
 2   FAC_ID        67574 non-null  object 
 3   BIKE_ID       67568 non-null  object 
 4   SYSTEM_ID     67574 non-null  int64  
 5   SYSTEM_NAME   67574 non-null  object 
 6   YEAR          67574 non-null  int64  
 7   ASOFDATE      67574 non-null  int64  
 8   FAC_NAME      67574 non-null  object 
 9   ADDRESS       67574 non-null  object 
 10  CITY          67574 non-null  object 
 11  STATE         67574 non-null  object 
 12  ZIPCODE       67574 non-null  object 
 13  CBSA_CODE     67574 non-null  object 
 14  LONGITUDE     67574 non-null  float64
 15  LATITUDE      67574 non-null  float64
 16  STATION_TYPE  67574 non-null  object 
 17  launchDate    67574 non-null  object 
 18  endDate       12057 non-nu

In [4]:
df1.shape

(67574, 19)

In [5]:
df1.describe()

Unnamed: 0,ID,SYSTEM_ID,YEAR,ASOFDATE,LONGITUDE,LATITUDE
count,67573.0,67574.0,67574.0,67574.0,67574.0,67574.0
mean,34156.945289,43.877941,2020.132492,202023.354693,-90.126471,38.783448
std,19913.705356,35.570226,2.743578,272.371383,18.35868,4.849393
min,1.0,1.0,2015.0,201512.0,-157.870423,21.2679
25%,16895.0,5.0,2018.0,201812.0,-96.69367,37.769095
50%,33855.0,43.0,2020.0,202012.0,-86.147,40.675832
75%,51759.0,60.0,2023.0,202307.0,-75.144008,41.870816
max,68663.0,166.0,2024.0,202407.0,-70.753969,47.666145


In [6]:
df1.isnull().sum()

the_geom            0
ID                  1
FAC_ID              0
BIKE_ID             6
SYSTEM_ID           0
SYSTEM_NAME         0
YEAR                0
ASOFDATE            0
FAC_NAME            0
ADDRESS             0
CITY                0
STATE               0
ZIPCODE             0
CBSA_CODE           0
LONGITUDE           0
LATITUDE            0
STATION_TYPE        0
launchDate          0
endDate         55517
dtype: int64

In [7]:
df1 = df1.dropna(subset=['ID', 'BIKE_ID'])
print(f"Number of rows after removing null values: {len(df1)}")
print("\nNull counts in ID and BIKE_ID after cleaning:")
print(df1[['ID', 'BIKE_ID']].isnull().sum())

Number of rows after removing null values: 67567

Null counts in ID and BIKE_ID after cleaning:
ID         0
BIKE_ID    0
dtype: int64


In [8]:
df1.isnull().sum()


the_geom            0
ID                  0
FAC_ID              0
BIKE_ID             0
SYSTEM_ID           0
SYSTEM_NAME         0
YEAR                0
ASOFDATE            0
FAC_NAME            0
ADDRESS             0
CITY                0
STATE               0
ZIPCODE             0
CBSA_CODE           0
LONGITUDE           0
LATITUDE            0
STATION_TYPE        0
launchDate          0
endDate         55517
dtype: int64

In [9]:
df1.duplicated().sum()


np.int64(0)

In [17]:
df1.to_csv(r'C:\Users\andre\Desktop\Bikeshare Station Project\cleaned_bikeshare_stations.csv', index=False)


In [36]:
unique_cities = df1['CITY'].nunique()
print(f"Number of unique cities in the dataset: {unique_cities}")
print("\nList of unique cities:")
print(df1['CITY'].unique())


Number of unique cities in the dataset: 326

List of unique cities:
['Chattanooga' 'Chevy Chase' 'Chicago' 'Jamaica Plain' 'Washington'
 'Roxbury' 'El Paso' 'Fort Worth' 'Berkeley' 'Alexandria' 'Portland'
 'Long Island City' 'Jersey City' 'Miami' 'Charlotte' 'Salt Lake City'
 'Boston' 'New York' 'Omaha' 'Madison' 'Aspen' 'Pittsburgh' 'Arlington'
 'Honolulu' 'Bethesda' 'Raleigh' 'Brookline' 'Brooklyn' 'Niagara Falls'
 'Manitowish Waters' 'Evanston' 'Troy' 'Cambridge' 'Cincinnati'
 'Los Angeles' 'Albany' 'Mesa' 'Topeka' 'Phoenix' 'Carmel' 'Atlanta'
 'Denver' 'Saint Paul' 'Tampa' 'Boise' 'Cleveland' 'Portsmouth'
 'Minneapolis' 'Wichita' 'Jeffersonville' 'Kansas City' 'Baton Rouge'
 'Louisville' 'Falls Church' 'San Antonio' 'Milwaukee' 'Long Beach'
 'Dayton' 'Nashville' 'Memphis' 'Las Vegas' 'Saratoga Springs' 'Houston'
 'Reston' 'Astoria' 'Austin' 'Silver Spring' 'Miami Beach' 'Eugene'
 'Dorchester' 'Somerville' 'Boulder' 'Columbus' 'San Jose' 'Philadelphia'
 'Lincoln' 'Schenectady' 'Basa

In [37]:
city_stats = df1.groupby('CITY').size().agg(['median', 'mean'])
print("Statistics for number of stations per city:")
print(f"\nMedian stations per city: {city_stats['median']:.1f}")
print(f"Mean stations per city: {city_stats['mean']:.1f}")
print("\nNumber of stations in each city:")
city_counts = df1.groupby('CITY').size().sort_values(ascending=False)
print(city_counts)


Statistics for number of stations per city:

Median stations per city: 37.0
Mean stations per city: 207.3

Number of stations in each city:
CITY
Chicago          8825
New York         5497
Brooklyn         3509
Washington       3055
San Francisco    2060
                 ... 
Daly City           1
Celebration         1
Cary                1
Brooks AFB          1
Alamo Heights       1
Length: 326, dtype: int64


In [11]:
df1 = df1.sort_values('YEAR')
df1 = df1.drop_duplicates(subset=['the_geom'], keep='first')



Number of duplicate locations removed: 0


In [15]:
df1 = df1.sort_values('YEAR')
df1 = df1.drop_duplicates(subset=['FAC_ID'], keep='first')


In [16]:
df1.shape



(10903, 19)

In [71]:
df1.head(100).to_csv(r'C:\Users\andre\Desktop\Bikeshare Station Project\cleaned_bikeshare_stations_100.csv', index=False)


In [None]:
df1.to_csv(r'C:\Users\andre\Desktop\Bikeshare Station Project\cleaned_bikeshare_stations.csv', index=False)