# Walkable Cities
DS4A Empowerment Cohort 4 Team 27: Sola Agogu | Daniel Bernal | Omar Ibarra | Chanel Lee | Dami Salami

## Introduction
Most American cities are designed for cars instead of people. With over 2 million people injured in car accidents every year and 14% of greenhouse gas emissions coming from cars, policy makers are considering if is time for the US to move away from car-focused
culture. As more cities move toward walkability, we are closer to a solution for curbing car-centrism in the US. From prioritizing pedestrian safety to public transportation, we will explore the effects of people-first city design.
Pedestrian-focused, walkable cities invest in the community. They not only reduce car use, they improve local economies, reduce obesity and diabetes rates, and more. We will investigate how walkable cities can contribute to a better quality of life for people, and compare walkable cities to car-centric cities that treat pedestrians as an
afterthought.
We will explore what makes a city walkable by considering pedestrian-centric designs, public transportation, and city ordinances. 
We will also compare walkable cities to car-centric cities on quality of life factors. We then aim to answer the following question: What quality of life factors are higher in walkable cities?


## Business Impact
This project is meant to encourage city planners and local governments to invest in walkability and implement pedestrian-centric city designs. We hypothesize that walkable cities lead to an improvement in the quality of life for its residents and increase the local economy. We hope to inspire residents to ask their local government to help shape the future of our country’s infrastructure and health by bringing walkability to their city.

*****************************************************
** Include research on any other studies on this and approaches of the solutions to the main question. Provide references!

## Approach

***** EDIT: A number of studies have been done by researchers and students world-wide about the possibility of improving sustainability by building more walkable cities including ****, ******** and ** (references). Urban planning articles like (reference) extol the common sense benefits of walkable cities on the environment. As with most other facets of life, human beings tend to care about topics when they understand the specific benefits to them.
Our approach to demonstrating this included regression analysis, and modeling to determine if there are any differences in quality of life between walkable and non-walkable cities. 





## Data Analysis & Computation

### Datasets
 Several American Community Survey datasets were obtained with 2021 estimates for different types of variables. These were valuable because they contained information at the city level.
 - Demographic variables including age, sex, education (***REFERENCE***)
 - Health variables including population percentages with diabetes, high blood pressure, obesity (***REFERENCE***)
 - Other quality of life health variables including sleep, depression, crime, and how often members of the population go for medical checkups (***REFERENCE***)
 Walkability variables including cities' walk scores and bike scores were obtained by web scraping the *** site (*REFERENCE*)


 ### Data Cleaning and Wrangling
 Our challenge was not just figuring out how to gather sufficient data, but also figuring out how to eliminate some of the 100+ variables from the ACS and EPA data sources.
 Doing this while maintaining the integrity of our dataset required extensive data cleaning and wrangling:
 - Using the python pandas library, we dropped repeated variables, columns that included annotations alone, and variables that were considered irrelevant to investigating our hypothesis like the emissions from power plants and other industries. 
 - We also standardized the column names to aid in the readability of the data (S0101_15E became public_transportation, for example). 
 - Unfortunately, the data cleaning and wrangling process also included getting rid of information with the wrong granularity, like EPA vehicle emissions data which was only available down to the State level. 
 
 This section was by far the most time-consuming and took several rounds of refining until we had an appropriate dataset.

### Edit Exploratory Data Analysis 

In carrying out an exploratory analysis of our data, we found encouraging signs that there were some correlations between our walkability variables and our quality of life variables. 

******Below is a geographic bubble plot of city walkability vs ** (or other bivariate/multivariate visualization)
******And below is our standardized data using SciKit Learn's StandardScaler 


include most meaningful data visualizations AND derived conclusions/remarkable hypothesis - use submitted extended analysis
B


#### 1. Data review
#### 2. Summary Statistics
#### 3. Scatterplots and other univariate distributions
#### 4. Heat Maps (Correlations)

### Statistical Analysis & Predictive Modeling

5. Hypothesis Testing
focus on why twe made the choices we made
6. Regression Analysis/ Modeling
model implications and validity

## Description of Dashboard

### Use Case

### Data Engineering

### Flow charts/diagrams to show 

## Conclusions and Future Work

In [3]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import statsmodels.formula.api  as smf
#import hvplot.pandas
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pingouin                 as pg
import statsmodels.formula.api  as sm
from bokeh.plotting import figure, show
from bokeh.models import HoverTool
from bokeh.io import output_notebook
import cufflinks as cf
from sklearn.preprocessing import StandardScaler


In [5]:
# import wrangled dataset
walkable_cities = pd.read_csv("datasources/walkable-cities.csv")
walkable_cities

FileNotFoundError: [Errno 2] No such file or directory: 'datasources/walkable-cities.csv'

In [None]:
# reviewing a list of the dataset's columns
display(walkable_cities.columns)

From our visual review of the dataset and columns, we can see that our variables include:
 - walkability variables: walk_score and bike_score
 - race variables
 - health variables
 - age variables
 - crime variables
 - commuting variables
 - age variable
 - gender variables

In [None]:
#REVIEW
#Additional data wrangling
#uses the loc method to select the subset of rows and columns where the column names are not duplicated. 
walkable_cities = walkable_cities.loc[:,~walkable_cities.columns.duplicated()]
#removes the specified columns from the walkable_cities dataframe
walkable_cities.drop(walkable_cities.columns[[4,5,6,7,11]], axis=1, inplace=True)
walkable_cities

For our main walkability variable, the frequency distribution in the histogram below shows a slight right skewness. We can explore it further by carrying out descriptive analysis.

In [None]:
hist_walk = walkable_cities['walk_score']

# Plot a histogram of the "walk_score" column
sns.displot(hist_walk, kde=False)

In [None]:
# For review purposes, group the walk scores by state
pd.options.plotting.backend='hvplot'
walkable_cities.groupby('state')['walk_score'].mean().plot(kind='bar', width=1400, height=600, bins=100)

In [None]:
# Identify the city with highest walk score
pd.options.plotting.backend='hvplot'
walkable_cities.groupby('city')['walk_score'].mean().plot(kind='bar', bins=100, width=1600, height=600)

In [None]:
#New dataframe with walk score greater than 50
poss_walK_cities = walkable_cities.groupby('city').filter(lambda x: x['walk_score'].max() > 50)
poss_walK_cities.sort_values('walk_score')
poss_walK_cities

In [None]:
#Identify only the cities with walk score over 70
poss_walk_sort = poss_walK_cities.groupby('city').mean().sort_values('walk_score')
#Other way: poss_walk_sort.groupby("city" ).apply(lambda x: x.sort_values("walk_score"))
posswalk_over = poss_walk_sort[['walk_score']].sort_values(by='walk_score', ascending=False)
posswalk_over.head(40)

In [None]:
walkable_cities.columns

In [None]:
from itertools import combinations
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# define variables
"""
x = np.array(walkable_cities[['pop_per_km2', 'median_age', 'male',
       'female',
       'access2', 'arthritis', 'binge', 'bphigh', 'bpmed', 'cancer', 'casthma',
       'cervical', 'chd', 'checkup', 'cholscreen', 'colon_screen', 'copd',
       'corem', 'corew', 'csmoking', 'dental', 'depression', 'diabetes',
       'ghlth', 'highchol', 'kidney', 'lpa', 'mammouse', 'mhlth', 'obesity',
       'phlth', 'sleep', 'stroke', 'teethlost', 'cumulative_confirmed',
       'cumulative_deceased', 'drive_commute',
       'public_transit_commute', 'walk_commute', 'bike_commute',
       'work_from_home', 'households', 'mean_household_income', 'mean_income',
       'median_household_income', 'living_wage', 'poverty',
       'unemployment_rate', 'median_aqi', 'violent_crime', 'property_crime']])
"""
x = walkable_cities[['pop_per_km2', 'median_age', 'male',
       'female',
       'access2', 'arthritis', 'binge', 'bphigh', 'bpmed', 'cancer', 'casthma',
       'cervical', 'chd', 'checkup', 'cholscreen', 'colon_screen', 'copd',
       'corem', 'corew', 'csmoking', 'dental', 'depression', 'diabetes',
       'ghlth', 'highchol', 'kidney', 'lpa', 'mammouse', 'mhlth', 'obesity',
       'phlth', 'sleep', 'stroke', 'teethlost', 'cumulative_confirmed',
       'cumulative_deceased', 'drive_commute',
       'public_transit_commute', 'walk_commute', 'bike_commute',
       'work_from_home', 'households', 'mean_household_income', 'mean_income',
       'median_household_income', 'living_wage', 'poverty',
       'unemployment_rate', 'median_aqi', 'violent_crime', 'property_crime']].values

y = np.array(walkable_cities['walk_score'])

# create a list of all possible variable combinations
combinations_list = list(combinations(['pop_per_km2', 'median_age', 'male',
       'female',
       'access2', 'arthritis', 'binge', 'bphigh', 'bpmed', 'cancer', 'casthma',
       'cervical', 'chd', 'checkup', 'cholscreen', 'colon_screen', 'copd',
       'corem', 'corew', 'csmoking', 'dental', 'depression', 'diabetes',
       'ghlth', 'highchol', 'kidney', 'lpa', 'mammouse', 'mhlth', 'obesity',
       'phlth', 'sleep', 'stroke', 'teethlost', 'cumulative_confirmed',
       'cumulative_deceased', 'drive_commute',
       'public_transit_commute', 'walk_commute', 'bike_commute',
       'work_from_home', 'households', 'mean_household_income', 'mean_income',
       'median_household_income', 'living_wage', 'poverty',
       'unemployment_rate', 'median_aqi', 'violent_crime', 'property_crime'], 2))

# loop through the list of combinations
for combination in combinations_list:
    # select the variables for this iteration
    var_indices = [np.where(x == var)[1][0] for var in combination]
    x_temp = x[:, [np.where(x.columns == var)[0][0] for var in combination]]


    
    # fit the model
    model = LinearRegression().fit(x_temp, y)
    
    # calculate the R-squared
    r2 = r2_score(y, model.predict(x_temp))
    
    # print the combination and the R-squared
    print(combination, r2)


In [None]:
# Identify the cities with walk score greater than 50
pd.options.plotting.backend='hvplot'
poss_walK_cities.groupby('city')['walk_score'].mean().plot(kind='bar', width=1400, height=600, bins=100)

In [None]:
# Identify only the cities with walk score less 70
posswalk_over = poss_walk_sort[['walk_score']].sort_values(by='walk_score', ascending=True)
posswalk_over.head(144)

In [None]:
#New dataframe with walk score greater less than 50
no_walK_cities = walkable_cities.groupby('city').filter(lambda x: x['walk_score'].max() <= 50)
no_walK_cities.sort_values('walk_score')
no_walK_cities

In [None]:
# Identify the cities with walk score less than 50
pd.options.plotting.backend='hvplot'
no_walK_cities.groupby('city')['walk_score'].mean().plot(kind='bar', width=1600, height=600, bins=100)

In [None]:
# Set the maximum number of rows and columns displayed
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

# Identify all NA values in the dataframe
na_values = walkable_cities.isna().sum()

# Print the resulting dataframe
display(na_values)

In [None]:
#Summary Statistics Table
walkable_cities.describe()

In [None]:
# Create a scatter plot matrix
sns.pairplot(walkable_cities[['walk_score', 'bike_score', 'chd', 'obesity', 'bphigh', 'highchol', 'diabetes']])

# Show the plot
plt.show()

In [None]:
# First, create a scatter plot matrix
scatter_matrix = pd.plotting.scatter_matrix(walkable_cities, figsize=(20, 20))

# Rotate the x-axis labels on the scatter plot matrix
[s.xaxis.label.set_rotation(45) for s in scatter_matrix.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in scatter_matrix.reshape(-1)]
[s.get_yaxis().set_label_coords(-0.3,0.5) for s in scatter_matrix.reshape(-1)]

# Set the labels for the x-axis and y-axis
[s.xaxis.set_label_text(col) for col, s in zip(walkable_cities.columns, scatter_matrix.reshape(-1))]
[s.yaxis.set_label_text(col) for col, s in zip(walkable_cities.columns, scatter_matrix.reshape(-1))]

# Adjust the spacing between subplots
plt.subplots_adjust(hspace=0.5, wspace=0.5)

# Show the plot
plt.show()

In [None]:
#Heat Maps
# Create a list only with the features needed
walkable_values = ['walk_score', 'bike_score', 'chd', 'obesity', 'bphigh', 'highchol', 'diabetes']
# Create a pivot table with the "city" column as the index and the "walk_score" column as the values
walkable_pivot_table = walkable_cities.pivot_table(index="city", values=walkable_values)

fig, ax = plt.subplots(figsize=(20, 50))
# Create a heatmap of the pivot table
sns.heatmap(walkable_pivot_table, cmap="coolwarm", ax=ax)

# Show the plot
plt.show()

In [None]:
#Group the cities in a dataframe by the "walk_score" column and create a heat map

# Create a new column with the "walk_score" values binned into ranges of 10
walkable_cities["walk_score_range"] = pd.cut(walkable_cities["walk_score"], bins=range(0, 110, 10))

# Create a pivot table with the "city" column as the index and the "walk_score_range" column as the values
walkable_pivot_table_1  = walkable_cities.pivot_table(index="city", values="walk_score_range")

# Create a heatmap of the pivot table
sns.heatmap(walkable_pivot_table_1)

# Show the plot
plt.show()

In [None]:
# Check if the pivot table contains any non-NaN values
if walkable_pivot_table_1.notnull().values.any():
    # Create a heatmap of the pivot table
    sns.heatmap(walkable_pivot_table_1)

    # Show the plot
    plt.show()
else:
    print("The pivot table does not contain any non-NaN values.")

In [None]:
# Correlations

top_neg_corr_walk = walkable_cities.corr()['walk_score'].sort_values(ascending = True)[1:11]
top_pos_corr_walk = walkable_cities.corr()['walk_score'].sort_values(ascending = False)[1:11]
top_neg_corr_bike = walkable_cities.corr()['walk_score'].sort_values(ascending = True)[1:11]
top_pos_corr_bike = walkable_cities.corr()['walk_score'].sort_values(ascending = False)[1:11]


In [None]:
#Top possitive correlations
top_pos_corr_walk

In [None]:
#Top negative correlations
top_neg_corr_walk

In [None]:
# Hypothesis Testing and model

# Create a multiple linear regression model with the dependent variable and the independent variables
model = smf.ols('walk_score ~ bike_score + chd + obesity + bphigh + highchol + diabetes', data=walkable_cities).fit()

# Dependent variable - 'walk_score'
# Independent variables - 'bike_score', 'chd', 'obesity', 'bphigh', 'highchol', 'diabetes'

# Print the summary of the model
print(model.summary())

# Define the null and alternate hypotheses
null_hypothesis = "There is no relationship between walk_score and the independent variables"
alternate_hypothesis = "There is a relationship between walk_score and the independent variables"

# Perform the F-test
f_test = model.f_test(np.identity(len(model.params)))

# Print the p-value of the F-test
print(f"p-value = {f_test.pvalue:.4f}")

# Interpret the results
if f_test.pvalue < 0.05:
    print("Reject the null hypothesis")
    print(alternate_hypothesis)
else:
    print("Fail to reject the null hypothesis")
    print(null_hypothesis)

In [None]:
#Linear Regression model

walkable_cities_1 = walkable_cities.loc[walkable_cities['walk_score'] >= 0, walkable_values]

type(walkable_cities_1)


In [None]:


# Split the dataframe into two groups
#from sklearn.preprocessing import StandardScaler

greater_than_50 = walkable_cities_1[walkable_cities_1['walk_score'] > 50]

less_than_or_equal_to_50 = walkable_cities_1[walkable_cities_1['walk_score'] <= 50]

# Define the features and target variable
X = greater_than_50.drop(columns='walk_score')
y = greater_than_50['walk_score']

# check the shape of greater_than_50 dataframe
print(greater_than_50.shape)

# check if walk_score column exists in the dataframe
print(greater_than_50.columns)

#check if there are any missing value in the dataframe
print(greater_than_50.isnull().sum())

# check if all columns are numeric
print(greater_than_50.dtypes)


In [None]:

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Drop missing values using dropna()
X_train.dropna()
X_test.dropna()

# Check the shape of the X_train variable
print(X_train.shape)

# Check if X_train is empty
if X_train.shape[0] == 0:
    print("X_train is empty.")

# Print the first 5 rows of X_train
print(X_train.head())

print(X_train.dtypes)

In [None]:
from sklearn.impute import SimpleImputer

# Create an imputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the imputer on the training set
X_train_imputed = imputer.fit_transform(X_train)

# Transform the test set
X_test_imputed = imputer.transform(X_test)

# Scale and normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Initialize the model
reg = LinearRegression()

# Train the model on the training set
reg.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = reg.predict(X_test_scaled)

# Calculate the mean squared error and R^2 score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error: {:.2f}".format(mse))
print("R-squared value: {:.2f}".format(r2))