Build a regression model.

In [1]:
# Import libraries and packages
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
# Read in data
master_df = pd.read_csv('C:/Users/HP/Music/LHLDataCourse/Python/project_data/master_df2.csv')
master_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               228 non-null    object 
 1   empty_slots        228 non-null    int64  
 2   free_bikes         228 non-null    int64  
 3   latitude           228 non-null    float64
 4   longitude          228 non-null    float64
 5   yelp_center_dist   228 non-null    float64
 6   yelp_college_dist  228 non-null    float64
 7   fsq_center_dist    228 non-null    float64
 8   fsq_college_dist   228 non-null    int64  
 9   total_slots        228 non-null    int64  
 10  pct_usage          228 non-null    float64
 11  usage_cat          228 non-null    object 
 12  bin_usage          228 non-null    int64  
dtypes: float64(6), int64(5), object(2)
memory usage: 23.3+ KB


## 
## Multivariate Regression

Provide model output and an interpretation of the results. 

#### Using pct_usage as the dependent variable ($y$) and yelp_center_dist, yelp_college_dist, fsq_center_dist, fsq_college_dist as independent variables ($x_1$ to $x_4$).


In [5]:
# Model 1
y = master_df['pct_usage']
X = master_df[["yelp_center_dist", "yelp_college_dist", "fsq_center_dist", "fsq_college_dist"]]
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept
X.head()

Unnamed: 0,const,yelp_center_dist,yelp_college_dist,fsq_center_dist,fsq_college_dist
0,1.0,206.798597,1331.446597,204.0,1555
1,1.0,130.753256,1706.096309,431.0,1732
2,1.0,794.358688,2282.52981,79.0,2069
3,1.0,165.841893,1808.390164,142.0,1579
4,1.0,711.265245,2707.043477,729.0,2114


In [6]:
model = sm.OLS(y, X) #instantiate
results = model.fit() #fit the model 
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              pct_usage   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.023
Method:                 Least Squares   F-statistic:                     2.309
Date:                Sun, 10 Sep 2023   Prob (F-statistic):             0.0589
Time:                        21:08:59   Log-Likelihood:                -1001.6
No. Observations:                 228   AIC:                             2013.
Df Residuals:                     223   BIC:                             2030.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                62.9003      2.84

The **Adjusted R-squared** shows that the multivariate model explains 2.3% of the variations in bike usage. The **P-value** for the model is greater than 0.05 as well as the p-values of the independent variables in the model. 


Since the centers correlated, I will try another model using just the center distances.

In [7]:
# Model 2
y = master_df['pct_usage']
X = master_df[["yelp_center_dist", "fsq_center_dist"]]
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept
X.head()

Unnamed: 0,const,yelp_center_dist,fsq_center_dist
0,1.0,206.798597,204.0
1,1.0,130.753256,431.0
2,1.0,794.358688,79.0
3,1.0,165.841893,142.0
4,1.0,711.265245,729.0


In [8]:
model = sm.OLS(y, X) #instantiate
results = model.fit() #fit the model
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              pct_usage   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     2.685
Date:                Sun, 10 Sep 2023   Prob (F-statistic):             0.0704
Time:                        21:09:42   Log-Likelihood:                -1003.5
No. Observations:                 228   AIC:                             2013.
Df Residuals:                     225   BIC:                             2023.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               59.9517      2.206  

The **Adj R-squared** of this model explains even less of the variation in bike usage with an insignificant p-value.

In [9]:
# Model 3 - Using only Yelp Fitness Centers
y = master_df['pct_usage']
X = master_df[["yelp_center_dist"]]
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept
X.head()

Unnamed: 0,const,yelp_center_dist
0,1.0,206.798597
1,1.0,130.753256
2,1.0,794.358688
3,1.0,165.841893
4,1.0,711.265245


In [10]:
model = sm.OLS(y, X) #instantiate
results = model.fit() #fit the model
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              pct_usage   R-squared:                       0.021
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     4.857
Date:                Sun, 10 Sep 2023   Prob (F-statistic):             0.0285
Time:                        21:12:08   Log-Likelihood:                -1003.8
No. Observations:                 228   AIC:                             2012.
Df Residuals:                     226   BIC:                             2018.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               60.5919      2.019  

The third model shows that it explains **1.7%** of the variation in bike usage; p-value < 0.05 which shows significance. The center also has a significant p-value < 0.05 showing a relationship, *albeit minute*, between the proximity of fitness centers and bike usage.

# Stretch

How can you turn the regression model into a classification model?

## Logistic Regression

#### Using bin_usage as the dependent variable ($y$) and yelp_center_dist, yelp_college_dist, fsq_center_dist, fsq_college_dist as independent variables ($x_1$ to $x_4$).




In [11]:
master_df['usage_cat'].value_counts()

High    120
Low     108
Name: usage_cat, dtype: int64

In [12]:
master_df['bin_usage'].value_counts()

1    120
0    108
Name: bin_usage, dtype: int64

In [13]:
# Model 1

y = master_df['bin_usage']
X = master_df[["yelp_center_dist", "yelp_college_dist", "fsq_center_dist", "fsq_college_dist"]]
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.Logit(y.astype(float), X.astype(float)) # (need to send in as floats)

results = model.fit() #fit the model (MLE)
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.676564
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:              bin_usage   No. Observations:                  228
Model:                          Logit   Df Residuals:                      223
Method:                           MLE   Df Model:                            4
Date:                Sun, 10 Sep 2023   Pseudo R-squ.:                 0.02197
Time:                        21:20:03   Log-Likelihood:                -154.26
converged:                       True   LL-Null:                       -157.72
Covariance Type:            nonrobust   LLR p-value:                    0.1396
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -0.2705      0.296     -0.914      0.361      -0.850       0.309
yelp_cen

The **Pseudo R-squared** which measures the goodness of fit of the model shows that the predictors explain only 2.197%, a small portion of the variance in the dependent variable. The **p-value** of 0.1396 suggests that the model is not statistically significant. None of the predictor variables show significance.

In [14]:
# Model 2
y = master_df['bin_usage']
X = master_df[["fsq_center_dist"]]
X = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.Logit(y.astype(float), X.astype(float)) # (need to send in as floats)

results = model.fit() #fit the model (MLE)
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.677240
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:              bin_usage   No. Observations:                  228
Model:                          Logit   Df Residuals:                      226
Method:                           MLE   Df Model:                            1
Date:                Sun, 10 Sep 2023   Pseudo R-squ.:                 0.02099
Time:                        21:26:46   Log-Likelihood:                -154.41
converged:                       True   LL-Null:                       -157.72
Covariance Type:            nonrobust   LLR p-value:                   0.01007
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -0.3428      0.222     -1.546      0.122      -0.777       0.092
fsq_center_dis

The predictor variable in this model explains 2.099% of variation in the dependent variable and the model is statistically significant. The "fsq_center_dist" predictor is also statistically significant with a p-value of 0.013.

### 
### Mapping

In [15]:
import folium

In [16]:
# Create a base map
m = folium.Map(location=[39.97195, -75.13445], zoom_start=15)

# Iterate through DataFrame rows and add CircleMarkers
for index, row in master_df.iterrows():
    lat, lng = row['latitude'], row['longitude']
    name = row['name']
    
    if not pd.isna(lat) and not pd.isna(lng):
        folium.CircleMarker(
            location=[lat, lng],
            radius=5,  # Adjust the size of the circle as needed
            color='magenta',  # Outline color
            fill=True,
            fill_color='white',  # Circle color
            fill_opacity=0.6,
            tooltip=name
        ).add_to(m)

# Save the map to an HTML file or display it
m.save("bikes_mapped.html")

In [17]:
m