<a href="https://colab.research.google.com/github/cboyda/LighthouseLabs/blob/main/Project_Stats_model_building.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Merge Data

In [3]:
import pandas as pd

In [1]:
# how to mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df_foursquare = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/Project-Statistics_city_bikes_FourSquare.csv')

In [5]:
df_yelp = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/Project-Statistics_city_bikes_yelp.csv')

In [6]:
merged_df = df_foursquare.merge(df_yelp, on=['city', 'station_name','latitude', 'longitude','empty_slots','slots','free_bikes','ebikes'])

In [7]:
df = merged_df.copy()

In [8]:
df.shape

(242, 12)

In [9]:
df

Unnamed: 0,city,station_name,empty_slots,slots,free_bikes,ebikes,latitude,longitude,station_location,location_count,yelp_location_count,yelp_review_count
0,Vancouver,10th & Cambie,22,35,13,4,49.262487,-123.114397,"49.262487,-123.114397",0,8,42
1,Vancouver,Yaletown-Roundhouse Station,6,16,10,0,49.274566,-123.121817,"49.274566,-123.121817",0,14,94
2,Vancouver,Dunsmuir & Beatty,23,26,3,1,49.279764,-123.110154,"49.279764,-123.110154",0,10,61
3,Vancouver,12th & Yukon (City Hall),14,16,2,2,49.260599,-123.113504,"49.260599,-123.113504",0,10,49
4,Vancouver,8th & Ash,15,16,1,0,49.264215,-123.117772,"49.264215,-123.117772",0,9,69
...,...,...,...,...,...,...,...,...,...,...,...,...
237,Vancouver,Burrard & 14th,12,18,4,4,49.259469,-123.145718,"49.259469,-123.145718",0,8,15
238,Vancouver,Hornby & Drake,19,24,5,2,49.277178,-123.130000,"49.277178,-123.13",0,13,176
239,Vancouver,Cardero & Bayshore,2,20,16,1,49.291597,-123.129158,"49.291597,-123.129158",0,12,1251
240,Vancouver,27th & Main,13,22,9,7,49.247204,-123.101549,"49.247204,-123.101549",0,6,173


# Build a regression model.

In [11]:
df.dtypes

city                    object
station_name            object
empty_slots              int64
slots                    int64
free_bikes               int64
ebikes                   int64
latitude               float64
longitude              float64
station_location        object
location_count           int64
yelp_location_count      int64
yelp_review_count        int64
dtype: object

In [54]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Split the dataset into input features (X) and target variable (y)
# X = df[df.columns[~df.columns.isin(['ebikes'])]] # fails since they must be numerical
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove('ebikes')
X = df[numeric_cols]
y = df['ebikes']

# Add a constant column to the input features
X_const = sm.add_constant(X)

# Fit the OLS regression model
model = sm.OLS(y, X_const)
results = model.fit()

Provide model output and an interpretation of the results. 

In [55]:
# Print the model summary
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 ebikes   R-squared:                       0.196
Model:                            OLS   Adj. R-squared:                  0.168
Method:                 Least Squares   F-statistic:                     7.084
Date:                Sun, 04 Jun 2023   Prob (F-statistic):           2.23e-08
Time:                        06:12:59   Log-Likelihood:                -432.32
No. Observations:                 242   AIC:                             882.6
Df Residuals:                     233   BIC:                             914.0
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                1690.9419    

Based on p values, only the latitude is less than 0.05 which indicates it is statistically significant.  We could try dropping all other columns and rerunning the model.

In [56]:
results.params

const                  1690.941861
empty_slots              -0.124956
slots                     0.149090
free_bikes               -0.051673
latitude                -47.280737
longitude                -5.191709
location_count           -0.168680
yelp_location_count      -0.024606
yelp_review_count         0.000215
dtype: float64

Let's interpret the coefficients for the given example:

* empty_slots: For each unit decrease in the number of empty slots, the predicted value of the response variable (ebikes) decreases by approximately 0.125.

* slots: For each unit increase in the total number of slots, the predicted value of the response variable increases by approximately 0.149.

* free_bikes: For each unit decrease in the number of free bikes, the predicted value of the response variable decreases by approximately 0.052.

* latitude: For each unit decrease in latitude, the predicted value of the response variable decreases by approximately 47.281.

* longitude: For each unit decrease in longitude, the predicted value of the response variable decreases by approximately 5.192.

* location_count: For each unit decrease in the location count, the predicted value of the response variable decreases by approximately 0.169.

* yelp_location_count: For each unit decrease in the Yelp location count, the predicted value of the response variable decreases by approximately 0.025.

* yelp_review_count: For each unit increase in the Yelp review count, the predicted value of the response variable increases by approximately 0.0002.

In [32]:
results.params.to_frame(name='Coefficient')

Unnamed: 0,Coefficient
const,1690.941861
empty_slots,-0.124956
slots,0.14909
free_bikes,-0.051673
latitude,-47.280737
longitude,-5.191709
location_count,-0.16868
yelp_location_count,-0.024606
yelp_review_count,0.000215


## Try with Logit

Logit only works when y is binary!

In [52]:
# Convert y to binary format
y_binary = np.where(y > 0, 1, 0)

In [57]:
X_const = sm.add_constant(X) #adds a column of 1's so the model will contain an intercept

model = sm.Logit(y_binary,X_const)
results = model.fit()
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.631347
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  242
Model:                          Logit   Df Residuals:                      233
Method:                           MLE   Df Model:                            8
Date:                Sun, 04 Jun 2023   Pseudo R-squ.:                 0.06237
Time:                        06:13:33   Log-Likelihood:                -152.79
converged:                       True   LL-Null:                       -162.95
Covariance Type:            nonrobust   LLR p-value:                  0.009166
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                1849.4328    947.993      1.951      0.051      -8.599    3707.465
em

* The p-value for the intercept (const) is 0.051. It is slightly higher than the significance level of 0.05, indicating that the intercept's relationship with the log odds of the positive class is borderline statistically significant.
* The p-values for empty_slots, slots, free_bikes, longitude, location_count, yelp_location_count, and yelp_review_count are all greater than 0.05. These variables do not show statistical significance in predicting ebikes

### Try with Linear Regression Model

In [58]:
# use this List comprehension method instead of manual for loops
Models = [sm.OLS(y, X[[x]]) for x in X]  # list of models
Results = [model.fit() for model in Models]  # list of results
Adj_Rsquared = [results.rsquared_adj for results in Results]  # list of adjusted R-squared
Pval = [results.pvalues for results in Results]  # list of p-values
Params = [results.params for results in Results]  # list of parameters

In [59]:
Results

[<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc83aa10>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc818dc0>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc819150>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc818160>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc818430>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc818520>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc818610>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fcafc8187c0>]

In [60]:
print(Results[0].summary())

                                 OLS Regression Results                                
Dep. Variable:                 ebikes   R-squared (uncentered):                   0.278
Model:                            OLS   Adj. R-squared (uncentered):              0.275
Method:                 Least Squares   F-statistic:                              92.79
Date:                Sun, 04 Jun 2023   Prob (F-statistic):                    8.68e-19
Time:                        06:13:50   Log-Likelihood:                         -484.06
No. Observations:                 242   AIC:                                      970.1
Df Residuals:                     241   BIC:                                      973.6
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

In this case, the R-squared and Adjusted R-squared values are relatively low, suggesting that the independent variables may not have a strong linear relationship with the dependent variable or that there may be other important variables missing from the model. It is important to interpret these values in the context of the specific data and the research question at hand.

## Model Predictions

In [62]:
y_pred = results.predict(X_const)

In [63]:
# how accurate are these predictions?
from sklearn.metrics import r2_score
r2 = r2_score(y, y_pred) * 100
r2_formatted = "{:.2f}%".format(r2) 

print("The R-squared (R2) score of the regression model is: {}".format(r2_formatted))


The R-squared (R2) score of the regression model is: -15.70%


Conclusion: This tells us counting the number of nearby "parks" is not a very accurate method for predicting the number of ebikes.

## Compare Predictions to Average as Baseline

In [64]:
# Calculate the average value of the response variable
y_mean = np.mean(y)

# Create an array of predicted values with the average value repeated for all observations
y_pred_baseline = np.full_like(y, y_mean)

# Calculate the accuracy of the baseline model
baseline_accuracy = np.mean(y == y_pred_baseline) * 100

print("Baseline prediction accuracy: {:.2f}%".format(baseline_accuracy))

Baseline prediction accuracy: 23.97%


This shows us just using average would be BETTER accuracy then our model.

# Stretch

How can you turn the regression model into a classification model?

To turn a regression model into a classification model, you can apply a threshold to the predicted values and convert them into discrete classes

In [44]:
# Set the threshold for classification
threshold = 0.5

# Apply the threshold to the predicted values
y_pred_class = np.where(y_pred >= threshold, 1, 0)

# Print the predicted class labels
print(y_pred_class)

[1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1]


In [50]:
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy = accuracy_score(y, y_pred_class)

# Print the accuracy score
print("y_pred_classification Accuracy:", "{:.2f}%".format(accuracy*100) )

y_pred_classification Accuracy: 24.38%


This is still too low to be considered statiscally relevant.