Build a regression model.

In [2]:
import pandas as pd

import statsmodels.api as sm

combined_df = pd.read_csv('combined_df.csv')

combined_df['Rating'] = combined_df['Rating'].replace('No rating provided', pd.NA).astype(float)
combined_df.dropna(subset=['Bikes Available', 'Rating'], inplace= True)

x = combined_df['Bikes Available']                                                                      # Setting Bikes available as the independent variable
y = combined_df['Rating']                                                                               # Setting Ratings as the dependent variable

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())






                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     4.235
Date:                Wed, 30 Oct 2024   Prob (F-statistic):             0.0436
Time:                        08:58:27   Log-Likelihood:                -42.533
No. Observations:                  68   AIC:                             89.07
Df Residuals:                      66   BIC:                             93.50
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               5.2179      0.095     

Provide model output and an interpretation of the results. 

In [None]:
# With the model above. the expected rating value when bikes are zero is "5.2179", for each additional bike available, the rating expected to go up by 0.0188 on average with an R^2 value of 0.060 or 6% implying that
# "bikes available" independent variable is not a strong predictor for Ratings by itself and other variables would need to be introduced to get a stronger prediction.


# In conclusion. This model shows a weak, positive correlation between bikes available and rating. Meaning that on average, locations with more bikes have a slightly higher rating for nearby businesses. But with a R^2 value of
# only 6% also indicates the model captures very little of variance in Rating. This can be due to the fact of little public interactions on Yelp/Foursquare leaving to either No ratings or very little ratings on said business locations
# along with indicating that other factors take a much larger role in determining a businesses rating. 





# Stretch

How can you turn the regression model into a classification model?

                Location   Latitude   Longitude  Free Bikes Station Name  \
0       Chilco & Barclay  49.291909 -123.140713        13.0          NaN   
1   St George & Broadway  49.262321 -123.093060         0.0          NaN   
2  Britannia Parking Lot  49.275882 -123.071865         0.0          NaN   
3        Morton & Denman  49.288030 -123.142135        16.0          NaN   
4    Thornton & National  49.273777 -123.092723        14.0          NaN   

  Station Coordinates  Bikes Available Name Address Rating  
0                 NaN              NaN  NaN     NaN    NaN  
1                 NaN              NaN  NaN     NaN    NaN  
2                 NaN              NaN  NaN     NaN    NaN  
3                 NaN              NaN  NaN     NaN    NaN  
4                 NaN              NaN  NaN     NaN    NaN  
