Build a regression model.

In [1]:
import statsmodels.api as sm
import pandas as pd

In [2]:
joined_bike_df = pd.read_json('joined_bike_df')

In [10]:
# Dependent variable
y = joined_bike_df['Number of Bikes']

# Independent variables
X = joined_bike_df[['Restaurant Count', 'Average Rating']]

In [11]:
#Add a constant to the predictor variable
X = sm.add_constant(X)

In [12]:
#Fit the regression model

model = sm.OLS(y, X)
results = model.fit()

Provide model output and an interpretation of the results. 

In [14]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        Number of Bikes   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.025
Method:                 Least Squares   F-statistic:                     2.269
Date:                Sun, 04 Jun 2023   Prob (F-statistic):              0.109
Time:                        22:29:39   Log-Likelihood:                -322.70
No. Observations:                 101   AIC:                             651.4
Df Residuals:                      98   BIC:                             659.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               34.1142     13.382  

# Output explanation

The regression results indicate the relationship between the number of bikes and the independent variables (Restaurant Count and Average Rating). 

- R-squared: The R-squared value of 0.044 suggests that approximately 4.4% of the variation in the number of bikes can be explained by the number of restaurants and their average rating included in the model.

- Adj. R-squared: The adjusted R-squared value of 0.025. It suggests that the independent variables explain about 2.5% of the variation in the number of bikes.

- F-statistic: The F-statistic of 2.269 tests the overall significance of the model. The associated probability is 0.109, which is higher than the typical significance level of 0.05. This indicates that the model as a whole is not statistically significant.

- The coefficient for the Restaurant Count variable is -1.4456. It suggests that, on average, for each additional restaurant in a location, the number of bikes decreases by 1.446.

- The coefficient for the Average Rating variable is -2.8927. It suggests that, on average, for each unit decrease in the average rating of restaurants, the number of bikes decreases by 2.893.

- P-values: the P-values for both the Restaurant Count and Average Rating variables are higher than the typical significance level of 0.05. This means that the coefficients are not statistically significant, and we fail to reject the null hypothesis that there is no relationship between the independent variables and the number of bikes.

In summary, the regression model with the given independent variables (Restaurant Count and Average Rating) does not show strong statistical significance in explaining the variation in the number of bikes. The coefficients are not statistically significant, and the overall model is not significant based on the F-statistic.

# Stretch

How can you turn the regression model into a classification model?

To turn the regression problem into a classification problem, we need to define classes or categories for the target variable (number of bikes) and then assign instances to these classes based on certain criteria.

- Define Classes: we can define classes based on ranges or thresholds. For example, we create three classes: "Low Bikes" (0-10 bikes), "Medium Bikes" (11-20 bikes), and "High Bikes" (21 or more bikes).

- Data Preprocessing: Prepare the dataset by assigning each instance to the appropriate class based on the defined ranges. This requires discretizing the target variable (number of bikes) and creating a new column with the assigned class labels.

- Feature Selection: Identify the most relevant features (independent variables) that can help classify the instances into the defined classes. Consider variables such as restaurant count, average rating, and any other relevant features that may contribute to the classification task.

- Model Selection: Choose an appropriate classification algorithm for the problem. Common algorithms for classification include logistic regression, decision trees, random forests, and support vector machines (SVM). The choice of algorithm depends on the characteristics of the dataset, the number of features, and the desired interpretability of the model.

- Model Training and Evaluation: Split the dataset into training and testing sets. Train the chosen classification model using the training data and evaluate its performance using appropriate metrics such as accuracy, precision, recall, and F1-score.

- Model Deployment and Evaluation: Evaluate the model's performance on the test set and monitor its accuracy over time. We can consider using additional techniques like feature importance analysis to gain insights into the factors influencing the classification.

Overall, the key steps involve defining classes, transforming the problem into a classification task, selecting relevant features, choosing an appropriate classification algorithm, training and evaluating the model, and optimizing its performance. By doing so, we can predict the classes (low, medium, high) of bike availability based on the given independent variables.