Build a regression model.

In [115]:
import pandas as pd


#Loading the joined_df csv file

joined_df = pd.read_csv('/Users/fitsumbahlebi/Desktop/repo2/Statistical-Modelling-with-Python/data/joined_df.csv')
joined_df.head()

Unnamed: 0,id,name_x,latitude_x,longitude_x,free_bikes,empty_slots,name_y,address,latitude_y,longitude_y,rating,review_count,distance_meters,term,bike_station_id
0,024a3edf037cb411d16acc08a7fcb954,Bay at Strachan,43.267859,-79.867923,10,14,El Grito Mexicano,236 James Street N,43.2626,-79.866237,5.0,1,600.437633,restaurants,024a3edf037cb411d16acc08a7fcb954
1,024a3edf037cb411d16acc08a7fcb954,Bay at Strachan,43.267859,-79.867923,10,14,Charred Chicken,244 James Street N,43.262736,-79.866384,4.2,121,583.045938,restaurants,024a3edf037cb411d16acc08a7fcb954
2,024a3edf037cb411d16acc08a7fcb954,Bay at Strachan,43.267859,-79.867923,10,14,Synonym,328 James Street N,43.264597,-79.865372,4.7,16,417.393702,restaurants,024a3edf037cb411d16acc08a7fcb954
3,024a3edf037cb411d16acc08a7fcb954,Bay at Strachan,43.267859,-79.867923,10,14,Harbour Diner,486 James Street N,43.2694,-79.86329,3.6,66,411.087086,restaurants,024a3edf037cb411d16acc08a7fcb954
4,024a3edf037cb411d16acc08a7fcb954,Bay at Strachan,43.267859,-79.867923,10,14,Shawarma Royale Plus,114 York Blvd,43.260103,-79.872464,5.0,1,934.769733,restaurants,024a3edf037cb411d16acc08a7fcb954


In [125]:

"""
Aggregates data from the joined DataFrame by 'bike_station_id' and 'term'.

The aggregation includes:
- total_pois: Count of POIs (points of interest) by 'name_y'.
- avg_poi_rating: Mean rating of POIs.
- total_reviews: Sum of review counts.
- avg_poi_distance: Mean distance of POIs in meters.
- std_poi_distance: Standard deviation of POI distances in meters.
- total_free_bikes: Sum of free bikes.

Returns:
    DataFrame: Aggregated DataFrame with the specified metrics.
"""
agg_df = joined_df.groupby(['bike_station_id', 'term']).agg(
    total_pois=('name_y', 'count'),
    avg_poi_rating=('rating', 'mean'),
    total_reviews=('review_count', 'sum'),
    avg_poi_distance=('distance_meters', 'mean'),
    std_poi_distance=('distance_meters', 'std'),
    total_free_bikes=('free_bikes', 'sum')  # Adding the average free_bikes
).reset_index()

# Show the updated aggregation result
print(agg_df.head())


                    bike_station_id         term  total_pois  avg_poi_rating  \
0  024a3edf037cb411d16acc08a7fcb954    libraries           2        0.000000   
1  024a3edf037cb411d16acc08a7fcb954  restaurants          21        3.747619   
2  024a3edf037cb411d16acc08a7fcb954     shopping          28        2.092857   
3  0263c2af4dcdc215b9c81753a8df8a9a    libraries           2        0.000000   
4  0263c2af4dcdc215b9c81753a8df8a9a  restaurants          21        3.747619   

   total_reviews  avg_poi_distance  std_poi_distance  total_free_bikes  
0              0        454.601886          0.000000                20  
1            727        617.370288        231.653385               210  
2             44        685.749886        190.083498               280  
3              0        454.601886          0.000000                 4  
4            727        617.370288        231.653385                42  


In [118]:
import statsmodels.api as sm
"""
This script performs a linear regression analysis using the statsmodels library.

Steps:
1. Imports the necessary library (statsmodels.api as sm).
2. Defines the independent variables (X) and the dependent variable (y) from the DataFrame `agg_df`.
3. Adds a constant to the independent variables to include the intercept in the model.
4. Fits the Ordinary Least Squares (OLS) regression model.
5. Prints the summary of the regression model.

Variables:
- X: DataFrame containing the independent variables 'total_pois', 'avg_poi_rating', 'total_reviews', 'avg_poi_distance', and 'std_poi_distance'.
- y: Series containing the dependent variable 'total_free_bikes'.
- model: The fitted OLS regression model.
"""

# Define the independent variables (X) and the dependent variable (y)
X = agg_df[['total_pois', 'avg_poi_rating', 'total_reviews', 'avg_poi_distance', 'std_poi_distance']]
y = agg_df['total_free_bikes']

# Add a constant to the independent variables (for the intercept in the model)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Get the model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:       total_free_bikes   R-squared:                       0.295
Model:                            OLS   Adj. R-squared:                  0.292
Method:                 Least Squares   F-statistic:                     118.4
Date:                Wed, 25 Dec 2024   Prob (F-statistic):           1.08e-43
Time:                        14:51:03   Log-Likelihood:                -3255.1
No. Observations:                 570   AIC:                             6516.
Df Residuals:                     567   BIC:                             6529.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -0.0016      0.000  

Provide model output and an interpretation of the results. 

**Model Fit:**

*   **R-squared: 0.295:** The model explains about 29.5% of the variance in `total_free_bikes`. This is okay, but there is more variance in the model which is not being explained.

*   **Adjusted R-squared: 0.292:** The adjusted R-squared value is also relatively low, but it is still comparable to the R-squared, which is good.

**Overall Significance:**

*   **F-statistic: 118.4:** The model is statistically significant.
*   **Prob (F-statistic): 1.08e-43:** The p-value is less than 0.05, meaning that the model is statistically significant.

**Individual Coefficients:**

*   **const: -0.0016:** The intercept is close to zero.
*   **total_pois: 0.0791:** A positive relationship with the `total_free_bikes`, holding other variables constant. The p-value for this coefficient is statistically significant.
*   **avg_poi_rating: 0.0059:** This also has a positive relationship, meaning that as the average rating for POIs increase, the number of bikes tends to increase. The p-value for this coefficient is statistically significant.
*   **total_reviews: -0.0762:** The total reviews has a significant negative effect on the dependent variable
*   **avg_poi_distance: 0.0186:** The `avg_poi_distance` is not significant, as the p-value is greater than 0.05.
*   **std_poi_distance: 0.5723:** The standard deviation of the distance also has a significant positive relationship with total bikes, which is also very high.

**Other Statistics**

*   **Omnibus, Durbin-Watson, Jarque-Bera, Skew, Kurtosis:** These values indicate the residuals are not normally distributed, which may indicate that the model is not optimal.
*   **Cond. No.: 3.16e+18:** The high condition number indicates that there are high levels of multicollinearity.

**Key Takeaways**

*   **Moderate Model Fit:** While the model explains some of the variance in the dependent variable, the R-squared value is still not high enough to be optimal.
*   **Significant Predictors:** Both the total number of POIs and the average rating of POIs, as well as the standard deviation of the distance are statistically significant factors affecting the number of free bikes. The total number of reviews has a statistically significant negative effect. The distance does not appear to be significant.
*   **Multicollinearity Issues:** The high condition number indicates that multicollinearity is present, even though it appears that some coefficients are significant. This means that there are strong correlations between the input variables. This may lead to unstable coefficient estimates.
*   **Residual Issues:** The residuals of the model are not normally distributed, which implies that the assumptions of the OLS model are not met, which would mean that the model may not be a good fit.

# Stretch

How can you turn the regression model into a classification model?

**From Regression to Classification: A Brief Overview**

1.  **Original Regression:** Predicts a continuous variable (e.g., exact number of `free_bikes` based on POI features like `rating`, `review_count`, and `distance_meters`).

2.  **Transformation:** Instead of predicting the exact number of bikes, we'll predict a category by grouping the `free_bikes` into discrete classes.

3.  **Defining Categories:**
    *   **Option 1 (Binned):** Create categories based on ranges of `free_bikes` (e.g., Low: 0-5, Medium: 6-15, High: 16+).
    *   **Option 2 (Relative):** Create categories based on percentiles (e.g., Low: bottom 25%, Medium: middle 50%, High: top 25%).

4.  **Target Variable:** The target becomes a categorical variable (e.g., Low, Medium, High).

5.  **Features:**  Use the same features (e.g., `rating`, `review_count`, `distance_meters`).

6.  **Classification Model:** Choose a suitable model (e.g., Logistic Regression, Random Forest, SVM).

7.  **Evaluation:** Use classification metrics (e.g., Accuracy, Confusion Matrix, Precision, Recall, F1-Score).

8.  **Interpretation:** Understand how features influence the probability of a station falling into a specific category (e.g., high rating might indicate high availability).

**Conceptual Approach:**

*   **Categorize:** Convert `free_bikes` into categories like "Low," "Medium," and "High".
*   **Target:** Treat these categories as the new target variable.
*   **Features:** Keep using the POI features as predictors.
*   **Model:** Apply a classification algorithm.
*   **Evaluate:** Use classification-specific performance measures.

**Stretch Goal:**

*   **Multi-Class:** Predict more specific combinations of features (e.g., "Very High availability with popular POIs").
*   **Hierarchical:** Classify into broad and then specific categories (e.g., first "High" vs "Low", then "Very High" vs "High")