Build a regression model.

In [291]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For data visualization
from sklearn.model_selection import train_test_split  # For splitting the data into training and testing sets
from sklearn.linear_model import LinearRegression  # For building the regression model
from sklearn.metrics import mean_squared_error, r2_score  # For evaluating the model
import statsmodels.api as sm  # For statistical modeling and hypothesis testing (optional)
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import statsmodels.api as sm

I am looking for a correlation between the amount of free bikes and the type of venue in the area

In [292]:
#Load in my total csv
total = pd.read_csv('total.csv')
total = total.drop(columns=['Unnamed: 0', 'empty_slots', 'total_slots'], axis=1)
total.head()

Unnamed: 0,name,longitude,latitude,timestamp,free_bikes,rating_mean,venue,venue_type
0,368 - Tolstoi - Lorenteggio,9.14943,45.45371,2025-02-07T23:19:44.932063Z,20,3.0,2,"Thrift Stores, Baby Gear & Furniture, Women's ..."
1,25 - Centrale 1,9.202572,45.485456,2025-02-07T23:19:44.938124Z,12,,1,"Clothing Store, Shoe Store"
2,161 - Coni Zugna - Solari,9.16801,45.457079,2025-02-07T23:19:44.936019Z,27,0.0,5,Fashion Clothing Store
3,16 - Moscova,9.18456,45.477534,2025-02-07T23:19:44.938271Z,12,,5,"Clothing Store, Women's Store"
4,359 - Tertulliano - Caroncini,9.218048,45.449209,2025-02-07T23:19:44.932234Z,18,,0,Fashion/Café/Bus


In [293]:
total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         312 non-null    object 
 1   longitude    312 non-null    float64
 2   latitude     312 non-null    float64
 3   timestamp    312 non-null    object 
 4   free_bikes   312 non-null    int64  
 5   rating_mean  222 non-null    float64
 6   venue        312 non-null    int64  
 7   venue_type   312 non-null    object 
dtypes: float64(3), int64(2), object(3)
memory usage: 19.6+ KB


I will see if I need to drop the NaN rows in the ratings if I end up using that column

In [294]:
# Use one hot step encoder to change name and timestamp to int
le = LabelEncoder()
total['name'] = total['name'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
total['timestamp'] = total['timestamp'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
total['venue_type'] = total['venue_type'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
# Fit the encoder and transform the name and timestamp col
total['name_encoded'] = le.fit_transform(total['name'])
total['timestamp_encoded'] = le.fit_transform(total['timestamp'])
total['venue_type_encoded'] = le.fit_transform(total['venue_type'])


In [295]:
# Drop all non int columns
total_encoded = total.drop(columns=['name', 'timestamp'])
total_encoded


Unnamed: 0,longitude,latitude,free_bikes,rating_mean,venue,venue_type,name_encoded,timestamp_encoded,venue_type_encoded
0,9.149430,45.453710,20,3.000000,2,"Thrift Stores, Baby Gear & Furniture, Women's ...",225,9,156
1,9.202572,45.485456,12,,1,"Clothing Store, Shoe Store",142,286,13
2,9.168010,45.457079,27,0.000000,5,Fashion Clothing Store,67,183,78
3,9.184560,45.477534,12,,5,"Clothing Store, Women's Store",65,293,14
4,9.218048,45.449209,18,,0,Fashion/Café/Bus,217,19,86
...,...,...,...,...,...,...,...,...,...
307,9.176445,45.478022,24,2.333333,3,Men's Clothing,90,195,117
308,9.189676,45.462064,3,4.500000,2,Fashion Jewelry Store,274,250,82
309,9.159315,45.466847,21,2.000000,5,"Desserts, Bakeries, Coffee & Tea Shoe Store",79,163,65
310,9.189180,45.478693,3,1.500000,3,Fashion Fashion Accessories Store,307,238,81


For reference: Venue Types by encoded value

In [296]:
 #Create a new DataFrame with only 'venue_type' and 'venue_type_encoded' columns
venue_type_df = total[['venue_type', 'venue_type_encoded']]
venue_type_df.sort_values(by=['venue_type'])

Unnamed: 0,venue_type,venue_type_encoded
200,"Advertising Agency, Business and Strategy Con...",0
268,"Bar, Café",1
44,"Bar, Café",1
42,"Bar, Café",1
311,"Bar, Café, Restaurant",2
...,...,...
243,"Women's Clothing, Men's Clothing Shoe Store",170
208,"Women's Clothing, Men's Clothing, Bookstores",171
283,"Women's Clothing, Men's Clothing, Children's C...",172
141,"Women's Clothing, Men's Clothing, Luggage Indu...",173


In [297]:
# Drop venue_type
total_encoded = total_encoded.drop(columns=['venue_type'])
total_encoded

Unnamed: 0,longitude,latitude,free_bikes,rating_mean,venue,name_encoded,timestamp_encoded,venue_type_encoded
0,9.149430,45.453710,20,3.000000,2,225,9,156
1,9.202572,45.485456,12,,1,142,286,13
2,9.168010,45.457079,27,0.000000,5,67,183,78
3,9.184560,45.477534,12,,5,65,293,14
4,9.218048,45.449209,18,,0,217,19,86
...,...,...,...,...,...,...,...,...
307,9.176445,45.478022,24,2.333333,3,90,195,117
308,9.189676,45.462064,3,4.500000,2,274,250,82
309,9.159315,45.466847,21,2.000000,5,79,163,65
310,9.189180,45.478693,3,1.500000,3,307,238,81


In [298]:
# Fill in any missing values
null_values = total_encoded.isnull().sum()
print(null_values)

longitude              0
latitude               0
free_bikes             0
rating_mean           90
venue                  0
name_encoded           0
timestamp_encoded      0
venue_type_encoded     0
dtype: int64


In [299]:
# Replace all nulls in ratings to the mean
mean_rating = total['rating_mean'].mean()

# Replace null values in the 'rating_mean' column with the mean
total_encoded['rating_mean'].fillna(mean_rating, inplace=True)
total_encoded

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  total_encoded['rating_mean'].fillna(mean_rating, inplace=True)


Unnamed: 0,longitude,latitude,free_bikes,rating_mean,venue,name_encoded,timestamp_encoded,venue_type_encoded
0,9.149430,45.453710,20,3.000000,2,225,9,156
1,9.202572,45.485456,12,2.862881,1,142,286,13
2,9.168010,45.457079,27,0.000000,5,67,183,78
3,9.184560,45.477534,12,2.862881,5,65,293,14
4,9.218048,45.449209,18,2.862881,0,217,19,86
...,...,...,...,...,...,...,...,...
307,9.176445,45.478022,24,2.333333,3,90,195,117
308,9.189676,45.462064,3,4.500000,2,274,250,82
309,9.159315,45.466847,21,2.000000,5,79,163,65
310,9.189180,45.478693,3,1.500000,3,307,238,81


In [300]:
y = total_encoded['free_bikes']
new_total_encoded = total_encoded.drop(columns=['free_bikes'])
new_total_encoded.columns

Index(['longitude', 'latitude', 'rating_mean', 'venue', 'name_encoded',
       'timestamp_encoded', 'venue_type_encoded'],
      dtype='object')

In [301]:
# Start with and empty list 
X = []

# for loop to go through a list of things (in this case, each column in new_total_encoded)
for column in new_total_encoded.columns:

    # appending something to our list (X), and the 'something' is sm.add_constant(new_total_encoded[column])
    X.append(sm.add_constant(new_total_encoded[column]))

In [302]:
# X = [sm.add_constant(new_total_encoded['selling_price']), sm.add_constant(new_total_encoded['year']), sm.add_constant(new_total_encoded['km_driven']), ...]

#Create a model for each indep. variable
#list of X's (with constants)
X = [sm.add_constant(new_total_encoded[column]) for column in new_total_encoded.columns]

In [303]:
# Starting first column
X[0].head()

Unnamed: 0,const,longitude
0,1.0,9.14943
1,1.0,9.202572
2,1.0,9.16801
3,1.0,9.18456
4,1.0,9.218048


In [304]:
model_num_free_bikes = sm.OLS(y,X[0])
results_num_free_bikes = model_num_free_bikes.fit()
ajd_r2_num_free_bikes = results_num_free_bikes.rsquared_adj
pvalues_num_free_bikes = results_num_free_bikes.pvalues

print(ajd_r2_num_free_bikes)
print(pvalues_num_free_bikes)

0.0005681526152812033
const        0.238433
longitude    0.278852
dtype: float64


In [305]:
print(results_num_free_bikes.summary())

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     1.177
Date:                Fri, 07 Feb 2025   Prob (F-statistic):              0.279
Time:                        23:52:51   Log-Likelihood:                -1092.6
No. Observations:                 312   AIC:                             2189.
Df Residuals:                     310   BIC:                             2197.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        192.9638    163.364      1.181      0.2

In [306]:
Models = [sm.OLS(y,x) for x in X] #list of models
Results = [model.fit() for model in Models] #list of results
Adj_Rsquared = [results.rsquared_adj for results in Results] #list of rsquared
Pval = [results.pvalues for results in Results] #list of p-values

for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {new_total_encoded.columns[i]}')

adj_R2: 0.001, P-values: (0.23843311733905326, 0.2788521612726642), column: longitude
adj_R2: 0.010, P-values: (0.04479480515612426, 0.046269653853018324), column: latitude
adj_R2: -0.003, P-values: (4.438572316897306e-48, 0.861011224995542), column: rating_mean
adj_R2: 0.002, P-values: (5.069831579639912e-66, 0.2113174742082155), column: venue
adj_R2: 0.082, P-values: (6.208372753793783e-68, 1.512729991303594e-07), column: name_encoded
adj_R2: 0.064, P-values: (4.194760735651466e-65, 3.8244565132131386e-06), column: timestamp_encoded
adj_R2: -0.001, P-values: (2.944862802587295e-48, 0.4371861281326809), column: venue_type_encoded


In [307]:
Results[0].pvalues

const        0.238433
longitude    0.278852
dtype: float64

In [308]:
remaining_var = total_encoded.drop(['free_bikes', 'timestamp_encoded'], axis=1)
included_df = total_encoded[['timestamp_encoded']]

In [309]:
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index = True, left_index = True))\
     for column in remaining_var.columns]
X[0]

Unnamed: 0,const,timestamp_encoded,longitude
0,1.0,9,9.149430
1,1.0,286,9.202572
2,1.0,183,9.168010
3,1.0,293,9.184560
4,1.0,19,9.218048
...,...,...,...
307,1.0,195,9.176445
308,1.0,250,9.189676
309,1.0,163,9.159315
310,1.0,238,9.189180


In [310]:
Models = [sm.OLS(y,x) for x in X] #list of models
Results = [model.fit() for model in Models] #list of results
Adj_Rsquared = [results.rsquared_adj for results in Results] #list of rsquared
Pval = [results.pvalues for results in Results] #list of list of p-values

for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {remaining_var.columns[i]}')

adj_R2: 0.061, P-values: (0.5955702968778681, 6.685288293738997e-06, 0.6816710054363029), column: longitude
adj_R2: 0.092, P-values: (0.001097909333613051, 1.2983660822228287e-07, 0.0011646672098106336), column: latitude
adj_R2: 0.061, P-values: (1.8966572352508888e-46, 3.919232872035296e-06, 0.8185794399345374), column: rating_mean
adj_R2: 0.064, P-values: (7.409286957404005e-51, 5.282110787951346e-06, 0.3154747872103129), column: venue
adj_R2: 0.146, P-values: (1.3823899142629452e-61, 1.4883312994338417e-06, 5.994848364631387e-08), column: name_encoded
adj_R2: 0.062, P-values: (1.1764522138460428e-43, 4.552528443897697e-06, 0.5617368985508167), column: venue_type_encoded


In [311]:
remaining_var = total_encoded.drop(['free_bikes', 'timestamp_encoded', 'name_encoded'], axis=1)
included_df = total_encoded[['timestamp_encoded', 'name_encoded']]

In [312]:
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index = True, left_index = True))\
     for column in remaining_var.columns]
X[0]

Unnamed: 0,const,timestamp_encoded,name_encoded,longitude
0,1.0,9,225,9.149430
1,1.0,286,142,9.202572
2,1.0,183,67,9.168010
3,1.0,293,65,9.184560
4,1.0,19,217,9.218048
...,...,...,...,...
307,1.0,195,90,9.176445
308,1.0,250,274,9.189676
309,1.0,163,79,9.159315
310,1.0,238,307,9.189180


In [313]:
Models = [sm.OLS(y,x) for x in X] #list of models
Results = [model.fit() for model in Models] #list of results
Adj_Rsquared = [results.rsquared_adj for results in Results] #list of rsquared
Pval = [results.pvalues for results in Results] #list of list of p-values

for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {remaining_var.columns[i]}')

adj_R2: 0.148, P-values: (0.1479086969645307, 4.820039119379637e-06, 2.8851540549448295e-08, 0.19478559607188056), column: longitude
adj_R2: 0.184, P-values: (8.979630963099018e-05, 1.6607607519625864e-08, 5.701072310804093e-09, 9.796494248314906e-05), column: latitude
adj_R2: 0.143, P-values: (4.310204319587647e-49, 1.739350795684936e-06, 6.453694204341673e-08, 0.9452580836374449), column: rating_mean
adj_R2: 0.148, P-values: (4.121571344643016e-53, 2.2259012640855114e-06, 4.272916042935048e-08, 0.18513631707368777), column: venue
adj_R2: 0.148, P-values: (9.772048151105133e-49, 2.006644979743013e-06, 3.117263385900777e-08, 0.19244275572626404), column: venue_type_encoded


In [314]:
remaining_var = total_encoded.drop(['free_bikes', 'timestamp_encoded', 'name_encoded', 'venue_type_encoded'], axis=1)
included_df = total_encoded[['timestamp_encoded', 'name_encoded', 'venue_type_encoded']]

In [315]:
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index = True, left_index = True))\
     for column in remaining_var.columns]
X[0]

Unnamed: 0,const,timestamp_encoded,name_encoded,venue_type_encoded,longitude
0,1.0,9,225,156,9.149430
1,1.0,286,142,13,9.202572
2,1.0,183,67,78,9.168010
3,1.0,293,65,14,9.184560
4,1.0,19,217,86,9.218048
...,...,...,...,...,...
307,1.0,195,90,117,9.176445
308,1.0,250,274,82,9.189676
309,1.0,163,79,65,9.159315
310,1.0,238,307,81,9.189180


In [316]:
Models = [sm.OLS(y,x) for x in X] #list of models
Results = [model.fit() for model in Models] #list of results
Adj_Rsquared = [results.rsquared_adj for results in Results] #list of rsquared
Pval = [results.pvalues for results in Results] #list of list of p-values

for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {remaining_var.columns[i]}')

adj_R2: 0.150, P-values: (0.15457918043719532, 6.277109572152097e-06, 1.529307977819371e-08, 0.1987203194379679, 0.2011450901944161), column: longitude
adj_R2: 0.185, P-values: (0.00012132434873824239, 2.5039394451138526e-08, 3.5015388004153203e-09, 0.2766465437472266, 0.00013176519283538026), column: latitude
adj_R2: 0.145, P-values: (1.0377311145893976e-36, 2.0641336324187493e-06, 3.431158123983247e-08, 0.18786079122247476, 0.830682752410084), column: rating_mean
adj_R2: 0.148, P-values: (7.179287297135734e-45, 2.719650774411611e-06, 2.670026538142152e-08, 0.2794719444753152, 0.26807774557337344), column: venue


In [317]:
remaining_var = total_encoded.drop(['free_bikes', 'timestamp_encoded', 'name_encoded', 'venue_type_encoded', 'latitude'], axis=1)
included_df = total_encoded[['timestamp_encoded', 'name_encoded', 'venue_type_encoded', 'latitude']]

In [318]:
X = [sm.add_constant(pd.merge(included_df, remaining_var[column], right_index = True, left_index = True))\
     for column in remaining_var.columns]
X[0]

Unnamed: 0,const,timestamp_encoded,name_encoded,venue_type_encoded,latitude,longitude
0,1.0,9,225,156,45.453710,9.149430
1,1.0,286,142,13,45.485456,9.202572
2,1.0,183,67,78,45.457079,9.168010
3,1.0,293,65,14,45.477534,9.184560
4,1.0,19,217,86,45.449209,9.218048
...,...,...,...,...,...,...
307,1.0,195,90,117,45.478022,9.176445
308,1.0,250,274,82,45.462064,9.189676
309,1.0,163,79,65,45.466847,9.159315
310,1.0,238,307,81,45.478693,9.189180


In [319]:
Models = [sm.OLS(y,x) for x in X] #list of models
Results = [model.fit() for model in Models] #list of results
Adj_Rsquared = [results.rsquared_adj for results in Results] #list of rsquared
Pval = [results.pvalues for results in Results] #list of list of p-values

for i in range(len(Adj_Rsquared)):
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {remaining_var.columns[i]}')

adj_R2: 0.186, P-values: (8.44129742682036e-05, 9.34887870293999e-08, 1.9723813264879e-09, 0.28263310836454797, 0.00016288059622945067, 0.2584261263217958), column: longitude
adj_R2: 0.183, P-values: (0.00011614127290000323, 2.4173055930797774e-08, 3.948400517469708e-09, 0.25002313069436777, 0.00012604569914881992, 0.6740487463500839), column: rating_mean
adj_R2: 0.186, P-values: (0.00012118741939287324, 3.5386662253247984e-08, 2.9612299148860917e-09, 0.38652984845312477, 0.00013142333642660837, 0.2588050920016717), column: venue


Ok, that is where I draw the line and see no real improvements.

In [320]:
# SPlit the df for testing
# Define the feature and the target
X = total[['timestamp_encoded', 'name_encoded', 'venue_type_encoded', 'latitude']]
y = total['free_bikes']




In [321]:
# Split the data which I learned with practice through some Kaggle exercises 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [322]:
# Fit the dummy variable regression model
model = sm.OLS(y_train, X_train).fit()

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:             free_bikes   R-squared (uncentered):                   0.824
Model:                            OLS   Adj. R-squared (uncentered):              0.821
Method:                 Least Squares   F-statistic:                              287.1
Date:                Fri, 07 Feb 2025   Prob (F-statistic):                    3.39e-91
Time:                        23:52:52   Log-Likelihood:                         -850.98
No. Observations:                 249   AIC:                                      1710.
Df Residuals:                     245   BIC:                                      1724.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                         coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------

In [323]:
# SPlit the df for testing
# Define the feature and the target
X = total[['timestamp_encoded', 'name_encoded', 'latitude']]
y = total['free_bikes']


In [324]:
# Split the data which I learned with practice through some Kaggle exercises 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [325]:
# Fit the dummy variable regression model
model = sm.OLS(y_train, X_train).fit()

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:             free_bikes   R-squared (uncentered):                   0.824
Model:                            OLS   Adj. R-squared (uncentered):              0.822
Method:                 Least Squares   F-statistic:                              384.2
Date:                Fri, 07 Feb 2025   Prob (F-statistic):                    1.66e-92
Time:                        23:52:52   Log-Likelihood:                         -851.02
No. Observations:                 249   AIC:                                      1708.
Df Residuals:                     246   BIC:                                      1719.
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------

Provide model output and an interpretation of the results. 

<p>SOOOOOOO, yeah, these numbers are so low, in the negatives lol. My guess is that there is no linear effect of any of these features on the amount of bikes available. The number of bikes available will be anywhere from the lowest to highest for all three types of venues. The only other feature that might help is how many venues there are there but that makes it even worse since there could be one of each type of venue. One way to make this better would be to go more detailed on the type of venue instead of batching them into 3 types. </p>

After thinking this through, I let it ride with all venue types to see what difference it made. <br>

I thought that the venue type POI would play a role in predicting the number of free bikes but its pvalue would suggest that it shouldn't be used so I removed it and ran it again but got the same result. I suggest using other POI features such as distance.

I get a better result, with the model claiming to 0.821 that it can predict how many bikes will be free using the 'timestamp_encoded', 'name_encoded', 'venue_type_encoded', 'latitude' columns.

# Stretch

How can you turn the regression model into a classification model?