Build a regression model.

In [7]:
import pandas as pd
import numpy as np
from sklearn import linear_model, datasets
import statsmodels.api as sm

df = pd.read_csv("bikes_POIs_df.csv")
df.drop("Unnamed: 0", axis=1, inplace=True)
df.drop("reversed_POI_lat", axis=1, inplace=True)
df.head()

Unnamed: 0,empty_slots,free_bikes,POI_name,POI_latitude,POI_longitude,category_name,category_id,price,rating,distance,bike_ll,bike_latitude,bike_longitude
0,15,4,Cueva Colomba,19.43734,-99.137459,Bar,13003.0,2.0,6.1,13.0,"19.4374,-99.137571",19.4374,-99.137571
1,15,4,Tintico,19.43721,-99.137425,Coffee Shop,13035.0,1.0,7.6,16.0,"19.4374,-99.137571",19.4374,-99.137571
2,15,4,La Cueva Colomba,19.437141,-99.137434,Pub,13018.0,1.0,6.1,18.0,"19.4374,-99.137571",19.4374,-99.137571
3,15,4,Tintico Cafe & Galeria,19.43725,-99.137483,Café,13034.0,,,19.0,"19.4374,-99.137571",19.4374,-99.137571
4,15,4,Cantina Río de la Plata,19.437195,-99.137617,Beer Bar,13006.0,1.0,6.6,20.0,"19.4374,-99.137571",19.4374,-99.137571


In [17]:
num_unique_categories = df.category_name.nunique()
print(num_unique_categories)
unique_categories = df.category_name.unique()
print(unique_categories)

# too many different categories to include in the regression. 
# However, if there were fewer, these could be converted from strings to categories using pd.get_dummies
# pd.get_dummies is a function in the pandas library that is used to convert categorical variables into a 
# numerical representation called "dummy variables". It is used to handle categorical data in regression problems.
# For each unique category, a new column is created with values of either 0 or 1, indicating whether or not 
# that particular sample belongs to that category.

128
['Bar' 'Coffee Shop' 'Pub' 'Café' 'Beer Bar' 'BBQ Joint' 'Dive Bar'
 'Mexican Restaurant' 'Diner' 'Restaurant' 'Taco Restaurant'
 'Sandwich Restaurant' 'Art Museum' 'Museum' 'Lounge' 'Salad Restaurant'
 'Fast Food Restaurant' 'Botanero' 'Cafes, Coffee, and Tea Houses'
 'Chinese Restaurant' 'Steakhouse' 'Tapas Restaurant' 'Pizzeria'
 'Seafood Restaurant' 'New American Restaurant' 'Peruvian Restaurant'
 'Beer Garden' 'American Restaurant' 'Burger Joint' 'Japanese Restaurant'
 'Middle Eastern Restaurant' 'Cocktail Bar' 'Afghan Restaurant'
 'Speakeasy' 'Arts and Entertainment' 'Brewery' 'Asian Restaurant'
 'Comfort Food Restaurant' 'Wine Bar' 'Yucatecan Restaurant' 'Wings Joint'
 'Deli' 'Italian Restaurant' 'History Museum' 'Spanish Restaurant'
 'Fried Chicken Joint' 'Ice Cream Parlor' 'Bistro'
 'Argentinian Restaurant' 'Vegan and Vegetarian Restaurant'
 'Mediterranean Restaurant' 'Buffet' 'Karaoke Bar'
 'Molecular Gastronomy Restaurant' 'Eastern European Restaurant'
 'Sushi Restaurant

In [56]:
# need to make sure all columns on which the regression is to be performed are ints/floats. Rn "price" is an object
# have to drop the NaNs in price before you can convert it to a float
# Turns out you have to get rid of all NaNs before doing the regression, even if the types are numeric

df["price"].value_counts(dropna=False)
    # indicates that there are NaNs in there
df_dropped = df.dropna(subset=["price", "rating", "distance", "free_bikes"])
    # gets rid of the NaNs

df_dropped["price"] = pd.to_numeric(df["price"], errors='coerce')
    # change to float

df_dropped["price"].value_counts(dropna=False)
    # check no NaNs left


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dropped["price"] = pd.to_numeric(df["price"], errors='coerce')


1.0    251
2.0    171
3.0     47
3.3     25
5.0     16
1.7      7
4.0      6
Name: price, dtype: int64

In [61]:
X = df_dropped[["price", "rating", "distance"]]  
    #double brackets because you are creating a new [df - outer brackets]
    # with column names input as a [list - inner brackets]
y = df_dropped.free_bikes

# check datatypes are all floats
X.dtypes
#y.dtypes

price       float64
rating      float64
distance    float64
dtype: object

In [65]:
X = sm.add_constant(X) # adding a constant to x, which will be the y intercept
lin_reg = sm.OLS(y,X)
model = lin_reg.fit()

print( model.summary() )

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     1.519
Date:                Thu, 02 Feb 2023   Prob (F-statistic):              0.209
Time:                        19:07:28   Log-Likelihood:                -1736.7
No. Observations:                 523   AIC:                             3481.
Df Residuals:                     519   BIC:                             3498.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.5539      2.048      4.176      0.0

Interpretation: The adjusted R squared is very low, suggesting that the independent variables collectibely do not explain the number of free bikes

Co-efficients: These are all below 1, suggesting that none of the independent variables has a strong relationship withthe number of free bikes.

P values are above 0.05 for everything except price, suggesting the other 2 variables are not statistically significant and can't be said to explain any correlation shown.

Performing again just for price:

In [68]:
A = df_dropped[["price"]] 
y = df_dropped.free_bikes

A = sm.add_constant(A) # adding a constant to x, which will be the y intercept
lin_reg = sm.OLS(y,A)
model = lin_reg.fit()

print( model.summary() )

                            OLS Regression Results                            
Dep. Variable:             free_bikes   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     4.287
Date:                Thu, 02 Feb 2023   Prob (F-statistic):             0.0389
Time:                        19:22:14   Log-Likelihood:                -1736.8
No. Observations:                 523   AIC:                             3478.
Df Residuals:                     521   BIC:                             3486.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.6064      0.618     12.310      0.0

Adj R squared is still low (range is 0 to 1, and it is only 0.006). However, it's bettern than the 0.03 that it was before. This suggests that not much of the variation in the avaiability of bikes is explained by the price of nearby bars/restaurants.

Coefficient measures the strength and direction of any relationship (specifically, the amount of change in y for a one-unit change in x.
This is very low, 0.6, suggesting a very weak positive correlation between price and number of free bikes.

P value is still below 0.05, suggesting that the correlation, though super weak, is still statistically significant.

The standard error measures the variability of the estimated coefficients (estimates how accurately the coefficients predict the response variable). A lower standard error indicates that the coefficients are more precisely estimated and therefore more reliable. The standard error is 0.3, which is quite low, suggesting again that the price is significant (though I don't know how relevant this might be when the coefficient is so low to begin with...?)

# Stretch

How can you turn the regression model into a classification model?

In [18]:
# Using pd.getdummies(), I could parse the 128 unique categories by checking eg "bar" "restaurant" "museum" using string extraction
# I could then input these to see whether more free bikes appear near bars, restaurants or museums.

# Or, I could put the number of free bikes into categories, eg 0-5, 6-10, 11-15 and then create a model to predict how many bikes
# are likely to be free, given the number of bars and restaurants nearby