#### Build a regression model.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import csv

In [2]:
# Import the cleaned DF
df = pd.read_csv('cleaned_data.csv')

#### Provide model output and an interpretation of the results. 

In [3]:
# Prepare the data for the regression model
x = df[['Business Latitude', 'Business Longitude']]
y = df['Rating']

x = sm.add_constant(x)

# Fit the linear regression model
model = sm.OLS(y, x).fit()

# Print the summary of the regression model
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     2.353
Date:                Thu, 03 Aug 2023   Prob (F-statistic):             0.0959
Time:                        01:07:03   Log-Likelihood:                -644.97
No. Observations:                 653   AIC:                             1296.
Df Residuals:                     650   BIC:                             1309.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                224.2468    287

The "Business Latitiude" P-value is smaller than 0.05, so it signifies that Latitude has significance in determination of a good Yelp rated Bubble Tea Shops in Vancouver. Using the other tests, we can summarize the following:

Omnibus test/prob(Omnibus): the data does not follow a normal distribution. This makes sense as the data is clustered towards downtown Vancouver.
Durbin-Watson (value between 0 and 4): the value of 2.086 is close to 2, which signifies no strong evidence of autocorrelation.
Jarque-Bera/prob(JB): prob(JB) < 0.05 means that the data is not normally distributed.
Skew: it is slightly negatively skewed with a value of -0.539.
Kurtosis: the value of 3.609 means that the Kurtosis has a 'lighter-tailed' distribution compared to a normal distribution - explains that more shops are in downtown Vancouver, but are still relative to the bike stations.
Cond. No: The high number indicated multicollinearity, meaning independent variables are correlated. Simply visualizing the data, shops are irregularly scattered as it keeps to commercial areas vs residential areas.

# Stretch

How can you turn the regression model into a classification model?

In [4]:
# Cast/set a threshold and change rating values from continuous to discrete
threshold = 4.0

# Label encoding: let 1 = "High rating" and 0 = "Low rating"
df['Rating_Class'] = df['Rating'].apply(lambda x: 1 if x > threshold else 0)

In [5]:
# logistic regression model
x = df[['Business Latitude', 'Business Longitude']]
y = df['Rating_Class']

x = sm.add_constant(x)

# Fit the logistic regression model
logit_model = sm.Logit(y, x).fit()

# Print logistic regression summary
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.591707
         Iterations 13
                           Logit Regression Results                           
Dep. Variable:           Rating_Class   No. Observations:                  653
Model:                          Logit   Df Residuals:                      650
Method:                           MLE   Df Model:                            2
Date:                Thu, 03 Aug 2023   Pseudo R-squ.:                 0.01652
Time:                        01:07:13   Log-Likelihood:                -386.38
converged:                       True   LL-Null:                       -392.87
Covariance Type:            nonrobust   LLR p-value:                  0.001521
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1450.8713    977.710      1.484      0.138    -465.404    3367.147
Busi

The classification model was better at describing the trend of the data, with P-values < 0.05. In fact, with a P-value of 0, Business Latitude would be the main factor for Bubble Tea Shops. My interpretation is that since there are more bubble tea shops in one area (downtown Vancouver), the quality of their products could be higher due to higher competition. In order to garner more revenue and have more satisfied customers (as per rating), the quality control must be better, otherwise the stores might lose business to their competitor down the street.

The importance of Lat/Lng will change depending on the threshold set. My assumption is that most people would generally consider a rating of 4.0 a highly rated business/shop.