<a href="https://colab.research.google.com/github/cbsebastian24/randomStuff/blob/main/plenary_5_a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breakout 5.a

In this breakout, we will continue to use the restaurants in Ann Arbor, MI from the previous project.

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import seaborn as sb
from scipy.stats import norm, t
import statsmodels.api as sm

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/UM-Data-Science-101/data-FA2025/refs/heads/main/GoogleMaps_Cleaned.csv")
df.shape

(200, 254)

In [None]:
df_clean = df.copy()

## Part I: Transformations and basic $R^2$

In what follows, we will focus on `totalScore` as our outcome ($Y$).

To begin with, create a histogram of `totalScore`. What do you notice about the shape of the distribution? In particular, what kind of skew (if any) does it exhibit?


Guess the sign of the **coefficient of skewness** for this distribution. Now calculate the coefficient to see if you were correct.

In class we discussed how to deal with **right skewed data**. As you have noticed, these data are **left skewed**.

Simple technique is to create

$$W = (\max Y) + \epsilon - Y$$

where $\epsilon$ is small positive value, such as $\epsilon = 1$. Create a new `"W"` column in the `df` table that will be the reversed data and plot it


Now we have right skewed data. Confirm by calculating the coefficient of skewness of "W". What do you notice about this value compared to the one you calculated before?

We discussed three transformations that apply to **positive, right-skewed data**:

- Reciprocal
- Square root
- Logarithmic (usually using a base of $e \approx 2.71$)

Implement each of these transformations, compute the coefficient of skewness for each, and graph the transformation with the best results.

Take a moment to think about what small and large values mean on this new scale (remember that we first created $W = (\max Y) + \epsilon - Y$).


Just to get us on the same page, we found that $V = 1 / W$ to be the best transformation. Graph the joint relationship of this variable with `imagesCount`. Would you say you observe a relationship?


Repeat the three transformations using `imagesCount`. Calculate the correlation between `imagesCount`, the three transformations of `imagesCount` and `V` (our best transformation of `totalScore`).

Find the correlations between "V" and the others. Which transformation resulted in the best correlation? Plot pair of variables. Call this best variable "U".

How much did this transformation help? Recall that **the coefficient of determination** is given by:

$$R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}$$

And is interpreted as the "percentage of variance explained".

Compare two models:

- regress `imagesCount` (Y) on `V` (X)
- regress `U` on `V`

and get the $R^2$ for each model. Remember to drop any rows with missing values before using `sm.OLS`.

As we probably expected from the modest improvement in correlation, the change in $R^2$ is also modest, but we see that improving the linear relation translates to the variance explained in by the model.

## Part II: Adjusted $R^2$ and Step-wise selection

Here is a table with the just the numeric or similar types (excludes strings, categories, variables with many missing values and a few other problematic variables).

We've also included the reciprocal tranformation of the review scores from the previous section.


In [None]:
df_num = df_clean.drop(columns = ["googleFoodUrl", "cid", "rank",
                                  "reviewsDistribution.oneStar",
                                  "reviewsDistribution.twoStar",
                                  "reviewsDistribution.threeStar",
                                  "reviewsDistribution.fourStar",
                                  "reviewsDistribution.fiveStar",
                                  "permanentlyClosed",
                                  "isAdvertisement",
                                  "additionalInfo.Offerings.Organic_products",
                                  "additionalInfo.Recycling.Plastic_bottles",
                                  "additionalInfo.Accessibility.Assistive_hearing_loop"]).select_dtypes(include=[np.number, bool]) * 1.0
df_num = df_num.dropna()

df_num["W"] = 1 / (df_num["totalScore"].max() + 1 - df_num["totalScore"])
df_num = df_num.drop(columns = "totalScore")

df_num.shape

(189, 155)

Create a regression using `"W"` as the outcome and all remaining variables as predictors.

Print out the R squared. What percentage of the variance does this model explain?

Of course, we know that bigger models will always fit better than smaller models (we can always set any coefficient to zero in the bigger model and do at least as well as the smaller model).

Recall that adjusted $R^2$ is used to *penalize* the $R^2$ calculation to take this into account.

$$R^2_{\text{adj}} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}$$

where $p$ is the number of variables in the model.

What was the adjusted $R^2$ in the previous model? Do things look as good using that?

Here is a function that will find the adjusted $R^2$ that comes from adding an additional parameter:

In [None]:
def add_param(current, next):
  mod = sm.OLS(df_num["W"], sm.add_constant(df_num[current + [next]])).fit()
  return mod.rsquared_adj

And here is a demonstration of its use seeing what happens when we add the `No_contact_delivery` variable.







In [None]:
add_param([], "additionalInfo.Service_options.No_contact_delivery")

np.float64(0.05026755236824154)

Here is a function that will take a list of columns and find the one with the highest adj-$R^2$ when added to the model.

  

In [None]:
def find_best(current, candidates):
  rs = pd.Series({ v:add_param(current, v) for v in candidates })
  return (rs.idxmax(), rs.max())

all_cols = df_num.drop(columns = "W").columns

find_best([], all_cols)

('additionalInfo.Service_options.Delivery', 0.06278426735635045)

Use the `find_best` to apply forward step-wise regression to this table. Use 5 steps. What variables did you select? Interpret the final model in terms of coefficients and $R^2$.