# Urban Models

"Theories are structures of ideas that explain and interpret facts." -Stephen Jay Gould

Theory and models: to conduct statistical inference, we usually rely on *statistical models*: sets of assumptions plus mathematical relationships between variables, producing a formal representation of some theory. We are essentially trying to explain the process underlying the generation of our data.

Regression modeling steps:

  1. think through the relevant theory and assumptions
  2. specify a model based on theory
  3. collect data and clean/prep it
  4. estimate model parameters using the data
  5. interpret and report the results

There is lots more to cover in a course on regression that we must skip for today's quick overview, such as interactions, transforming variables, handling multicollinearity, handling outliers, conducting diagnostics, etc. That's why there are entire courses dedicated to regression analysis.

In this notebook, we will focus on the basics of specifying, estimating, and interpreting regression models. The goal is to make you a knowledgeable consumer of studies that use regression, as well as able to tell stories using regression yourself.

In [None]:
!curl -s -o pyproject.toml https://raw.githubusercontent.com/gboeing/ppd430/refs/heads/main/pyproject.toml && uv pip install -q -r pyproject.toml

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

## 1. Load and prep the data

The first step in regression analysis is doing all the boring work to collect and load your data, clean it, merge/join it, and otherwise get it in suitable condition for analysis. Here, we want tract census data about each listing's local environs. We can get it by merging tract geometries and census data, then spatial-joining listings to those tracts.

### 1a. Load/merge tract geometries and census data

In [None]:
# load tract geometries (shapefile)
gdf_tracts = gpd.read_file(
    "https://raw.githubusercontent.com/gboeing/ppd430/main/data/tl_2017_06_tract/tl_2017_06_tract.shp"
)
gdf_tracts.shape

In [None]:
# load tract census data (csv file)
df_tracts = pd.read_csv(
    "https://raw.githubusercontent.com/gboeing/ppd430/main/data/census_tracts_data_ca.csv",
    dtype={"GEOID10": str, "state": str, "county": str},
)
df_tracts.shape

In [None]:
# finally, merge tract geometries and census data together
gdf_tracts_data = pd.merge(
    left=gdf_tracts, right=df_tracts, left_on="GEOID", right_on="GEOID10", how="left"
)
gdf_tracts_data.shape

### 1b. Load listings then spatial-join listings to tracts

So we attach local (ie, neighborhood) data about each listing's environs. I just copied this code from Module 8's notebook, so if it looks unfamiliar, review it there!

In [None]:
# load listings, create geometry column, convert to GeoDataFrame
df_listings = pd.read_csv(
    "https://raw.githubusercontent.com/gboeing/ppd430/main/data/listings-la_oc_vc.csv"
)
df_listings["geometry"] = gpd.points_from_xy(x=df_listings["longitude"], y=df_listings["latitude"])
gdf_listings = gpd.GeoDataFrame(df_listings, crs="epsg:4326")
gdf_listings.shape

In [None]:
# remember: always make sure CRSs match before a spatial join!
gdf_tracts_data.crs == gdf_listings.crs

In [None]:
# they don't match, so project one to the other's CRS
gdf_listings = gdf_listings.to_crs(gdf_tracts_data.crs)

In [None]:
# now spatial join listings to tracts
gdf = gpd.sjoin(gdf_listings, gdf_tracts_data, how="inner", op="within")
gdf.shape

In [None]:
# all done: what variables do we have now to work with?
gdf.columns

## 2. Simple regression

Simple (aka bivariate) regression has just 2 variables: one is used to predict the other.
  
  - **Response** variable = what you are predicting (synonyms: dependent variable, outcome variable, regressand)
  - **Predictor** variable = what you are using to predict (synonyms: independent variable, feature, covariate, regressor)
  
The response vector  $y$ is the column of observations of your response variable. The design matrix $X$ is the set of columns of observations of your predictor variables (in simple regression, there's only one column here).

In this example, I want to predict rent as a function of square footage. Therefore, I **specify** my model as $y = \beta_0 + \beta_1 \times x_1$ where $y$ represents unit's asking rent and $x_1$ represents unit's square footage. $\beta_0$ (the intercept, aka constant) and $\beta_1$ (the coefficient on square footage) are the model parameters to be estimated. My chosen confidence level is 95% (and thus my significance level is 0.05).

In [None]:
# choose a response and predictor
response = "rent"
predictor = "sqft"

In [None]:
# filter full dataset to retain only these columns and only rows without nulls in these columns
data = gdf[[response, predictor]].dropna()
print(data.shape)

# create design matrix and response vector
X = data[predictor]
y = data[response]

In [None]:
# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**How do I interpret this regression results table?**

Remember my model from earlier: $y = \beta_0 + \beta_1 \times x_1$

The coefficient is the estimated relationship between my variables (the slope of the line). The *t*-statistic is my coefficient divided by its standard error. The *p* value tells me the probability of seeing a *t* at least this large, assuming the null hypothesis is true (that the coefficient's value actually equals 0). This is a hypothesis test.

The 95% confidence interval spans approximately the coefficient ± 2 standard errors, and has a 95% probability of containing the true value of the coefficient.

For more info on the diagnostics, see: https://www.statsmodels.org/stable/stats.html

Now that I've estimated my model parameters I can plug them in: $y = 935.07 + 1.07 \times x_1$

So if I know that a unit is 800 sqft in size, I can predict its asking rent (using my model) as \\$1,788. We can see this visually:

In [None]:
ax = sns.regplot(x=X, y=y)

To interpret my results in plain language, I would say that a 1 sqft increase in house size is associated with a \\$1.07 increase in asking rent. The relationship between rent and sqft is significant at my chosen confidence level (95%).

So how good is my model? What does the $R^2$ value tell me?

To explain more (and predict better), we need more predictors in our model.

## 3. Multiple regression

OLS regression with multiple predictors

In [None]:
# choose a response and predictors
response = "rent"
predictors = ["bedrooms", "sqft"]

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = gdf[[response] + predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**How do I interpret this multiple regression results table?**

Each coefficient represents the individual predictor's relationship with the response, while holding all the other predictors constant.

A 1 bedroom increase is associated with a \\$148 increase in asking rent, all else equal, and a 1 square foot increase is associate with a \\$0.92 increase in asking rent, all else equal.

I can plug my results into my model to make predictions: rent = 804.46 + 148.17(beds) + 0.92(sqft)

So if I know a unit is 2 bedrooms and 800 sqft I can predict its asking rent (using my new model) as \\$1,837.

For another example of interpreting a regression model in practice, see the discussion in [this article](https://www-tandfonline-com.libproxy2.usc.edu/doi/pdf/10.1080/01944363.2020.1819382?needAccess=true).

In [None]:
# choose a response and predictors
response = "rent"
predictors = ["bedrooms", "sqft", "med_home_value", "mean_commute_time"]

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = gdf[[response] + predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

In [None]:
# now it's your turn
# try different sets of predictors to increase R2 while keeping the total number of predictors relatively low and theoretically sound

## 4. Multiple regression with dummy variables

A dummy variable is a binary (1 or 0) variable that represents mutually exclusive categories.

The trick with dummies in regression analysis is that you can't include all categories: you have to leave out 1 (as a reference group). For example, if I want to know the relationship between asking rent and majority white vs non-majority white neighborhood population, I would only include a majority white dummy in my model (so that non-majority white is the reference group that I'm leaving out of my model). Here's an example.

In [None]:
# create a dummy variable: 1 if listing is in majority white tract, otherwise 0
gdf["majority_white"] = (gdf["pct_white"] > 50).astype(int)

In [None]:
# choose a response and predictors
response = "rent"
predictors = ["bedrooms", "sqft", "med_home_value", "mean_commute_time", "majority_white"]

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = gdf[[response] + predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**How do I interpret my dummy variable's coefficient?**

"Compared to listings in non-majority white tracts, listings in majority white tracts are associated with asking rents \\$36 higher, all else equal."

The coefficient represents the "effect" (be cautious with causal language) on the response variable of this categorical predictor, compared to the reference group.



## 5. Explore relationships and transform variables

You should normally do this first, but here I want to show you the before-and-after, so I didn't.

In this example, I'm interested in predicting rent. So I want to explore the relationships between rent and my predictors, as well as relationships among those predictors themselves. Common ways to do this are with correlation matrices and pair plots.

If I see a nonlinear relationship, I can try to linearize it for better modeling.

In [None]:
# correlation matrix
variables = ["rent", "bedrooms", "sqft", "med_home_value", "mean_commute_time"]
correlations = gdf[variables].corr().round(2)
correlations

In [None]:
# visual correlation matrix via seaborn heatmap
# use vmin, vmax, center to set colorbar scale properly
ax = sns.heatmap(
    correlations, vmin=-1, vmax=1, center=0, cmap="coolwarm", square=True, linewidths=1
)

In [None]:
# plot pairwise relationships with seaborn
grid = sns.pairplot(gdf[variables].dropna(), markers=".")

In [None]:
# take the natural log of rent
gdf["rent_log"] = np.log(gdf["rent"])

In [None]:
# correlation matrix using our transformed variable
variables = ["rent_log", "bedrooms", "sqft", "med_home_value", "mean_commute_time"]
correlations = gdf[variables].corr().round(2)
correlations

In [None]:
# choose a response and predictors
response = "rent_log"
predictors = ["bedrooms", "sqft", "med_home_value", "mean_commute_time"]

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = gdf[[response] + predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**How do I interpret coefficients when the response is log-transformed?**

If the response is log-transformed but the predictor is not (a "log-linear" model), then the coefficient is called a "semi-elasticity" and (when multiplied by 100) it approximately represents the percent change in the response given a one-unit increase in the predictor.

Example: each 1-bedroom increase is associated with a 14% increase in asking rent, and each 1-minute increase in commute time is associated with a 1% decrease in asking rent.

If both the response and the predictor are log-transformed (a "log-log" model), then the coefficient is called an "elasticity" and it approximately represents the percent change in the response given a 1% increase in the predictor. See [this paper](https://doi.org/10.1177/0308518X19869678) for examples/discussion of interpreting log-linear and log-log models in practice.