# Regression with Categorical Data

In [6]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# load the ds
diamonds = pd.read_csv("../data/diamonds.csv")

diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17858 entries, 0 to 17857
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  17858 non-null  int64  
 1   carat       17858 non-null  float64
 2   cut         17858 non-null  object 
 3   color       17858 non-null  object 
 4   clarity     17858 non-null  object 
 5   depth       17858 non-null  float64
 6   table       17858 non-null  float64
 7   price       17858 non-null  int64  
 8   x           17858 non-null  float64
 9   y           17858 non-null  float64
 10  z           17857 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 1.5+ MB


In [7]:
import statsmodels.api as sm

predictors = [ "carat", "x", "y"]

# Prepare the predictor and the response variables
X = sm.add_constant(diamonds[predictors]) # add a constant term to the predictor
y = diamonds["price"]

In [8]:
# build and fit the model

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.747
Method:                 Least Squares   F-statistic:                 1.757e+04
Date:                Fri, 20 Dec 2024   Prob (F-statistic):               0.00
Time:                        11:04:54   Log-Likelihood:            -1.4564e+05
No. Observations:               17858   AIC:                         2.913e+05
Df Residuals:                   17854   BIC:                         2.913e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -6364.4232    137.522    -46.279      0.0

In [10]:
# treating categorical
cut_dummies = pd.get_dummies(diamonds["color"], drop_first=True)
# you need to set drop_first=True to avoid multicolliniarity
# it means one cat is a reference category, when all the others are False, the reference is true

cut_dummies = cut_dummies.astype(int) #now convert bolean to int

In [11]:
# add the new predictors here
X = sm.add_constant(cut_dummies)
y = diamonds["price"]

# check X
X

Unnamed: 0,const,E,F,G,H,I,J
0,1.0,1,0,0,0,0,0
1,1.0,1,0,0,0,0,0
2,1.0,1,0,0,0,0,0
3,1.0,0,0,0,0,1,0
4,1.0,0,0,0,0,0,1
...,...,...,...,...,...,...,...
17853,1.0,0,0,1,0,0,0
17854,1.0,0,0,0,0,0,1
17855,1.0,0,0,1,0,0,0
17856,1.0,0,1,0,0,0,0


In [13]:
# Fit the model
model = sm.OLS(y, X)

results = model.fit()

# Print the summary
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     29.22
Date:                Fri, 20 Dec 2024   Prob (F-statistic):           5.02e-35
Time:                        11:14:39   Log-Likelihood:            -1.5783e+05
No. Observations:               17858   AIC:                         3.157e+05
Df Residuals:                   17851   BIC:                         3.157e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4014.7306     35.971    111.611      0.0

In the model summary, you'll see coefficients for all but one of the colors (since you dropped one to avoid multicollinearity). Each coefficient tells you the expected difference from the reference color, which is the dropped one.

For example (assuming the reference color is D), if the coefficient for J is 2153, it means that on average the price is expected to be 2153 USD higher for color J with respect to the color D (reference color). Similarly, negative coefficients indicate lower prices compared to the reference category.