<a href="https://colab.research.google.com/github/anjimeth2001/Basics-ML-Learning/blob/main/Linear%20Regression%20with%20Categorical%20V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Linear Regression with categorical predictors**

In [4]:
import pandas as pd
data=pd.read_csv("/content/jane.csv")
data.head()

Unnamed: 0,x,color,y
0,1,red,24.894
1,1,blue,12.323
2,1,green,16.645
3,2,red,25.231
4,2,blue,12.119


In [5]:
data.shape

(150, 3)

In [6]:
data.dtypes

Unnamed: 0,0
x,int64
color,object
y,float64


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       150 non-null    int64  
 1   color   150 non-null    object 
 2   y       150 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 3.6+ KB


In [8]:
#summary statistics
data.describe()

Unnamed: 0,x,y
count,150.0,150.0
mean,25.5,41.668193
std,14.479214,15.606385
min,1.0,12.119
25%,13.0,28.8525
50%,25.5,42.037
75%,38.0,52.42425
max,50.0,73.067


In [13]:
#Converts the color column to categorical
data['color']=data['color'].astype('category')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   x       150 non-null    int64   
 1   color   150 non-null    category
 2   y       150 non-null    float64 
dtypes: category(1), float64(1), int64(1)
memory usage: 2.7 KB


In [15]:
# number of unique values of column color
#data.color.nunique()
data.color.unique()

['red', 'blue', 'green']
Categories (3, object): ['blue', 'green', 'red']

In [16]:
#each category count
print(pd.crosstab(index=data["color"], columns="count"))

col_0  count
color       
blue      50
green     50
red       50


Note: color is a factor variable. So, create dummy variables with 0/1 values for each category value (One-hot encoding).

pd.get_dummies() converts categorical columns (like text categories) into one-hot encoded numeric columns.

drop_first=True--It drops the first category (Color_Blue) to avoid the dummy variable trap — a situation where one column can be perfectly predicted from the others, which causes multicollinearity in regression models.


In [19]:
datadummy = pd.get_dummies(data=data, drop_first=True)
print(datadummy.head())

   x       y  color_green  color_red
0  1  24.894        False       True
1  1  12.323        False      False
2  1  16.645         True      False
3  2  25.231        False       True
4  2  12.119        False      False


In [21]:
#Fit the model
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
model=ols('y~x+color',data=data).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     428.6
Date:                Sun, 05 Oct 2025   Prob (F-statistic):           3.81e-72
Time:                        09:08:21   Log-Likelihood:                -453.26
No. Observations:                 150   AIC:                             914.5
Df Residuals:                     146   BIC:                             926.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         13.1699      1.017     12.

Note: intercept: mean for color=blue and x=0, and coefficient for colorgreen: difference in means for color=green and color=blue.

Is there interaction between x and color?


In [23]:
model3 = ols('y ~ x + color + x:color', data=data).fit()
print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.899
Model:                            OLS   Adj. R-squared:                  0.895
Method:                 Least Squares   F-statistic:                     256.0
Date:                Sun, 05 Oct 2025   Prob (F-statistic):           9.09e-70
Time:                        09:15:31   Log-Likelihood:                -452.64
No. Observations:                 150   AIC:                             917.3
Df Residuals:                     144   BIC:                             935.4
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           13.9612      1.450  

# Confidence interval and prediction interval

**Confidence interval for response at $x_0$:** The target $y_0$ is a fixed quantity and only incorporates uncertainty in $\widehat{y_0}$.

**Prediction interval for response at $x_0$:** The target $y_0$ is a random quantity and incorporates uncertainty in both $\widehat{y_0}$ and $ϵ_0$.

In [24]:
newdata = pd.DataFrame({"x":[2],"color":["blue"]})
pred = model.predict(newdata)
print(pred)

0    15.176774
dtype: float64


In [26]:
model.get_prediction(newdata).summary_frame()

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,15.176774,0.977234,13.245421,17.108126,5.041323,25.312224
