## Linear regression exercises
We will use the [Kaggle dataset about gender pay gap](https://www.kaggle.com/datasets/mohithsairamreddy/salary-data?resource=download).
In Week 1, we learned how to open Kaggle dataset.
Perform the necessary EDA steps and a meaningful linear regression test that you will interpret.


In [43]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm

In [44]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mohithsairamreddy/salary-data")

print("Path to dataset files:", path)

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/mohithsairamreddy/salary-data/versions/4


In [45]:
filepath="/home/cgraiff/.cache/kagglehub/datasets/mohithsairamreddy/salary-data/versions/4/Salary_Data.csv"
df=pd.read_csv(filepath)
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


### Preprocessing
Some hints for text cleaning (Source: [This tutorial](https://medium.com/@evelyn.eve.9512/gender-pay-gap-comparisons-with-regression-analysis-45223cd3ed13))
<br> <br>
1. `pd.get_dummies()`: for linear regression, you need numerical variables. This method is useful to handle categorical variables. It creates a column for each value, and assigns value 1 (if it corresponds) or 0 (if it does not) to it.
For `gender`, this dataset only has two entries, so we can map it to one single column, which we will call male and identify with True=1 and False=0.

In [46]:
df['Male'] = pd.get_dummies(df['Gender'], drop_first=True)['Male']
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0,True
1,28.0,Female,Master's,Data Analyst,3.0,65000.0,False
2,45.0,Male,PhD,Senior Manager,15.0,150000.0,True
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0,False
4,52.0,Male,Master's,Director,20.0,200000.0,True


2. It makes more sense to visualize the age as "difference to the mean age", because age=0 is not relevant **to this specific analysis**.
> The step before is necessary, because linear regression needs numerical values. This step is not, and needs to be evaluated depending on your needs.

In [32]:
df['C_Age'] = df['Age'] - df['Age'].mean()
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0,True,-1.620859
1,28.0,Female,Master's,Data Analyst,3.0,65000.0,False,-5.620859
2,45.0,Male,PhD,Senior Manager,15.0,150000.0,True,11.379141
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0,False,2.379141
4,52.0,Male,Master's,Director,20.0,200000.0,True,18.379141


3. We can divide the salary by 1000 to facilitate its visualization by avoiding huge numbers.
> Also not necessary!

In [33]:
df['Salary'] = df['Salary'] / 1000
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,90.0,True,-1.620859
1,28.0,Female,Master's,Data Analyst,3.0,65.0,False,-5.620859
2,45.0,Male,PhD,Senior Manager,15.0,150.0,True,11.379141
3,36.0,Female,Bachelor's,Sales Associate,7.0,60.0,False,2.379141
4,52.0,Male,Master's,Director,20.0,200.0,True,18.379141


Remember to check for empty values, and **drop them** or **replace them with the mean**, depending on how many they are and how meaningful the mean is.

In [None]:
# Let's check for empty values
df.isna().sum()
df.head()

In [None]:
# We can drop them
df.dropna(inplace=True)

In [35]:
X = df[["Years of Experience", "Male", "C_Age"]]
y = df["Salary"]

In [None]:
# Sanity check
X.isna().sum()

Years of Experience    0
Male                   0
C_Age                  0
dtype: int64

In [37]:
# Split dataset in train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # the data is shuffled before being split. You need to set a random state (in this case 42) to always obtain the same train and test sets

print("Training Set Size:", X_train.shape)
print("Testing Set Size:", X_test.shape)

Training Set Size: (5358, 3)
Testing Set Size: (1340, 3)


In [38]:
model = LinearRegression()
model.fit(X, y)

In [39]:
# Display the model parameters
print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Coefficient (β₁): {model.coef_[0]:.2f}")

Intercept (β₀): 37.80
Coefficient (β₁): 9.17


In [40]:
# Predict values for the test set
predictions = model.predict(X_test)

# Get residuals
residuals = y_test - predictions
print(residuals)

1883    28.006018
2630   -10.676183
498     37.176753
5973   -15.639189
4108   -18.201307
          ...    
2830    -6.099607
6154   -10.639189
4940     1.860457
135    -64.575381
3688   -25.385365
Name: Salary, Length: 1340, dtype: float64


In [41]:
# Calculate R²
r2 = r2_score(y_test, predictions)

# Calculate adjusted R²
n = X.shape[0]  # number of samples
p = X.shape[1]  # number of features
print(n)
print(p)
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"R²: {r2:.3f}")
print(f"Adjusted R²: {adjusted_r2:.3f}")

6698
3
R²: 0.670
Adjusted R²: 0.670
