**HANDS-ON SESSION-II**

**PROFESSOR: IRINA HASHMI**

**REGRESSION MODELS**

Import the required libraries

In [78]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import make_scorer, mean_squared_error
import numpy as np

Import the data using pandas library

Delete the empty rows

set 'Capital-gain' as the target variable - This is the money earned when an asset is sold at a higher price than its original purchase price. For example, if someone buys stock for 1,000 and later sells it for 1,500, the capital gain is $500.

In [79]:
income_reg=pd.read_csv('/content/income.csv')
income_reg.dropna(inplace=True)
target = 'capital-gain'

Encode the categorical columns and perform standardization on the numeric columns. Standardization is an import step for regression models as we need to keep all numerical columns in one single scale.

In [80]:
# Define categorical and numerical columns
categorical_cols = ['workclass', 'education', 'marital status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
numerical_cols = ['age', 'education-num', 'capital-loss', 'hours-per-week']

# One-hot encode categorical features
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_data = encoder.fit_transform(income_reg[categorical_cols])

# Create a DataFrame from encoded data
encoded_columns = encoder.get_feature_names_out(categorical_cols)
df_encoded = pd.DataFrame(encoded_data, columns=encoded_columns, index=income_reg.index)

# Combine numerical data and encoded categorical data
X = pd.concat([income_reg[numerical_cols], df_encoded], axis=1)
y = income_reg[target]

# Scale the features for better performance of regression models
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Split the data into training and test sets. 70% of the original data as training data and 30% of the original data as the test set.

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


Call the LinearRegression object and apply it to the data.

Regression models in general and evaluated using Mean Square Error and R square.

But, for this model the R square is 0.04 which means the model is not a good fit for the data.

In [82]:
# Initialize and train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict on the testing set
y_pred_linear = linear_model.predict(X_test)

# Evaluate the Linear Regression model
print("Linear Regression Mean Squared Error:", mean_squared_error(y_test, y_pred_linear))
print("Linear Regression R^2 Score:", r2_score(y_test, y_pred_linear))


Linear Regression Mean Squared Error: 59572016.428556494
Linear Regression R^2 Score: 0.04982291214451218


Decision trees for regression problems are built by splitting the data into smaller and smaller subsets to predict a continuous target value.

Random Forest Regression is a machine learning method that predicts a continuous value by averaging the results of many decision trees.

Initially the R square for the model is in negetives which suggests that the model is performing worse.

In the next steps we perform regularization or apply different regularization terms

In [83]:
# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the testing set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the Random Forest Regressor
print("Random Forest Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred_rf))
print("Random Forest Regressor R^2 Score:", r2_score(y_test, y_pred_rf))


Random Forest Regressor Mean Squared Error: 69449198.89273985
Random Forest Regressor R^2 Score: -0.10771871616834439


In the next steps we perform regularization or apply different regularization terms which define the model more precisely which can improve the model performance.

And the model performs better, but not as expected.

In [84]:
# Create a RandomForestRegressor with regularization parameters
regressor = RandomForestRegressor(
    n_estimators=100,          # Number of trees
    max_depth=10,              # Limit the depth of each tree
    min_samples_split=5,       # Minimum samples required to split an internal node
    min_samples_leaf=4,        # Minimum samples required to be at a leaf node
    max_features='sqrt',       # Use the square root of the total features at each split
    random_state=42
)

# Fit the model
regressor.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mean_mse:.2f}")
print(f"R^2 Score: {r2_score(y_test, y_pred):.2f}")


Mean Squared Error: 50849905.59
R^2 Score: 0.06


We give a try for cross validation for the regressor, we can see that the performance is same as applying regularization parameters.

In [85]:
# Assuming X and y are your features and target variable
# Define the model with regularization parameters
regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=4,
    max_features='sqrt',
    random_state=42
)

# Set up k-fold cross-validation (5 folds)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and calculate MSE for each fold
scores = cross_val_score(
    regressor, X, y, cv=kf, scoring=make_scorer(mean_squared_error)
)

# Calculate the mean and standard deviation of the MSE scores
mean_mse = np.mean(scores)
std_mse = np.std(scores)

print(f"Mean MSE from cross-validation: {mean_mse:.2f}")
print(f"Standard Deviation of MSE: {std_mse:.2f}")
print(f"R^2 Score: {r2_score(y_test, y_pred):.2f}")


Mean MSE from cross-validation: 50849905.59
Standard Deviation of MSE: 7408346.24
R^2 Score: 0.06


Gradient Boosting Regressor is a machine learning algorithm that builds a strong model by sequentially adding small decision trees, each correcting the errors of the previous ones. It combines these trees to make accurate predictions for continuous values.


The model performs a bit better than the Random Forest Regressor.

In [86]:
#define the gradient boosting model with different parameters
gbm_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm_model.fit(X_train, y_train)

# Predict on the testing set
y_pred_gbm = gbm_model.predict(X_test)

# Evaluate the Gradient Boosting Regressor
print("Gradient Boosting Regressor Mean Squared Error:", mean_squared_error(y_test, y_pred_gbm))
print("Gradient Boosting Regressor R^2 Score:", r2_score(y_test, y_pred_gbm))


Gradient Boosting Regressor Mean Squared Error: 58788759.604564026
Gradient Boosting Regressor R^2 Score: 0.06231590353010075


**OTHER REGRESSION MODELS**:

1. Ridge Regression (L2 Regularization):
Ridge regression adds a penalty to the sum of the squared coefficients in the linear model, which shrinks the coefficients but keeps all features.
This helps prevent overfitting by ensuring that the model doesn’t rely too heavily on any single feature.
2. Lasso Regression (L1 Regularization):
Lasso regression adds a penalty to the sum of the absolute values of the coefficients, which can shrink some coefficients to zero, effectively selecting features.
It’s useful for simplifying the model by keeping only the most important features.
3. Elastic Net (Combination of L1 and L2):
Elastic Net combines both L1 and L2 penalties, balancing between shrinkage (like Ridge) and feature selection (like Lasso).
It’s ideal when you want to retain some features but also simplify the model without relying too heavily on either type of regularization.

In [87]:
# L2 Regularization: Ridge Regression
ridge = Ridge(alpha=1.0)  # alpha controls the regularization strength; higher means more regularization
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)
print("Ridge MSE:", mean_squared_error(y_test, ridge_predictions))
print("Ridge R^2 Score:", r2_score(y_test, ridge_predictions))

# L1 Regularization: Lasso Regression
lasso = Lasso(alpha=0.1)  # alpha is the regularization parameter
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)
print("Lasso MSE:", mean_squared_error(y_test, lasso_predictions))
print("Lasso R^2 Score:", r2_score(y_test, lasso_predictions))

# L1 + L2 Regularization: Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio balances between L1 and L2 (0 = pure L2, 1 = pure L1)
elastic_net.fit(X_train, y_train)
elastic_net_predictions = elastic_net.predict(X_test)
print("Elastic Net MSE:", mean_squared_error(y_test, elastic_net_predictions))
print("Elastic Net R^2 Score:", r2_score(y_test, elastic_net_predictions))

Ridge MSE: 59572408.75693983
Ridge R^2 Score: 0.04981665448423622


  model = cd_fast.enet_coordinate_descent(


Lasso MSE: 59572227.93484804
Lasso R^2 Score: 0.049819538606998726
Elastic Net MSE: 59576903.49001714
Elastic Net R^2 Score: 0.04974496323317357


**CONCLUSION**:
<ol>
Linear Regression                           - 0.04  


Random Forest Regressor                     - -0.17


Random Forest Regressor with regularization - 0.05


Gradient Boosting Regressor                 - 0.06


Ridge Regression                            - 0.04


Lasso Regression                            - 0.04


Elastic Regression                          - 0.04
</ol>

Again in Regression model Gradient Boosting Regressor is the best model among others.