# Adel Movahedian 
# 400102074
## assignment 6

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, r2_score
from sklearn.linear_model import Lasso

## -----------------------------------------

df = pd.read_csv('omid_zendegi.csv')
print(df.head())

df.columns = df.columns.str.strip()
print(df.columns.tolist())

df = df.dropna(subset=['Life expectancy'])

df['Is_Developed'] = df['Status'].map({'Developed': 1, 'Developing': 0})
df = df.drop(columns=['Status', 'Country'])

numerical_features = [
    'Year', 'Is_Developed', 'Adult Mortality', 'infant deaths', 'Alcohol',
    'percentage expenditure', 'Hepatitis B', 'Measles', 'BMI', 'under-five deaths',
    'Polio', 'Total expenditure', 'Diphtheria', 'HIV/AIDS', 'GDP', 'Population',
    'thinness  1-19 years', 'Income composition of resources', 'Schooling'
]

imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])

df['log_GDP'] = np.log(df['GDP'] + 1)
df['log_Population'] = np.log(df['Population'] + 1)
df = df.drop(columns=['GDP', 'Population'])
df['Alcohol_squared'] = df['Alcohol'] ** 2

features = [
    'Year', 'Is_Developed', 'Adult Mortality', 'infant deaths', 'Alcohol',
    'Alcohol_squared', 'percentage expenditure', 'Hepatitis B', 'Measles', 'BMI',
    'Polio', 'Total expenditure', 'Diphtheria', 'HIV/AIDS', 'log_GDP',
    'log_Population', 'thinness  1-19 years', 'Income composition of resources',
    'Schooling'
]

X = df[features]
y = df['Life expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

model = LinearRegression()
model.fit(X_train_poly, y_train)

y_pred = model.predict(X_test_poly)

r2 = r2_score(y_test, y_pred)
print(f"\nLinear Regression R2 Score: {r2:.3f}")


       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0            263.0   
1  Afghanistan  2014  Developing              59.9            271.0   
2  Afghanistan  2013  Developing              59.9            268.0   
3  Afghanistan  2012  Developing              59.5            272.0   
4  Afghanistan  2011  Developing              59.2            275.0   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624         65.0      1154  ...   
1             64     0.01               73.523582         62.0       492  ...   
2             66     0.01               73.219243         64.0       430  ...   
3             69     0.01               78.184215         67.0      2787  ...   
4             71     0.01                7.097109         68.0      3013  ...   

   Polio  Total expenditure  Diphtheria    HIV/AIDS         GDP  Population  \
0    6.

The code cleans and processes life expectancy data, converts categorical values into numbers, transforms features, and handles missing values. It prepares the data, trains a linear regression model, and evaluates its performance using the R2 score to assess prediction accuracy.

this code provides a practical demonstration of data preprocessing and regression analysis in machine learning. It highlights how to clean, transform, and encode real-world data while addressing missing values effectively. I gained an understanding of feature transformation techniques (like logarithmic scaling and polynomial expansion) to capture relationships in data, and how standardization ensures consistent scaling. Moreover, the code illustrates the importance of dividing data into training and testing sets for unbiased model evaluation, and how to use metrics like R2 to assess prediction accuracy. It’s a concise yet complete workflow for predictive modeling.

In [None]:
param_grid = {
    'alpha': [0.01, 0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'poly']
}

kr = KernelRidge()

grid_search = GridSearchCV(kr, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

y_pred_kernel_opt = grid_search.predict(X_test_scaled)
r2_kernel_opt = r2_score(y_test, y_pred_kernel_opt)
print(f"Kernel Regression R2 Score: {r2_kernel_opt:.3f}")

Kernel Regression R2 Score: 0.923


This part of the code aims to optimize the performance of a Kernel Ridge Regression model for predicting life expectancy. By using GridSearchCV, it systematically tests different combinations of hyperparameters (alpha, gamma, and kernel types like 'rbf' and 'poly') to find the best configuration that maximizes the R2 score. The process uses cross-validation (cv=5) to ensure the model's performance is evaluated robustly and avoids overfitting to the training data.

The optimized model is then applied to the test data, and its accuracy is measured using the R2 score, which indicates how well the optimized model explains the variation in life expectancy.

From this, I have gained an understanding of hyperparameter tuning's importance in improving model accuracy and robustness. It also demonstrates the use of cross-validation to validate model performance effectively. By automating the search for the best parameters, it ensures the model achieves its optimal potential without relying on arbitrary or manual parameter selection.

In [None]:
df['High_life'] = (df['Life expectancy'] >= 70).astype(int)

selected_features = [
    'Year', 'Is_Developed', 'Adult Mortality', 'Alcohol', 'BMI',
    'HIV/AIDS', 'log_GDP', 'log_Population', 'Income composition of resources',
    'Schooling', 'Total expenditure', 'percentage expenditure', 'Diphtheria', 'Polio'
]

X_log = df[selected_features]
y_log = df['High_life']

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=4, interaction_only=False, include_bias=False)),
    ('logreg', LogisticRegression(C=100, solver='liblinear', max_iter=1000, class_weight='balanced'))
])

pipeline.fit(X_train_log, y_train_log)

y_pred = pipeline.predict(X_test_log)
y_pred_prob = pipeline.predict_proba(X_test_log)[:, 1]

accuracy = accuracy_score(y_test_log, y_pred)
r2 = r2_score(y_test_log, y_pred_prob)

print(f"Logistic Regression R2 Score: {r2:.3f}")

Logistic Regression R2 Score: 0.810



This part of the code builds and evaluates a logistic regression model to classify countries into two groups based on their life expectancy: countries with high life expectancy (≥70 years) and those below 70. It creates a new binary column (High_life) in the dataset to represent this classification.

To prepare for modeling, the code selects relevant features and splits the data into training and testing sets. Then, it creates a pipeline that standardizes the features, generates polynomial features for capturing complex interactions, and applies logistic regression for classification. The pipeline is trained on the training set and evaluated on the test set. Predictions are made both for the classes and probabilities, with the model's performance assessed using accuracy and the R2 score.

This taught me how to transform a regression task into a classification problem by defining custom thresholds. I also learned how pipelines simplify preprocessing and modeling, ensuring all steps are applied consistently. Using polynomial features and balanced class weights helps improve performance, especially with imbalanced datasets. Additionally, the R2 score for probabilities demonstrates how well the model predicts the likelihood of high life expectancy rather than just its binary outcome.


In [None]:
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)),
    ('ridge', Ridge(alpha=10))
])

ridge_pipeline.fit(X_train, y_train)
y_pred_ridge = ridge_pipeline.predict(X_test)
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression R2 Score: {r2_ridge:.3f}")

Ridge Regression R2 Score: 0.907



This part of the code uses Ridge Regression, a linear model with regularization to prevent overfitting, particularly useful when dealing with multicollinearity or polynomial features. The pipeline standardizes the data, creates polynomial features to capture non-linear relationships, and applies Ridge Regression with a specified regularization strength (alpha=10).

The model is trained on the training set and evaluated on the test set using the R2 score to measure how well it predicts the target variable (life expectancy). By introducing polynomial features and regularization, the goal is to balance complexity and performance, avoiding overfitting while capturing meaningful interactions.

From this, I’ve learned how Ridge Regression can be effectively combined with feature transformations to handle complex relationships in data. It demonstrates how regularization helps control model complexity, making it both robust and generalizable to unseen data.

In [None]:
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)),
    ('lasso', Lasso(alpha=0.01, max_iter=5000))
])

lasso_pipeline.fit(X_train, y_train)
y_pred_lasso = lasso_pipeline.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f"LASSO Regression R2 Score: {r2_lasso:.3f}")



LASSO Regression R2 Score: 0.908


This section implements LASSO regression, a linear model that applies regularization to enhance the selection of relevant features by penalizing less impactful ones. The pipeline begins by standardizing the data to ensure all features have a uniform scale, adds polynomial features to capture nonlinear relationships, and then applies the LASSO regression with specific regularization (alpha=0.01). This method is particularly useful for simplifying models by shrinking less influential feature coefficients to zero.

The model is trained on the training data and tested on the test set. Its performance is measured using the R2 score, which indicates how well the model predicts the target variable (life expectancy).

From this, I’ve learned how LASSO regression is effective for reducing model complexity by performing feature selection, ensuring robustness without overfitting. It also demonstrates the combination of polynomial features and regularization to capture intricate data relationships while maintaining simplicity and interpretability.

---
---
---


The kernel trick is a method used in machine learning to efficiently map input data to a higher-dimensional space without explicitly performing the transformation. Instead of calculating the mapping, it uses a kernel function to compute the relationship between data points in the transformed space directly. This enables models, such as Kernel Ridge Regression or Support Vector Machines, to capture complex, nonlinear patterns in the data.

By using the kernel trick, regression models can better fit complicated relationships within the data, leading to improved prediction accuracy, especially in cases where linear models fall short. It's computationally efficient and avoids the high costs of working in very high-dimensional spaces explicitly.