<div style='color: #BCA37F;
           background-color: #113946;
           font-size: 200%;
           border-radius:15px;
           text-align:center;
           font-weight:600;
           border-style: solid;
           border-color: dark green;
           font-family: Courier New;'>
Life Expectancy Analysis & prediction 
<a class="anchor" id="1"></a> 

In [None]:
# importing libraries 
import pandas as pd  
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np 

In [None]:
from sklearn.impute import KNNImputer 

In [None]:
# to see all columns and ignore warnings if exist 
import warnings 
warnings.filterwarnings('ignore')
#pd.set_option('display.max_rows',None) 
pd.set_option('display.max_columns',None) 

sns.set_palette("crest")

In [None]:
df = pd.read_csv("/kaggle/input/life-expect/data.csv") # reading data

<span style='color: #BCA37F;
           font-size: 250%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Data overview & Exploration

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
# drop "Unnamed: 0" column  -> index column 
df.drop(columns=["Unnamed: 0"],inplace=True)

In [None]:
df.info() 

In [None]:
df.describe().T.style.bar(subset=['mean'], color='#205ff2').background_gradient(subset=['std'], cmap='Reds').background_gradient(subset=['50%'], cmap='coolwarm')

In [None]:
df.describe(exclude='number').T

In [None]:
df.isna().sum() # count of null values each column 

<span style='color: #BCA37F;
           font-size: 300%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Cleaning & EDA 

In [None]:
import missingno as msno
msno.matrix(df)
plt.show()

#### _Missing values are Missing completely at random (MCAR) , i will use different methods to fill them [medain,mean] or knn and multivariate imputation by chained equation (MICE)_ 

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Nation 

- Name of each country 

In [None]:
df.Nation.value_counts()

#### _we notice that some Nations have 16 observations and other have just 1 observation so we can drop them to avoid bias in our model_ 

In [None]:
df.loc[df["Nation"].isna()] 

In [None]:
df = df.drop(186,axis=0)

In [None]:
df["Nation"].isna().sum() 

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Survay year 

- Survey_Year : The year in which the data related to the same row was collected

In [None]:
df["Survey_Year"].isna().sum()

In [None]:
df.loc[df["Survey_Year"].isna(),:]

In [None]:
df = df.dropna(axis=0,subset=["Survey_Year"])

In [None]:
df["Survey_Year"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Survey_Year"], ax = ax[0])
sns.boxplot(x = df["Survey_Year"], ax= ax[1])
plt.show()

- _There is no outliers in this column_

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Coubtry category 

- Country_Category : The category of the country based on devoloed or developing

In [None]:
df["Country_Category"].isna().sum()

In [None]:
df.loc[df["Country_Category"].isna(),:]

- we can search this countries in google to know if they are developed or developing
- we can use .loc to see what is the category of each country

In [None]:
df.loc[df["Nation"] == "Swaziland",:]

- sweziland : developing 
- lebanan : Developing
- chad : developing

In [None]:
df.iloc[9, df.columns.get_loc("Country_Category")] = "developing"
df.iloc[54, df.columns.get_loc("Country_Category")] = "Developing"
df.iloc[101, df.columns.get_loc("Country_Category")] = "developing"

In [None]:
# df.columns.get_loc("Country_Category") # get index of column

In [None]:
df["Country_Category"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Mortality Adults

- Mortality_Adults :  represents the rate or number of deaths specifically among the adult population

In [None]:
df["Mortality_Adults"].isna().sum() 

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Mortality_Adults"], ax = ax[0])
sns.boxplot(x = df["Mortality_Adults"], ax= ax[1])
plt.show()

- Mortality_Adults distribution is right skewed so we can use median or groub each country and use the mean of each country to fill the missing values
- this outliers are not real outliers so we can keep them becasuse they lying in normal range of mortality rate 

In [None]:
grouped_df = df.groupby(by='Nation').agg({'Mortality_Adults':'mean'}).sort_values(by='Mortality_Adults',ascending=False)

In [None]:
grouped_df.head(10)

In [None]:
null_Mortality_Adults = df.loc[df["Mortality_Adults"].isna()]

In [None]:
null_Mortality_Adults

- this observations has alot of missing for multip columns so we can drop them

In [None]:
null_Mortality_Adults.index.to_list()

In [None]:
df = df.drop(null_Mortality_Adults.index.to_list(),axis=0)

In [None]:
df["Mortality_Adults"].isna().sum() 

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Infant deaths count

- Infant_Deaths_Count : refers to the number of deaths that occurred among infants (babies under one year old) .

In [None]:
df["Infant_Deaths_Count"].isna().sum()  

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Infant_Deaths_Count"], ax = ax[0])
sns.boxplot(x = df["Infant_Deaths_Count"], ax= ax[1])
plt.show()

- Infant_Deaths_Count distribution is heavly right skewed . 
- we don't know if this outliers are real or not so we will explore data has infant deaths > 600 

In [None]:
df.loc[df['Infant_Deaths_Count'] > 600]

- __All data has infant deaths > 600 are from the same country (India) so we will keep them__

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Alcohol Consumption Rate 

- Alcohol_Consumption_Rate : refers to the average consumption of alcohol per person per year 

In [None]:
df["Alcohol_Consumption_Rate"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Alcohol_Consumption_Rate"], ax = ax[0])
sns.boxplot(x = df["Alcohol_Consumption_Rate"], ax= ax[1])
plt.show()

- there is skewness but no outliers so we can use mean

In [None]:
grouped_df = df.groupby(by='Nation').agg({'Alcohol_Consumption_Rate':'mean'}).sort_values(by='Alcohol_Consumption_Rate',ascending=False)

In [None]:
grouped_df.head(5)

In [None]:
grouped_df.loc["Czechia"][0]

In [None]:
imputer = KNNImputer(n_neighbors=3)
df['Alcohol_Consumption_Rate'] = imputer.fit_transform(df[['Alcohol_Consumption_Rate']])


In [None]:
df["Alcohol_Consumption_Rate"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Alcohol_Consumption_Rate"], ax = ax[0])
sns.boxplot(x = df["Alcohol_Consumption_Rate"], ax= ax[1])
plt.show()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Expenditure Percentage GDP 
 

-  _Expenditure on health as a percentage of Gross Domestic Product per capita(%)_

In [None]:
df["Expenditure_Percentage_GDP"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Expenditure_Percentage_GDP"], ax = ax[0])
sns.boxplot(x = df["Expenditure_Percentage_GDP"], ax= ax[1])
plt.show()

In [None]:
df.loc[df['Expenditure_Percentage_GDP'] > 12500]

_the distribution is heavly rigth skewed and thereis outliers but it looks like rich countries has high expenditure on health so we will keep them_

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Hepatitis B-Vaccination Coverage 
 

- Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

In [None]:
df["Hepatitis_B_Vaccination_Coverage"].isna().sum()

In [None]:
df.loc[df['Hepatitis_B_Vaccination_Coverage'].isna()].head(10)

In [None]:
print(f"prcintage of null : {551 / len(df) * 100}")

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Hepatitis_B_Vaccination_Coverage"], ax = ax[0])
sns.boxplot(x = df["Hepatitis_B_Vaccination_Coverage"], ax= ax[1])
plt.show()

- prcentage of null values is 18% so we can use KNN imputer to fill the missing values becaause its numerical column and we have alot of outliers so we can't use mean or median

In [None]:
df["Hepatitis_B_Vaccination_Coverage"].fillna(df["Hepatitis_B_Vaccination_Coverage"].median(),inplace=True)

In [None]:
df["Hepatitis_B_Vaccination_Coverage"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Hepatitis_B_Vaccination_Coverage"], ax = ax[0])
sns.boxplot(x = df["Hepatitis_B_Vaccination_Coverage"], ax= ax[1])
plt.show()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Measles Infection Count
 

- Measles - number of reported cases per 1000 population

In [None]:
df["Measles_Infection_Count"].isna().sum()

In [None]:
df.loc[df['Measles_Infection_Count'].isna()].head(10)

In [None]:
df["Measles_Infection_Count"].fillna(df["Measles_Infection_Count"].median(),inplace=True)

In [None]:
df["Measles_Infection_Count"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Body Mass Index_Avg
 


- body mass index (BMI) is a person's weight in kilograms divided by the square of height in meters

In [None]:
df['Body_Mass_Index_Avg'].isnull().sum() 

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Body_Mass_Index_Avg"], ax = ax[0])
sns.boxplot(x = df["Body_Mass_Index_Avg"], ax= ax[1])
plt.show()

In [None]:
! pip install miceforest 

In [None]:
from miceforest import ImputationKernel

# Create an instance of MultipleImputedKernel
kernel = ImputationKernel(
    data=df.select_dtypes('number'),
    save_all_iterations=True,
    random_state=2003
)

# Run the MICE algorithm for 3 iterations 
kernel.mice(3)

# Get the completed data
imputed_df = kernel.complete_data(0)

In [None]:
df["Body_Mass_Index_Avg"] = imputed_df["Body_Mass_Index_Avg"]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Body_Mass_Index_Avg"], ax = ax[0])
sns.boxplot(x = df["Body_Mass_Index_Avg"], ax= ax[1])
plt.show()

In [None]:
df['Body_Mass_Index_Avg'].isnull().sum() 

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Polio Vaccination Coverage



- Polio_Vaccination_Coverage : Pol3 immunization coverage among 1-year-olds (%)

In [None]:
df["Polio_Vaccination_Coverage"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Polio_Vaccination_Coverage"], ax = ax[0])
sns.boxplot(x = df["Polio_Vaccination_Coverage"], ax= ax[1])
plt.show()

In [None]:
df["Polio_Vaccination_Coverage"] = imputed_df["Polio_Vaccination_Coverage"]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Polio_Vaccination_Coverage"], ax = ax[0])
sns.boxplot(x = df["Polio_Vaccination_Coverage"], ax= ax[1])
plt.show()

In [None]:
df["Polio_Vaccination_Coverage"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Total Health Expenditure



- Total_Health_Expenditure : refers to the total expenditure on health as a percentage of total government expenditure

In [None]:
df["Total_Health_Expenditure"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Total_Health_Expenditure"], ax = ax[0])
sns.boxplot(x = df["Total_Health_Expenditure"], ax= ax[1])
plt.show()

In [None]:
df["Total_Health_Expenditure"] = imputed_df["Total_Health_Expenditure"]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Total_Health_Expenditure"], ax = ax[0])
sns.boxplot(x = df["Total_Health_Expenditure"], ax= ax[1])
plt.show()

In [None]:
df["Total_Health_Expenditure"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Diphtheria Vaccination Coverage



- Diphtheria_Vaccination_Coverage : DTP3 immunization coverage among 1-year-olds (%) 

In [None]:
df["Diphtheria_Vaccination_Coverage"].isna().sum()

In [None]:
df["Diphtheria_Vaccination_Coverage"] = imputed_df["Diphtheria_Vaccination_Coverage"] 

In [None]:
df["Diphtheria_Vaccination_Coverage"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 HIV AIDS Prevalence Rate



- HIV_AIDS_Prevalence_Rate : HIV/AIDS prevalence, adult (% ages 15-49)

In [None]:
df["HIV_AIDS_Prevalence_Rate"].isna().sum()

- no null values

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["HIV_AIDS_Prevalence_Rate"], ax = ax[0],log_scale=True)
sns.boxplot(x = df["HIV_AIDS_Prevalence_Rate"], ax= ax[1])
plt.show()

- heavly right skewed and there is outliers but we will keep them because it's in normal range of HIV/AIDS prevalence, adult (% ages 15-49)

In [None]:
df.loc[df["HIV_AIDS_Prevalence_Rate"] > 10].head(10)

- its high in poor countries with low GDP and low expenditure on health

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Gross Domestic Product



- Gross_Domestic_Product : refers to the total value of goods produced and services provided in a country during one year.

In [None]:
df["Gross_Domestic_Product"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Gross_Domestic_Product"], ax = ax[0],log_scale=True)
sns.boxplot(x = df["Gross_Domestic_Product"], ax= ax[1])
plt.show()

In [None]:
df["Gross_Domestic_Product"] = imputed_df["Gross_Domestic_Product"]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Gross_Domestic_Product"], ax = ax[0],log_scale=True)
sns.boxplot(x = df["Gross_Domestic_Product"], ax= ax[1])
plt.show()

In [None]:
df["Gross_Domestic_Product"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Total Population

- Total_Population : refers to the total number of people living in a country at a particular year

In [None]:
df["Total_Population"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Total_Population"], ax = ax[0],log_scale=True)
sns.boxplot(x = df["Total_Population"], ax= ax[1])
plt.show()

In [None]:
df["Total_Population"] = imputed_df["Total_Population"]

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Total_Population"], ax = ax[0],log_scale=True)
sns.boxplot(x = df["Total_Population"], ax= ax[1])
plt.show()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Thinness

- Thinness : refers to the percentage of children under five years of age who are underweight

In [None]:
df["Thinness"].isna().sum()

In [None]:
df["Thinness"] = imputed_df["Thinness"]

In [None]:
df["Thinness"].isna().sum()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
🔘 Life Expectancy Years

- Life_Expectancy_Years : refers to the average number of years a newborn is expected to live if current mortality rates continue to apply

In [None]:
df["Life_Expectancy_Years"].isna().sum()

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 5))
sns.histplot(df["Life_Expectancy_Years"], ax = ax[0])
sns.boxplot(x = df["Life_Expectancy_Years"], ax= ax[1])
plt.show()

- all things looks good in this column 

In [None]:
df.isna().sum()

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df.select_dtypes('number').corr(),annot=True,cmap='Greens')
plt.show()

- _Here we notice that "Expendeture_Percentage_GDP" and "Groos_Domestic_Product" are highly correlated so we can drop one of them to avoid multicollinearity_

In [None]:
df.drop(columns=["Gross_Domestic_Product"],inplace=True)

In [None]:
df.hist(bins=10, figsize=(16,16))
plt.suptitle("Data Distribution of all the columns")
plt.show()

<span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Bivariate Analysis

- __Lets see the correlation between cloumns has hihg correlation with life expectancy__

In [None]:
# moratality adults vs life expectancy years
plt.figure(figsize=(5,5))
sns.jointplot(x=df["Mortality_Adults"], y=df["Life_Expectancy_Years"], kind="hex", color="r")
plt.show()

In [None]:
#  Body_Mass_Index_Avg vs life expectancy years
plt.figure(figsize=(5,5))
sns.jointplot(x=df["Body_Mass_Index_Avg"], y=df["Life_Expectancy_Years"], kind="hex", color="g")
plt.show()

In [None]:
#  Body_Mass_Index_Avg vs life expectancy years
plt.figure(figsize=(5,5))
sns.jointplot(x=df["Thinness"], y=df["Life_Expectancy_Years"], kind="hex", color="g")
plt.show()

  <span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Preprocessing & Modeling



In [None]:
# split data to independent and dependent variables 

X = df.copy().drop(columns=["Life_Expectancy_Years","Nation"])
y = df["Life_Expectancy_Years"] 

In [None]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42,shuffle=True) # split data to train and test 

- _from EDA there i will transform columns based on their distribution and outliers_

In [None]:
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import FunctionTransformer 
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import  make_column_transformer 

log_num_pipeline = make_pipeline(FunctionTransformer(np.log1p))


sqr_num_pipeline = make_pipeline(FunctionTransformer(np.square)) 

normal_num_pipeline = make_pipeline(MinMaxScaler()) 

cat_pipeline = make_pipeline(OneHotEncoder()) 



preprocessor = make_column_transformer(
                                        (log_num_pipeline, ["Mortality_Adults", "Infant_Deaths_Count", "Alcohol_Consumption_Rate", "Expenditure_Percentage_GDP", "Measles_Infection_Count", "HIV_AIDS_Prevalence_Rate", "Total_Population", "Thinness"]),
                                        (sqr_num_pipeline, ["Hepatitis_B_Vaccination_Coverage", "Polio_Vaccination_Coverage", "Diphtheria_Vaccination_Coverage"]),
                                        (normal_num_pipeline, ["Body_Mass_Index_Avg", "Total_Health_Expenditure"]),
                                        (cat_pipeline, ["Country_Category", "Survey_Year"])
                                        
                                        )    
                        

In [None]:
preprocessor

In [None]:
x_train_preprocessed = preprocessor.fit_transform(x_train) 

In [None]:
x_test_preprocessed = preprocessor.transform(x_test) 

In [None]:
pd.DataFrame(x_train_preprocessed).head(5)

In [None]:
pd.DataFrame(x_test_preprocessed).head(5) 

In [None]:
print(x_train_preprocessed.shape)
print(x_test_preprocessed.shape )
print(y_train.shape)
print(y_test.shape)

  <span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Model Selection 



 - I will use RMSE to evaluate the model 

In [None]:
! pip install lazypredict

In [None]:
from lazypredict.Supervised import LazyRegressor  # to see all models and their scores as a summary

In [None]:
reg = LazyRegressor(verbose=0,ignore_warnings=False, custom_metric=None) 
models,predictions = reg.fit(x_train_preprocessed, x_test_preprocessed, y_train, y_test)

In [None]:
print(models)

- _I will use cross validation to avoid overfitting and ma make sure that the model is generalizable._

In [None]:
from sklearn.ensemble import ExtraTreesRegressor                          
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
models = [
    ExtraTreesRegressor(),
    RandomForestRegressor(),
    LinearRegression(),
    HistGradientBoostingRegressor(),
    XGBRegressor(),
    GradientBoostingRegressor()
]


for model in models:
    cv_scores = cross_val_score(model, x_train_preprocessed, y_train, cv=5, scoring='neg_root_mean_squared_error')
    print(f"{model.__class__.__name__} cv scores: {-cv_scores.mean()}")



 - _Best model from cross validation is ExtraTreesRegressor so i will use grid search to tune the hyperparameters_ 

  <span style='color: #BCA37F;
           font-size: 200%;
           border-radius:10px;
           text-align:left;
           font-weight:600;
           padding-left: 20px;
           padding-right:20px;
           font-family: "Courier New";'>
Hyperparameter Tuning



In [None]:
from sklearn.model_selection import GridSearchCV 

extra_trees_model = ExtraTreesRegressor() 

parameters = {
    'n_estimators': [50, 100, 200],             # Number of trees in the forest
    'max_features': ['auto', 'sqrt', 'log2'],   
    'max_depth': [None, 10, 20, 30],             # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],             # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],               
    'random_state': [42],                       
}

grid_search = GridSearchCV(extra_trees_model, parameters, cv=5) 

grid_search.fit(x_train_preprocessed, y_train) 

best_params = grid_search.best_params_
print("Best Parameters:", best_params)

In [None]:
extra_trees_model = ExtraTreesRegressor(**best_params) # ** -> unpacking 

extra_trees_model.fit(x_train_preprocessed, y_train) # training model

y_pred = extra_trees_model.predict(x_test_preprocessed) # prediction 

In [None]:
# evaluation in test data 
from sklearn.metrics import mean_squared_error 

print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred))) 

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred) # r2 score to see how much our model is good 
print("R-squared:", r2)


**Thanks for reading my notebook , i hope you enjoyed it , if you have any question or suggestion please leave it in the comments .**