Verinin okunması ve eksik veri kontrolü

In [24]:
import pandas as pd
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C


df = pd.read_csv('Life Expectancy Data.csv')

missing_data = df[df.isnull().any(axis=1)]

print(missing_data)


       Country  Year      Status  Life expectancy   Adult Mortality  \
32     Algeria  2015  Developing              75.6             19.0   
44     Algeria  2003  Developing              71.7            146.0   
45     Algeria  2002  Developing              71.6            145.0   
46     Algeria  2001  Developing              71.4            145.0   
47     Algeria  2000  Developing              71.3            145.0   
...        ...   ...         ...               ...              ...   
2918    Zambia  2003  Developing              46.4             64.0   
2919    Zambia  2002  Developing              45.5             69.0   
2920    Zambia  2001  Developing              44.6            611.0   
2921    Zambia  2000  Developing              43.8            614.0   
2922  Zimbabwe  2015  Developing              67.0            336.0   

      infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   \
32               21      NaN                0.000000         95.0   

In [25]:
missing_counts = df.isnull().sum()

print("Her sütundaki eksik veri sayıları:")
print(missing_counts)


Her sütundaki eksik veri sayıları:
Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64


Verileri standartlaştırma

In [26]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaled_data = scaler.fit_transform(df.select_dtypes(include=[np.number]))  # Only scale numeric data
scaled_df = pd.DataFrame(scaled_data, columns=df.select_dtypes(include=[np.number]).columns)




Kolonları teker teker eksik var mı yok mu kontrol ediyoruz. Eksik olan verileri test, eksik olmayanları train verisi olarak ayırıyoruz. Modeli eğitirken satırlarda herhangi bir eksik olamamsı için eksik satır içeren verileri temizliyoruz. Model eğitilip eksik veriler tamamlanıyor. (Modelin çalışması biraz uzun sürüyor)

In [27]:
for col in scaled_df.columns:
    if scaled_df[col].isnull().sum() > 0:  # If the column has missing values
        print(f"Imputing missing values for column: {col}")
        
        # Separate rows with and without missing values
        train_data = scaled_df[scaled_df[col].notnull()]
        test_data = scaled_df[scaled_df[col].isnull()]
        
        # Use other columns as features
        X_train = train_data.drop(columns=[col]).values
        y_train = train_data[col].values
        X_test = test_data.drop(columns=[col]).values
        
        # Handle missing values in X_train and X_test
        valid_train_rows = ~np.isnan(X_train).any(axis=1)  # Remove rows with NaN in X_train
        X_train = X_train[valid_train_rows]
        y_train = y_train[valid_train_rows]
        
        # Temporarily fill NaNs in X_test with column means (or another strategy)
        X_test = np.nan_to_num(X_test, nan=np.nanmean(X_train, axis=0))
        
        # Define the Gaussian Process model
        kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-3, 1e3))
        gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=1e-2)
        
        # Fit the model
        gpr.fit(X_train, y_train)
        
        # Predict the missing values
        y_pred, y_std = gpr.predict(X_test, return_std=True)
        
        # Fill the missing values
        scaled_df.loc[scaled_df[col].isnull(), col] = y_pred

Imputing missing values for column: Life expectancy 
Imputing missing values for column: Adult Mortality
Imputing missing values for column: Alcohol
Imputing missing values for column: Hepatitis B
Imputing missing values for column:  BMI 
Imputing missing values for column: Polio
Imputing missing values for column: Total expenditure
Imputing missing values for column: Diphtheria 
Imputing missing values for column: GDP
Imputing missing values for column: Population
Imputing missing values for column:  thinness  1-19 years
Imputing missing values for column:  thinness 5-9 years
Imputing missing values for column: Income composition of resources
Imputing missing values for column: Schooling


In [None]:
imputed_data = scaler.inverse_transform(scaled_df)
imputed_df = pd.DataFrame(imputed_data, columns=df.select_dtypes(include=[np.number]).columns)

# Replace the original numeric columns with imputed values
for col in imputed_df.columns:
    df[col] = imputed_df[col]



missing_counts = df.isnull().sum()

print("Her sütundaki eksik veri sayıları:")
print(missing_counts)

Her sütundaki eksik veri sayıları:
Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64
