# Veri İşleme - Eksik Veri ve Dengesiz Sınıf Dağılımı

<p>Bu notebookta işlenen konular hakkında işe yarar bağlantılar:</p><br />
<a href="https://www.kaggle.com/jackdaoud/marketing-data">Kullanılan Veri Kümesi</a><br />
<a href="https://scikit-learn.org/stable/modules/impute.html">Scikit-learn Eksik Veri Doldurma Dokümanı</a><br />
<a href="https://imbalanced-learn.readthedocs.io/en/stable/api.html">Dengesiz Sınıf Dağılımı için imblearn API Dokümantasyonu</a>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from datetime import datetime
from copy import deepcopy

In [2]:
data = pd.read_csv("marketing_data.csv")
y = np.array(data["Response"]).astype(np.float16)
X = data.drop("Response", 1)
X = X.drop("ID", 1)

In [3]:
X.shape

(2240, 26)

In [4]:
X.isna().sum()

Year_Birth              0
Education               0
Marital_Status          0
 Income                24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Country                 0
dtype: int64

In [5]:
def remove_value(X):
    # Randomly remove values from the data
    a, b = X.shape
    n_missing = int(a * b * 0.1)
    print(f"Removing {n_missing} values randomly.")
    for i in range(n_missing):
        coor = (np.random.rand(2) * [*X.shape]).astype(np.int16)
        coor = np.clip(coor, a_min=0, a_max=max(X.shape) - 1)
        X.iloc[coor[0], coor[1]] = np.nan
    return X

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [7]:
X_train_orig = deepcopy(X_train)
X_test_orig = deepcopy(X_test)

In [8]:
X_train = remove_value(X_train)
X_test = remove_value(X_test)

Removing 4659 values randomly.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


Removing 1164 values randomly.


In [9]:
X_train.isna().sum()

Year_Birth             184
Education              170
Marital_Status         154
 Income                173
Kidhome                162
Teenhome               159
Dt_Customer            154
Recency                169
MntWines               186
MntFruits              154
MntMeatProducts        177
MntFishProducts        179
MntSweetProducts       168
MntGoldProds           171
NumDealsPurchases      186
NumWebPurchases        161
NumCatalogPurchases    187
NumStorePurchases      163
NumWebVisitsMonth      151
AcceptedCmp3           159
AcceptedCmp4           181
AcceptedCmp5           181
AcceptedCmp1           173
AcceptedCmp2           205
Complain               178
Country                159
dtype: int64

In [10]:
X_test.isna().sum()

Year_Birth             41
Education              50
Marital_Status         42
 Income                51
Kidhome                42
Teenhome               44
Dt_Customer            49
Recency                56
MntWines               29
MntFruits              41
MntMeatProducts        42
MntFishProducts        44
MntSweetProducts       38
MntGoldProds           38
NumDealsPurchases      41
NumWebPurchases        59
NumCatalogPurchases    36
NumStorePurchases      41
NumWebVisitsMonth      41
AcceptedCmp3           35
AcceptedCmp4           47
AcceptedCmp5           43
AcceptedCmp1           31
AcceptedCmp2           49
Complain               40
Country                38
dtype: int64

#### İlk İzlenimler

In [12]:
pos_res = y_train[y_train == 1].shape[0]
neg_res = y_train[y_train == 0].shape[0]
print(f"Number of positive responses: {pos_res}")
print(f"Number of negative responses: {neg_res}")
print(f"Accuracy if model predicts positive everytime: %{100 * pos_res/(pos_res+neg_res)}")
print(f"Accuracy if model predicts negative everytime: %{100 * neg_res/(pos_res+neg_res)}")

Number of positive responses: 264
Number of negative responses: 1528
Accuracy if model predicts positive everytime: %14.732142857142858
Accuracy if model predicts negative everytime: %85.26785714285714


In [13]:
X_train.isna().sum()

Year_Birth             184
Education              170
Marital_Status         154
 Income                173
Kidhome                162
Teenhome               159
Dt_Customer            154
Recency                169
MntWines               186
MntFruits              154
MntMeatProducts        177
MntFishProducts        179
MntSweetProducts       168
MntGoldProds           171
NumDealsPurchases      186
NumWebPurchases        161
NumCatalogPurchases    187
NumStorePurchases      163
NumWebVisitsMonth      151
AcceptedCmp3           159
AcceptedCmp4           181
AcceptedCmp5           181
AcceptedCmp1           173
AcceptedCmp2           205
Complain               178
Country                159
dtype: int64

In [14]:
X_train.dtypes

Year_Birth             float64
Education               object
Marital_Status          object
 Income                 object
Kidhome                float64
Teenhome               float64
Dt_Customer             object
Recency                float64
MntWines               float64
MntFruits              float64
MntMeatProducts        float64
MntFishProducts        float64
MntSweetProducts       float64
MntGoldProds           float64
NumDealsPurchases      float64
NumWebPurchases        float64
NumCatalogPurchases    float64
NumStorePurchases      float64
NumWebVisitsMonth      float64
AcceptedCmp3           float64
AcceptedCmp4           float64
AcceptedCmp5           float64
AcceptedCmp1           float64
AcceptedCmp2           float64
Complain               float64
Country                 object
dtype: object

In [15]:
X_train.columns[X_train.dtypes == object]

Index(['Education', 'Marital_Status', ' Income ', 'Dt_Customer', 'Country'], dtype='object')

In [18]:
X_train[X_train.columns[X_train.dtypes == object]].nunique()

Education            5
Marital_Status       8
 Income           1477
Dt_Customer        623
Country              8
dtype: int64

#### Veriyi Analize Uygun Hale Getirmek

In [20]:
def ordinal_encode(X, cats):
    X = deepcopy(X)
    for cat in cats:
        X[cat][X[cat].isna()] = "nan"
        
    enc=OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
    enc.fit(X[cats])
    
    transformed = enc.transform(X[cats])
    
    for i in range(len(cats)):
        nan_pos = np.where(enc.categories_[i] == "nan")
        transformed[[np.where(transformed[:, i] == nan_pos)], i] = np.nan
        
    return transformed, enc

In [21]:
cats = ["Education", "Marital_Status", "Country"]
a, b = ordinal_encode(X_train, cats)
for i, cat in enumerate(cats):
    X_train[cat] = a[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [22]:
for cat in cats:
    X_test[cat][X_test[cat].isna()] = "nan"
c = b.transform(X_test[cats])
for i in range(len(cats)):
    nan_pos = np.where(b.categories_[i] == "nan")
    c[[np.where(c[:, i] == nan_pos)], i] = np.nan
for i, cat in enumerate(cats):
    X_test[cat] = c[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [23]:
for cat in cats:
    X_train_orig[cat][X_train_orig[cat].isna()] = "nan"
d = b.transform(X_train_orig[cats])
for i in range(len(cats)):
    nan_pos = np.where(b.categories_[i] == "nan")
    d[[np.where(d[:, i] == nan_pos)], i] = np.nan
for i, cat in enumerate(cats):
    X_train_orig[cat] = d[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [24]:
X_train

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
873,1963.0,,,"$33,378.00",1.0,1.0,,,33.0,6.0,...,,4.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,
2161,1984.0,2.0,3.0,"$31,761.00",1.0,0.0,4/5/14,96.0,19.0,1.0,...,0.0,4.0,,0.0,0.0,0.0,0.0,0.0,0.0,
496,1972.0,2.0,4.0,"$38,808.00",1.0,0.0,,,125.0,17.0,...,1.0,4.0,8.0,1.0,0.0,0.0,0.0,,0.0,6.0
997,1957.0,2.0,4.0,,1.0,,5/27/14,45.0,7.0,,...,0.0,2.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
1793,1961.0,1.0,3.0,"$28,249.00",0.0,0.0,6/15/14,80.0,1.0,9.0,...,0.0,3.0,6.0,,0.0,0.0,0.0,0.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,1963.0,2.0,6.0,"$34,213.00",1.0,1.0,9/7/12,2.0,50.0,4.0,...,1.0,2.0,9.0,0.0,,0.0,0.0,0.0,0.0,5.0
1643,1958.0,0.0,4.0,"$85,485.00",0.0,0.0,6/21/14,73.0,630.0,26.0,...,6.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
1984,1985.0,2.0,4.0,"$29,760.00",,0.0,8/29/12,87.0,64.0,4.0,...,1.0,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
567,1973.0,2.0,3.0,,1.0,1.0,5/7/14,25.0,69.0,2.0,...,0.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [29]:
X_test

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
1350,1943.0,,,,0.0,0.0,7/6/13,,735.0,40.0,...,3.0,13.0,6.0,0.0,,0.0,0.0,,0.0,
1288,1975.0,2.0,4.0,"$84,196.00",0.0,1.0,6/3/13,56.0,215.0,63.0,...,4.0,7.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
669,1960.0,,5.0,"$78,468.00",0.0,0.0,4/9/14,29.0,434.0,22.0,...,7.0,,4.0,0.0,0.0,0.0,1.0,0.0,,7.0
1383,1959.0,4.0,3.0,"$33,762.00",2.0,1.0,7/7/13,,53.0,1.0,...,2.0,2.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
347,1977.0,0.0,5.0,,0.0,0.0,4/30/13,14.0,2.0,12.0,...,1.0,2.0,,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010,1982.0,2.0,,"$68,627.00",0.0,0.0,,45.0,395.0,15.0,...,3.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
73,1970.0,2.0,,"$27,242.00",1.0,0.0,11/11/12,,,,...,0.0,3.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
1946,1965.0,2.0,3.0,"$44,393.00",1.0,1.0,8/22/13,86.0,24.0,2.0,...,0.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
1835,1964.0,4.0,3.0,"$59,304.00",0.0,1.0,,81.0,418.0,61.0,...,8.0,10.0,5.0,0.0,0.0,,0.0,0.0,0.0,6.0


In [22]:
# X_train = pd.get_dummies(X_train, columns=["Education", "Marital_Status", "Country"], dummy_na=True)
# X_test = pd.get_dummies(X_test, columns=["Education", "Marital_Status", "Country"], dummy_na=True)

In [23]:
# X_train_orig = pd.get_dummies(X_train_orig, columns=["Education", "Marital_Status", "Country"], dummy_na=True)

In [30]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_train[" Income "] if isinstance(inc, str)]
X_train[" Income "][X_train[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_train[" Income "] = X_train[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [31]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_train_orig[" Income "] if isinstance(inc, str)]
X_train_orig[" Income "][X_train_orig[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_train_orig[" Income "] = X_train_orig[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [32]:
X_train

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
873,1963.0,,,33378.0,1.0,1.0,,,33.0,6.0,...,,4.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,
2161,1984.0,2.0,3.0,31761.0,1.0,0.0,4/5/14,96.0,19.0,1.0,...,0.0,4.0,,0.0,0.0,0.0,0.0,0.0,0.0,
496,1972.0,2.0,4.0,38808.0,1.0,0.0,,,125.0,17.0,...,1.0,4.0,8.0,1.0,0.0,0.0,0.0,,0.0,6.0
997,1957.0,2.0,4.0,,1.0,,5/27/14,45.0,7.0,,...,0.0,2.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
1793,1961.0,1.0,3.0,28249.0,0.0,0.0,6/15/14,80.0,1.0,9.0,...,0.0,3.0,6.0,,0.0,0.0,0.0,0.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,1963.0,2.0,6.0,34213.0,1.0,1.0,9/7/12,2.0,50.0,4.0,...,1.0,2.0,9.0,0.0,,0.0,0.0,0.0,0.0,5.0
1643,1958.0,0.0,4.0,85485.0,0.0,0.0,6/21/14,73.0,630.0,26.0,...,6.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
1984,1985.0,2.0,4.0,29760.0,,0.0,8/29/12,87.0,64.0,4.0,...,1.0,4.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
567,1973.0,2.0,3.0,,1.0,1.0,5/7/14,25.0,69.0,2.0,...,0.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [33]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_test[" Income "] if isinstance(inc, str)]
X_test[" Income "][X_test[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_test[" Income "] = X_test[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [35]:
X_train["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_train["Dt_Customer"])).dt.days
X_test["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_test["Dt_Customer"])).dt.days
X_train_orig["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_train_orig["Dt_Customer"])).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [36]:
X_train.insert(2, "Age", [2021 - birth for birth in X_train["Year_Birth"]])
X_test.insert(2, "Age", [2021 - birth for birth in X_test["Year_Birth"]])
X_train_orig.insert(2, "Age", [2021 - birth for birth in X_train_orig["Year_Birth"]])

In [37]:
X_train.dtypes

Year_Birth             float64
Education              float64
Age                    float64
Marital_Status         float64
 Income                float64
Kidhome                float64
Teenhome               float64
Dt_Customer            float64
Recency                float64
MntWines               float64
MntFruits              float64
MntMeatProducts        float64
MntFishProducts        float64
MntSweetProducts       float64
MntGoldProds           float64
NumDealsPurchases      float64
NumWebPurchases        float64
NumCatalogPurchases    float64
NumStorePurchases      float64
NumWebVisitsMonth      float64
AcceptedCmp3           float64
AcceptedCmp4           float64
AcceptedCmp5           float64
AcceptedCmp1           float64
AcceptedCmp2           float64
Complain               float64
Country                float64
dtype: object

## Korelasyon

In [39]:
corr_mat = X_train.corr()

In [40]:
corr_mat.to_csv("corr_mat.csv")

In [41]:
corr_mat

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
Year_Birth,1.0,-0.21729,-1.0,-0.058123,-0.171415,0.211125,-0.37865,,-0.022521,-0.126031,...,-0.110843,-0.095175,0.113192,0.055167,-0.063828,0.041124,0.005538,-0.039667,-0.019949,-0.002035
Education,-0.21729,1.0,0.21729,0.006467,0.161462,-0.068455,0.140368,,-0.006241,0.208418,...,0.06638,0.077955,-0.04487,0.005302,0.043719,0.058088,0.006075,0.029399,-0.052456,0.041884
Age,-1.0,0.21729,1.0,0.058123,0.171415,-0.211125,0.37865,,0.022521,0.126031,...,0.110843,0.095175,-0.113192,-0.055167,0.063828,-0.041124,-0.005538,0.039667,0.019949,0.002035
Marital_Status,-0.058123,0.006467,0.058123,1.0,0.019598,-0.027889,0.003408,,0.004135,-0.005199,...,0.031671,0.012867,-0.042,-0.042317,0.035291,0.027965,0.000299,0.025813,-0.014164,0.015153
Income,-0.171415,0.161462,0.171415,0.019598,1.0,-0.517584,0.039021,,-0.001984,0.7044,...,0.695483,0.624901,-0.639746,-0.004337,0.219209,0.397079,0.314149,0.114571,-0.043716,0.011576
Kidhome,0.211125,-0.068455,-0.211125,-0.027889,-0.517584,1.0,-0.050477,,0.021682,-0.495394,...,-0.503314,-0.48506,0.467562,0.012656,-0.160525,-0.221751,-0.154404,-0.080898,0.034003,-0.02324
Teenhome,-0.37865,0.140368,0.37865,0.003408,0.039021,-0.050477,1.0,,0.007299,-0.001481,...,-0.109984,0.057555,0.136089,-0.025432,0.030935,-0.204585,-0.132343,-0.017368,-0.016418,-0.027791
Dt_Customer,,,,,,,,,,,...,,,,,,,,,,
Recency,-0.022521,-0.006241,0.022521,0.004135,-0.001984,0.021682,0.007299,,1.0,0.005208,...,-0.001012,-0.000513,-0.011114,-0.041825,0.023339,-0.012951,-0.023796,0.00297,-0.007054,0.038233
MntWines,-0.126031,0.208418,0.126031,-0.005199,0.7044,-0.495394,-0.001481,,0.005208,1.0,...,0.658616,0.631491,-0.325184,0.054812,0.383909,0.457003,0.356384,0.2123,-0.050219,0.012026


In [43]:
corr_mat["Age"][np.abs(corr_mat["Age"]) > 0.5]

Year_Birth   -1.0
Age           1.0
Name: Age, dtype: float64

## Ölçekleme

Eksik verileri doldurmadan önce ölçeklemek (scaling) oldukça önemli. Bu adım atlandığında tek değişken yöntemleri daha iyi sonuç verebiliyor çünkü çok değişken yöntemleri aynı ölçekte olmayan sütunlardan fazlaca etkilenip daha yanlış kararlar alabiliyor.

In [44]:
sc = StandardScaler()
X_train_orig = pd.DataFrame(sc.fit_transform(X_train_orig), columns=X_train.columns)
X_train = pd.DataFrame(sc.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)

## Tek Değişken Yöntemleri 

### Ortalama, Medyan, Mod ve Sabit Değer

In [45]:
from sklearn.impute import SimpleImputer

In [46]:
imputers = list()
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imputers.append(imp_mean)
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputers.append(imp_median)
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputers.append(imp_mode)
imp_constant = SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0)
imputers.append(imp_constant)

In [47]:
for imp in imputers:
    imp.fit(X_train)

In [48]:
X_train.mode().iloc[0]

Year_Birth             0.180637
Education             -0.357350
Age                   -0.180637
Marital_Status        -0.679522
 Income               -2.032625
Kidhome               -0.842520
Teenhome              -0.953397
Dt_Customer            0.000000
Recency                0.251008
MntWines              -0.899266
MntFruits             -0.660288
MntMeatProducts       -0.705575
MntFishProducts       -0.680377
MntSweetProducts      -0.658122
MntGoldProds          -0.813450
NumDealsPurchases     -0.694241
NumWebPurchases       -0.742134
NumCatalogPurchases   -0.929365
NumStorePurchases     -0.847171
NumWebVisitsMonth      0.685533
AcceptedCmp3          -0.280356
AcceptedCmp4          -0.290032
AcceptedCmp5          -0.278553
AcceptedCmp1          -0.257944
AcceptedCmp2          -0.115801
Complain              -0.100063
Country                0.689935
Name: 0, dtype: float64

In [49]:
print(X_train.mean())

Year_Birth            -8.020119e-16
Education             -8.568100e-17
Age                    2.582235e-16
Marital_Status         3.034293e-16
 Income                1.153425e-16
Kidhome               -2.062427e-16
Teenhome               8.484742e-17
Dt_Customer            0.000000e+00
Recency               -1.265504e-17
MntWines              -7.777092e-17
MntFruits              1.182747e-17
MntMeatProducts        1.031167e-18
MntFishProducts        1.727625e-17
MntSweetProducts      -1.562106e-17
MntGoldProds           4.588831e-18
NumDealsPurchases     -1.838850e-17
NumWebPurchases        7.174587e-17
NumCatalogPurchases   -4.634576e-17
NumStorePurchases      1.117720e-16
NumWebVisitsMonth     -7.668725e-17
AcceptedCmp3           4.691083e-18
AcceptedCmp4          -4.471215e-16
AcceptedCmp5          -5.711687e-16
AcceptedCmp1           1.725337e-16
AcceptedCmp2          -4.255155e-17
Complain               7.434161e-17
Country                2.383034e-16
dtype: float64


In [50]:
print(X_train.median())

Year_Birth             0.096249
Education             -0.357350
Age                   -0.096249
Marital_Status         0.245917
 Income               -0.039385
Kidhome               -0.842520
Teenhome              -0.953397
Dt_Customer            0.000000
Recency                0.006730
MntWines              -0.394460
MntFruits             -0.459643
MntMeatProducts       -0.447978
MntFishProducts       -0.461610
MntSweetProducts      -0.464132
MntGoldProds          -0.399416
NumDealsPurchases     -0.168569
NumWebPurchases       -0.380058
NumCatalogPurchases   -0.232504
NumStorePurchases     -0.229565
NumWebVisitsMonth      0.272400
AcceptedCmp3          -0.280356
AcceptedCmp4          -0.290032
AcceptedCmp5          -0.278553
AcceptedCmp1          -0.257944
AcceptedCmp2          -0.115801
Complain              -0.100063
Country                0.689935
dtype: float64


In [51]:
X_mean_tr = pd.DataFrame(imp_mean.transform(X_train), columns=X_train.columns)
X_median_tr = pd.DataFrame(imp_median.transform(X_train), columns=X_train.columns)
X_mode_tr = pd.DataFrame(imp_mode.transform(X_train), columns=X_train.columns)
X_const_tr = pd.DataFrame(imp_mode.transform(X_train), columns=X_train.columns)

## Çok Değişken Yöntemleri

### En Yakın Komşu

In [52]:
from sklearn.impute import KNNImputer

In [54]:
imp_knn = KNNImputer(missing_values=np.nan, n_neighbors=5, weights="distance")

In [55]:
imp_knn.fit(X_train)

KNNImputer(weights='distance')

In [56]:
X_knn_tr = pd.DataFrame(imp_knn.transform(X_train), columns=X_train.columns)

In [57]:
X_knn_tr

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,-0.494468,-0.139655,0.494468,0.258017,-0.847915,1.003324,0.880404,0.0,0.109101,-0.808547,...,-0.711679,-0.538368,0.685533,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-0.155389
1,1.277681,-0.357350,-1.277681,-0.679522,-0.921942,1.003324,-0.953397,0.0,1.646885,-0.849517,...,-0.929365,-0.538368,0.540735,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.575067
2,0.265025,-0.357350,-0.265025,0.245917,-0.599327,1.003324,-0.953397,0.0,0.048843,-0.539318,...,-0.580935,-0.538368,1.098665,3.566888,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.689935
3,-1.000796,-0.357350,1.000796,0.245917,-1.044455,1.003324,-0.241047,0.0,-0.132858,-0.884634,...,-0.929365,-1.155974,0.685533,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-1.145393
4,-0.663244,-1.244431,0.663244,-0.679522,-1.082724,-0.842520,-0.953397,0.0,1.088534,-0.902192,...,-0.929365,-0.847171,0.272400,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-0.686561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1787,-0.494468,-0.357350,0.494468,2.096794,-0.809688,1.003324,0.880404,0.0,-1.633425,-0.758798,...,-0.580935,-1.155974,1.511797,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.231103
1788,-0.916408,-2.131513,0.916408,0.245917,1.537573,-0.842520,-0.953397,0.0,0.844256,0.938519,...,1.161218,0.079239,-1.380129,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.689935
1789,1.362069,-0.357350,-1.362069,0.245917,-1.013549,1.003324,-0.953397,0.0,1.332813,-0.717829,...,-0.580935,-0.538368,1.098665,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-1.145393
1790,0.349413,-0.357350,-0.349413,-0.679522,-0.569955,1.003324,0.880404,0.0,-0.830796,-0.703197,...,-0.929365,-0.538368,-0.140732,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-0.268631


#### Döngülü Yöntemler (Sklearn'de Henüz Olgun bir İmplementasyonu Yok)

In [58]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [59]:
imp_linear = IterativeImputer(max_iter=100, random_state=0, estimator=BayesianRidge())
imp_tree = IterativeImputer(max_iter = 100, random_state=0, estimator=DecisionTreeRegressor(max_features='sqrt', random_state=0))
# imp_forest = IterativeImputer(max_iter = 25, random_state=0, estimator=RandomForestRegressor(n_estimators=20, random_state=0))

In [60]:
imp_linear = IterativeImputer(max_iter=100, random_state=0, estimator=BayesianRidge())

In [61]:
imp_linear.fit(X_train)

IterativeImputer(estimator=BayesianRidge(), max_iter=100, random_state=0)

In [62]:
imp_tree.fit(X_train)



IterativeImputer(estimator=DecisionTreeRegressor(max_features='sqrt',
                                                 random_state=0),
                 max_iter=100, random_state=0)

In [54]:
# imp_forest.fit(X_train)

In [63]:
X_linear_tr = pd.DataFrame(imp_linear.transform(X_train), columns=X_train.columns)
X_tree_tr = pd.DataFrame(imp_tree.transform(X_train), columns=X_train.columns)
# X_forest_tr = pd.DataFrame(imp_forest.transform(X_train), columns=X_train.columns)

#### Karşılaştırma

In [65]:
print("Metrik: Mean Absolute Error (Ortalama Mutlak Hata)\n")
print(f"Ortalamaya Göre: {((np.abs(X_train_orig - X_mean_tr))).mean().mean()}")
print(f"Medyana Göre: {((np.abs(X_train_orig - X_median_tr))).mean().mean()}")
print(f"Moda Göre: {((np.abs(X_train_orig - X_mode_tr))).mean().mean()}")
print(f"Sabit Değere Göre: {((np.abs(X_train_orig - X_const_tr))).mean().mean()}")
print(f"En Yakın Komşulara Göre: {((np.abs(X_train_orig - X_knn_tr))).mean().mean()}")
print(f"Lineer Regresyona Göre: {((np.abs(X_train_orig - X_linear_tr))).mean().mean()}")
print(f"Karar Ağacına Göre: {((np.abs(X_train_orig - X_tree_tr))).mean().mean()}")
# print(f"Karar Ağaçları Ormanına Göre: {((np.abs(X_train_orig - X_forest_tr))).mean().mean()}")

Metrik: Mean Absolute Error (Ortalama Mutlak Hata)

Ortalamaya Göre: 0.07346204475963544
Medyana Göre: 0.06531456214668258
Moda Göre: 0.07303956739720724
Sabit Değere Göre: 0.07303956739720724
En Yakın Komşulara Göre: 0.050770609849708874
Lineer Regresyona Göre: 0.05608738698276064
Karar Ağacına Göre: 0.05733556274316441


#### Test Kümesinin Dönüştürülmesi

In [67]:
X_test = pd.DataFrame(imp_knn.transform(X_test), columns=X_test.columns)

In [68]:
X_test

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,-2.182228,0.134891,2.182228,-0.328557,0.584225,-0.842520,-0.953397,0.0,-0.489510,1.245792,...,0.115926,2.240861,0.272400,-0.280356,1.880351,-0.278553,-0.257944,1.532488,-0.100063,0.435353
1,0.518189,-0.357350,-0.518189,0.245917,1.478561,-0.842520,0.880404,0.0,0.251008,-0.275941,...,0.464357,0.388042,-0.966996,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.689935
2,-0.747632,-0.679529,0.747632,1.171356,1.216330,-0.842520,-0.953397,0.0,-0.691208,0.364943,...,1.509649,0.245753,-0.553864,-0.280356,-0.290032,-0.278553,3.876816,-0.115801,-0.100063,1.148767
3,-0.832020,1.416813,0.832020,-0.679522,-0.830336,2.849167,0.880404,0.0,-1.050969,-0.750019,...,-0.232504,-1.155974,1.098665,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-1.604226
4,0.686965,-2.131513,-0.686965,1.171356,-1.582113,-0.842520,-0.953397,0.0,-1.214662,-0.899266,...,-0.580935,-1.155974,0.749754,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-1.604226
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,1.108905,-0.357350,-1.108905,0.067329,0.765804,-0.842520,-0.953397,0.0,-0.132858,0.250813,...,0.115926,0.079239,-1.793261,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.231103
444,0.096249,-0.357350,-0.096249,0.039698,-1.128825,1.003324,-0.953397,0.0,0.513226,-0.845566,...,-0.929365,-0.847171,1.511797,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,-0.686561
445,-0.325692,-0.357350,0.325692,-0.679522,-0.343642,1.003324,0.880404,0.0,1.297916,-0.834885,...,-0.929365,-0.538368,-0.553864,-0.280356,-0.290032,-0.278553,-0.257944,-0.115801,-0.100063,0.689935
446,-0.410080,1.416813,0.410080,-0.679522,0.338992,-0.842520,0.880404,0.0,1.123431,0.318120,...,1.858080,1.314451,-0.140732,-0.280356,-0.290032,0.412747,-0.257944,-0.115801,-0.100063,0.689935


In [70]:
# Inverse transform to get the originals
pd.DataFrame(sc.inverse_transform(X_test), columns=X_test.columns)

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,1943.0,2.554899e+00,78.0,3.379242,64660.706898,0.0,0.000000,18643.0,34.779845,735.000000,...,3.000000,13.000000,6.00000,0.0,0.580638,0.000000,6.938894e-18,0.188349,0.0,5.445153
1,1975.0,2.000000e+00,46.0,4.000000,84196.000000,0.0,1.000000,18643.0,56.000000,215.000000,...,4.000000,7.000000,3.00000,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,6.000000
2,1960.0,1.636810e+00,61.0,5.000000,78468.000000,0.0,0.000000,18643.0,29.000000,434.000000,...,7.000000,6.539224,4.00000,0.0,0.000000,0.000000,1.000000e+00,0.000000,0.0,7.000000
3,1959.0,4.000000e+00,62.0,3.000000,33762.000000,2.0,1.000000,18643.0,18.690760,53.000000,...,2.000000,2.000000,8.00000,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,1.000000
4,1977.0,4.440892e-16,44.0,5.000000,17340.680429,0.0,0.000000,18643.0,14.000000,2.000000,...,1.000000,2.000000,7.15545,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,1982.0,2.000000e+00,39.0,3.807023,68627.000000,0.0,0.000000,18643.0,45.000000,395.000000,...,3.000000,6.000000,1.00000,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,5.000000
444,1970.0,2.000000e+00,51.0,3.777167,27242.000000,1.0,0.000000,18643.0,63.514059,20.350158,...,0.000000,3.000000,9.00000,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,3.000000
445,1965.0,2.000000e+00,56.0,3.000000,44393.000000,1.0,1.000000,18643.0,86.000000,24.000000,...,0.000000,4.000000,4.00000,0.0,0.000000,0.000000,6.938894e-18,0.000000,0.0,6.000000
446,1964.0,4.000000e+00,57.0,3.000000,59304.000000,0.0,1.000000,18643.0,81.000000,418.000000,...,8.000000,10.000000,5.00000,0.0,0.000000,0.178699,6.938894e-18,0.000000,0.0,6.000000


## Sınıf Dengesizliğini Gidermek

In [71]:
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.over_sampling import RandomOverSampler
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [72]:
smote_tomek = SMOTETomek(random_state=0)
smote_enn = SMOTEENN(random_state=0)
ros = RandomOverSampler(random_state=0)

In [74]:
X_resampled_tomek, y_resampled_tomek = smote_tomek.fit_resample(X_linear_tr, y_train)
X_resampled_enn, y_resampled_enn = smote_enn.fit_resample(X_linear_tr, y_train)
X_oversampled, y_oversampled = ros.fit_resample(X_linear_tr, y_train)

In [75]:
clf = [LGBMClassifier(), LGBMClassifier(), LGBMClassifier(), LGBMClassifier()]
clf[0].fit(X_knn_tr, y_train)
clf[1].fit(X_oversampled, y_oversampled)
clf[2].fit(X_resampled_tomek, y_resampled_tomek)
clf[3].fit(X_resampled_enn, y_resampled_enn)

print("Yalnızca Değer Doldurma")
print(classification_report(y_test, clf[0].predict(X_test)))

print("Random Oversampling")
print(classification_report(y_test, clf[1].predict(X_test)))

print("SMOTE TOMEK")
print(classification_report(y_test, clf[2].predict(X_test)))

print("SMOTE ENN")
print(classification_report(y_test, clf[0].predict(X_test)))

Yalnızca Değer Doldurma
              precision    recall  f1-score   support

         0.0       0.90      0.97      0.94       378
         1.0       0.75      0.43      0.55        70

    accuracy                           0.89       448
   macro avg       0.83      0.70      0.74       448
weighted avg       0.88      0.89      0.88       448

Random Oversampling
              precision    recall  f1-score   support

         0.0       0.89      0.94      0.92       378
         1.0       0.55      0.40      0.46        70

    accuracy                           0.85       448
   macro avg       0.72      0.67      0.69       448
weighted avg       0.84      0.85      0.85       448

SMOTE TOMEK
              precision    recall  f1-score   support

         0.0       0.91      0.93      0.92       378
         1.0       0.57      0.47      0.52        70

    accuracy                           0.86       448
   macro avg       0.74      0.70      0.72       448
weighted avg      

In [77]:
clf = [LogisticRegression(), LogisticRegression(), LogisticRegression(), LogisticRegression()]
clf[0].fit(X_knn_tr, y_train)
clf[1].fit(X_oversampled, y_oversampled)
clf[2].fit(X_resampled_tomek, y_resampled_tomek)
clf[3].fit(X_resampled_enn, y_resampled_enn)

print("Yalnızca Değer Doldurma")
print(classification_report(y_test, clf[0].predict(X_test)))

print("Random Oversampling")
print(classification_report(y_test, clf[1].predict(X_test)))

print("SMOTE TOMEK")
print(classification_report(y_test, clf[2].predict(X_test)))

print("SMOTE ENN")
print(classification_report(y_test, clf[0].predict(X_test)))

Yalnızca Değer Doldurma
              precision    recall  f1-score   support

         0.0       0.88      0.98      0.93       378
         1.0       0.77      0.29      0.42        70

    accuracy                           0.88       448
   macro avg       0.83      0.63      0.67       448
weighted avg       0.86      0.88      0.85       448

Random Oversampling
              precision    recall  f1-score   support

         0.0       0.93      0.80      0.86       378
         1.0       0.38      0.66      0.48        70

    accuracy                           0.78       448
   macro avg       0.66      0.73      0.67       448
weighted avg       0.84      0.78      0.80       448

SMOTE TOMEK
              precision    recall  f1-score   support

         0.0       0.92      0.81      0.86       378
         1.0       0.38      0.63      0.47        70

    accuracy                           0.78       448
   macro avg       0.65      0.72      0.67       448
weighted avg      

In [78]:
print("Değer Doldurma ve (Threshold Değiştirme)")
print(classification_report(y_test,
                           np.array(pd.DataFrame(clf[0].predict_proba(X_test)).applymap(lambda x: 1 if x>0.4 else 0)[1])))

Değer Doldurma ve (Threshold Değiştirme)
              precision    recall  f1-score   support

         0.0       0.89      0.97      0.93       378
         1.0       0.66      0.36      0.46        70

    accuracy                           0.87       448
   macro avg       0.77      0.66      0.69       448
weighted avg       0.85      0.87      0.85       448

