# Veri İşleme - Eksik Veri ve Dengesiz Sınıf Dağılımı

<p>Bu notebookta işlenen konular hakkında işe yarar bağlantılar:</p><br />
<a href="https://www.kaggle.com/jackdaoud/marketing-data">Kullanılan Veri Kümesi</a><br />
<a href="https://scikit-learn.org/stable/modules/impute.html">Scikit-learn Eksik Veri Doldurma Dokümanı</a><br />
<a href="https://imbalanced-learn.readthedocs.io/en/stable/api.html">Dengesiz Sınıf Dağılımı için imblearn API Dokümantasyonu</a>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from datetime import datetime
from copy import deepcopy

In [2]:
data = pd.read_csv("marketing_data.csv")
y = np.array(data["Response"]).astype(np.float16)
X = data.drop(["Response"], axis=1)
X = X.drop(["ID"], axis=1)

In [3]:
X.shape

(2240, 26)

In [4]:
X.isna().sum()

Year_Birth              0
Education               0
Marital_Status          0
 Income                24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Country                 0
dtype: int64

In [5]:
def remove_value(X):
    # Randomly remove values from the data
    a, b = X.shape
    n_missing = int(a * b * 0.1)
    print(f"Removing {n_missing} values randomly.")
    for i in range(n_missing):
        coor = (np.random.rand(2) * [*X.shape]).astype(np.int16)
        coor = np.clip(coor, a_min=0, a_max=max(X.shape) - 1)
        X.iloc[coor[0], coor[1]] = np.nan
    return X

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [7]:
X_train_orig = deepcopy(X_train)
X_test_orig = deepcopy(X_test)

In [8]:
X_train = remove_value(X_train)
X_test = remove_value(X_test)

Removing 4659 values randomly.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Removing 1164 values randomly.


In [9]:
X_train.isna().sum()

Year_Birth             154
Education              170
Marital_Status         159
 Income                184
Kidhome                174
Teenhome               170
Dt_Customer            173
Recency                180
MntWines               179
MntFruits              165
MntMeatProducts        167
MntFishProducts        159
MntSweetProducts       165
MntGoldProds           163
NumDealsPurchases      179
NumWebPurchases        166
NumCatalogPurchases    176
NumStorePurchases      180
NumWebVisitsMonth      191
AcceptedCmp3           149
AcceptedCmp4           178
AcceptedCmp5           177
AcceptedCmp1           156
AcceptedCmp2           181
Complain               183
Country                167
dtype: int64

In [10]:
X_test.isna().sum()

Year_Birth             49
Education              37
Marital_Status         42
 Income                49
Kidhome                42
Teenhome               39
Dt_Customer            44
Recency                46
MntWines               44
MntFruits              40
MntMeatProducts        44
MntFishProducts        39
MntSweetProducts       36
MntGoldProds           44
NumDealsPurchases      42
NumWebPurchases        38
NumCatalogPurchases    34
NumStorePurchases      44
NumWebVisitsMonth      42
AcceptedCmp3           33
AcceptedCmp4           49
AcceptedCmp5           46
AcceptedCmp1           45
AcceptedCmp2           33
Complain               46
Country                53
dtype: int64

#### İlk İzlenimler

In [11]:
pos_res = y_train[y_train == 1].shape[0]
neg_res = y_train[y_train == 0].shape[0]
print(f"Number of positive responses: {pos_res}")
print(f"Number of negative responses: {neg_res}")
print(f"Accuracy if model predicts positive everytime: %{100 * pos_res/(pos_res+neg_res)}")
print(f"Accuracy if model predicts negative everytime: %{100 * neg_res/(pos_res+neg_res)}")

Number of positive responses: 266
Number of negative responses: 1526
Accuracy if model predicts positive everytime: %14.84375
Accuracy if model predicts negative everytime: %85.15625


In [12]:
X_train.isna().sum()

Year_Birth             154
Education              170
Marital_Status         159
 Income                184
Kidhome                174
Teenhome               170
Dt_Customer            173
Recency                180
MntWines               179
MntFruits              165
MntMeatProducts        167
MntFishProducts        159
MntSweetProducts       165
MntGoldProds           163
NumDealsPurchases      179
NumWebPurchases        166
NumCatalogPurchases    176
NumStorePurchases      180
NumWebVisitsMonth      191
AcceptedCmp3           149
AcceptedCmp4           178
AcceptedCmp5           177
AcceptedCmp1           156
AcceptedCmp2           181
Complain               183
Country                167
dtype: int64

In [13]:
X_train.dtypes

Year_Birth             float64
Education               object
Marital_Status          object
 Income                 object
Kidhome                float64
Teenhome               float64
Dt_Customer             object
Recency                float64
MntWines               float64
MntFruits              float64
MntMeatProducts        float64
MntFishProducts        float64
MntSweetProducts       float64
MntGoldProds           float64
NumDealsPurchases      float64
NumWebPurchases        float64
NumCatalogPurchases    float64
NumStorePurchases      float64
NumWebVisitsMonth      float64
AcceptedCmp3           float64
AcceptedCmp4           float64
AcceptedCmp5           float64
AcceptedCmp1           float64
AcceptedCmp2           float64
Complain               float64
Country                 object
dtype: object

In [14]:
X_train.columns[X_train.dtypes == object]

Index(['Education', 'Marital_Status', ' Income ', 'Dt_Customer', 'Country'], dtype='object')

In [15]:
X_train[X_train.columns[X_train.dtypes == object]].nunique()

Education            5
Marital_Status       8
 Income           1482
Dt_Customer        620
Country              8
dtype: int64

#### Veriyi Analize Uygun Hale Getirmek

In [16]:
def ordinal_encode(X, cats):
    X = deepcopy(X)
    for cat in cats:
        X[cat][X[cat].isna()] = "nan"
        
    enc=OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
    enc.fit(X[cats])
    
    transformed = enc.transform(X[cats])
    
    for i in range(len(cats)):
        nan_pos = np.where(enc.categories_[i] == "nan")
        transformed[[np.where(transformed[:, i] == nan_pos)], i] = np.nan
        
    return transformed, enc

In [17]:
cats = ["Education", "Marital_Status", "Country"]
a, b = ordinal_encode(X_train, cats)
for i, cat in enumerate(cats):
    X_train[cat] = a[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[cat][X[cat].isna()] = "nan"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [18]:
for cat in cats:
    X_test[cat][X_test[cat].isna()] = "nan"
c = b.transform(X_test[cats])
for i in range(len(cats)):
    nan_pos = np.where(b.categories_[i] == "nan")
    c[[np.where(c[:, i] == nan_pos)], i] = np.nan
for i, cat in enumerate(cats):
    X_test[cat] = c[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[cat][X_test[cat].isna()] = "nan"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [19]:
for cat in cats:
    X_train_orig[cat][X_train_orig[cat].isna()] = "nan"
d = b.transform(X_train_orig[cats])
for i in range(len(cats)):
    nan_pos = np.where(b.categories_[i] == "nan")
    d[[np.where(d[:, i] == nan_pos)], i] = np.nan
for i, cat in enumerate(cats):
    X_train_orig[cat] = d[:, i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_orig[cat][X_train_orig[cat].isna()] = "nan"


In [20]:
X_train

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
2205,1973.0,,,"$54,222.00",0.0,1.0,3/1/14,98.0,199.0,12.0,...,3.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,
1370,1949.0,2.0,6.0,"$72,643.00",0.0,0.0,,,526.0,80.0,...,10.0,7.0,2.0,0.0,,0.0,1.0,,0.0,6.0
2003,1976.0,2.0,3.0,"$30,772.00",1.0,1.0,3/12/14,89.0,7.0,2.0,...,0.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1447,1975.0,3.0,3.0,"$54,730.00",0.0,1.0,8/15/13,64.0,318.0,,...,1.0,8.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
546,1968.0,,2.0,"$41,335.00",1.0,0.0,12/26/13,24.0,112.0,19.0,...,1.0,4.0,7.0,0.0,0.0,0.0,0.0,,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334,1973.0,2.0,6.0,"$60,208.00",1.0,1.0,10/7/12,13.0,488.0,23.0,...,,7.0,7.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0
442,1975.0,3.0,3.0,"$7,500.00",1.0,0.0,10/2/13,19.0,3.0,1.0,...,0.0,3.0,5.0,0.0,0.0,0.0,0.0,0.0,,0.0
1440,1968.0,4.0,3.0,"$36,778.00",1.0,1.0,8/5/12,63.0,29.0,4.0,...,0.0,3.0,9.0,0.0,0.0,,,0.0,0.0,6.0
1968,1962.0,3.0,4.0,"$59,247.00",0.0,2.0,11/8/13,87.0,327.0,9.0,...,2.0,9.0,6.0,0.0,,0.0,0.0,0.0,0.0,


In [21]:
X_test

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
634,1959.0,,,"$26,887.00",0.0,1.0,2/10/13,27.0,6.0,,...,0.0,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,
422,1979.0,0.0,,"$54,210.00",0.0,1.0,5/20/13,18.0,70.0,54.0,...,1.0,7.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,
533,1987.0,4.0,4.0,"$42,000.00",0.0,0.0,1/10/13,,124.0,,...,2.0,11.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
710,1982.0,2.0,3.0,"$79,908.00",0.0,0.0,4/5/13,30.0,557.0,129.0,...,6.0,7.0,2.0,0.0,,1.0,0.0,0.0,,6.0
1362,1965.0,3.0,3.0,,2.0,1.0,10/11/13,60.0,292.0,3.0,...,,5.0,7.0,0.0,0.0,0.0,,0.0,0.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1149,1963.0,2.0,2.0,"$68,118.00",0.0,1.0,10/18/13,51.0,595.0,23.0,...,9.0,4.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
1739,1976.0,2.0,2.0,"$31,859.00",1.0,0.0,6/14/13,77.0,3.0,1.0,...,0.0,2.0,7.0,0.0,0.0,0.0,,0.0,0.0,5.0
408,1965.0,0.0,2.0,"$60,161.00",0.0,1.0,10/23/12,,584.0,44.0,...,4.0,8.0,8.0,0.0,0.0,,0.0,0.0,0.0,1.0
456,1965.0,2.0,2.0,"$4,861.00",0.0,0.0,6/22/14,20.0,2.0,1.0,...,0.0,0.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0


In [22]:
# X_train = pd.get_dummies(X_train, columns=["Education", "Marital_Status", "Country"], dummy_na=True)
# X_test = pd.get_dummies(X_test, columns=["Education", "Marital_Status", "Country"], dummy_na=True)

In [23]:
# X_train_orig = pd.get_dummies(X_train_orig, columns=["Education", "Marital_Status", "Country"], dummy_na=True)

In [24]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_train[" Income "] if isinstance(inc, str)]
X_train[" Income "][X_train[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_train[" Income "] = X_train[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[" Income "][X_train[" Income "].apply(lambda x: isinstance(x, str))] = incomes
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [25]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_train_orig[" Income "] if isinstance(inc, str)]
X_train_orig[" Income "][X_train_orig[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_train_orig[" Income "] = X_train_orig[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_orig[" Income "][X_train_orig[" Income "].apply(lambda x: isinstance(x, str))] = incomes


In [26]:
X_train

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
2205,1973.0,,,54222.0,0.0,1.0,3/1/14,98.0,199.0,12.0,...,3.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,
1370,1949.0,2.0,6.0,72643.0,0.0,0.0,,,526.0,80.0,...,10.0,7.0,2.0,0.0,,0.0,1.0,,0.0,6.0
2003,1976.0,2.0,3.0,30772.0,1.0,1.0,3/12/14,89.0,7.0,2.0,...,0.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1447,1975.0,3.0,3.0,54730.0,0.0,1.0,8/15/13,64.0,318.0,,...,1.0,8.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
546,1968.0,,2.0,41335.0,1.0,0.0,12/26/13,24.0,112.0,19.0,...,1.0,4.0,7.0,0.0,0.0,0.0,0.0,,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334,1973.0,2.0,6.0,60208.0,1.0,1.0,10/7/12,13.0,488.0,23.0,...,,7.0,7.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0
442,1975.0,3.0,3.0,7500.0,1.0,0.0,10/2/13,19.0,3.0,1.0,...,0.0,3.0,5.0,0.0,0.0,0.0,0.0,0.0,,0.0
1440,1968.0,4.0,3.0,36778.0,1.0,1.0,8/5/12,63.0,29.0,4.0,...,0.0,3.0,9.0,0.0,0.0,,,0.0,0.0,6.0
1968,1962.0,3.0,4.0,59247.0,0.0,2.0,11/8/13,87.0,327.0,9.0,...,2.0,9.0,6.0,0.0,,0.0,0.0,0.0,0.0,


In [27]:
incomes = [float(inc.lstrip("$").replace(",", "")) for inc in X_test[" Income "] if isinstance(inc, str)]
X_test[" Income "][X_test[" Income "].apply(lambda x: isinstance(x, str))] = incomes
X_test[" Income "] = X_test[" Income "].astype(np.float64)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[" Income "][X_test[" Income "].apply(lambda x: isinstance(x, str))] = incomes
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [28]:
X_train["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_train["Dt_Customer"])).dt.days
X_test["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_test["Dt_Customer"])).dt.days
X_train_orig["Dt_Customer"] = (datetime.today() - pd.to_datetime(X_train_orig["Dt_Customer"])).dt.days

In [29]:
X_train.insert(2, "Age", [2021 - birth for birth in X_train["Year_Birth"]])
X_test.insert(2, "Age", [2021 - birth for birth in X_test["Year_Birth"]])
X_train_orig.insert(2, "Age", [2021 - birth for birth in X_train_orig["Year_Birth"]])

In [30]:
X_train.dtypes

Year_Birth             float64
Education              float64
Age                    float64
Marital_Status         float64
 Income                float64
Kidhome                float64
Teenhome               float64
Dt_Customer            float64
Recency                float64
MntWines               float64
MntFruits              float64
MntMeatProducts        float64
MntFishProducts        float64
MntSweetProducts       float64
MntGoldProds           float64
NumDealsPurchases      float64
NumWebPurchases        float64
NumCatalogPurchases    float64
NumStorePurchases      float64
NumWebVisitsMonth      float64
AcceptedCmp3           float64
AcceptedCmp4           float64
AcceptedCmp5           float64
AcceptedCmp1           float64
AcceptedCmp2           float64
Complain               float64
Country                float64
dtype: object

## Korelasyon

In [31]:
corr_mat = X_train.corr()

In [32]:
X_train.corrwith(pd.DataFrame(y_train)[0])

Year_Birth             0.006165
Education             -0.035431
Age                   -0.006165
Marital_Status        -0.041897
 Income                0.023546
Kidhome               -0.033143
Teenhome              -0.011060
Dt_Customer            0.004319
Recency               -0.016004
MntWines               0.050974
MntFruits              0.030784
MntMeatProducts        0.052295
MntFishProducts        0.047346
MntSweetProducts       0.014269
MntGoldProds           0.024523
NumDealsPurchases     -0.016147
NumWebPurchases        0.011903
NumCatalogPurchases    0.042604
NumStorePurchases      0.024579
NumWebVisitsMonth     -0.054940
AcceptedCmp3           0.034795
AcceptedCmp4           0.012330
AcceptedCmp5           0.083188
AcceptedCmp1           0.002546
AcceptedCmp2          -0.009598
Complain              -0.017151
Country               -0.043390
dtype: float64

In [33]:
corr_mat.to_csv("corr_mat.csv")

In [34]:
corr_mat

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
Year_Birth,1.0,-0.175916,-1.0,-0.035773,-0.12218,0.201373,-0.337064,-0.000147,-0.006454,-0.153614,...,-0.0918,-0.104636,0.084274,0.046838,-0.055651,0.020653,0.010356,0.004169,-0.023999,0.035768
Education,-0.175916,1.0,0.175916,-0.002133,0.095603,-0.051764,0.127788,-0.050332,-0.041149,0.184363,...,0.064561,0.046497,-0.061787,0.002608,0.029021,0.024141,0.004137,0.014571,-0.069968,0.043919
Age,-1.0,0.175916,1.0,0.035773,0.12218,-0.201373,0.337064,0.000147,0.006454,0.153614,...,0.0918,0.104636,-0.084274,-0.046838,0.055651,-0.020653,-0.010356,-0.004169,0.023999,-0.035768
Marital_Status,-0.035773,-0.002133,0.035773,1.0,0.021104,-0.023075,-0.017573,-0.008043,0.008028,0.000294,...,-0.00312,0.016375,-0.028445,-0.004232,0.011032,0.030234,-0.019823,0.031976,-0.00772,0.018629
Income,-0.12218,0.095603,0.12218,0.021104,1.0,-0.388243,-0.004968,-0.01381,-0.01056,0.539239,...,0.541505,0.493839,-0.516402,-0.023003,0.167086,0.329477,0.28324,0.08767,-0.038239,-0.004804
Kidhome,0.201373,-0.051764,-0.201373,-0.023075,-0.388243,1.0,-0.02523,-0.049252,-0.004462,-0.483049,...,-0.490422,-0.494329,0.456574,0.030571,-0.169213,-0.215498,-0.169996,-0.064854,0.058006,-0.016739
Teenhome,-0.337064,0.127788,0.337064,-0.017573,-0.004968,-0.02523,1.0,0.015486,0.012925,-0.016555,...,-0.128736,0.039736,0.14421,-0.057707,0.051157,-0.208089,-0.134481,-0.034028,-0.007966,-0.008825
Dt_Customer,-0.000147,-0.050332,0.000147,-0.008043,-0.01381,-0.049252,0.015486,1.0,0.03827,0.1717,...,0.096861,0.104122,0.283681,-0.000105,0.013559,0.004563,-0.032342,0.006401,0.041631,0.028056
Recency,-0.006454,-0.041149,0.006454,0.008028,-0.01056,-0.004462,0.012925,0.03827,1.0,0.026883,...,0.025909,0.001669,-0.028069,-0.011473,-0.001374,-0.001163,-0.014915,-0.021846,-0.000955,0.016194
MntWines,-0.153614,0.184363,0.153614,0.000294,0.539239,-0.483049,-0.016555,0.1717,0.026883,1.0,...,0.650882,0.648439,-0.319616,0.050955,0.378522,0.492061,0.387303,0.182009,-0.04997,0.014141


In [35]:
corr_mat["Age"][np.abs(corr_mat["Age"]) > 0.5]

Year_Birth   -1.0
Age           1.0
Name: Age, dtype: float64

## Ölçekleme

Eksik verileri doldurmadan önce ölçeklemek (scaling) oldukça önemli. Bu adım atlandığında tek değişken yöntemleri daha iyi sonuç verebiliyor çünkü çok değişken yöntemleri aynı ölçekte olmayan sütunlardan fazlaca etkilenip daha yanlış kararlar alabiliyor.

In [36]:
sc = StandardScaler()
X_train_orig = pd.DataFrame(sc.fit_transform(X_train_orig), columns=X_train.columns)
X_train = pd.DataFrame(sc.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)

## Tek Değişken Yöntemleri 

### Ortalama, Medyan, Mod ve Sabit Değer

In [37]:
from sklearn.impute import SimpleImputer

In [38]:
imputers = list()
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imputers.append(imp_mean)
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputers.append(imp_median)
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputers.append(imp_mode)
imp_constant = SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0)
imputers.append(imp_constant)

In [39]:
for imp in imputers:
    imp.fit(X_train)

In [40]:
X_train.mode().iloc[0]

Year_Birth             0.584482
Education             -0.368352
Age                   -0.584482
Marital_Status        -0.661122
 Income               -1.693677
Kidhome               -0.830948
Teenhome              -0.950319
Dt_Customer           -1.503781
Recency               -1.616838
MntWines              -0.886892
MntFruits             -0.664619
MntMeatProducts       -0.715536
MntFishProducts       -0.687568
MntSweetProducts      -0.664354
MntGoldProds          -0.787608
NumDealsPurchases     -0.679947
NumWebPurchases       -0.762426
NumCatalogPurchases   -0.913623
NumStorePurchases     -0.850613
NumWebVisitsMonth      0.720441
AcceptedCmp3          -0.281959
AcceptedCmp4          -0.283410
AcceptedCmp5          -0.293393
AcceptedCmp1          -0.271092
AcceptedCmp2          -0.112119
Complain              -0.100219
Country                0.708680
Name: 0, dtype: float64

In [41]:
print(X_train.mean())

Year_Birth            -7.536096e-15
Education             -2.349892e-16
Age                   -1.272893e-16
Marital_Status         3.401419e-17
 Income               -4.715686e-17
Kidhome               -1.081404e-16
Teenhome              -2.512034e-17
Dt_Customer            3.564466e-16
Recency                1.221796e-16
MntWines               7.646119e-17
MntFruits              8.045193e-17
MntMeatProducts        1.714868e-17
MntFishProducts        9.565731e-17
MntSweetProducts       6.332434e-17
MntGoldProds          -1.020942e-16
NumDealsPurchases      5.733514e-17
NumWebPurchases       -1.666017e-17
NumCatalogPurchases    4.809134e-18
NumStorePurchases      2.479406e-16
NumWebVisitsMonth      1.047119e-17
AcceptedCmp3           3.513792e-17
AcceptedCmp4           1.220970e-16
AcceptedCmp5          -1.288959e-16
AcceptedCmp1          -1.154333e-16
AcceptedCmp2           1.931519e-16
Complain              -1.050537e-17
Country                1.601073e-16
dtype: float64


In [42]:
print(X_train.median())

Year_Birth             0.087270
Education             -0.368352
Age                   -0.087270
Marital_Status        -0.661122
 Income               -0.053291
Kidhome               -0.830948
Teenhome              -0.950319
Dt_Customer           -0.008136
Recency                0.005009
MntWines              -0.390465
MntFruits             -0.467223
MntMeatProducts       -0.437541
MntFishProducts       -0.472473
MntSweetProducts      -0.468094
MntGoldProds          -0.375220
NumDealsPurchases     -0.180741
NumWebPurchases       -0.050974
NumCatalogPurchases   -0.228830
NumStorePurchases     -0.238202
NumWebVisitsMonth      0.302381
AcceptedCmp3          -0.281959
AcceptedCmp4          -0.283410
AcceptedCmp5          -0.293393
AcceptedCmp1          -0.271092
AcceptedCmp2          -0.112119
Complain              -0.100219
Country                0.708680
dtype: float64


In [43]:
X_mean_tr = pd.DataFrame(imp_mean.transform(X_train), columns=X_train.columns)
X_median_tr = pd.DataFrame(imp_median.transform(X_train), columns=X_train.columns)
X_mode_tr = pd.DataFrame(imp_mode.transform(X_train), columns=X_train.columns)
X_const_tr = pd.DataFrame(imp_mode.transform(X_train), columns=X_train.columns)

## Çok Değişken Yöntemleri

### En Yakın Komşu

In [44]:
from sklearn.impute import KNNImputer

In [45]:
imp_knn = KNNImputer(missing_values=np.nan, n_neighbors=5, weights="distance")

In [46]:
imp_knn.fit(X_train)

KNNImputer(weights='distance')

In [47]:
X_knn_tr = pd.DataFrame(imp_knn.transform(X_train), columns=X_train.columns)

In [48]:
X_knn_tr

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,0.335876,0.302215,-0.335876,0.685673,0.067178,-0.830948,0.893482,-1.150711,1.695871,-0.317029,...,0.113567,-0.238202,-0.951797,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,-0.519776
1,-1.652973,-0.368352,1.652973,2.151089,0.761428,-0.830948,-0.950319,-0.848996,-0.267390,0.643514,...,2.510344,0.374209,-1.369856,-0.281959,-0.283410,-0.293393,3.688786,-0.112119,-0.100219,0.708680
2,0.584482,-0.368352,-0.584482,-0.661122,-0.816603,1.003259,0.893482,-1.204652,1.385304,-0.881017,...,-0.913623,-1.156819,0.302381,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,-1.564023
3,0.501614,0.524172,-0.501614,-0.661122,0.086324,-0.830948,0.893482,-0.179767,0.522620,0.032527,...,-0.571226,0.680415,-0.533737,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,0.708680
4,-0.078467,-0.201648,0.078467,-1.598526,-0.418506,1.003259,-0.950319,-0.831967,-0.857675,-0.572586,...,-0.571226,-0.544408,0.720441,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,-2.018564
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1787,0.335876,-0.368352,-0.335876,2.151089,0.292778,1.003259,0.893482,1.350205,-1.237257,0.531891,...,-0.174715,0.374209,0.720441,-0.281959,3.528456,-0.293393,-0.271092,-0.112119,-0.100219,0.708680
1788,0.501614,0.524172,-0.501614,-0.661122,-1.693677,1.003259,-0.950319,-0.415147,-1.030212,-0.892767,...,-0.913623,-0.850613,-0.115678,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,-2.018564
1789,-0.078467,1.416697,0.078467,-0.661122,-0.590250,1.003259,0.893482,1.659142,0.488112,-0.816393,...,-0.913623,-0.850613,1.556560,-0.281959,-0.283410,-0.293393,-0.271092,-0.112119,-0.100219,0.708680
1790,-0.575679,0.524172,0.575679,0.276281,0.256560,-0.830948,2.737283,-0.596586,1.316290,0.058964,...,-0.228830,0.986620,0.302381,-0.281959,0.383719,-0.293393,-0.271092,-0.112119,-0.100219,-0.080227


#### Döngülü Yöntemler (Sklearn'de Henüz Olgun bir İmplementasyonu Yok)

In [49]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [50]:
imp_linear = IterativeImputer(max_iter=100, random_state=0, estimator=BayesianRidge())
imp_tree = IterativeImputer(max_iter = 100, random_state=0, estimator=DecisionTreeRegressor(max_features='sqrt', random_state=0))
# imp_forest = IterativeImputer(max_iter = 25, random_state=0, estimator=RandomForestRegressor(n_estimators=20, random_state=0))

In [51]:
imp_linear = IterativeImputer(max_iter=100, random_state=0, estimator=BayesianRidge())

In [52]:
imp_linear.fit(X_train)

IterativeImputer(estimator=BayesianRidge(), max_iter=100, random_state=0)

In [53]:
imp_tree.fit(X_train)



IterativeImputer(estimator=DecisionTreeRegressor(max_features='sqrt',
                                                 random_state=0),
                 max_iter=100, random_state=0)

In [54]:
# imp_forest.fit(X_train)

In [55]:
X_linear_tr = pd.DataFrame(imp_linear.transform(X_train), columns=X_train.columns)
X_tree_tr = pd.DataFrame(imp_tree.transform(X_train), columns=X_train.columns)
# X_forest_tr = pd.DataFrame(imp_forest.transform(X_train), columns=X_train.columns)

#### Karşılaştırma

In [56]:
print("Metrik: Mean Absolute Error (Ortalama Mutlak Hata)\n")
print(f"Ortalamaya Göre: {((np.abs(X_train_orig - X_mean_tr))).mean().mean()}")
print(f"Medyana Göre: {((np.abs(X_train_orig - X_median_tr))).mean().mean()}")
print(f"Moda Göre: {((np.abs(X_train_orig - X_mode_tr))).mean().mean()}")
print(f"Sabit Değere Göre: {((np.abs(X_train_orig - X_const_tr))).mean().mean()}")
print(f"En Yakın Komşulara Göre: {((np.abs(X_train_orig - X_knn_tr))).mean().mean()}")
print(f"Lineer Regresyona Göre: {((np.abs(X_train_orig - X_linear_tr))).mean().mean()}")
print(f"Karar Ağacına Göre: {((np.abs(X_train_orig - X_tree_tr))).mean().mean()}")
# print(f"Karar Ağaçları Ormanına Göre: {((np.abs(X_train_orig - X_forest_tr))).mean().mean()}")

Metrik: Mean Absolute Error (Ortalama Mutlak Hata)

Ortalamaya Göre: 0.07486019168719599
Medyana Göre: 0.0663682563495706
Moda Göre: 0.07947788885172288
Sabit Değere Göre: 0.07947788885172288
En Yakın Komşulara Göre: 0.051860386677715156
Lineer Regresyona Göre: 0.058339048638690695
Karar Ağacına Göre: 0.05745386056005442


#### Test Kümesinin Dönüştürülmesi

In [57]:
X_test = pd.DataFrame(imp_knn.transform(X_test), columns=X_test.columns)

In [58]:
X_test

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,-0.824286,-0.191590,0.824286,0.640350,-0.963021,-0.830948,0.893482,0.732332,-0.754153,-0.883954,...,-0.913623,-0.850613,0.302381,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.259934
1,0.833089,-2.153401,-0.833089,-0.813049,0.066726,-0.830948,0.893482,0.246860,-1.064720,-0.695958,...,-0.571226,0.374209,-0.115678,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.797968
2,1.496038,1.416697,-1.496038,0.276281,-0.393443,-0.830948,-0.950319,0.884348,0.069758,-0.537337,...,-0.228830,1.599031,-0.115678,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.708680
3,1.081695,-0.368352,-1.081695,-0.661122,1.035230,-0.830948,-0.950319,0.467529,-0.650631,0.734575,...,1.140757,0.374209,-1.369856,-0.281959,-0.28341,3.408400,-0.271092,-0.112119,-0.100219,0.708680
4,-0.327073,0.524172,0.327073,-0.661122,-0.252952,2.837465,0.893482,-0.459281,0.384590,-0.043847,...,-0.089684,-0.238202,0.720441,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.254140
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,-0.492811,-0.368352,0.492811,-1.598526,0.590890,-0.830948,0.893482,-0.493608,0.074024,0.846197,...,2.167948,-0.544408,0.302381,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.708680
444,0.584482,-0.368352,-0.584482,-1.598526,-0.775636,1.003259,-0.950319,0.124266,0.971216,-0.892767,...,-0.913623,-1.156819,0.720441,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.254140
445,-0.327073,-2.153401,0.327073,-1.598526,0.291007,-0.830948,0.893482,1.271745,-0.122329,0.813886,...,0.455964,0.680415,1.138500,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,-1.564023
446,-0.327073,-0.368352,0.327073,-1.598526,-1.793135,-0.830948,-0.950319,-1.704835,-0.995705,-0.895704,...,-0.913623,-1.769230,3.646856,-0.281959,-0.28341,-0.293393,-0.271092,-0.112119,-0.100219,0.254140


In [59]:
# Inverse transform to get the originals
pd.DataFrame(sc.inverse_transform(X_test), columns=X_test.columns)

Unnamed: 0,Year_Birth,Education,Age,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Country
0,1959.0,2.198048,62.0,4.388380,26887.000000,0.0,1.0,3099.0,27.000000,6.0,...,0.000000,3.0,6.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,5.012747
1,1979.0,0.000000,42.0,2.837928,54210.000000,0.0,1.0,3000.0,18.000000,70.0,...,1.000000,7.0,5.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,6.196435
2,1987.0,4.000000,34.0,4.000000,42000.000000,0.0,0.0,3130.0,50.876377,124.0,...,2.000000,11.0,5.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,6.000000
3,1982.0,2.000000,39.0,3.000000,79908.000000,0.0,0.0,3045.0,30.000000,557.0,...,6.000000,7.0,2.0,0.0,0.0,1.000000e+00,0.0,0.0,0.0,6.000000
4,1965.0,3.000000,56.0,3.000000,45727.747621,2.0,1.0,2856.0,60.000000,292.0,...,2.406386,5.0,7.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,5.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,1963.0,2.000000,58.0,2.000000,68118.000000,0.0,1.0,2849.0,51.000000,595.0,...,9.000000,4.0,6.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,6.000000
444,1976.0,2.000000,45.0,2.000000,31859.000000,1.0,0.0,2975.0,77.000000,3.0,...,0.000000,2.0,7.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,5.000000
445,1965.0,0.000000,56.0,2.000000,60161.000000,0.0,1.0,3209.0,45.309831,584.0,...,4.000000,8.0,8.0,0.0,0.0,2.775558e-17,0.0,0.0,0.0,1.000000
446,1965.0,2.000000,56.0,2.000000,4861.000000,0.0,0.0,2602.0,20.000000,2.0,...,0.000000,0.0,14.0,0.0,0.0,0.000000e+00,0.0,0.0,0.0,5.000000


## Sınıf Dengesizliğini Gidermek

In [60]:
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.over_sampling import RandomOverSampler
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [61]:
smote_tomek = SMOTETomek(random_state=0)
smote_enn = SMOTEENN(random_state=0)
ros = RandomOverSampler(random_state=0)

In [62]:
X_resampled_tomek, y_resampled_tomek = smote_tomek.fit_resample(X_linear_tr, y_train)
X_resampled_enn, y_resampled_enn = smote_enn.fit_resample(X_linear_tr, y_train)
X_oversampled, y_oversampled = ros.fit_resample(X_linear_tr, y_train)

In [63]:
clf = [LGBMClassifier(), LGBMClassifier(), LGBMClassifier(), LGBMClassifier()]
clf[0].fit(X_knn_tr, y_train)
clf[1].fit(X_oversampled, y_oversampled)
clf[2].fit(X_resampled_tomek, y_resampled_tomek)
clf[3].fit(X_resampled_enn, y_resampled_enn)

print("Yalnızca Değer Doldurma")
print(classification_report(y_test, clf[0].predict(X_test)))

print("Random Oversampling")
print(classification_report(y_test, clf[1].predict(X_test)))

print("SMOTE TOMEK")
print(classification_report(y_test, clf[2].predict(X_test)))

print("SMOTE ENN")
print(classification_report(y_test, clf[0].predict(X_test)))

Yalnızca Değer Doldurma
              precision    recall  f1-score   support

         0.0       0.89      0.97      0.93       380
         1.0       0.69      0.35      0.47        68

    accuracy                           0.88       448
   macro avg       0.79      0.66      0.70       448
weighted avg       0.86      0.88      0.86       448

Random Oversampling
              precision    recall  f1-score   support

         0.0       0.91      0.96      0.93       380
         1.0       0.67      0.47      0.55        68

    accuracy                           0.88       448
   macro avg       0.79      0.71      0.74       448
weighted avg       0.87      0.88      0.88       448

SMOTE TOMEK
              precision    recall  f1-score   support

         0.0       0.90      0.96      0.93       380
         1.0       0.66      0.43      0.52        68

    accuracy                           0.88       448
   macro avg       0.78      0.69      0.72       448
weighted avg      

In [64]:
clf = [LogisticRegression(), LogisticRegression(), LogisticRegression(), LogisticRegression()]
clf[0].fit(X_knn_tr, y_train)
clf[1].fit(X_oversampled, y_oversampled)
clf[2].fit(X_resampled_tomek, y_resampled_tomek)
clf[3].fit(X_resampled_enn, y_resampled_enn)

print("Yalnızca Değer Doldurma")
print(classification_report(y_test, clf[0].predict(X_test)))

print("Random Oversampling")
print(classification_report(y_test, clf[1].predict(X_test)))

print("SMOTE TOMEK")
print(classification_report(y_test, clf[2].predict(X_test)))

print("SMOTE ENN")
print(classification_report(y_test, clf[0].predict(X_test)))

Yalnızca Değer Doldurma
              precision    recall  f1-score   support

         0.0       0.90      0.98      0.94       380
         1.0       0.78      0.37      0.50        68

    accuracy                           0.89       448
   macro avg       0.84      0.67      0.72       448
weighted avg       0.88      0.89      0.87       448

Random Oversampling
              precision    recall  f1-score   support

         0.0       0.93      0.85      0.89       380
         1.0       0.44      0.66      0.53        68

    accuracy                           0.82       448
   macro avg       0.69      0.75      0.71       448
weighted avg       0.86      0.82      0.83       448

SMOTE TOMEK
              precision    recall  f1-score   support

         0.0       0.94      0.84      0.89       380
         1.0       0.44      0.69      0.54        68

    accuracy                           0.82       448
   macro avg       0.69      0.77      0.71       448
weighted avg      

In [65]:
print("Değer Doldurma ve (Threshold Değiştirme)")
print(classification_report(y_test,
                           np.array(pd.DataFrame(clf[0].predict_proba(X_test)).applymap(lambda x: 1 if x>0.4 else 0)[1])))

Değer Doldurma ve (Threshold Değiştirme)
              precision    recall  f1-score   support

         0.0       0.91      0.97      0.94       380
         1.0       0.71      0.44      0.55        68

    accuracy                           0.89       448
   macro avg       0.81      0.70      0.74       448
weighted avg       0.88      0.89      0.88       448

