**Data Preprocessing**

Importing the Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Importing the Dataset

In [2]:
dataset = pd.read_csv('/content/Data.csv')
X = dataset.iloc[:, :-1].values # : This selects all rows in the DataFrame. :-1 This selects all columns except the last one.
y = dataset.iloc[:, -1].values # -1 means the last column

In [3]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


Taking Care of Missing Data

In [5]:
# Import the SimpleImputer class from sklearn.impute for handling missing data
from sklearn.impute import SimpleImputer

# Create an instance of SimpleImputer to replace missing values (np.nan) with the mean of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the 2nd and 3rd columns (index 1 and 2) of X to compute the mean for each column
imputer.fit(X[:, 1:3])

# Replace the missing values in the 2nd and 3rd columns of X with the computed mean values
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Encoding Categorical Data

Encoding the Independent Variable

In [7]:
from sklearn.compose import ColumnTransformer # Used to apply transformations (like encoding or scaling) to specific columns of a dataset while leaving the rest unchanged.
from sklearn.preprocessing import OneHotEncoder # OneHotEncoder is used to convert categorical data into a one-hot encoded format (binary vectors)

# Create a ColumnTransformer to apply OneHotEncoding to the first column (index 0) remainder='passthrough': Keeps all other columns as they are, without any transformation.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# Apply the transformation and convert the resulting structure to a NumPy array for easier handling.
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


Encoding the Dependent Variable

In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


Splitting the Dataset into the Training Set and the Test Set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [12]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [13]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [14]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [16]:
print(y_test)

[0 1]


Feature Scaling

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [18]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [19]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


1. Standartlaştırma ve fit_transform
sc.fit_transform(X_train[:, 3:]):
fit kısmı: Eğitim setindeki verilerden ortalama ve standart sapma değerleri hesaplanır.
transform kısmı: Hesaplanan ortalama ve standart sapma kullanılarak eğitim setindeki her bir değer, standartlaştırılır.
Eğitim setine bu işlem uygulanır, çünkü model bu verilerle eğitilecektir ve test setinin istatistiklerinden bilgi alınmamalıdır.

2. Test Setine Sadece transform Uygulanması
sc.transform(X_test[:, 3:]):
Test setine yalnızca eğitim setinden öğrenilen ortalama ve standart sapma değerleriyle dönüştürme işlemi yapılır.
Test setine fit uygulanmaz, çünkü test setindeki istatistikler modele asla verilmemelidir.
3. Neden? (Data Leakage - Veri Sızması)
Eğer test setine de fit uygulanırsa:

Test Setinin Özellikleri Model Tarafından Bilinir:

Test setindeki verilerden öğrenilen ortalama ve standart sapma, dolaylı olarak modele aktarılır.
Bu, test setinin "görülmemiş veri" olma özelliğini bozar ve modelin gerçek performansı doğru ölçülemez.
Gerçek Dünya Senaryosu Simüle Edilemez:

Gerçek hayatta model, sadece eğitim verileriyle eğitilir. Test seti, tamamen yeni, modelin daha önce görmediği verilerden oluşur.
Test setinin istatistiklerini kullanmak, modelin gerçek hayatta karşılaşacağı durumları simüle etmeyi engeller.