# Full ML Preprocessing Practice

## PART 1 ‚Äî Core Coding Test

* Task
    
    1. Write clean, production-quality code to:
    2. Load the dataset
    3. Separate features and target
    4. Perform a train-test split
    5. Handle missing values
    6. Encode categorical variables
    7. Scale numerical features
    9. Ensure NO DATA LEAKAGE
    10. Output X_train_final, X_test_final, y_train, y_test

### Import Libraries

In [1]:
# Import Pandas and Numpy
import pandas as pd
import numpy as np

### Load the data

In [2]:
# Load the dataset
df = pd.read_csv("data_day6_ml.csv")

# Get first 5 datapoints
df.head()

Unnamed: 0,age,salary,city,owns_house,target
0,25.0,50000.0,Delhi,Yes,0
1,32.0,60000.0,Mumbai,No,1
2,45.0,,Bangalore,Yes,1
3,28.0,52000.0,Delhi,,0
4,,58000.0,Mumbai,No,1


### Data Exploration

In [3]:
# Data info for understanding data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         19 non-null     float64
 1   salary      16 non-null     float64
 2   city        18 non-null     object 
 3   owns_house  17 non-null     object 
 4   target      20 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 928.0+ bytes


In [4]:
# Shape and length of data
df.shape, len(df)

((20, 5), 20)

In [5]:
# Get satistical info of the data
df.describe()

Unnamed: 0,age,salary,target
count,19.0,16.0,20.0
mean,36.736842,65187.5,0.65
std,8.258612,13823.741655,0.48936
min,25.0,48000.0,0.0
25%,30.0,56500.0,0.0
50%,36.0,61500.0,1.0
75%,42.5,72750.0,1.0
max,52.0,95000.0,1.0


In [6]:
# Get column names
df.columns

Index(['age', 'salary', 'city', 'owns_house', 'target'], dtype='object')

### Preprocessing of data
1. Check for null values
2. Create train and test sets
3. Handle null or missing values
4. Encode the categorical values
5. Scale the numerical features

In [7]:
# Make a copy of dataset and work on it to keep original data safe incase of any failure
df_temp = df.copy()
df_temp

Unnamed: 0,age,salary,city,owns_house,target
0,25.0,50000.0,Delhi,Yes,0
1,32.0,60000.0,Mumbai,No,1
2,45.0,,Bangalore,Yes,1
3,28.0,52000.0,Delhi,,0
4,,58000.0,Mumbai,No,1
5,40.0,75000.0,Bangalore,Yes,1
6,35.0,62000.0,,No,0
7,29.0,,Delhi,Yes,0
8,50.0,90000.0,Mumbai,Yes,1
9,38.0,68000.0,Bangalore,,1


#### 1. Check for null values

In [8]:
# Check the null values
df_temp.isnull().any()

age            True
salary         True
city           True
owns_house     True
target        False
dtype: bool

In [9]:
# Get the total number of null values column wise
df_temp.isnull().sum()

age           1
salary        4
city          2
owns_house    3
target        0
dtype: int64

In [10]:
# Get the data types of column names to know which prcessing to apply to which column
df_temp.dtypes

age           float64
salary        float64
city           object
owns_house     object
target          int64
dtype: object

#### 2. Create train and test sets

In [11]:
# Create X and y 
X = df_temp.drop('target', axis = 1)
y = df_temp['target']

In [12]:
# Create Train and Test sets using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((16, 4), (4, 4), (16,), (4,))

#### 3. Handle null or missing values

In [13]:
# Handle numerical null values
X_train['age'].fillna(X_train['age'].median(), inplace = True)
X_train['salary'].fillna(X_train['salary'].median(), inplace = True)

X_test['age'].fillna(X_test['age'].median(), inplace = True)
X_test['salary'].fillna(X_test['salary'].median(), inplace = True)

In [14]:
# Handle Categorical null values
X_train['city'].fillna(X_train['city'].mode()[0], inplace = True)
X_train['owns_house'].fillna(X_train['owns_house'].mode()[0], inplace = True)

X_test['city'].fillna(X_test['city'].mode()[0], inplace = True)
X_test['owns_house'].fillna(X_test['owns_house'].mode()[0], inplace = True)

In [15]:
X_train.head()

Unnamed: 0,age,salary,city,owns_house
8,50.0,90000.0,Mumbai,Yes
5,40.0,75000.0,Bangalore,Yes
11,26.0,48000.0,Mumbai,No
3,28.0,52000.0,Delhi,Yes
18,52.0,95000.0,Mumbai,Yes


In [16]:
X_test.head()

Unnamed: 0,age,salary,city,owns_house
0,25.0,50000.0,Delhi,Yes
17,27.0,51000.0,Delhi,No
15,36.0,64000.0,Mumbai,No
1,32.0,60000.0,Mumbai,No


In [17]:
# Check for remaining null values
X_train.isnull().sum(), X_test.isnull().sum()

(age           0
 salary        0
 city          0
 owns_house    0
 dtype: int64,
 age           0
 salary        0
 city          0
 owns_house    0
 dtype: int64)

#### 4. Encode the categorical values
A. Encode Ordinal features

B. Encode Nominal features

In [18]:
# A. Encode Ordinal features
enc_val  = {
    'Yes' : 1,
    'No' : 0
}

In [19]:
# Map the values to X_train
X_train['owns_house'] = X_train['owns_house'].map(enc_val)
X_test['owns_house'] =X_test['owns_house'].map(enc_val)

In [20]:
# B. Encode Nominal features
X_train =  pd.get_dummies(X_train, columns = ['city'])

X_test = pd.get_dummies(X_test, columns = ['city'])

### Above code creates a problem that is 
* Your feature sets still don‚Äôt match.
`X_train has: city_Bangalore`

`X_test does NOT have: city_Bangalore`

* To fix this do below 

In [21]:
X_train, X_test = X_train.align(
    X_test,
    join = 'left',
    axis = 1 , 
    fill_value = 0
)

### What this does is : 
* `axis=1` ‚Üí align columns
    axis=0 ‚Üí rows
    axis=1 ‚Üí columns
  
* `join='left'` ‚Üí training set is the boss üëë
    join='left' means:

    ‚ÄúKeep all columns from X_train.
    Add missing ones to X_test.‚Äù

* `fill_value=0` ‚Üí safe default for one-hot encoding

    If a category is missing in test:
    city_Bangalore = 0


#### 5. Scale the numerical features

In [22]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [23]:
# Get numerical feature columns
num_col = ['age', 'salary']
num_col

['age', 'salary']

In [24]:
# Apply it to training and test sets

X_train[num_col] =scaler.fit_transform(X_train[num_col])
X_test[num_col] = scaler.transform(X_test[num_col])

In [25]:
# Final X  train and test sets
X_train_final = X_train
X_test_final = X_test

### Output Final Sets

In [26]:
# Output final a sets
print(f"Final X_train is  : {X_train_final.head()}\n")
print(f"Final y_train is  : {y_train.head()}\n")
print(f"Final X_test is  : {X_test_final.head()}\n")
print(f"Final y_test is  : {y_test.head()}\n")

Final X_train is  :          age    salary  owns_house  city_Bangalore  city_Delhi  city_Mumbai
8   1.505115  1.874968           1               0           0            1
5   0.189168  0.631895           1               1           0            0
11 -1.653160 -1.605636           0               0           0            1
3  -1.389970 -1.274150           1               0           1            0
18  1.768305  2.289326           1               0           0            1

Final y_train is  : 8     1
5     1
11    0
3     0
18    1
Name: target, dtype: int64

Final X_test is  :          age    salary  owns_house  city_Bangalore  city_Delhi  city_Mumbai
0  -1.784754 -1.439893           1               0           1            0
17 -1.521565 -1.357021           0               0           1            0
15 -0.337212 -0.279691           0               0           0            1
1  -0.863591 -0.611177           0               0           0            1

Final y_test is  : 0     0
17    0
