## Task: Manual Preprocessing (Numeric + Categorical)

    1.Split the data into train (first 3 rows) and test (last 2 rows)
    2.Impute numeric columns manually
    3. Use mean imputation
    4. Compute mean only from train
    5. Apply to both train and test
    6. Standard-scale numeric columns
    7. Compute mean & std from train only
    8. Apply to both train and test
    9. One-hot encode City manually
    10. Use categories from train only
    11. Ensure test has the same columns
    12. Unseen categories (if any) â†’ all zeros
    13. Print:
    14.Final X_train Final X_test

In [90]:
# Import the Libraries
import pandas as pd
import numpy as np

In [91]:
data = {
    "Age": [25, np.nan, 35, 45, np.nan],
    "Salary": [50000, 60000, np.nan, 80000, 90000],
    "City": ["Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi"]
}


In [92]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Age,Salary,City
0,25.0,50000.0,Mumbai
1,,60000.0,Delhi
2,35.0,,Mumbai
3,45.0,80000.0,Chennai
4,,90000.0,Delhi


In [93]:
df.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     3 non-null      float64
 1   Salary  4 non-null      float64
 2   City    5 non-null      object 
dtypes: float64(2), object(1)
memory usage: 248.0+ bytes


## Splitting the data

In [94]:
X = df.drop('Salary', axis = 1)
y = df['Salary']
X.shape, y.shape

((5, 2), (5,))

In [95]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=40)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3, 2), (2, 2), (3,), (2,))

## Handling missing values using mean

In [96]:
X_train.isnull().sum(), X_test.isnull().sum()

(Age     2
 City    0
 dtype: int64,
 Age     0
 City    0
 dtype: int64)

In [97]:
X_train["Age"].fillna(X_train["Age"].mean(), inplace = True)

In [98]:
X_test['Age'].fillna(X_test['Age'].mean(), inplace = True)

In [99]:
X_train.isnull().sum(), X_test.isnull().sum()

(Age     0
 City    0
 dtype: int64,
 Age     0
 City    0
 dtype: int64)

In [100]:
X_train, X_test

(    Age     City
 1  45.0    Delhi
 4  45.0    Delhi
 3  45.0  Chennai,
     Age    City
 2  35.0  Mumbai
 0  25.0  Mumbai)

## Standard Scaling

In [101]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_cols = ['Age']

In [102]:
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

In [103]:
X_train, X_test

(   Age     City
 1  0.0    Delhi
 4  0.0    Delhi
 3  0.0  Chennai,
     Age    City
 2 -10.0  Mumbai
 0 -20.0  Mumbai)

## OneHotEncode

In [104]:
cat_col  = ["City"]

In [105]:
X_train_final = pd.get_dummies(X_train, columns = cat_col)

In [106]:
X_train

Unnamed: 0,Age,City
1,0.0,Delhi
4,0.0,Delhi
3,0.0,Chennai


In [107]:
X_test = pd.get_dummies(X_test, columns= cat_col)

In [108]:
X_test

Unnamed: 0,Age,City_Mumbai
2,-10.0,1
0,-20.0,1


In [109]:
X_test_final = X_test.reindex(
    columns = X_train_final.columns,
    fill_value =0
)

In [110]:
X_test_final

Unnamed: 0,Age,City_Chennai,City_Delhi
2,-10.0,0,0
0,-20.0,0,0


In [111]:
X_train

Unnamed: 0,Age,City
1,0.0,Delhi
4,0.0,Delhi
3,0.0,Chennai
