## Scenario (Realistic)

### You are given customer data for a small housing platform.
### Your task is to prepare ML-ready data to predict house prices.

## STEP 0 — Create the Dataset 

In [1]:
import pandas as pd
import numpy as np

In [51]:
data = {
    "City": ["Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi", "Mumbai", None],
    "Education": ["Bachelor", "Master", "PhD", "Bachelor", "Master", None, "PhD"],
    "Age": [25, 35, 45, np.nan, 29, 40, 50],
    "Experience": [2, 10, 20, 3, np.nan, 15, 25],
    "Price": [10, 18, 30, 12, 15, 25, 35]
}

In [52]:
df = pd.DataFrame(data)

In [53]:
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,Bachelor,25.0,2.0,10
1,Delhi,Master,35.0,10.0,18
2,Mumbai,PhD,45.0,20.0,30
3,Chennai,Bachelor,,3.0,12
4,Delhi,Master,29.0,,15
5,Mumbai,,40.0,15.0,25
6,,PhD,50.0,25.0,35


## PART 1 — Pandas Basics + Inspection
    Task 1
        Print:
        shape
        column names
        data types
    
    Identify:
        which columns have missing values
        how many missing values per column

In [5]:
df.shape

(7, 5)

In [6]:
df.columns

Index(['City', 'Education', 'Age', 'Experience', 'Price'], dtype='object')

In [7]:
df.dtypes

City           object
Education      object
Age           float64
Experience    float64
Price           int64
dtype: object

In [8]:
df.isna().any()

City           True
Education      True
Age            True
Experience     True
Price         False
dtype: bool

In [9]:
df.isnull().sum()

City          1
Education     1
Age           1
Experience    1
Price         0
dtype: int64

## PART 2 — Handling Missing 
    Task 2
        Fill missing values using ML-safe logic:
        City → fill with most frequent value
        Education → fill with most frequent value
        Age → fill with mean
        Experience → fill with median

In [60]:
df["City"].fillna(df['City'].mode()[0], inplace = True)
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,Bachelor,25.0,2.0,10
1,Delhi,Master,35.0,10.0,18
2,Mumbai,PhD,45.0,20.0,30
3,Chennai,Bachelor,,3.0,12
4,Delhi,Master,29.0,,15
5,Mumbai,,40.0,15.0,25
6,Mumbai,PhD,50.0,25.0,35


In [61]:
df['Education'].fillna(df['Education'].mode()[0], inplace = True)
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,Bachelor,25.0,2.0,10
1,Delhi,Master,35.0,10.0,18
2,Mumbai,PhD,45.0,20.0,30
3,Chennai,Bachelor,,3.0,12
4,Delhi,Master,29.0,,15
5,Mumbai,Bachelor,40.0,15.0,25
6,Mumbai,PhD,50.0,25.0,35


In [62]:
df['Age'].fillna(df["Age"].mean(), inplace = True)
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,Bachelor,25.0,2.0,10
1,Delhi,Master,35.0,10.0,18
2,Mumbai,PhD,45.0,20.0,30
3,Chennai,Bachelor,37.0,3.0,12
4,Delhi,Master,29.0,,15
5,Mumbai,Bachelor,40.0,15.0,25
6,Mumbai,PhD,50.0,25.0,35


In [64]:
df['Experience'].fillna(df["Experience"].median(), inplace = True)
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,Bachelor,25.0,2.0,10
1,Delhi,Master,35.0,10.0,18
2,Mumbai,PhD,45.0,20.0,30
3,Chennai,Bachelor,37.0,3.0,12
4,Delhi,Master,29.0,12.5,15
5,Mumbai,Bachelor,40.0,15.0,25
6,Mumbai,PhD,50.0,25.0,35


In [65]:
df.isnull().sum()

City          0
Education     0
Age           0
Experience    0
Price         0
dtype: int64

## PART 3 — Encoding
    Task 3 — Decide encoding (THINK FIRST)
    
        Answer mentally:
        
        City → ❓ One-Hot or Label?
        -> OneHot as it is unordered data
        
        Education → ❓ One-Hot or Label?
        -> coz it is ordered data
        Why?

## Task 4 — Encode correctly

In [66]:
label_encode = {
    'Bachelor' : 1,
    "Master" : 2,
    "PhD" : 3
}

In [67]:
df['Education'] = df['Education'].map(label_encode)

In [68]:
df

Unnamed: 0,City,Education,Age,Experience,Price
0,Mumbai,1,25.0,2.0,10
1,Delhi,2,35.0,10.0,18
2,Mumbai,3,45.0,20.0,30
3,Chennai,1,37.0,3.0,12
4,Delhi,2,29.0,12.5,15
5,Mumbai,1,40.0,15.0,25
6,Mumbai,3,50.0,25.0,35


In [71]:
df_encoded = pd.get_dummies(df, columns=['City'])

In [76]:
df_encoded

Unnamed: 0,Education,Age,Experience,Price,City_Chennai,City_Delhi,City_Mumbai
0,1,25.0,2.0,10,0,0,1
1,2,35.0,10.0,18,0,1,0
2,3,45.0,20.0,30,0,0,1
3,1,37.0,3.0,12,1,0,0
4,2,29.0,12.5,15,0,1,0
5,1,40.0,15.0,25,0,0,1
6,3,50.0,25.0,35,0,0,1


## PART 4 — NumPy Conversion

In [80]:
X = df_encoded.drop("Price", axis = 1).values
y = df_encoded["Price"].values

X, y

(array([[ 1. , 25. ,  2. ,  0. ,  0. ,  1. ],
        [ 2. , 35. , 10. ,  0. ,  1. ,  0. ],
        [ 3. , 45. , 20. ,  0. ,  0. ,  1. ],
        [ 1. , 37. ,  3. ,  1. ,  0. ,  0. ],
        [ 2. , 29. , 12.5,  0. ,  1. ,  0. ],
        [ 1. , 40. , 15. ,  0. ,  0. ,  1. ],
        [ 3. , 50. , 25. ,  0. ,  0. ,  1. ]]),
 array([10, 18, 30, 12, 15, 25, 35], dtype=int64))

In [81]:
X.shape, y.shape

((7, 6), (7,))

## PART 5 — NumPy Thinking
    Task 6 — NumPy operations

In [83]:
df.mean(numeric_only = True)

Education      1.857143
Age           37.285714
Experience    12.500000
Price         20.714286
dtype: float64

In [84]:
df['Age'].min(), df['Age'].max()

(25.0, 50.0)

## PART 6 — ML Mindset Check

### 1. Why should encoding happen after missing value handling?
    -> So that the missing or null values wont be encoded and the model wont learn uncessary feature
    
    -->> Encoding should be done after handling missing values because encoders cannot correctly process NaNs and may treat them as separate categories, introducing noise and misleading features into the model.


### 2. Why would KNN fail if City was Label Encoded?
    -> Because City is unordered data and Lable Encoding it will introduce a artificial ranking and equal distance which in KNN is dangerous as it classifies the data based on the distance similarity.
    -->>KNN relies on distance metrics, and Label Encoding introduces artificial order and numeric distances between nominal categories like cities, causing the model to compute meaningless similarities and select incorrect neighbors.
    
### 3. Which columns would a Decision Tree care least about scaling? 
    -> Categorical Values like City, Education cause Decision Tree splits based on the measure of distance of numerical features.
    -->>Decision Trees care least about the scaling of numerical features because they split based on thresholds rather than distances, so changing the scale does not affect the split logic.