# Data Preprocessing Workflow

__Scenario__
You are given a dataset containing information about house prices, with features  like size, location, number of bedrooms, and price. The dataset contains missing values and categorical variables.

1. Load and Explorer the Dataset
    * Create a pandas DataFrame with the following data
        | Size (sqft) | Location   | Bedrooms | Price ($) |
        |-------------|------------|----------|-----------|
        | 1500        | Downtown   | 3        | 400000    |
        | 1700        | Suburban   | 4        | 450000    |
        | 1600        | Downtown   | NaN      | 420000    |
        | NaN         | Rural      | 2        | 200000    |
        | 1800        | Suburban   | 3        | NaN       |

2. Handle Missing Values. Fill missing values for.
    * Bedrooms: Use the median value.
    * Size.. Use the mean value.
    * Price. Use the median value.

3. Encode Categorical Data
    * Convert the Location column into numerical values using one-hot encoding.

4. Feature Scaling
    * Standardize the numerical columns: Size, Bedrooms, and Price.

5. Split the Data
    * Divide the preprocessed dataset into 80% training and 20% testing sets.



In [6]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Create the dataset
data = {
    'Size (sqft)' : [1500, 1700, 1600, np.nan, 1800],
    'Location' : ['Downtown', 'Suburban', 'Downtown', 'Rural', 'Suburban'],
    'Bedrooms' : [3, 4, np.nan, 2, 3],
    'Price ($)' : [400000, 450000, 420000, 200000, np.nan]
}

df = pd.DataFrame(data)

# 2. Handle missing values
df['Bedrooms'].fillna(df['Bedrooms'].median())
df['Size (sqft)'].fillna(df['Size (sqft)'].mean())
df['Price ($)'].fillna(df['Price ($)'].median())

# 3. Encode categorical data
df = pd.get_dummies( df, columns=['Location'], drop_first=True)

# 4. Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Size (sqft)', 'Bedrooms', 'Price ($)']])
df_scaled = pd.DataFrame(scaled_features, columns=['Size (sqft)', 'Bedrooms', 'Price ($)'])
df_scaled = pd.concat([df_scaled, df.iloc[:,3:]], axis=1)

# 5. Split the data
train, test = train_test_split(df_scaled, test_size=0.2, random_state=42)

# Display results

print(df_scaled, train, test)


   Size (sqft)  Bedrooms  Price ($)  Location_Rural  Location_Suburban
0    -1.341641  0.000000   0.330520           False              False
1     0.447214  1.414214   0.839013           False               True
2    -0.447214       NaN   0.533917           False              False
3          NaN -1.414214  -1.703451            True              False
4     1.341641  0.000000        NaN           False               True    Size (sqft)  Bedrooms  Price ($)  Location_Rural  Location_Suburban
4     1.341641  0.000000        NaN           False               True
2    -0.447214       NaN   0.533917           False              False
0    -1.341641  0.000000   0.330520           False              False
3          NaN -1.414214  -1.703451            True              False    Size (sqft)  Bedrooms  Price ($)  Location_Rural  Location_Suburban
1     0.447214  1.414214   0.839013           False               True
