# Task 1: Data Preprocessing
### Level 1 - Machine Learning Internship

**Objective:**
The goal of this task is to prepare the "House Prediction Dataset" for machine learning. This involves:
1. Loading the dataset.
2. Handling missing values.
3. Splitting the data into training and testing sets.
4. Feature Scaling.

In [10]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

print("Libraries Imported Successfully!")

Libraries Imported Successfully!


In [11]:
# 1. Loading Dataset
# Dataset has whitespace delimiters & bo header
df = pd.read_csv('4) house Prediction Data Set.csv',
                 delim_whitespace=True,
                 header=None)
# Display first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

# Dataset Information
print("\nDataset Info:")
print(df.info())

First 5 rows of the dataset:
        0     1     2   3      4      5     6       7   8      9     10  \
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3   
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8   
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8   
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7   
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7   

       11    12    13  
0  396.90  4.98  24.0  
1  396.90  9.14  21.6  
2  392.83  4.03  34.7  
3  394.63  2.94  33.4  
4  396.90  5.33  36.2  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       506 non-null    float64
 1   1       506 non-null    float64
 2   2       506 non-null    float64
 3   3       506 non-null    int64  
 4   4       506 non-null    float

  df = pd.read_csv('4) house Prediction Data Set.csv',


In [12]:
# 2. Handling Missing Values
X = df.iloc[:, :-1].values # Features
y = df.iloc[:, -1].values  # Target

# Using SimpleImputer to replace missing values with the mean
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

print(f"Missing values handled. X shape: {X.shape}")

Missing values handled. X shape: (506, 13)


In [13]:
# 3. Splitting the Dataset
# 80% Training, 20% Testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Data Shape: {X_train.shape}")
print(f"Testing Data Shape: {X_test.shape}")

Training Data Shape: (404, 13)
Testing Data Shape: (102, 13)


In [14]:
# 4. Feature Scaling
# Standardization is crucial for regression models
scaler = StandardScaler()

# Fit on training set only, then transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data Preprocessing Completed Successfully!")
print(f"First 5 rows of X_train_scaled:\n {X_train_scaled[:5]}")

Data Preprocessing Completed Successfully!
First 5 rows of X_train_scaled:
 [[ 1.28770177 -0.50032012  1.03323679 -0.27808871  0.48925206 -1.42806858
   1.02801516 -0.80217296  1.70689143  1.57843444  0.84534281 -0.07433689
   1.75350503]
 [-0.33638447 -0.50032012 -0.41315956 -0.27808871 -0.15723342 -0.68008655
  -0.43119908  0.32434893 -0.62435988 -0.58464788  1.20474139  0.4301838
  -0.5614742 ]
 [-0.40325332  1.01327135 -0.71521823 -0.27808871 -1.00872286 -0.40206304
  -1.6185989   1.3306972  -0.97404758 -0.60272378 -0.63717631  0.06529747
  -0.65159505]
 [ 0.38822983 -0.50032012  1.03323679 -0.27808871  0.48925206 -0.30045039
   0.59168149 -0.8392398   1.70689143  1.57843444  0.84534281 -3.86819251
   1.52538664]
 [-0.32528234 -0.50032012 -0.41315956 -0.27808871 -0.15723342 -0.83109424
   0.03374663 -0.00549428 -0.62435988 -0.58464788  1.20474139  0.3791194
  -0.16578736]]
