Case Study on Preprocessing / 
 - Dataset Overview
        Load the dataset House_Pricing.csv
        Display basic information about the dataset using .info() and .describe() to understand the features, data types, and any initial insights into missing values.
 - Duplicate Removal
        Rows: Check for duplicate rows in the dataset, if any, and remove them.
        Columns: Identify and drop duplicate columns, if any, based on their values.
 - Handling Missing Values
        Identify missing values in each column.
        Handle missing values:
            For numerical columns, use imputation techniques like mean/median imputation.
            For categorical columns, fill with mode.
        Document your approach for each feature with missing data.
 - Scaling Numerical Variables
        Identify all numerical columns (excluding the target variable SalePrice).
        Scale these features using techniques like Min-Max scaling or Standard scaling.
 - Encoding Categorical Variables
        Identify all categorical columns in the dataset.
        Apply appropriate encoding techniques:
            Use One-Hot Encoding for nominal categories.
            Use Label Encoding for ordinal categories, if applicable.
 - Outlier Removal
        Perform an outlier detection analysis on numerical variables (e.g., using the IQR method).
        Remove outliers from these features if they are not representative of typical house prices.
 - Train-Test Split
        Set aside the SalePrice column as the target variable.
        Split the dataset into training (80%) and testing (20%) sets using the train_test_split function from sklearn.



In [9]:
#Data set overview
import pandas as pd
df =pd.read_csv(r"C:\Users\rahul\OneDrive\Desktop\anitha\DSA\dsa assignments\preprocessing assignment\House_Pricing.csv")
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   ID                                         21613 non-null  int64  
 1   Date House was Sold                        21613 non-null  object 
 2   Sale Price                                 21609 non-null  float64
 3   No of Bedrooms                             21613 non-null  int64  
 4   No of Bathrooms                            21609 non-null  float64
 5   Flat Area (in Sqft)                        21604 non-null  float64
 6   Lot Area (in Sqft)                         21604 non-null  float64
 7   No of Floors                               21613 non-null  float64
 8   Waterfront View                            21613 non-null  object 
 9   No of Times Visited                        2124 non-null   object 
 10  Condition of the House

In [10]:
#Duplicate Removal
df=df.drop_duplicates()
df = df.loc[:, ~df.columns.duplicated()]

In [11]:
#identify missing values
print(df.isnull().sum())

ID                                               0
Date House was Sold                              0
Sale Price                                       4
No of Bedrooms                                   0
No of Bathrooms                                  4
Flat Area (in Sqft)                              9
Lot Area (in Sqft)                               9
No of Floors                                     0
Waterfront View                                  0
No of Times Visited                          19489
Condition of the House                           0
Overall Grade                                    0
Area of the House from Basement (in Sqft)        3
Basement Area (in Sqft)                          0
Age of House (in Years)                          0
Renovated Year                                   0
Zipcode                                          1
Latitude                                         1
Longitude                                        1
Living Area after Renovation (i

In [12]:
# For numerical columns,mean/median imputation.
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())
#categorical column with mode
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))



In [13]:
#scaling
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop(columns=['Sale Price'])
y = df['Sale Price']

# Identify numerical columns
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns

# Apply standard scaling
scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

In [14]:
#encoding categorical values
# One-Hot Encoding for nominal(unordered )variables
X = pd.get_dummies(X, drop_first=True)
print(X)
from sklearn.preprocessing import LabelEncoder




             ID  No of Bedrooms  No of Bathrooms  Flat Area (in Sqft)  \
0      0.886146       -0.398737        -1.447640            -0.979940   
1      0.637511       -0.398737         0.175628             0.533757   
2      0.365444       -1.473959        -1.447640            -1.426426   
3     -0.727656        0.676485         1.149589            -0.130528   
4     -0.912881       -0.398737        -0.149026            -0.435445   
...         ...             ...              ...                  ...   
21608 -1.500888       -0.398737         0.500282            -0.598793   
21609  0.702159        0.676485         0.500282             0.250619   
21610 -1.062751       -1.473959        -1.772294            -1.154179   
21611 -1.491046       -0.398737         0.500282            -0.522564   
21612 -1.062751       -1.473959        -1.772294            -1.154179   

       Lot Area (in Sqft)  No of Floors  Overall Grade  \
0               -0.228268     -0.915427      -0.564013   
1      

In [15]:
#outlier removal
#Calculate IQR
Q1 = X[numerical_cols].quantile(0.25)
Q3 = X[numerical_cols].quantile(0.75)
IQR = Q3 - Q1
print("Q1:",Q1,)
print("Q3:",Q3)
print("IQR:",IQR)

# Define outlier condition
outlier_condition = ((X[numerical_cols] < (Q1 - 1.5 * IQR)) | (X[numerical_cols] > (Q3 + 1.5 * IQR)))

# Remove rows with outliers
X = X[~outlier_condition.any(axis=1)]
y = y[X.index]
print(X)
print(y)

Q1: ID                                          -0.854251
No of Bedrooms                              -0.398737
No of Bathrooms                             -0.473679
Flat Area (in Sqft)                         -0.707692
Lot Area (in Sqft)                          -0.242996
No of Floors                                -0.915427
Overall Grade                               -0.564013
Area of the House from Basement (in Sqft)   -0.722678
Basement Area (in Sqft)                     -0.658681
Age of House (in Years)                     -0.885000
Renovated Year                              -0.210128
Zipcode                                     -0.839900
Latitude                                    -0.642675
Longitude                                   -0.810290
Living Area after Renovation (in Sqft)      -0.724470
Lot Area after Renovation (in Sqft)         -0.280859
Name: 0.25, dtype: float64
Q3: ID                                           0.948583
No of Bedrooms                               0.

In [16]:
#Train test split
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)