<h1 style="color:red">Train data</h1>

In [26]:
import pandas as pd
# Load training and testing data
train = pd.read_csv("house_prices_data/train.csv")

In [27]:
# Display the first few rows of the training data
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [28]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [29]:
# Calculate the number and percentage of missing values
missing_values = train.isnull().sum()
missing_percentage = (missing_values / len(train)) * 100

# Create a DataFrame to summarize missing values
missing_summary = pd.DataFrame({
    'Column': train.columns,
    'Missing_Count': missing_values,
    'Missing_Percentage': missing_percentage
}).sort_values(by='Missing_Percentage', ascending=False)

# Display columns with missing values
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
missing_summary.reset_index(drop=True, inplace=True)
print(missing_summary)


          Column  Missing_Count  Missing_Percentage
0         PoolQC           1453           99.520548
1    MiscFeature           1406           96.301370
2          Alley           1369           93.767123
3          Fence           1179           80.753425
4     MasVnrType            872           59.726027
5    FireplaceQu            690           47.260274
6    LotFrontage            259           17.739726
7    GarageYrBlt             81            5.547945
8     GarageCond             81            5.547945
9     GarageType             81            5.547945
10  GarageFinish             81            5.547945
11    GarageQual             81            5.547945
12  BsmtFinType2             38            2.602740
13  BsmtExposure             38            2.602740
14      BsmtQual             37            2.534247
15      BsmtCond             37            2.534247
16  BsmtFinType1             37            2.534247
17    MasVnrArea              8            0.547945
18    Electr

<h3 style="color:lime;">From this summary, we can determine the appropriate actions for handling the missing values based on the percentage of missingness and the importance of each column.</h3> 

### Columns to drop:
1. **`PoolQC` (99.52%)**: The data is almost entirely missing and may not provide significant predictive power.
2. **`MiscFeature` (96.30%)**: Similarly, it has very high missingness and limited impact.
3. **`Alley` (93.77%)**: High missing percentage, likely not critical to house pricing.
4. **`Fence` (80.75%)**: Too much missing data to be useful.

### Columns to consider filling:
1. **`MasVnrType` (59.73%)**: Moderate missingness, but it might be important for building characteristics.
2. **`FireplaceQu` (47.26%)**: Missing values likely indicate "No Fireplace," so it can be filled with a placeholder like "None."
3. **`LotFrontage` (17.74%)**: Missing values can be imputed using the median or based on neighborhood.
4. **Garage-related columns** (`GarageYrBlt`, `GarageCond`, `GarageType`, `GarageFinish`, `GarageQual`) all have 5.54% missing values, which likely indicate "No Garage." These can be filled with appropriate placeholders.
5. **Basement-related columns** (`BsmtFinType1`, `BsmtFinType2`, `BsmtExposure`, `BsmtQual`, `BsmtCond`) have about 2.53-2.60% missing values, which likely represent "No Basement." They can also be filled with placeholders.

### Columns with very low missingness:
1. **`MasVnrArea` (0.55%)**: Can be filled with the median or 0 if it indicates "No Masonry Veneer."
2. **`Electrical` (0.06%)**: Can be filled with the mode, as it's the most common electrical system type.



<h2 style="color:red;">CLEANING</h2>

In [30]:
# List of columns to drop for the train dataset
drop_columns_train = ['PoolQC', 'MiscFeature', 'Alley', 'Fence','Id'] #Dropping Id bcz it is unique identifier and i dont need it right now

# Dropping the columns from the train dataset
train = train.drop(columns=drop_columns_train)

# Check the result
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   LotShape       1460 non-null   object 
 6   LandContour    1460 non-null   object 
 7   Utilities      1460 non-null   object 
 8   LotConfig      1460 non-null   object 
 9   LandSlope      1460 non-null   object 
 10  Neighborhood   1460 non-null   object 
 11  Condition1     1460 non-null   object 
 12  Condition2     1460 non-null   object 
 13  BldgType       1460 non-null   object 
 14  HouseStyle     1460 non-null   object 
 15  OverallQual    1460 non-null   int64  
 16  OverallCond    1460 non-null   int64  
 17  YearBuilt      1460 non-null   int64  
 18  YearRemo

In [31]:
train.shape

(1460, 76)

<h3>1. Identifying numerical and categorical columns with missing values:</h3>

In [36]:
# Identify numerical columns with missing values
numerical_columns = train.select_dtypes(include=['float64', 'int64']).columns
numerical_missing_train = train[numerical_columns].isnull().sum()

# Identify categorical columns with missing values
categorical_columns = train.select_dtypes(include=['object']).columns
categorical_missing_train = train[categorical_columns].isnull().sum()

In [37]:
# Print the columns with missing values
numerical_missing_train[numerical_missing_train > 0]

LotFrontage    259
MasVnrArea       8
GarageYrBlt     81
dtype: int64

In [38]:
# Print the columns with missing values
categorical_missing_train[categorical_missing_train > 0]

MasVnrType      872
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Electrical        1
FireplaceQu     690
GarageType       81
GarageFinish     81
GarageQual       81
GarageCond       81
dtype: int64

<h3>2. Imputation Strategy:</h3>
<ul>
    <li><b>Numerical columns:</b> We will fill the missing values with the median. The median is preferred over the mean because it is less sensitive to outliers.</li>
<li><b>Categorical columns:</b> We will fill the missing values with the mode (most frequent value).</li>

</ul>

In [40]:
# Impute missing values for numerical columns (using median)
train[numerical_columns] = train[numerical_columns].apply(lambda x: x.fillna(x.median()))

# Impute missing values for categorical columns (using mode)
train[categorical_columns] = train[categorical_columns].apply(lambda x: x.fillna(x.mode()[0]))

# Verify the imputation
train.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 76, dtype: int64