# Data Cleaning Notebook
## Objectives
Assess and handle missing values
Clean data

## Inputs
outputs/datasets/collection/HousePrices.csv

## Outputs
Cleaned data in outputs/datasets/cleaned
Data cleaning pipline

---

# Change working directory
We need to change the working directory from its current folder to its parent folder

In [None]:
import os

current_path = os.getcwd()
os.chdir(os.path.dirname(current_path))
current_path = os.getcwd()
current_path

# Data Exploration
Let's look at the data we gathered.

In [None]:
import pandas as pd
df = pd.read_csv(f"outputs/datasets/collection/HousePrices.csv")
df.head()


## Profile Report

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Handling Missing Data
There seems to be some data missing, we need to investigate this further.

In [None]:
summary = pd.DataFrame({
    "Name": df.columns,
    "DataType": df.dtypes,
    "TotalValues": len(df),
    "MissingValues": df.isnull().sum(),
    "PercentageMissing": df.isnull().sum() * 100 / len(df)
})

summary.reset_index(drop=True, inplace=True)

sorted_summary = summary.sort_values(by='PercentageMissing', ascending=False)
sorted_summary

## Drop High-missing columns
Since both **EnclosedPorch** and **WoodDeckSF** have ~90% missing values, we will drop these.

In [None]:
df.drop(['EnclosedPorch', 'WoodDeckSF'], axis=1, inplace=True)

## Replacing values
The missing values of both **GarageFinish** and **BsmtFinType1** implies that the house does'nt have either a Garage or a basement.
So we will be replacing these values with zeroes.

In [None]:
for col in ['GarageFinish', 'BsmtFinType1']:
    df[col] = df[col].fillna(0)

## Median Imputation
The rest of the missing values will be filled with their median value.

In [None]:
for col in ['LotFrontage', 'BedroomAbvGr', '2ndFlrSF', 'GarageYrBlt', 'MasVnrArea']:
    df[col].fillna(df[col].median(), inplace=True)

### Double check
Now we'll doublecheck that there are no more missing values

In [None]:
df.isnull().sum()

# Split Train and Test Set

In [None]:
try:
    os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
    print(e)

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['SalePrice'],
                                        test_size=0.2,
                                        random_state=0)

## Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

## Test Set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)

## Cleaned Set

In [None]:
df.to_csv("outputs/datasets/cleaned/HousePricesCleaned.csv", index=False)