# **Data Cleaning**

## Objectives

Clean data in preparation for determining the most important features.

## Inputs

1. House_prices_records.csv
2. Inherited_houses.csv

## Outputs

3. House_prices_records_clean.csv
4. Inherited_houses_clean.csv

## Comments
Note that this step is separate from feature engineering. We will need to do that after the correlation study in order to prepare the data for the model. We will using this opportunity to apply the same cleaning tools to the Inherited_houses.csv data set as we will need to do that before we can run them through the model.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing'

# Load Data

Section 1 content

In [4]:
import numpy as np
import pandas as pd

house_prices_df = pd.read_csv(f"outputs/datasets/collection/house_prices_records.csv")
inherited_houses_df = pd.read_csv(f"outputs/datasets/collection/inherited_houses.csv")

---

## Data cleaning

Following investigations the below cleaning strategy was implemented.


#### Remove columns missing substantial amounts of data.
EnclosedPorch and WoodDeckSF have 90.7% and 89.4% of the data missing. Where present it does not vary significantly across the range of the houses so will have limited predictive power. Imputing such a high proportion of missing data is a riskier option than removing these features entirely. 


In [5]:
from feature_engine.selection import DropFeatures

def drop_features(df):
    """
    Function to remove EnclosedPorch and WoodDeckSF from data set
    """
    imputer = DropFeatures(features_to_drop=['EnclosedPorch' , 'WoodDeckSF'])
    df_removed_columns = imputer.fit_transform(df)
    return df_removed_columns


In [6]:
# Apply drop_features function to house_prices_df and check columns have been removed.
# Should now have 22 columns rather than 24.
house_prices_df= drop_features(house_prices_df)
house_prices_df.shape

(1460, 22)

In [7]:
# Apply drop_features function to inherited_houses_df and check columns have been removed.
# Should now have 21 columns rather than 24.
inherited_houses_df = drop_features(inherited_houses_df)
inherited_houses_df.shape

(4, 21)

---

# Push files to Repo

* Creating a new folder to save the cleaned data to.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/clean_data')
except Exception as e:
  print(e)

House_prices_records_test.to_csv(f"outputs/datasets/clean_data/House_prices_records_clean.csv",index=False)
Inherited_houses_clean.to_csv(f"outputs/datasets/clean_data/Inherited_houses_clean.csv",index=False)

