# **Data Cleaning - Ames Housing dataset**

## Objectives

* Load relevant .csv data file from Kaggle
* Handle missing values appropriately 
* Fix bad data: Empty cells, Data in wrong format, Wrong data, Duplicates
* Detect and treat outliers 
* Remove duplicate rows if any exist 
* Standardise categorical values where necessary 
* Save cleaned data for modelling 

## Inputs

* Datasets:
   - Raw datset: data/raw/house_prices_records.csv 
   - Client dataset: data/raw/inherited_houses.csv 
* Libraries pandas, numpy

## Outputs

* Updated variables and explanations that will be used for the EDA, feature selection and ML model. 
* Cleaned dataset saved to: data/processed/cleaned_data.csv 

## Additional Comments

* Doccument all cleaning with justified decisions
* Visualise missing values before and after cleaning 
* The cleaned dataset will be resused in later notebooks 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/Users/aisha/Desktop/vscode-projects/p5-heritage-housing/p5-heritage-housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/aisha/Desktop/vscode-projects/p5-heritage-housing/p5-heritage-housing'

---

# Clean Data - Introduction 

In this notebook, I will clean the Ames Housing dataset to prepare it for exploratory data analysis (EDA) and machine learning (ML).

#### 1. Inspect the raw structure of the dataset
- Confirm the dataset shape
- Ensure the dataset loaded correctly before cleaning.

#### 2. Identify and handle any duplicate rows 
- Detect any fully duplicated entries.
- Decide whether these should be removed to prevent bias during training.

#### 3. Analyse missing data patterns
- Identify columns with high levels of missingness.
- Group columns by type of missingness 

#### 4. Decide on appropriate missing value treatments
Based on the nature of each feature:
- Impute numerical columns using mean/median where appropriate.
- Impute categorical columns using mode or a meaningful label 
- Drop columns with extremely high missingness if they provide no predictive value.
- Consider dropping rows with excessive missing values if justified.

#### 5. Validate and correct data types
- Convert incorrectly stored numeric columns (e.g., stored as object) to integers/floats.
- Ensure categorical features are stored as object or category.
- Handle year-based columns to ensure no impossible dates.

#### 6. Detect and treat outliers
- Use summary statistics to identify unrealistic values 
- Decide whether to cap, remove, or leave outliers 

#### 7. Standardise and clean categorical values
- Check for inconsistent spelling or formatting.
- Replace values such as “None”/“NA”/“Missing” with consistent labels.
- Ensure categories match between the main dataset and inherited_houses.csv.

#### 8. Produce and save a clean dataset
- Once all cleaning steps are complete, export the final dataset as cleaned_data.csv into a data/processed/ folder.
- This cleaned dataset will feed into the next notebooks (EDA and Model Training).


---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import streamlit

print("✅ Notebook connected to correct environment!")


ModuleNotFoundError: No module named 'sklearn'