# Exploratory Data Analysis for House Price Dataset in Kaggle 
Created on September 4, 2024 
Author: Xin (David) Zhao 

Example [notebook](https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python) 



## EDA Steps

The EDA steps ensure dataset is well-prepared for modeling. 

1. Load data: Use `pandas` to load datasets from CSV, Excel or other forms  
2. Initial data overview: Inspect data structure and check the first few rows, columns, and data types  
3. Check for missing values: Identify missing values with `pandas` 
4. Handling missing data: Drop missing values, fill with mean/ median/ mode, or use more advanced methods such as KNN imputations
5. Identify and handle duplicates: Check for and remove duplicate rows
6. Outlier detection and handling: Use box plots or scatter plots to visually detect outliers; remove, transform, or use robust models less sensitive to outliers 
7. Feature analysis and data visualization: Visualize distribution - use histogram, KDE plots, or count plots for categorical variables; Correlation analysis - use a correlation matrix to identify relationships between numerical features 
8. Handle incorrect data types: Convert data types if necessary (eg. dates, categories)
9. Encode categorical variables: One-hot encoding - for nominal categorical variables; Label encoding - for ordinal categorical variables 
10. Scaling and normalization: Normalize or standardize features using `StandardScaler` or `MinMaxScaler` from `scikit-learn` for better performance in some models 
11. Feature analysis and selection: Identify features with low variance or high multicollinearity and consider removing them 

## Import necessary modules 


In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 

## Step 1: Load Data