# **2. EDA and Feature Engineering**

## *Table of Contents*

1. [Data Cleaning](../01_Data_Cleaning/1_Data_Cleaning.ipynb)
2. [**EDA and Feature Engineering**](./2_Exploratory_Data_Analysis.ipynb)
   1. [*Library Imports*](#Library-Imports)
   2. [*File Importation*](#File-Importation)
   3. [*Analysis*](#Analysis)
      1. [Descriptive Statistics](#Descriptive-Statistics)
      2. [Histogram Plots](#Histogram-Plots)
      3. [Skewness Check](#Skewness-Check)
      4. [Q-Q Plots](#Q-Q-Plots)
      5. [Box Plots](#Box-Plots)
      6. [Count Plots](#Count-Plots)
      7. [Pair Plot](#Pair-Plot)
      8. [Correlation Matrix](#Correlation-Matrix)
      9. [Chi-Square Test](#Chi-Square-Test)
3. [Regression Modeling](../03_Regression_Models/3_Regression_Modeling.ipynb)
4. [Time Series](../04_Time_Series_Analysis/4_Time_Series.ipynb)

## **Library Imports**

### Standard library imports

In [None]:
import sys # Provides a way of using operating system dependent functionality
import os  # For interacting with the operating system
from itertools import combinations  # For creating combinations of elements

### Third-party imports

In [None]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For scientific computing and array objects
import matplotlib.pyplot as plt  # For creating static, animated, and interactive visualizations
import seaborn as sns  # For data visualization based on matplotlib
from scipy import stats  # For scientific and technical computing
from sklearn.preprocessing import MinMaxScaler  # For feature scaling
from scipy.stats import chi2_contingency  # For performing chi-square contingency tests

### Local application imports

In [None]:
# Define the absolute path of the parent directory of the script's grandparent directory
# This is useful for module importation from a different directory structure
parent_dir = os.path.dirname(os.getcwd())
sys.path.insert(0, parent_dir)

# Local application imports
from utils import plot_utils, func_utils

## **File Importation**

In [None]:
# Determine the absolute path to the directory containing the current script
script_dir = os.path.dirname(os.getcwd())

# Construct the path to the data file
data_path = os.path.join(script_dir, '01_Data_Cleaning', '1_cleaned_melb_data.csv')

# Load dataset containing cleaned Melbourne housing data
melb_data = pd.read_csv(data_path)

## **Analysis**

In [None]:
# Define quantitative and categorical columns for subsequent analysis
quan_columns = ['Price', 'Bedroom', 'Bathroom', 'Car', 'Distance', 'Landsize', 'BuildingArea']
cat_columns = ['Postcode', 'Suburb', 'Regionname', 'CouncilArea', 'Type', 'SellerG', 'Method', 'Year', 'Month']

### Descriptive Statistics

In [None]:
# Display descriptive statistics to summarize central tendency, dispersion, and shape
print(melb_data.describe())

### Histogram Plots

In [None]:
# Plot histograms for quantitative columns in 'melb_data' across a 3x3 grid.
plot_utils.plot_hist(data=melb_data, column_list=quan_columns, rows=3, cols=3)

### Skewness Check

In [None]:
# Assess skewness of quantitative variables
print(melb_data[quan_columns].skew())

### Q-Q Plots

In [None]:
# Generate Q-Q plots for quantitative columns in 'melb_data' on a 3x3 grid.
plot_utils.plot_qq(data=melb_data, column_list=quan_columns, rows=3, cols=3)

### Box Plots

In [None]:
# Create box plots for quantitative columns in 'melb_data' arranged in a 3x3 grid.
plot_utils.plot_box(data=melb_data, column_list=quan_columns, rows=3, cols=3)

In [None]:
# Plot box plots for categorical columns in 'melb_data' with price consideration over a 3x3 grid.
plot_utils.plot_box(data=melb_data, column_list=cat_columns, price=True, rows=3, cols=3)

In [None]:
# Display bar plots for categorical columns in 'melb_data' configured in a 3x3 grid layout.
plot_utils.plot_bar(data=melb_data, column_list=cat_columns, rows=3, cols=3)

### Count Plots

In [None]:
# Plot count plots for categorical columns in 'melb_data', structured within a 3x3 grid.
plot_utils.plot_count(data=melb_data, column_list=cat_columns, rows=3, cols=3)

### Pair Plot

In [None]:
# Visualize pair-wise relationships to identify potential correlations and trends
sns.pairplot(data=melb_data[quan_columns])
plt.show()  # Display the pairplot

### Correlation Matrix

In [None]:
# Display correlation matrix to assess linear relationships between variables
correlation_matrix = melb_data[quan_columns].corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title("Correlation Matrix")
plt.show()

### Chi-Square Test

In [None]:
# List of combinations of categorical columns taken two at a time
cat_comb = list(combinations(cat_columns, 2))

for pair in cat_comb:
    # Create a contingency table for the current pair of categorical columns
    table = pd.crosstab(melb_data[pair[0]], melb_data[pair[1]])

    # Perform the chi-square test on the contingency table
    chi2_stat, p_value, dof, expected = chi2_contingency(table)

    # Print the pair, Chi-square statistic, and p-value
    print(f"Pair: {pair}, Chi2 Statistic: {chi2_stat}, p-value: {p_value}")