# Exploratory Data Analysis Notebook

## Introduction
In this notebook, we will explore the Ames housing data which contains information about the sale of individual residential properties in Ames, Iowa.

The major steps involved in this exploratory data analysis are as follows:

- Load and inspect the data
- Identify missing values
- Explore distributions of key variables
- Find correlations between variables
- Generate visualizations for insights into the data

This process will allow us to better understand the data and determine what preprocessing and feature engineering need to be done before modeling.

## Importing Libraries and Setting Style

First, we import the necessary libraries for our analysis: `pandas` for data manipulation, `numpy` for numerical operations, `matplotlib` for basic plotting, and `seaborn` for more advanced visualization.

We also set the color palette to a pastel one and the style to `whitegrid` using `seaborn`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Loading Data

We start by reading in the training and test datasets using `pandas`'s `read_csv()` function. The `train_df` and `test_df` dataframes are created to store the data from the corresponding CSV files.

We then print the shapes of the two dataframes to check the number of rows and columns in each. The `train_df` dataframe contains X rows and Y columns, while the `test_df` dataframe contains A rows and B columns.

Finally, we print the number of rows in the training and test datasets using `len()`. The training dataset contains training rows, while the test dataset contains test rows.



In [None]:
train_df = pd.read_csv('../data_details/train.csv')
test_df = pd.read_csv('../data_details/test.csv')

print(train_df.shape)
print(test_df.shape)


print(f"Training rows: {len(train_df)}") 
print(f"Test rows: {len(test_df)}")

## Data Inspection

To get a better understanding of our data, we inspect the first few rows, last few rows, and a random sample of the `train_df` dataframe using the `head()`, `tail()`, and `sample()` functions, respectively. These functions allow us to quickly view the data and check for any obvious issues or anomalies.

The `head()` function displays the first few rows of the dataframe, while the `tail()` function displays the last few rows. The `sample()` function displays a random sample of rows from the dataframe.

By inspecting the data, we can get a sense of the variables and their values, as well as any missing data or other issues that may need to be addressed during preprocessing.


In [None]:
# Data inspection

print(train_df.head())
print(train_df.dtypes) 


In [None]:
print(train_df.columns)

In [None]:
null_counts = train_df.isnull().sum()
print(null_counts[null_counts > 0])

## Exploring Distribution of Target


We plot a histogram of the target variable - variable we want to predict - SalePrice to understand its distribution. This helps us determine appropriate models and transformations later on.

In [None]:
plt.figure(figsize=(10,6))
plt.hist(train_df['SalePrice'], bins=30, color='pink', edgecolor='black')
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()




In [None]:
# get the correlation of numerical variables
correlation = train_df.select_dtypes(include=[np.number]).corr()

# sort the correlations of the features with SalePrice
print("Correlation with SalePrice:")
correlation['SalePrice'].sort_values(ascending=False)


In [None]:
# Selecting the top 10 most positively correlated features with SalePrice
top_corr_features = correlation['SalePrice'].sort_values(ascending=False).head(11).index

# Creating a correlation matrix for the top correlated features
corr_matrix = train_df[top_corr_features].corr()

# Setting up the figure size
plt.figure(figsize=(12, 8))

# Plotting the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='RdPu')

# Displaying the plot
plt.show()


We can observe strong positive correlations between SalePrice and features like OverallQual, GrLivArea, GarageCars, GarageArea, and TotalBsmtSF, as we discussed earlier.

We can also see some correlation between the predictors themselves. For example, GarageCars and GarageArea are highly correlated (0.88), which makes sense - the more cars that fit into a garage, the larger the garage area tends to be. Similarly, GrLivArea and TotRmsAbvGrd have a high correlation (0.83) because houses with more rooms are likely to have a larger living area.

We'll need to consider this multicollinearity, as highly correlated predictors can sometimes negatively impact certain types of regression models. For example, in linear regression, high levels of multicollinearity can cause the coefficients of the predictors to be unstable and difficult to interpret.

In [None]:

# Variables to explore
vars_to_explore = ['GrLivArea', 'OverallQual', 'Neighborhood', 'YearBuilt', 'LotArea', 'YearRemodAdd', 'BsmtQual']

# Plotting
fig, axs = plt.subplots(nrows=len(vars_to_explore), figsize=(12, 6*len(vars_to_explore)))

for i, var in enumerate(vars_to_explore):
    # Scatter plot for numerical variables
    if train_df[var].dtype in ['int64', 'float64']:
        axs[i].scatter(train_df[var], train_df['SalePrice'])
        axs[i].set_title(f'SalePrice vs {var}')
        axs[i].set_xlabel(var)
        axs[i].set_ylabel('SalePrice')
    # Box plot for categorical variables
    else:
        sns.boxplot(x=var, y='SalePrice', data=train_df, ax=axs[i])
        axs[i].set_title(f'SalePrice by {var}')
        axs[i].set_xlabel(var)
        axs[i].set_ylabel('SalePrice')

plt.tight_layout()
plt.show()


GrLivArea: As expected, there seems to be a positive correlation between 'GrLivArea' and 'SalePrice'. Larger living areas generally lead to higher sale prices. However, there are a few exceptions (outliers) where large houses sold for relatively low prices. This might indicate that other factors (e.g., the quality of the house, the neighborhood, etc.) can significantly influence the sale price.

OverallQual: The box plots clearly show that the 'SalePrice' increases with 'OverallQual'. The interquartile range (IQR, the range within which the middle 50% of the prices fall) also increases with 'OverallQual'. This suggests that higher quality houses not only sell for higher prices on average, but the price variability is also higher for these houses.

Neighborhood: The sale price appears to vary significantly by neighborhood. Some neighborhoods like 'NoRidge', 'NridgHt', and 'StoneBr' have much higher median prices compared to others. This confirms that location is a crucial factor in determining house prices.

YearBuilt: There seems to be a slight upward trend in prices for more recently built houses, which is expected as newer houses tend to have modern designs and require less maintenance. However, the trend is not very strong, indicating that other factors also play a significant role in determining prices.

LotArea: While there's a general trend of larger lots commanding higher prices, the relationship is not very strong. There are many small lots that sold for high prices, and some large lots that sold for relatively low prices. This might indicate that the utility of a larger lot diminishes after a certain point.

YearRemodAdd: Houses that were remodeled more recently tend to sell for higher prices. This is consistent with the expectation that buyers would pay more for more modern and updated features.

BsmtQual: The sale price seems to increase with the quality of the basement, with 'Ex' (Excellent) basements commanding the highest prices. The variation in price also seems to increase with basement quality.

##  Numeric feature distributions

We look at the distribution, shape, and outliers of the key numeric columns/features. This informs data scaling/normalization and transformation needs later.



In [None]:
# Histogram for GrLivArea
plt.hist(train_df['GrLivArea'], bins=30, color='pink', edgecolor='black')
plt.title('Distribution of GrLivArea')
plt.xlabel('GrLivArea')
plt.ylabel('Frequency')

In [None]:
# Histogram for LotArea
plt.hist(train_df['LotArea'], bins=20, color='pink', edgecolor='black')
plt.title('Distribution of LotArea')
plt.xlabel('LotArea')
plt.ylabel('Frequency')

In [None]:
# Summary statistics
print(train_df[['GrLivArea', 'LotArea', '1stFlrSF']].describe())

In [None]:
# Histograms
plt.hist(train_df['GrLivArea'])
plt.hist(train_df['LotArea'])

# Boxplots
plt.boxplot(train_df['GrLivArea'])
plt.boxplot(train_df['LotArea'])

# Summary stats
print(train_df.describe())

## Discrete feature analysis

We're looking at features that fall into specific buckets or groups, like number of bedrooms and bathrooms. This is like sorting marbles by color and counting how many of each color there are.

Seeing how many houses have 1 bedroom vs 2 vs 3 etc gives us insights into patterns in the data. We can do this grouping and counting for other categorical features too, like neighborhood or house style.

I thought by creating count plots for the 'Bedrooms' and 'FullBath' features, we can begin to understand the distribution of these discrete features in our dataset. This can provide insights on common trends in the data such as the most common number of bedrooms or bathrooms in a house.


In [None]:
# Bedrooms
plt.hist(train_df['BedroomAbvGr'], bins=9, color='pink', edgecolor='black')
plt.xlabel('Bedrooms')
plt.ylabel('Frequency')

In [None]:
# Full bathrooms
plt.hist(train_df['FullBath'], color='pink', edgecolor='black')  
plt.xlabel('FullBath')
plt.ylabel('Frequency')



### Understanding Zoning

Zoning refers to the local or municipal laws or regulations that dictate how real estate can and cannot be utilized within specific geographic regions. For instance, zoning laws might restrict commercial or industrial land use, ensuring that businesses related to oil, manufacturing, or other industries don't set up their premises in residential neighborhoods.

The 'MSZoning' feature in our dataset signifies the general zoning classification of each property:

- **RL - Residential Low Density**: This category includes properties where the housing density is low, with one unit or a small number of units per building.
- **RM - Residential Medium Density**: This category includes properties with a higher number of units per building compared to low-density residential areas.
- **FV - Floating Village Residential**: This category includes houses that are part of a village grouping with shared open spaces.
- **RH - Residential High Density**: This category includes properties where the housing density is maximized, with the most possible number of units in each building.
- **C (all) - Commercial**: This category includes areas that are intended for commercial business use.

By evaluating the frequency of each unique zoning category in our dataset, we can gain an understanding of the property types and their distributions in our dataset.
The MSZoning feature indicates the general zoning classification of each property such as residential low density (RL) or commercial (C).


We will use The 'value_counts()' function to assess the frequency of each unique category within the 'MSZoning' categorical feature. This can help identify the most common zoning classification for houses in our dataset.




In [None]:
# The value counts printed show the distribution of houses across these zoning types

print('Frequency of values for MSZoning:')
print(train_df['MSZoning'].value_counts())


We can see that the majority of houses (1151) are in residential low density areas (RL). There are also a good number (218) in residential medium density (RM) and some in floating village zones (FV). Only a small fraction are in high density (RH) or commercial (C) areas.


In [None]:
# The Neighborhood feature specifies which neighborhood each house belongs to. The value counts show the number of houses in each:

print('\nFrequency of values for Neighborhood:') 
print(train_df['Neighborhood'].value_counts())

NAmes is the largest with 225 houses, while neighborhoods like Veenker only have 11 houses with Blueste having only 2.

## Location Analysis

Exploring the relationships between location features and sale price allows us to explore insights like neighborhood price patterns.


In [None]:
# Location analysis

# Setting up the figure size
plt.figure(figsize=(12, 8))

# Creating a boxplot of Neighborhood and SalePrice

custom_color = (255/255, 51/255, 153/255) # RGB values for a bright pink color
sns.boxplot(x='Neighborhood', y='SalePrice', data=train_df, color=custom_color)

# sns.boxplot(x='Neighborhood', y='SalePrice', data=train_df, color='pink')

# Rotating the x labels for better visibility
plt.xticks(rotation=45)
plt.title('SalePrice distribution by Neighborhood')

# Displaying the plot
plt.show()
