# **Heretage Housing**

# 1. Introduction
"""
This notebook aims to help maximize the sale prices of four inherited properties in Ames, Iowa by analyzing house features and building a machine learning model to predict house sale prices.
We will explore several hypotheses and validate them through data analysis and visualizations.
"""

## Objectives

* The objective of this notebook is to fetch, clean, and analyze housing data to predict house sale prices using machine learning models. This includes performing Exploratory Data Analysis (EDA), building a predictive model, tuning it for accuracy, and evaluating the results.

## Inputs

* The input data is from the 'Ames Housing Dataset', a CSV file. The notebook requires the data to have features like total square footage, year built, neighborhood information, garage area, and sale price, among others. 

## Outputs

* The outputs will include exploratory data visualizations, trained machine learning models, and their performance metrics. The final model will be able to predict sale prices based on house features. Artifacts generated include the best-tuned Random Forest model and its performance report.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Section 1 loading imports and files

This section is just to simply load all imports and files needed in order to rurn the rest of the notebook

In [None]:
# 1. Imports and Data Loading
import streamlit as st
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('AmesHousing.csv')

# Fill missing values
df.fillna(df.median(), inplace=True)

# Encode categorical features
le = LabelEncoder()
df['Neighborhood'] = le.fit_transform(df['Neighborhood'])

# Add new features
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']

# Split the dataset into features (X) and target (y)
X = df[['TotalSF', 'OverallQual', 'GarageArea', 'YearBuilt', 'Neighborhood']]
y = df['SalePrice']

# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Title
st.title("Heritage Housing Price Prediction Dashboard")
st.write("This dashboard helps predict house prices in Ames, Iowa using Exploratory Data Analysis and machine learning models.")


---

# Section 2 (EDA analysis)

This section is about the Analysis and plotting of the dataset given to me when i first forked the template at the start of this project.

In [None]:
# 2. Exploratory Data Analysis (EDA)

# Correlation Heatmap
st.subheader('Exploratory Data Analysis')
st.write("### Correlation Heatmap")
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f', ax=ax)
st.pyplot(fig)

# Scatter plot for Total Square Footage vs Sale Price
st.write("### Total Square Footage vs Sale Price")
fig = px.scatter(df, x='TotalSF', y='SalePrice', opacity=0.5, title='Total Square Footage vs Sale Price')
st.plotly_chart(fig)

# Box plot for Neighborhood vs Sale Price
st.write("### Neighborhood vs Sale Price")
fig = px.box(df, x='Neighborhood', y='SalePrice', title='Neighborhood vs Sale Price')
st.plotly_chart(fig)

# Scatter plot for Overall Quality vs Sale Price
st.write("### Overall Quality vs Sale Price")
fig = px.scatter(df, x='OverallQual', y='SalePrice', opacity=0.5, title='Overall Quality vs Sale Price')
st.plotly_chart(fig)

# Scatter plot for Year Built vs Sale Price
st.write("### Year Built vs Sale Price")
fig = px.scatter(df, x='YearBuilt', y='SalePrice', opacity=0.5, title='Year Built vs Sale Price')
st.plotly_chart(fig)

# Scatter plot for Garage Area vs Sale Price
st.write("### Garage Area vs Sale Price")
fig = px.scatter(df, x='GarageArea', y='SalePrice', opacity=0.5, title='Garage Area vs Sale Price')
st.plotly_chart(fig)


# **Section 4 Machine learning model development**

This section is all about the predictions and evaluations from different models in the dataset given such as:
- Linear regression model
- Random Forest Model
    - Hyperperameter tuning for the RFM

In [None]:
# 3. Machine Learning Model Development

# Function to make predictions
def predict_price(total_sf, overall_qual, garage_area, year_built, neighborhood):
    input_data = pd.DataFrame([[total_sf, overall_qual, garage_area, year_built, neighborhood]], 
                              columns=['TotalSF', 'OverallQual', 'GarageArea', 'YearBuilt', 'Neighborhood'])
    return rf_model.predict(input_data)[0]

# Sidebar for user input
st.sidebar.header("Input Features")
total_sf = st.sidebar.slider('Total Square Footage', int(df['TotalSF'].min()), int(df['TotalSF'].max()), int(df['TotalSF'].mean()))
overall_qual = st.sidebar.slider('Overall Quality', int(df['OverallQual'].min()), int(df['OverallQual'].max()), int(df['OverallQual'].mean()))
garage_area = st.sidebar.slider('Garage Area', int(df['GarageArea'].min()), int(df['GarageArea'].max()), int(df['GarageArea'].mean()))
year_built = st.sidebar.slider('Year Built', int(df['YearBuilt'].min()), int(df['YearBuilt'].max()), int(df['YearBuilt'].mean()))
neighborhood = st.sidebar.selectbox('Neighborhood', df['Neighborhood'].unique())

# Predict house price based on input
if st.sidebar.button("Predict Sale Price"):
    price = predict_price(total_sf, overall_qual, garage_area, year_built, neighborhood)
    st.write(f"### Predicted Sale Price: ${price:,.2f}")


# **Section 5 Model Evaluation and results**

In [None]:
# 4. Model Evaluation and Results

# Residuals Plot
st.subheader("Model Evaluation")
st.write("### Residuals Plot")
y_pred_rf = rf_model.predict(X_test)
fig, ax = plt.subplots(figsize=(8,6))
plt.scatter(y_test, y_test - y_pred_rf, alpha=0.5)
plt.title('Residuals Plot')
plt.xlabel('Actual Sale Price')
plt.ylabel('Residuals')
st.pyplot(fig)

# Display R² and RMSE for the model
st.write(f"### Random Forest R²: {r2_score(y_test, y_pred_rf):.4f}")
st.write(f"### Random Forest RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.2f}")


---

# **Conclusion**

From the analysis, we found that larger homes, higher quality, and newer homes fetch higher sale prices. Features such as location (neighborhood) and garage space also impact prices significantly. The best-tuned Random Forest model provided accurate predictions, and feature importance analysis showed that house size and overall quality are key drivers of sale price.

---

# 7. Credits and Acknowledgements

- Dataset from Ames Housing Dataset.
- Inspiration from Machine Learning and Data Analysis walkthrough projects
- My mentor precious ijege for his guidance in this project
- My fellow peers such as Beth Cottel for checking in with me when times were tuff during this development (and for keeping me smiling aswell as motivated)

# Limitations and Next steps

* One limitation of the model is that we focused on a subset of features; other factors such as market conditions or interior characteristics might also impact house prices.