## Import Libraries
In this section, we import the necessary libraries to load and manipulate the dataset.

In [67]:
pip install streamlit

Note: you may need to restart the kernel to use updated packages.


In [68]:
pip install seaborn

Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error
import joblib
import streamlit as st  # or gradio as gr


## Data Exploration
First, let's load the dataset and explore its basic properties.


In [None]:
# Load the dataset
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

# Check the shape and columns of the dataset
print(train.shape)
print(train.columns)

# Display the first few rows of the dataset
train.head()


## Visualize the Data
Now, let's visualize the distribution of the target variable 'SalePrice'.


In [None]:
# Plot the distribution of SalePrice
plt.figure(figsize=(8, 6))
sns.histplot(train['SalePrice'], kde=True, bins=30)
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()


We use a histogram to check if the target variable (SalePrice) has a normal distribution or any skewness.


## Data Cleaning
Next, we clean the data by handling missing values and removing duplicates.


In [None]:
# Check for missing values
missing_data = train.isnull().sum()
missing_data = missing_data[missing_data > 0]
missing_data.sort_values(ascending=False)


We have identified columns with missing values. We'll handle them accordingly (e.g., imputation or removal).


In [None]:
# Fill missing values in 'LotFrontage' with the median
train['LotFrontage'] = train['LotFrontage'].fillna(train['LotFrontage'].median())

# Drop rows with missing target value (SalePrice)
train.dropna(subset=['SalePrice'], inplace=True)

# Fill remaining missing values with mode for categorical columns
train.fillna(train.mode().iloc[0], inplace=True)

We've handled missing data by filling numerical missing values with the median and categorical missing values with the mode.


In [None]:
# Convert categorical columns to numerical by one-hot encoding
train_encoded = pd.get_dummies(train, drop_first=True)

## Categorical Data Encoding
Since the dataset includes categorical variables, we'll encode them for model training.


In [None]:
# Convert categorical columns to numerical by one-hot encoding
train_encoded = pd.get_dummies(train, drop_first=True)


## Model Training
Now, we will train a Random Forest Regressor model using the cleaned and preprocessed data.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Define features and target
X = train_encoded.drop('SalePrice', axis=1)
y = train_encoded['SalePrice']

# Split data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on the validation set
y_pred = model.predict(X_val)

# Evaluate the model performance
mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error: {mse}')

We trained a Random Forest Regressor on the dataset and evaluated its performance using Mean Squared Error (MSE).


## Save the Model
Now, we will save the trained model to a file for deployment.


In [None]:
import joblib

# Save the model as a .pkl file
joblib.dump(model, 'house_price_model.pkl')


## Dashboard Creation
Next, we will create an interactive dashboard using Streamlit to predict house prices.


In [None]:
import warnings

# Suppress specific warning
warnings.filterwarnings("ignore", message="missing ScriptRunContext!")

In [None]:
import streamlit as st
import joblib
import numpy as np

# Load the saved model
model = joblib.load('house_price_model.pkl')

# Define function for prediction
def predict_house_price(features):
    return model.predict([features])[0]

# Streamlit app
st.title('House Price Prediction')
st.write('Enter house details for price prediction')

# Input fields for prediction
lot_area = st.number_input('Lot Area', min_value=0)
overall_quality = st.selectbox('Overall Quality', options=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Get features from user
features = [lot_area, overall_quality]

# Predict when the user presses the button
if st.button('Predict'):
    prediction = predict_house_price(features)
    st.write(f'Predicted House Price: ${prediction:,.2f}')
