## Housing Price Regression Walkthrough
> let start with some goal to achive on this data_set :
1. Understand and clean the data to ensure it is ready for analysis and modeling.
2. Explore dependance, Data analysis 
3. Basic Data Engineering
4. Experiment with various regression models and tune their hyperparameters.
5. Implement cross-validation to ensure the model generalizes well.
6. Feature Engineering 
7. Conduct error analysis to identify and address the model's shortcomings.
8. Ensembling 
9. Submit the model 

In [69]:
%pip install plotly

Note: you may need to restart the kernel to use updated packages.


In [54]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import scipy.stats as stats
import os
import seaborn as sns
from IPython.display import display, HTML
SEED = 42

In [25]:
ls = os.path.abspath('H_data_set')

In [26]:
train_df = pd.read_csv(ls + '/train.csv')
test_df = pd.read_csv(ls + '/test.csv')

In [30]:
def scrollable_table(train_df, title,table_id):
    html = f'<h2>{title}</h2>'
    html += f'<div id="{table_id}" style="height:300px; overflow:auto;">'
    html += train_df.to_html()
    html += '</div>'
    return html

In [34]:
df_num = train_df.select_dtypes(include = ['float64', 'int64'])
df_num.describe().T
html_numerical = scrollable_table(df_num.describe().T, 'Numerical Features Summary', 'Summary statistics for numerical features')
display(HTML(html_numerical))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


In [35]:
df_cat = train_df.select_dtypes(include = ['object'])
df_cat.describe().T
html_numerical = scrollable_table(df_cat.describe().T, 'Categorical Features Summary', 'Summary statistics for categorical features')
display(HTML(html_numerical))

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


In [41]:
null_values = (train_df.isnull().sum()/ len(train_df)*100)
html_null = scrollable_table(null_values.to_frame(), 'Null Values', 'Null values in the dataset')
display(HTML(html_null))


Unnamed: 0,0
Id,0.0
MSSubClass,0.0
MSZoning,0.0
LotFrontage,17.739726
LotArea,0.0
Street,0.0
Alley,93.767123
LotShape,0.0
LandContour,0.0
Utilities,0.0


In [66]:
hist_data = go.Histogram(x=train_df['SalePrice'], nbinsx=50, name="Histogram",opacity=0.75, histnorm='probability density',marker=dict(color='purple'))
x_norm = np.linspace(train_df['SalePrice'].min(), train_df['SalePrice'].max(), 100)

y_norm = stats.norm.pdf(x_norm, train_df['SalePrice'].mean(), train_df['SalePrice'].std())
norm_data = go.Scatter(x=x_norm, y=y_norm, mode='lines', name='Normal Distribution')
fig = go.Figure(data=[hist_data, norm_data])

fig.update_layout(
    title='Sale Price Distribution',
    xaxis_title='Sale Price', 
    yaxis_title='Frequency',
    legend_title='Data Distribution',
    plot_bgcolor='rgba(32,32,32,1)',
    paper_bgcolor='rgba(32,32,32,1)',
    font=dict(color='white')
)
fig.show()

In [68]:
qq_data=stats.probplot(train_df['SalePrice'], dist="norm")
qq_fig = px.scatter(x=qq_data[0][0], y=qq_data[0][1], labels={'x':'Theoretical Quantiles', 'y':'Ordered Values'}, color_discrete_sequence=['purple'])
qq_fig.update_layout(
    title='Q-Q Plot for Sale Price',
    xaxis_title='Theoretical Quantiles', 
    yaxis_title='Ordered Values',
    legend_title='Data Distribution',
    plot_bgcolor='rgba(32,32,32,1)',
    paper_bgcolor='rgba(32,32,32,1)',
    font=dict(color='white')
)

slope, intercept, r_value, p_value, std_err = stats.linregress(qq_data[0][0], qq_data[0][1])
line_x = np.array([qq_data[0][0].min(), qq_data[0][0].max()])
line_y = intercept + slope * line_x
line_data = go.Scatter(x=line_x, y=line_y, mode='lines', name='Linear Regression', line=dict(color='green', width=2))
qq_fig.add_trace(line_data)
qq_fig.show()