
# Jupyter Notebook: Explanation of Experiments in Layman's Terms

**Introduction:**
This project is about predicting house prices using different techniques. We are going to try various methods to make our predictions more accurate and to see which approach works best. We'll begin with simpler methods and move to more advanced ones to understand their benefits and challenges.
    


**Experiment 1: Basic Linear Regression**
- **What's Happening:** In this first experiment, we use a straightforward method called linear regression. Think of it as drawing a straight line through the data to predict house prices based on a few key features (like house size and number of rooms). This experiment serves as our baseline, or starting point, to see how well a simple model can predict prices.
    

In [None]:

# Importing Libraries and Loading Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

# 1. Introduction: Loading dataset and providing a brief overview
train = pd.read_csv('train.csv')
print("Dataset Overview:")
print(train.info())
print(train.describe())

# Checking target variable (SalePrice) distribution
plt.figure(figsize=(10, 6))
sns.histplot(train['SalePrice'], kde=True)
plt.title("Distribution of Sale Prices")
plt.xlabel("Sale Price")
plt.ylabel("Frequency")
plt.show()

# Selecting basic features for Experiment 1
features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
X = train[features]
y = train['SalePrice']

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Making predictions and evaluating
y_pred = linear_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Experiment 1 - RMSE for Linear Regression: {rmse}")

# Plotting actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Sale Price")
plt.ylabel("Predicted Sale Price")
plt.title("Experiment 1: Actual vs Predicted Sale Price")
plt.show()
    


**Experiment 2: Improving with More Features and Ridge Regression**
- **What's Happening:** Here, we improve upon our first attempt by adding more features and using a different method called Ridge Regression. Instead of just drawing a straight line, we're trying to include more details to better capture relationships in the data. Ridge Regression also helps prevent overfitting, which is like making the model too focused on our specific datasetâ€”making it less useful for new, unseen data.
    

In [None]:

# Adding new polynomial features and scaling for Ridge Regression
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Scaling the features
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)

# Splitting data with transformed features
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly_scaled, y, test_size=0.2, random_state=42)

# Training Ridge regression model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_poly, y_train_poly)

# Making predictions and evaluating
y_pred_poly = ridge_model.predict(X_test_poly)
rmse_poly = np.sqrt(mean_squared_error(y_test_poly, y_pred_poly))
print(f"Experiment 2 - RMSE for Ridge Regression: {rmse_poly}")

# Plotting actual vs predicted values
plt.scatter(y_test_poly, y_pred_poly)
plt.xlabel("Actual Sale Price")
plt.ylabel("Predicted Sale Price")
plt.title("Experiment 2: Actual vs Predicted Sale Price (Ridge Regression)")
plt.show()
    


**Experiment 3: Even More Details with Lasso Regression**
- **What's Happening:** Now, we're taking an even deeper dive by adding more features and using Lasso Regression. Lasso is similar to Ridge but it can actually remove less useful features, simplifying our model. We're experimenting to see if removing some features helps improve the accuracy of our predictions.
    

In [None]:

# Adding additional relevant features and Lasso regression model
additional_features = features + ['1stFlrSF', '2ndFlrSF', 'LotArea', 'BsmtFinSF1', 'Fireplaces']
X_extended = train[additional_features]

# Splitting data with extended feature set
X_train_ext, X_test_ext, y_train_ext, y_test_ext = train_test_split(X_extended, y, test_size=0.2, random_state=42)

# Training Lasso regression model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_ext, y_train_ext)

# Making predictions and evaluating
y_pred_ext = lasso_model.predict(X_test_ext)
rmse_ext = np.sqrt(mean_squared_error(y_test_ext, y_pred_ext))
print(f"Experiment 3 - RMSE for Lasso Regression with adjustments: {rmse_ext}")

# Plotting actual vs predicted values
plt.scatter(y_test_ext, y_pred_ext)
plt.xlabel("Actual Sale Price")
plt.ylabel("Predicted Sale Price")
plt.title("Experiment 3: Actual vs Predicted Sale Price (Lasso Regression with Adjustments)")
plt.show()
    


**Experiment 3 (Alternative): Using Lasso with Cross-Validation**
- **What's Happening:** In this part, we take Lasso Regression a step further by using an automated way to find the best settings. Cross-validation means we're using different parts of the data to test and find the best balance for making predictions. This helps us avoid making mistakes like overfitting or underfitting the data.
    

In [None]:

# Standardizing the features
scaler = StandardScaler()
X_extended_scaled = scaler.fit_transform(X_extended)

# Splitting data with extended feature set and standardized features
X_train_ext, X_test_ext, y_train_ext, y_test_ext = train_test_split(X_extended_scaled, y, test_size=0.2, random_state=42)

# Using LassoCV to automatically select the best alpha and handle convergence
lasso_cv_model = LassoCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0], max_iter=10000, cv=5)
lasso_cv_model.fit(X_train_ext, y_train_ext)

# Getting the best alpha
print(f"Optimal alpha selected by LassoCV: {lasso_cv_model.alpha_}")

# Making predictions and evaluating
y_pred_ext_cv = lasso_cv_model.predict(X_test_ext)
rmse_ext_cv = np.sqrt(mean_squared_error(y_test_ext, y_pred_ext_cv))
print(f"Experiment 3 - RMSE for Lasso Regression with LassoCV: {rmse_ext_cv}")

# Plotting actual vs predicted values
plt.scatter(y_test_ext, y_pred_ext_cv)
plt.xlabel("Actual Sale Price")
plt.ylabel("Predicted Sale Price")
plt.title("Experiment 3: Actual vs Predicted Sale Price (Lasso Regression with LassoCV)")
plt.show()
    