# Ames Housing

## 1-Overview

In this project, I will predict housing prices using training and test datasets provided by Kaggle from the Ames Housing dataset, with the goal of achieving the highest RMSE. The project is designed to show that the factors influencing price negotiations extend beyond the basic assumptions typically held by the average person.

### 1.1-Project Plan

1. **Initialization and Data Loading**  
   - Import necessary libraries (e.g., pandas, numpy, matplotlib, scikit-learn).  
   - Load the training and test datasets from Kaggle (Ames Housing dataset).  
   - Display a sample of the first few rows, and other information, to verify successful loading.

2. **Data Cleaning and Preparation**  
   - Handle missing values and duplicates using pandas.  
   - Encode categorical variables.  

3. **Feature Engineering**  
   - Identify potential new features or transform existing ones to enhance model performance.

4. **Model Training/Model Evaluation**  
   - Train a Linear Regression model.  
   - Tune hyperparameters using GridSearchCV with k-fold cross-validation.  
   - Evaluate model performance solely based on RMSE.

5. **Results Output**  
   - Generate predictions on the test set.  
   - Export the predictions to a CSV file containing only the `Id` and predicted `SalePrice` columns.

### 1.2-Questions

1. What does the distribution of SalePrice reveal about housing market trends in the dataset?  
2. Which numerical and categorical features show the strongest correlation with SalePrice?  
3. What insights can be derived from the relationships between these features and SalePrice?  
4. What feature engineering techniques could improve model performance, and how might they impact the RMSE?

## 2-Initialization

Imports in Jupyter notebooks allow users to access external libraries for extended functionality and facilitate code organization by declaring dependencies at the beginning of the notebook, ensuring clear and efficient development.

In [1]:
import pandas as pd

1. **pandas:** A powerful library for data manipulation and analysis using DataFrame structures.

In [2]:
test = pd.read_csv('./datasets/test.csv')
train = pd.read_csv('./datasets/train.csv')

Each line in the code snippet reads a different CSV file from the `./datasets/` directory and loads it into a DataFrame using the `pd.read_csv()` function from the pandas library. Specifically, `test.csv` is loaded into the `test` DataFrame, and `train.csv` into the `train` DataFrame.

In [3]:
display(test, train)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


Both data frames have been successfully loaded. They are of decent size, and the `test` data frame has the target variable `SalePrice` removed.