# Paris House Price Prediction - Regression Problem

## Introduction

This project involves predicting house prices in Paris using a dataset created from imaginary data representing various features of houses in an urban environment. The dataset is ideal for educational purposes, allowing students and practitioners to practice regression modeling and enhance their knowledge in data science.

## Content

The dataset provides a comprehensive view of house attributes, making it suitable for building regression models to predict house prices. Each row represents a house, and each column represents a specific feature of the house.

## Source

This dataset is available on Kaggle in the following link:
> [https://www.kaggle.com/datasets/mssmartypants/paris-housing-price-prediction/data]

## Data Dictionary

All attributes in the dataset are numeric variables, which are described below:

- **squareMeters**: The total area of the house in square meters. This is numeric.
- **numberOfRooms**: The total number of rooms in the house. This is numeric.
- **hasYard**: Indicates whether the house has a yard (1 for yes, 0 for no). This is binary.
- **hasPool**: Indicates whether the house has a swimming pool (1 for yes, 0 for no). This is binary.
- **floors**: The number of floors in the house. This is numeric.
- **cityCode**: The zip code of the area where the house is located. This is numeric.
- **cityPartRange**: Indicates the exclusivity of the neighborhood (the higher the range, the more exclusive the neighborhood).
- **numPrevOwners**: The number of previous owners the house has had. This is numeric.
- **made**: The year the house was built. This is numeric.
- **isNewBuilt**: Indicates whether the house is newly built (1 for yes, 0 for no). This is binary.
- **hasStormProtector**: Indicates whether the house has a storm protector (1 for yes, 0 for no). This is binary.
- **basement**: The size of the basement in square meters. This is numeric.
- **attic**: The size of the attic in square meters. This is numeric.
- **garage**: The size of the garage in square meters. This is numeric.
- **hasStorageRoom**: Indicates whether the house has a storage room (1 for yes, 0 for no). This is binary.
- **hasGuestRoom**: The number of guest rooms in the house. This is numeric.
- **price**: The predicted price of the house (target variable).

## Problem Statement

1. **Feature Engineering**: The objective of feature engineering is to tranform an exsisting feature or create new features from existing features.
2. **Feature Selection**: Select the features those are most effective for understanding the patterns of data for a model.

### Load Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import warnings

from sklearn.feature_selection import SelectKBest, chi2, f_regression

### Settings

In [2]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "ParisHousing_uf.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check data
df.head()

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,75523,3,0,1,63,3,8,2005,0,1,4313,9005,956,0,7,7559081.5
1,80771,39,1,1,98,8,6,2015,1,0,3653,2436,128,1,2,8085989.5
2,55712,58,0,1,19,6,8,2021,0,0,2937,8852,135,1,9,5574642.1
3,32316,47,0,0,6,10,4,2012,0,1,659,7141,359,0,3,3232561.2
4,70429,19,1,1,90,3,7,1990,1,0,8435,2429,292,1,4,7055052.0


### Feature Selection

In [10]:
# Separate Input and Output features
X = df.drop("price", axis= 1)
y = df["price"]

In [24]:
# Defiene selector
selector = SelectKBest(score_func= f_regression, k = 12)

# Train the selector model
selector.fit(X, y)

In [25]:
# Get the selected feature indices
selected_index = selector.get_support(indices=True)
selected_index

array([ 0,  1,  2,  3,  5,  6,  7,  8,  9, 10, 12, 13])

In [26]:
# Get selected features
selected_features= df.columns[selected_index]
selected_features

Index(['squareMeters', 'numberOfRooms', 'hasYard', 'hasPool', 'cityPartRange',
       'numPrevOwners', 'made', 'isNewBuilt', 'hasStormProtector', 'basement',
       'garage', 'hasStorageRoom'],
      dtype='object')

In [27]:
# Get Dataframe with selected features
df_selected = df[selected_features]
# Add output feature price to this data frame so that we can save this dataframe for future use in model training
df_selected["price"] = df["price"]

In [28]:
# Sanity Check
df.head()

Unnamed: 0,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,75523,3,0,1,63,3,8,2005,0,1,4313,9005,956,0,7,7559081.5
1,80771,39,1,1,98,8,6,2015,1,0,3653,2436,128,1,2,8085989.5
2,55712,58,0,1,19,6,8,2021,0,0,2937,8852,135,1,9,5574642.1
3,32316,47,0,0,6,10,4,2012,0,1,659,7141,359,0,3,3232561.2
4,70429,19,1,1,90,3,7,1990,1,0,8435,2429,292,1,4,7055052.0


In [29]:
# Save the dataframe with selected features
sel_path = os.path.join(data_path, "ParisHousing_sf.csv")
df_selected.to_csv(sel_path, index= False)