<a href="https://colab.research.google.com/github/crazygovind/AI-Foundation-Whitehat-Jr/blob/master/53_Project_MULTIPLE_LINEAR_REGRESSION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

---

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the following lessons:

 1. Multiple linear regression - Introduction
 
 

---

### Problem Statement

A real estate company wishes to analyse the prices of properties based on various factors such as area, number of rooms, bathrooms, bedrooms, etc. Create a multiple linear regression model which is capable of predicting the sale price of houses based on multiple factors and evaluate the accuracy of this model.








---

### List of Activities

**Activity 1:** Analysing the Dataset

**Activity 2:** Data Preparation
  
**Activity 3:** Train-Test Split

**Activity 4:**  Model Training

**Activity 5:** Model Prediction and Evaluation







---


#### Activity 1:  Analysing the Dataset

- Create a Pandas DataFrame for **Housing** dataset using the below link. This dataset consists of following columns:


|Field|Description|
|---:|:---|
|price|Sale price of a house in INR|
|area|Total size of a property in square feet|
|bedrooms|Number of bedrooms|
|bathrooms|Number of bathrooms|
|storeys|Number of storeys excluding basement|
|mainroad|yes, if the house faces a main road|
|livingroom|yes, if the house has a separate living room or a drawing room for guests|
|basement|yes, if the house has a basement|
|hotwaterheating|yes, if the house uses gas for hot water heating|
|airconditioning|yes, if there is central air conditioning|
|parking|number of cars that can be parked|
|prefarea|yes, if the house is located in the preferred neighbourhood of the city|


  **Dataset Link:** https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/house-prices.csv

- Print the first five rows of the dataset. Check for null values and treat them accordingly.






In [2]:
# Import modules
import numpy as  np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
ds=pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/house-prices.csv')
# Print first five rows using head() function
ds.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [3]:
# Check if there are any null values. If any column has null values, treat them accordingly
ds.isnull()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,False,False,False,False,False,False,False,False,False,False,False,False,False
541,False,False,False,False,False,False,False,False,False,False,False,False,False
542,False,False,False,False,False,False,False,False,False,False,False,False,False
543,False,False,False,False,False,False,False,False,False,False,False,False,False


---

#### Activity 2: Data Preparation

This dataset contains many columns having categorical data i.e. values 'Yes' or 'No'. However for linear regression, we need numerical data. So you need to convert all 'Yes' and 'No' values to 1s and 0s, where 
- 1 means 'Yes'
- 0 means 'No'

Similarly, replace

- `unfurnished` with 0
- `semi-furnished` with 1
- `furnished` with 2

**Hint:** To replace all 'Yes' values with 1 and 'No' values with 0, use `replace()` function of the DataFrame object. 

For ex: `df.replace(to_replace="yes", value=1, inplace=True)` $\Rightarrow$ replaces the "yes" values in all columns with 1. If you need to make changes inplace, use `inplace` boolean argument.



In [4]:
# Replace all the non-numeric values with numeric values.
ds.replace(to_replace='yes',value=1,inplace=True)
ds.replace(to_replace='no',value=2,inplace=True)
ds.replace(to_replace='semi-furnished',value=2,inplace=True)
ds.replace(to_replace='furnished',value=1,inplace=True)
ds.replace(to_replace='unfurnished',value=3,inplace=True)
ds.head(20)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,2,2,2,1,2,1,1
1,12250000,8960,4,4,4,1,2,2,2,1,3,2,1
2,12250000,9960,3,2,2,1,2,1,2,2,2,1,2
3,12215000,7500,4,2,2,1,2,1,2,1,3,1,1
4,11410000,7420,4,1,2,1,1,1,2,1,2,2,1
5,10850000,7500,3,3,1,1,2,1,2,1,2,1,2
6,10150000,8580,4,3,4,1,2,2,2,1,2,1,2
7,10150000,16200,5,3,2,1,2,2,2,2,0,2,3
8,9870000,8100,4,1,2,1,1,1,2,1,2,1,1
9,9800000,5750,3,2,4,1,1,2,2,1,1,1,3


---

#### Activity 3: Train-Test Split

You need to predict the house prices based on several factors. Thus, `price` is the target variable and other columns except `price` will be feature variables.

Split the dataset into training set and test set such that the training set contains 67% of the instances and the remaining instances will become the test set.

In [5]:
# Split the DataFrame into the training and test sets.
from sklearn.model_selection import train_test_split
features=list(ds.columns)
features.remove('price')
print(features)
X=ds[features]
y=ds['price']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

['area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea', 'furnishingstatus']


---

#### Activity 4: Model Training

Implement multiple linear regression using `sklearn` module in the following way:

1. Reshape the target variable array into two-dimensional arrays by using `reshape(-1, 1)` function of the numpy module.
2. Deploy the model by importing the `LinearRegression` class and create an object of this class.
3. Call the `fit()` function on the LinearRegression object.

In [6]:
# Create two-dimensional NumPy arrays for the target variable 
y_train_reshaped=y_train.values.reshape(-1,1)
y_test_reshaped=y_test.values.reshape(-1,1)

# Build linear regression model 
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train,y_train_reshaped)

# Print the value of the intercept 
print(lr.intercept_)

# Print the names of the features along with the values of their corresponding coefficients.
for i in list(zip(X.columns.values,lr.coef_[0])):
  print(i[0],i[1])

[6286623.99051092]
area 253.0623283718065
bedrooms 82734.87457030988
bathrooms 1117372.866630468
stories 415801.12251943746
mainroad -408320.4647816372
guestroom -279534.0414578727
basement -484980.2152513361
hotwaterheating -619934.3471477584
airconditioning -680006.9208959391
parking 304078.3327665583
prefarea -509441.46380309114
furnishingstatus -198031.32519469137


---

#### Activity 5: Model Prediction and Evaluation

Predict the values for both training and test sets by calling the `predict()` function on the LinearRegression object. Also, calculate the $R^2$, MSE, RMSE and MAE values to evaluate the accuracy of your model.

In [7]:
# Predict the target variable values for training and test set
y_train_pred=lr.predict(X_train)
y_test_pred=lr.predict(X_test)

In [8]:
# Evaluate the linear regression model using the 'r2_score', 'mean_squared_error' & 'mean_absolute_error' functions of the 'sklearn' module.
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print(r2_score(y_train_reshaped,y_train_pred))
print(mean_squared_error(y_train_reshaped,y_train_pred))
print(np.sqrt(mean_squared_error(y_train_reshaped,y_train_pred)))
print(mean_absolute_error(y_train_reshaped,y_train_pred),'\n')

print(r2_score(y_test_reshaped,y_test_pred))
print(mean_squared_error(y_test_reshaped,y_test_pred))
print(np.sqrt(mean_squared_error(y_test_reshaped,y_test_pred)))
print(mean_absolute_error(y_test_reshaped,y_test_pred))

0.6927795109061217
965153171508.6733
982422.094371189
719440.7398749229 

0.6435419628959107
1535047758428.0498
1238970.4429194627
925543.5483156563


---