# Home price prediction with linear regression multivariate
Sample problem of predicting home price in monroe, new jersey (USA)
Below is the table containing home prices in monroe twp, NJ. Here price depends on area (square feet), bed rooms and age of the home (in years). Given these prices we have to predict prices of new homes based on area, bed rooms and age.

<img src="https://github.com/codebasics/py/blob/master/ML/2_linear_reg_multivariate/homeprices.jpg?raw=true">

Given these home prices find out price of a home that has,

3000 sqr ft area, 3 bedrooms, 40 year old

2500 sqr ft area, 4 bedrooms, 5 year old

We will use regression with multiple variables here. Price can be calculated using following equation,

<img src="https://github.com/codebasics/py/blob/master/ML/2_linear_reg_multivariate/home_equation.jpg?raw=true">


Here area, bedrooms, age are called independant variables or features whereas price is a dependant variable

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

In [2]:
df = pd.read_csv('homeprices.csv')[:5]
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000


In [3]:
import math
median_bedroom = math.floor(df.bedrooms.median())
median_bedroom

3

In [4]:
df['bedrooms'] = df['bedrooms'].fillna(median_bedroom)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,3.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000


In [5]:
reg = LinearRegression()
reg.fit(df[['area', 'bedrooms', 'age']], df['price'])

LinearRegression()

In [6]:
reg.coef_

array([   137.25, -26025.  ,  -6825.  ])

In [7]:
# linear regression multivariate calculus: y_pred = area * coeficient1 + bedrooms * coeficient2 + age * coeficient3 + intercept
y = df['area'] * reg.coef_[0] + df['bedrooms'] * reg.coef_[1] + df['age'] * reg.coef_[2] + reg.intercept_
y

0    526000.0
1    589000.0
2    622000.0
3    595000.0
4    748000.0
dtype: float64

In [8]:
reg.predict([[3000, 3, 40]])

  "X does not have valid feature names, but"


array([444400.])

In [9]:
reg.predict([[2500, 4, 5]])

  "X does not have valid feature names, but"


array([588625.])

##Exercise
The `hiring.csv` contains hiring statistics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,

<strong>2 yr experience, 9 test score, 6 interview score</strong>

<strong>12 yr experience, 10 test score, 10 interview score</strong>

In [10]:
df = pd.read_csv('hiring.csv')
df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [11]:

df['experience'] = df['experience'].fillna('zero')
df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


###Data Cleaning

In [12]:
import math
df['test_score(out of 10)'] = df['test_score(out of 10)'].fillna(math.floor(df['test_score(out of 10)'].mean()))
df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,7.0,7,72000
7,eleven,7.0,8,80000


In [13]:
# convert the experience to number with lib word2number
from word2number import w2n
df['experience'] = df['experience'].apply(w2n.word_to_num)
df

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.0,7,72000
7,11,7.0,8,80000


In [14]:
reg = LinearRegression()
reg.fit(df[['experience', 'test_score(out of 10)',	'interview_score(out of 10)']], df['salary($)'])

LinearRegression()

In [15]:
reg.predict([[2, 9, 6]])

  "X does not have valid feature names, but"


array([53713.86677124])

In [16]:
reg.predict([[12, 10, 10]])

  "X does not have valid feature names, but"


array([93747.79628651])