<h2 style="color:green" align="center"> Machine Learning With Python: Linear Regression Multiple Variables</h2>

<h3 style="color:purple">Sample problem of predicting home price in monroe, new jersey (USA)</h3>

Below is the table containing home prices in monroe twp, NJ. Here price depends on **area (square feet), bed rooms and age of the home (in years)**. Given these prices we have to predict prices of new homes based on area, bed rooms and age.

<img src="homeprices.jpg" style='height:200px;width:350px'>

Given these home prices find out price of a home that has,

**3000 sqr ft area, 3 bedrooms, 40 year old**

**2500 sqr ft area, 4 bedrooms,  5 year old**

We will use regression with multiple variables here. Price can be calculated using following equation,

<img src="equation.jpg" >

Here area, bedrooms, age are called independant variables or **features** whereas price is a dependant variable

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
home_price = pd.read_csv('homeprices.csv')
home_price

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


**Data Preprocessing: Fill NA values with median value of a column**

In [3]:
home_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   area      6 non-null      int64  
 1   bedrooms  5 non-null      float64
 2   age       6 non-null      int64  
 3   price     6 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 320.0 bytes


In [4]:
home_price.describe()

Unnamed: 0,area,bedrooms,age,price
count,6.0,5.0,6.0,6.0
mean,3416.666667,4.2,16.5,648333.333333
std,587.934237,1.30384,8.288546,109117.673484
min,2600.0,3.0,8.0,550000.0
25%,3050.0,3.0,9.75,572500.0
50%,3400.0,4.0,16.5,602500.0
75%,3900.0,5.0,19.5,722500.0
max,4100.0,6.0,30.0,810000.0


In [5]:
home_price.loc[:, ['bedrooms']] = home_price.loc[:, ['bedrooms']].fillna(home_price.median())

# another solution (1) : home_price.bedrooms.fillna(home_price.bedrooms.median(),inplace=True)
# another solution (2) : home_price['bedrooms'].fillna(home_price['bedrooms'].median(),inplace=True)

home_price

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [6]:
our_model=linear_model.LinearRegression()
our_model.fit(home_price.drop('price', axis='columns'),home_price.price)

LinearRegression()

In [7]:
our_model.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [8]:
our_model.intercept_

221323.0018654043

**Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old**

In [9]:
our_model.predict([[3000, 3, 40]])

array([498408.25158031])

In [10]:
112.06244194*3000 + 23388.88007794*3 + -3231.71790863*40 + 221323.00186540384

498408.25157402386

**Find price of home with 2500 sqr ft area, 4 bedrooms,  5 year old**

In [11]:
our_model.predict([[2500, 4, 5]])

array([578876.03748933])

<h3>Exercise<h3>

In exercise folder (same level as this notebook on github) there is **hiring.csv**. This file contains hiring statics for a firm such as experience of candidate, his written test score and personal interview score. Based on these 3 factors, HR will decide the salary. Given this data, you need to build a machine learning model for HR department that can help them decide salaries for future candidates. Using this predict salaries for following candidates,


**2 yr experience, 9 test score, 6 interview score**

**12 yr experience, 10 test score, 10 interview score**


In [2]:
excercise2=pd.read_csv(r"H:\ML&DL\ML Project 01\py\ML\2_linear_reg_multivariate\Exercise\hiring.csv")
excercise2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [13]:
excercise2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   experience                  6 non-null      object 
 1   test_score(out of 10)       7 non-null      float64
 2   interview_score(out of 10)  8 non-null      int64  
 3   salary($)                   8 non-null      int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 384.0+ bytes


In [14]:
excercise2.describe()

Unnamed: 0,test_score(out of 10),interview_score(out of 10),salary($)
count,7.0,8.0,8.0
mean,7.857143,7.875,63000.0
std,1.345185,1.642081,11501.55269
min,6.0,6.0,45000.0
25%,7.0,6.75,57500.0
50%,8.0,7.5,63500.0
75%,8.5,9.25,70500.0
max,10.0,10.0,80000.0


In [6]:
excercise2.loc[:, ['experience']] = excercise2.loc[:, ['experience']].fillna('zero')
excercise2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [16]:
from word2number import w2n

In [17]:
!pip install word2number



In [18]:
experience=excercise2['experience']
for i in range (len(experience)):
    experience[i] = w2n.word_to_num(str(experience[i]))
experience

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  experience[i] = w2n.word_to_num(str(experience[i]))


0     0
1     0
2     5
3     2
4     7
5     3
6    10
7    11
Name: experience, dtype: object

In [24]:
x=w2n.word_to_num("ten")
x

10

In [19]:
excercise2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


In [25]:
excercise2.loc[:, ['test_score(out of 10)']] = excercise2.loc[:, ['test_score(out of 10)']].fillna(excercise2.mean())
excercise2

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.857143,7,72000
7,11,7.0,8,80000


In [26]:
x=excercise2.drop('salary($)',axis='columns')
x

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10)
0,0,8.0,9
1,0,8.0,6
2,5,6.0,7
3,2,10.0,10
4,7,9.0,6
5,3,7.0,10
6,10,7.857143,7
7,11,7.0,8


In [27]:
y=excercise2['salary($)']
y

0    50000
1    45000
2    60000
3    65000
4    70000
5    62000
6    72000
7    80000
Name: salary($), dtype: int64

In [28]:
LR=linear_model.LinearRegression()
LR.fit(x,y)

LinearRegression()

<h3>Answer<h3>

53713.86 and 93747.79

In [42]:
LR.predict([[2,9,6]])

array([53290.89255945])

In [43]:
LR.predict([[12,10,10]])

array([92268.07227784])

In [44]:
2 yr experience, 9 test score, 6 interview score
12 yr experience, 10 test score, 10 interview scor


SyntaxError: invalid syntax (<ipython-input-44-7c5843434b76>, line 1)