price = m1 * area + m2 * bedrooms + m3 * age + b

m1, m2, m3 are independent variables (features/predictors), while price is dependent variable (label)\
m1,m2,m3 are coefficients and b is intercept

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model

In [2]:
df = pd.read_csv('files/homeprices.csv')
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


handle missing data

In [3]:
import math

median_bedrooms = math.floor(df.bedrooms.median())
median_bedrooms

4

In [4]:
df.bedrooms = df.bedrooms.fillna(median_bedrooms)

In [5]:
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [6]:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'bedrooms', 'age']], df.price)

LinearRegression()

In [7]:
reg.coef_  # m1, m2 and m3

array([  112.06244194, 23388.88007794, -3231.71790863])

In [8]:
reg.intercept_  # b

221323.00186540443

In [9]:
reg.predict([[3000, 3, 40]])  # predicting the price of a house with 3000 square feets, 3 bedrooms and 40 years old

array([498408.25158031])

how did we get this value ?

In [10]:
p = 112.06244194 * 3000 + 23388.88007794 * 3 + (-3231.71790863 * 40) + 221323.00186540443
p

498408.25157402444

new prediction

In [11]:
reg.predict([[2500, 4, 5]])

array([578876.03748933])

# exercise

In [12]:
data = pd.read_csv('files/hiring.csv')
data

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


data preprocessing

In [13]:
data.iloc[:2]['experience'] = 'zero'
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,zero,8.0,9,50000
1,zero,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


In [14]:
from word2number import w2n


data.experience = data.experience.apply(lambda x: w2n.word_to_num(x))

In [15]:
data

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


In [16]:
data['test_score(out of 10)'].isnull().sum()

1

In [17]:
median_test_score = math.floor(data['test_score(out of 10)'].mean())
median_test_score

7

In [18]:
data['test_score(out of 10)'] = data['test_score(out of 10)'].fillna(median_test_score)
data

Unnamed: 0,experience,test_score(out of 10),interview_score(out of 10),salary($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.0,7,72000
7,11,7.0,8,80000


In [19]:
lr = linear_model.LinearRegression()
lr.fit(data[['experience', 'test_score(out of 10)', 'interview_score(out of 10)']], data['salary($)'])

LinearRegression()

In [20]:
lr.coef_

array([2922.26901502, 2221.30909959, 2147.48256637])

In [21]:
lr.intercept_

14992.651446693148

In [22]:
lr.predict([[2, 9, 6]])

array([53713.86677124])

In [23]:
lr.predict([[12, 10, 10]])

array([93747.79628651])