# Linear Regression with Multiple Variables

The goal is to implement multivariate linear regression from scratch. Since it's hard to randomise test values, I'll be using real world data and performing the algorithm on that data set as a test. Therefore I'll need to use a python library like pandas as well.

In [1]:
!conda install pandas

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [69]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

a = np.matrix([[1,2],[3,4]])
print(a)
b = np.matrix([[2,3]])
print(b)
d = np.array(a[0])[0].shape[0]
d
e = np.matrix([1,2,3,4,5])
print(a.shape[0])

t = np.transpose(np.matrix([1,2,3,4,5]))
x = np.matrix([[1,2,3,4,5],[1,1,1,1,1]])
y = x*t

sums = 0
for k in range(2):
    sums += y.item(k,0)
print(sums)

tel = [0,0]
tel[0], tel[1] = 1, 3
tel.append(0)


[[1 2]
 [3 4]]
[[2 3]]
2
55
70


## Approach

Lets say that there are m+1 parameters $\theta$<sub>0</sub> to $\theta$<sub>M</sub> and n+1 variables Xo to Xn. Lets also say that an AxB matrix has A rows and B columns
1. Represent the parameters in a (m+1)x1 matrix
2. Represent the variables in an (m+1)x(n+1) matrix
3. Hypothesis Function will be H = $\theta$<sup>T</sup> * X
4. Note: In numpy matrix multiplication if you are multiplying a nx1 matrix and an nxm matrix you would expect a matrix with dimensions mx1 but you get the opposite in numpy.

In [77]:
#For house prices we will consider area, bhk and bathrooms (only numerical values in the csv file)
def sumDiff(hMatrix, y,xMatrix, m, row):
    xT = np.transpose(xMatrix)
    sums = 0
    for k in range(m):
        sums += (hMatrix.item(k,0)-y[k])*xT.item(row, k)
    return sums/m
    
def gradDesc(x,theta,alpha,iterations):
    parameters = np.array(x[0])[0].shape[0]
    m = x.shape[0]
    
    for i in range(parameters):
        theta.append(np.random.randint(low=2, high=5))
    
    for j in range(iterations):
        h = x * np.transpose(np.matrix([theta]))
        temp0 = theta[0] - alpha*sumDiff(h,y,x,m,0)
        temp1 = theta[1] - alpha*sumDiff(h,y,x,m,1)
        temp2 = theta[2] - alpha*sumDiff(h,y,x,m,2)
        temp3 = theta[3] - alpha*sumDiff(h,y,x,m,3)
        theta = [temp0, temp1, temp2, temp3]
    return theta      

In [120]:
import math
df = pd.read_csv("data1.csv")

x1scale = 100
yscale = 10000

x1 = np.array(df["Area"].tolist())/x1scale
x2 = np.array(df["BHK"].tolist())

items = x1.shape[0]
x3 = df["Bathroom"].tolist()

for o in range(items):
    if math.isnan(x3[o]):
        x3[o] = 0
x3 = np.array(x3)

y = np.array(df["Price"].tolist())/yscale

x0 = []
for i in range(items):
    x0.append(1)
    
x = np.transpose(np.matrix([x0,x1,x2,x3]))
theta = []

paras = gradDesc(x, theta, 0.0000006, 9500)
print((paras[0] + paras[1]*11 + paras[2]*3 + paras[3]*2)*yscale)

12901958.417743534


# Things to Note
1. Linear regression modesl don't work very well when we also have to consider other non-numerical variables like the locality and furnishing
2. Works well for data that depends on mostly numerical variables