### Day 8 - Least Square Regression Line
________________________________________________

  <br/>

- [Background](#Background)
- [Task 1](#Task)
- [Task 2](#task2)

  <br/>

#### Background 

If our data shows a linear relationship between $ X $ and $ Y $, then the straight line which best describes the relationship is the regression line. The regression line is given by $ Y = a + bx $.

The values of $ a $ and $ b $ can be calculated using following formulas:

$$
\begin{eqnarray}
b & = \rho ⋅ \frac{\sigma_Y}{\sigma_X} \\
a & = \bar{y} - b\bar{x}
\end{eqnarray}
$$,
where:
- $ \rho $ is Pearson correlation coefficient
- $ \sigma_X $ is standard deviation of $ X $
- $ \sigma_Y $ is standard deviation of $ Y $
- $ \bar{x} $ is mean of $ X $
- $ \bar{y} $ is mean of $ Y $

[Full tutorial link on HackerRank](https://www.hackerrank.com/challenges/s10-least-square-regression-line/tutorial)


#### Task

A group of five students enrolls in Statistics immediately after taking a Math aptitude test. Each student's Math aptitude test score, __x__, and Statistics course grade, __y__ , can be expressed as the following list of __(x,y)__ points:
    (95, 85)
    (85, 95)
    (80, 70)
    (70, 65)
    (60, 70)
    
If a student scored an __80__ on the Math aptitude test, what grade would we expect them to achieve in Statistics? Determine the equation of the best-fit line using the least squares method, then compute and print the value of __y__ when __x = 80__.

##### Input Format

There are five lines of input; each line contains two space-separated integers describing a student's respective  __x__ and __y__ grades:

    95 85
    85 95
    80 70
    70 65
    60 70

If you do not wish to read this information from stdin, you can hard-code it into your program.

##### Output Format

Print a single line denoting the answer, rounded to a scale of __3__ decimal places



In [1]:
#solution using sklearn

from sklearn import linear_model
import numpy as np

grades=[]
#for _ in range(5):
#    grades.append(list(map(int,input().split())))
grades=[[95, 85],[85, 95],[80, 70],[70, 65],[60, 70]]
X, Y = zip(*grades)
lm = linear_model.LinearRegression()
lm.fit(np.asarray(X).reshape(-1, 1), Y)
a = lm.intercept_
b = lm.coef_[0]
res = a + b*80

print(f"{res:0.3f}")

78.288


In [2]:
# Solution without sklearn
import math
grades=[]
#for _ in range(5):
#    grades.append(list(map(int,input().split())))
grades=[[95, 85],[85, 95],[80, 70],[70, 65],[60, 70]]
X, Y = zip(*grades)

X_mean = sum(X)/len(X)
Y_mean = sum(Y)/len(Y)

X_std = math.sqrt(sum([(x-X_mean)**2 for x in X])/len(X))
Y_std = math.sqrt(sum([(y-Y_mean)**2 for y in Y])/len(Y))

pear_c = sum([(x-X_mean)*(y-Y_mean) for x,y in zip(X,Y)]) / (len(X) * X_std * Y_std)

b = pear_c * (Y_std/X_std)
a = Y_mean - b * X_mean 
res = a + b*80

print(f"{res:0.3f}")

78.288


#### Task<a name="task2" />
The regression line of $ Y $  on $ X $  is $ 3x + 4y + 8 = 0 $, and the regression line of $ Y $ on $ X $ is $ 4x + 3y + 7 = 0 $. What is the value of the Pearson correlation coefficient?

##### Solution
Find $b_1$ and $b_2$ from the equations:
$$
\begin{eqnarray}
4y = & -8 -3x \\
y = & -2 - \frac{3}{4}x \\
b_1 = & -\frac{3}{4} \\
4x = & -7 -3y \\
x = & -\frac{7}{4} - \frac{3}{4}y \\
b_2 = & -\frac{3}{4} \\
\end{eqnarray}
$$

$b_1$ and $b_2$ can be expressed using Pearson coefficient, so

$$
\begin{eqnarray}
b_1 = & \rho ⋅ \frac{\sigma_X}{\sigma_Y} \\
b_2 = & \rho ⋅ \frac{\sigma_Y}{\sigma_X} \\
\rho = & b_1 ⋅ \frac{\sigma_X}{\sigma_Y} \\
\rho = & b_2 ⋅ \frac{\sigma_Y}{\sigma_X} \\
\end{eqnarray}
$$

Multiplying both equations gives:

$$
\begin{eqnarray}
\rho^2 = & b_1 ⋅ b_2 \\
\rho^2 = & \frac{9}{16}\\
\rho = & \sqrt{\frac{9}{16}}
\end{eqnarray}
$$

Given that b_1 and b_2 are negative, x and y are negatively correlated, so Pearson correlation coefficient would also be negative

##### Answer
$ \rho = -\frac{3}{4} $