<a href="https://colab.research.google.com/github/andrew-ryabchenko/DS-Unit-1-Sprint-3-Linear-Algebra/blob/master/module4-linear-algebra/LS_DS_134_Linear_Algebra_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Objectives:
- define a vector and calculate a vector length and dot product
- define a matrix and calculate a matrix dot product, transpose, and inverse
- explain cosine similarity and compute the similarity between two vectors
- use linear algebra to solve for linear regression coefficients

#Use the following information to answer the assignment questions 1) - 11).

###Is head size related to brain weight in healthy adult humans?

The Brainhead.csv dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed). 

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

In [None]:
#Import the Brainhead.csv dataset from a URL and print the first few rows

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


1) Store the response variable - brain size - as a matrix called Y.

In [None]:
Y = np.array(df['Brain']).reshape((-1,1))

2) Store the explanatory variable - head size size - as a matrix called X.  Don't forget to include the column of 1s for the intercept term.

In [None]:
X = np.array(df['Head']).reshape((-1,1))

In [None]:
ones = np.array(np.ones(len(X))).reshape((-1,1))

In [None]:
X = np.concatenate((ones, X), axis = 1)

In [None]:
#np.transpose(X).shape

In [None]:
X.shape

(237, 2)

3) Calculate $X^T$.  Explain what the transpose of a matrix is.

In [None]:
X_t = np.transpose(X)

In [None]:
X_t.shape

(3, 237)

> Transposing matrix is making a new matrix where each row is a column of the original matrix and each column is a row of the original matrix.

4) Use matrix multplication to calculate $X^TX$

In [None]:
X_t_X = np.matmul(X_t,X)

In [None]:
X_t_X

array([[3.16128319e+09, 1.31823100e+06, 8.61256000e+05],
       [1.31823100e+06, 6.18000000e+02, 3.64000000e+02],
       [8.61256000e+05, 3.64000000e+02, 2.37000000e+02]])

5) Calculate $(X^TX)^{-1}$.  Explain what the inverse of a matrix is.

In [None]:
X_t_X_inv = np.linalg.inv(X_t_X)

In [None]:
X_t_X_inv

array([[ 3.21169750e-08,  2.47472443e-06, -1.20513659e-04],
       [ 2.47472443e-06,  1.71556109e-02, -3.53418297e-02],
       [-1.20513659e-04, -3.53418297e-02,  4.96445307e-01]])

> Inverse is the reciprocal of the matrix.

6) Use matrix multiplication to calculate $X^TY$.

In [None]:
X_t_Y = np.matmul(X_t,Y)

In [None]:
X_t_Y

array([[1.1131768e+09],
       [4.6456100e+05],
       [3.0404100e+05]])

7) Use your previous results to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [None]:
B = np.matmul(X_t_X_inv,X_t_Y)

In [None]:
B

array([[3.25573421e+02],
       [2.63429339e-01]])

8) Use the OLS function to calculate the slope and intercept and compare your answers.

In [None]:
from statsmodels.formula.api import ols

model = ols('Brain~Head', data = df ).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Brain   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.638
Method:                 Least Squares   F-statistic:                     416.5
Date:                Fri, 16 Oct 2020   Prob (F-statistic):           5.96e-54
Time:                        05:19:56   Log-Likelihood:                -1350.3
No. Observations:                 237   AIC:                             2705.
Df Residuals:                     235   BIC:                             2711.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    325.5734     47.141      6.906      0.0

9) Create a new X matrix that includes coluns for both head size and age group.

In [None]:
df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


In [None]:
head = np.array(df['Head']).reshape((-1,1))
age = np.array(df['Age']).reshape((-1,1))
ones  = np.ones(len(head)).reshape((-1,1))
X = np.concatenate((ones, head ,age), axis = 1)


11) Calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [None]:
B

array([[ 2.60438766e-01],
       [-2.07316446e+01],
       [ 3.68282145e+02]])

11) Use the OLS function to confirm your answer in 10).

In [None]:
model = ols('Brain~Head+Age', data = df).fit()

In [None]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Brain   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     214.1
Date:                Fri, 16 Oct 2020   Prob (F-statistic):           1.38e-53
Time:                        05:27:24   Log-Likelihood:                -1347.8
No. Observations:                 237   AIC:                             2702.
Df Residuals:                     234   BIC:                             2712.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    368.2821     50.618      7.276      0.0

#Use the following information to answer the assignment questions 12) - 16).

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collabroative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collabortion, primarily (or totally) writen by Lennon, or primarily (or totally) written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collabortive, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums".In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

12) Import the text of Strawberry Fields and calculate the freqency of song lyrics using the code below.

In [37]:
import pandas as pd

#Strawberry Fields - John Lennon (not disputed)

Strawberry_ = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"

Strawberry_ = Strawberry_.lower()

Strawberry_df = pd.DataFrame({"Words": Strawberry_.split()})

Strawberry_df = pd.crosstab(Strawberry_df['Words'], 'count_strawberry')

Strawberry_df.head()

col_0,count_strawberry
Words,Unnamed: 1_level_1
a,1
about,4
all,4
always,1
and,4


13) Import the text of From Me to You and calculate the freqency of song lyrics using the code below.

In [46]:
From_ = 'youre like a melody that never goes away youre the sweetest thing an easy song to sing your like a work of art so priceless to me a timeless beauty from the movie screen that never ever seems to fade so baby put that glass down and turn the music up its like these words was written just for the two of us do you remember that day in september when we fell in love this ones for me and you so let the record play i love the way it makes your body move it sounds so good to me reminds me of you and when the record plays a melody love fills up the room it sounds so good to me this ones for me and you girl we came so far and beat out all the odds they never thought wed make it but i knew it from the start so lets celebrate cause we got it good a classic going down in history feels just like we won the lottery so baby put that glass down and turn music up its like these words was written just for the two of us do you remember that day in september when we fell in love this ones for me and you so let the record play i love the way it makes your body move it sound so good to me reminds me of you and when the record plays a melody love fills up the room it sounds so good to me this ones for me and you let it play let it play let it play this ones for my baby let it play let it play let it play this ones for me and you let it play let it play let it play just for me and you girl let it play let it play let it play yeah oh so baby put that glass down and turn music up its like these words was written just for the two of us do you remember that day in september baby when we fell when we fell in love this ones for me and you so let the record play i love the way it makes your body move it sounds so good to me reminds me of you and when record plays a melody love fills up the room it sounds so good to me this ones for me and you let it play let it play let it play hey were gonna sit back down yes we are with a glass of wine let it play let it play let it play for my baby for my girl this ones for you let it play let it play let it play for my baby for my girl this ones for you let it play let it play let it play for my baby for my girl this ones for you'

In [47]:
From_df = pd.DataFrame({'Words':From_.split(' ')})

In [48]:
From_df = pd.crosstab(From_df['Words'],'from_count')

13) Import the text of Ticket to Ride using the code below.

In [32]:
#@title
Ticket_ = '''I think I'm gonna be sad
I think it's today, yeah
The girl that's driving me mad
Is going away
She's got a ticket to ride
She's got a ticket to ride
She's got a ticket to ride
But she don't care
She said that living with me
Is bringing her down yeah
For she would never be free
When I was around
She's got a ticket to ride
She's got a ticket to ride
She's got a ticket to ride
But she don't care
I don't know why she's ridin' so high
She ought to think twice
She ought to do right by me
Before she gets to saying goodbye
She ought to think twice,
She ought to do right by me
I think I'm gonna be sad
I think it's today yeah
The girl that's driving me mad
Is going away, yeah
She's got a ticket to ride
She's got a ticket to ride
She's got a ticket to ride
But she don't care
I don't know why she's ridin' so high
She ought to think twice
She ought to do right by me
Before she gets to saying goodbye
She ought to think twice
She ought to do right by me
She said that living with me
Is bringing her down, yeah
For she would never be free
When I was around
Ah, she's got a ticket to ride
She's got a ticket to ride
She's got a ticket to ride
But she don't care
My baby don't care, my baby don't care
My baby don't care, my baby don't care
My baby don't care, my baby don't care'''

In [33]:
Ticket_ = Ticket_.replace(',','').replace("'", '').replace('\n',' ')

In [10]:
Ticket_ = Ticket_.lower()

14) Concatenate Ticket to Ride and Strawberry Fields and calculate the cosine similarity.

In [63]:
### YOUR CODE HERE ###
Ticket_df = pd.DataFrame({'Words': Ticket_.split(" ")}) 

In [64]:
Ticket_df = pd.crosstab(Ticket_df['Words'], 'ticket_count')

In [43]:
Strawberry_df.shape

(69, 1)

15) Concatenate Ticket to Ride and From Me to You and calculate the cosine similarity.

In [55]:
pd.set_option('display.max_rows', 100)

In [65]:
ticket_from = pd.merge(From_df, Ticket_df, on='Words', how='outer')


In [68]:
import numpy as np

In [69]:
ticket_vector = np.nan_to_num(np.array(ticket_from['ticket_count']))

In [71]:
from_vector = np.nan_to_num(np.array(ticket_from['from_count']))

In [73]:
from numpy import dot
from numpy.linalg import norm

In [74]:
cosine = dot(ticket_vector, from_vector) / (norm(ticket_vector) * norm(from_vector))

In [75]:
cosine

0.17149942927278827

16) What is your conclusion about Ticket to Ride?  Does it appear most similar to Strawberry Fields (Lennon) or From Me to You (collaborative)?

> It appears more similar to Strawberry fields