## Objectives:
- define a vector and calculate a vector length and dot product
- define a matrix and calculate a matrix dot product, transpose, and inverse
- explain cosine similarity and compute the similarity between two vectors
- use linear algebra to solve for linear regression coefficients

#Use the following information to answer the assignment questions 1) - 11).

###Is head size related to brain weight in healthy adult humans?

The Brainhead.csv dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed). 

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

In [98]:
#Import the Brainhead.csv dataset from a URL and print the first few rows

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


1) Store the response variable - brain size - as a matrix called Y.

In [99]:
### YOUR CODE HERE ###
Y = np.array(df['Brain']).reshape(-1, 1)
print(Y)

[[1530]
 [1297]
 [1335]
 [1282]
 [1590]
 [1300]
 [1400]
 [1255]
 [1355]
 [1375]
 [1340]
 [1380]
 [1355]
 [1522]
 [1208]
 [1405]
 [1358]
 [1292]
 [1340]
 [1400]
 [1357]
 [1287]
 [1275]
 [1270]
 [1635]
 [1505]
 [1490]
 [1485]
 [1310]
 [1420]
 [1318]
 [1432]
 [1364]
 [1405]
 [1432]
 [1207]
 [1375]
 [1350]
 [1236]
 [1250]
 [1350]
 [1320]
 [1525]
 [1570]
 [1340]
 [1422]
 [1506]
 [1215]
 [1311]
 [1300]
 [1224]
 [1350]
 [1335]
 [1390]
 [1400]
 [1225]
 [1310]
 [1560]
 [1330]
 [1222]
 [1415]
 [1175]
 [1330]
 [1485]
 [1470]
 [1135]
 [1310]
 [1154]
 [1510]
 [1415]
 [1468]
 [1390]
 [1380]
 [1432]
 [1240]
 [1195]
 [1225]
 [1188]
 [1252]
 [1315]
 [1245]
 [1430]
 [1279]
 [1245]
 [1309]
 [1412]
 [1120]
 [1220]
 [1280]
 [1440]
 [1370]
 [1192]
 [1230]
 [1346]
 [1290]
 [1165]
 [1240]
 [1132]
 [1242]
 [1270]
 [1218]
 [1430]
 [1588]
 [1320]
 [1290]
 [1260]
 [1425]
 [1226]
 [1360]
 [1620]
 [1310]
 [1250]
 [1295]
 [1290]
 [1290]
 [1275]
 [1250]
 [1270]
 [1362]
 [1300]
 [1173]
 [1256]
 [1440]
 [1180]
 [1306]


2) Store the explanatory variable - head size size - as a matrix called X.  Don't forget to include the column of 1s for the intercept term.

In [100]:

### YOUR CODE HERE ###
ones = np.repeat(1,len(df)).reshape(-1,1)
head = np.array(df['Head']).reshape(-1,1)

X = np.concatenate((ones, head), axis=1)

print(X)

[[   1 4512]
 [   1 3738]
 [   1 4261]
 [   1 3777]
 [   1 4177]
 [   1 3585]
 [   1 3785]
 [   1 3559]
 [   1 3613]
 [   1 3982]
 [   1 3443]
 [   1 3993]
 [   1 3640]
 [   1 4208]
 [   1 3832]
 [   1 3876]
 [   1 3497]
 [   1 3466]
 [   1 3095]
 [   1 4424]
 [   1 3878]
 [   1 4046]
 [   1 3804]
 [   1 3710]
 [   1 4747]
 [   1 4423]
 [   1 4036]
 [   1 4022]
 [   1 3454]
 [   1 4175]
 [   1 3787]
 [   1 3796]
 [   1 4103]
 [   1 4161]
 [   1 4158]
 [   1 3814]
 [   1 3527]
 [   1 3748]
 [   1 3334]
 [   1 3492]
 [   1 3962]
 [   1 3505]
 [   1 4315]
 [   1 3804]
 [   1 3863]
 [   1 4034]
 [   1 4308]
 [   1 3165]
 [   1 3641]
 [   1 3644]
 [   1 3891]
 [   1 3793]
 [   1 4270]
 [   1 4063]
 [   1 4012]
 [   1 3458]
 [   1 3890]
 [   1 4166]
 [   1 3935]
 [   1 3669]
 [   1 3866]
 [   1 3393]
 [   1 4442]
 [   1 4253]
 [   1 3727]
 [   1 3329]
 [   1 3415]
 [   1 3372]
 [   1 4430]
 [   1 4381]
 [   1 4008]
 [   1 3858]
 [   1 4121]
 [   1 4057]
 [   1 3824]
 [   1 3394]
 [   1 3558]

3) Calculate $X^T$.  Explain what the transpose of a matrix is.

In [101]:
### YOUR CODE HERE ###
X_T = np.transpose(X)
print(X_T)

[[   1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1    1    1
     1    1    1    1    1    1    1    1    1    1    1    1   

Transposing is where you flip the rows and the columns of the matrix.

4) Use matrix multplication to calculate $X^TX$

In [102]:
### YOUR CODE HERE ###
X_T_X = np.matmul(X_T,X)

print(X_T_X)

[[       237     861256]
 [    861256 3161283190]]


5) Calculate $(X^TX)^{-1}$.  Explain what the inverse of a matrix is.

In [103]:
### YOUR CODE HERE ###
X_T_X_inv = np.linalg.inv(X_T_X)

print(X_T_X_inv)

[[ 4.23638519e-01 -1.15415543e-04]
 [-1.15415543e-04  3.17599920e-08]]


The inverse is the reciprocal of that matrix.

6) Use matrix multiplication to calculate $X^TY$.

In [104]:
### YOUR CODE HERE ###
X_T_Y = np.matmul(X_T,Y)

print(X_T_Y)

[[    304041]
 [1113176805]]


7) Use your previous results to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [105]:
### YOUR CODE HERE ###
B = np.matmul(X_T_X_inv,X_T_Y)

print(B)

[[3.25573421e+02]
 [2.63429339e-01]]


8) Use the OLS function to calculate the slope and intercept and compare your answers.

In [106]:
### YOUR CODE HERE ###
from statsmodels.formula.api import ols

#Enter the model in the format Y ~ X

model = ols('Brain ~ Head', data=df).fit()

print(model.params)

Intercept    325.573421
Head           0.263429
dtype: float64


9) Create a new X matrix that includes coluns for both head size and age group.

In [107]:
### YOUR CODE HERE ###
ones = np.repeat(1,len(df)).reshape(-1,1)
head = np.array(df['Head']).reshape(-1,1)
age = np.array(df['Age']).reshape(-1,1)

X = np.concatenate((ones, head, age), axis=1)

print(X)

[[   1 4512    1]
 [   1 3738    1]
 [   1 4261    1]
 [   1 3777    1]
 [   1 4177    1]
 [   1 3585    1]
 [   1 3785    1]
 [   1 3559    1]
 [   1 3613    1]
 [   1 3982    1]
 [   1 3443    1]
 [   1 3993    1]
 [   1 3640    1]
 [   1 4208    1]
 [   1 3832    1]
 [   1 3876    1]
 [   1 3497    1]
 [   1 3466    1]
 [   1 3095    1]
 [   1 4424    1]
 [   1 3878    1]
 [   1 4046    1]
 [   1 3804    1]
 [   1 3710    1]
 [   1 4747    1]
 [   1 4423    1]
 [   1 4036    1]
 [   1 4022    1]
 [   1 3454    1]
 [   1 4175    1]
 [   1 3787    1]
 [   1 3796    1]
 [   1 4103    1]
 [   1 4161    1]
 [   1 4158    1]
 [   1 3814    1]
 [   1 3527    1]
 [   1 3748    1]
 [   1 3334    1]
 [   1 3492    1]
 [   1 3962    1]
 [   1 3505    1]
 [   1 4315    1]
 [   1 3804    1]
 [   1 3863    1]
 [   1 4034    1]
 [   1 4308    1]
 [   1 3165    1]
 [   1 3641    1]
 [   1 3644    1]
 [   1 3891    1]
 [   1 3793    1]
 [   1 4270    1]
 [   1 4063    1]
 [   1 4012    1]
 [   1 345

11) Calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [108]:
### YOUR CODE HERE ###
X_T = np.transpose(X)
X_T_X = np.matmul(X_T,X)
X_T_X_inv = np.linalg.inv(X_T_X)
X_T_Y = np.matmul(X_T,Y)
B = np.matmul(X_T_X_inv,X_T_Y)

print(B)

[[ 3.68282145e+02]
 [ 2.60438766e-01]
 [-2.07316446e+01]]


11) Use the OLS function to confirm your answer in 10).

In [109]:
### YOUR CODE HERE ###
model = ols('Brain ~ Head + Age', data=df).fit()

print(model.params)

Intercept    368.282145
Head           0.260439
Age          -20.731645
dtype: float64


#Use the following information to answer the assignment questions 12) - 16).

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collabroative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collabortion, primarily (or totally) writen by Lennon, or primarily (or totally) written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collabortive, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums".In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

12) Import the text of Strawberry Fields and calculate the freqency of song lyrics using the code below.

In [110]:
### YOUR CODE HERE ###
Strawberry = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Strawberry_df = pd.DataFrame({"Words": Strawberry.split()})
Strawberry_df_freq = pd.DataFrame(pd.crosstab(index=Strawberry_df['Words'],columns='count'))

Strawberry_df_freq.head(20)

col_0,count
Words,Unnamed: 1_level_1
Fields,10
I,8
Im,4
Strawberry,10
a,1
about,4
all,4
always,1
and,4
bad,1


13) Import the text of From Me to You and calculate the freqency of song lyrics using the code below.

In [111]:
### YOUR CODE HERE ###
Me = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"
Me_df = pd.DataFrame({"Words": Me.split()})
Me_df_freq = pd.DataFrame(pd.crosstab(index=Me_df['Words'],columns='count'))

Me_df_freq.head(20)

col_0,count
Words,Unnamed: 1_level_1
I,3
Ill,5
Ive,5
a,1
along,5
and,9
anything,6
arms,2
by,2
call,5


13) Import the text of Ticket to Ride using the code below.

In [112]:
### YOUR CODE HERE ###
Ride = "I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care she said that living with me is bringing her down yeah for she would never be free when I was around shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away yeah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me she said that living with me is bringing her down yeah for she would never be free when I was around ah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care"
Ride_df = pd.DataFrame({"Words": Ride.split()})
Ride_df_freq = pd.DataFrame(pd.crosstab(index=Ride_df['Words'],columns='count'))

Me_df_freq.head(20)

col_0,count
Words,Unnamed: 1_level_1
I,3
Ill,5
Ive,5
a,1
along,5
and,9
anything,6
arms,2
by,2
call,5


14) Concatenate Ticket to Ride and Strawberry Fields and calculate the cosine similarity.

In [113]:
### YOUR CODE HERE ###
from numpy import dot
from numpy.linalg import norm

dfs = [Strawberry_df_freq, Ride_df_freq]

combined = pd.concat(dfs, axis=1)
combined = combined.fillna(0)
combined.columns = ['Strawberry', 'Ride']
cos_sim = dot(combined['Strawberry'], combined['Ride'])/(norm(combined['Strawberry'])*norm(combined['Ride']))

print(cos_sim)
combined.head(20)

0.324035859004908


Unnamed: 0,Strawberry,Ride
Fields,10.0,0.0
I,8.0,8.0
Im,4.0,2.0
Strawberry,10.0,0.0
a,1.0,12.0
about,4.0,0.0
all,4.0,0.0
always,1.0,0.0
and,4.0,0.0
bad,1.0,0.0


15) Concatenate Ticket to Ride and From Me to You and calculate the cosine similarity.

In [114]:
### YOUR CODE HERE ###
dfs2 = [Me_df_freq, Ride_df_freq]

combined2 = pd.concat(dfs2, axis=1)
combined2 = combined2.fillna(0)
combined2.columns = ['Me', 'Ride']
cos_sim2 = dot(combined2['Me'], combined2['Ride'])/(norm(combined2['Me'])*norm(combined2['Ride']))

print(cos_sim2)
combined2.head(20)

0.2882268853551227


Unnamed: 0,Me,Ride
I,3.0,8.0
Ill,5.0,0.0
Ive,5.0,0.0
a,1.0,12.0
along,5.0,0.0
and,9.0,0.0
anything,6.0,0.0
arms,2.0,0.0
by,2.0,4.0
call,5.0,0.0


16) What is your conclusion about Ticket to Ride?  Does it appear most similar to Strawberry Fields (Lennon) or From Me to You (collaborative)?

Ticket to Ride seems closer to Strawberry Fields than From Me to You, so there is some evidence it was written by Lennon.