## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Unit 1 Sprint 3 Module 4

## Module Project: Linear Algebra

### Learning Objectives

* define a vector and calculate a vector length and dot product
* define a matrix and calculate a matrix dot product, transpose, and inverse
* explain cosine similarity and compute the similarity between two vectors
* use linear algebra to solve for linear regression coefficients

### Total notebook points: 13

## Part A: Introduction

### Statistical significance between head size and brain weight in healthy adult humans

The `Brainhead.csv` dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed).

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

### Use the above information to complete the following tasks. 

**Task 1** - Load the data

Let's load the data! The URL has been provided as well as the imports for pandas and numpy.

* load your CSV file into a DataFrame named `df`

In [1]:
# Task 1

# Imports
import pandas as pd

# Dataset URL
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

# YOUR CODE HERE
df = pd.read_csv(data_url)

# Print out your DataFrame
print(df.head())

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


**Task 2**

Store the response variable (brain size) as a matrix called Y.

* Import numpy
* Assign the `Brain` column to a matrix variable called `Y` which should be a `np.array` with dimensions `(237, 1)`
* Print out your matrix shape to check the dimensions.

In [4]:
# Task 2

# YOUR CODE HERE
import numpy as np
Y = np.array(df['Brain']).reshape(-1,1)
print(Y[:10]); print()

# Print your matrix
print("Matrix Y shape: ", Y.shape)

[[1530]
 [1297]
 [1335]
 [1282]
 [1590]
 [1300]
 [1400]
 [1255]
 [1355]
 [1375]]

Matrix Y shape:  (237, 1)


In [5]:
# Task 2 - Test

assert Y.shape == (237, 1), "Is your Y matrix of the correct shape?"


**Task 3** - Create the X and Y matrices

Store the explanatory variable (head size) as the matrix `X`.  Don't forget to include the **column of ones** for the intercept term (refer to the Guided Project notes if needed).

* Create an `np.array` of ones with the name `ones`; it should be the same length as your DataFrame.
* Create another `np.array` with the name `head`; it should have a shape of `(237, 1)`.
* Combine the `ones` and `head` arrays to create your matrix and call it `X`; it should have a shape of `(237, 2)`

$$Y = XB + \epsilon$$

$$ B = (X^{'}X)^{-1}X^{'}Y$$

In [96]:
# Task 3

# INPUTS ***********************************************************************

the_df = df #PANDAS DATAFRAME
ind_var_cols = ['Head'] #LIST OF COLUMN LABELS
dep_var_col = 'Brain' #SINGLE COLUMN LABEL

# CALCULATIONS *****************************************************************

#X MATRIX
ones = np.repeat(1, len(the_df)).reshape(-1,1) 
X = ones.copy() 
for col in ind_var_cols:
  X = np.concatenate([X, np.array(the_df[col]).reshape(-1,1)], axis=1)

#X TRANSPOSE
X_T = np.transpose(X)

#COEFFICIENTS
B = np.matmul( np.matmul( np.linalg.inv(np.matmul(X_T,X)), X_T ), Y)

#RESIDUALS
E = np.array(the_df[dep_var_col]).reshape(-1,1) - np.matmul(X,B)

head = np.array(the_df['Head']).reshape(-1,1)

In [97]:
# Task 3 - Test

assert ones.shape == (237, 1), "Is your ones matrix of the correct shape?"
assert head.shape == (237, 1), "Is your 'head' matrix of the correct shape?"
assert X.shape == (237, 2), "Is your X matrix of the correct shape?"


**Task 4** - Transpose of a matrix

Calculate the transpose of X which is written as $X^T$.  

* Calculate the transpose of `X` and assign it to the variable `X_T`.

Explain what the transpose of a matrix is.

YOUR ANSWER


In [52]:
# Task 4

# YOUR CODE HERE
X_T = X_T

# Print your transposed matrix shape
print(X_T.shape)

(2, 237)


In [53]:
# Task 4 - Test

assert X_T.shape == (2, 237), "Is your X_T matrix of the correct shape?"


**Task 5** - More matrix multiplication!

Use matrix multiplication to calculate the transpose of X multiple by X, which is written as $X^TX$.

* Assign the results of your calculation to the variable `X_T_X` which should be a `numpy.ndarray`.

In [55]:
# Task 5

# YOUR CODE HERE
X_T_X = np.matmul(X_T,X)

# Look at your matrix
print(X_T_X)

[[       237     861256]
 [    861256 3161283190]]


In [56]:
# Task 5 - Test

assert X_T_X.shape == (2, 2), "Is your X_T_X matrix of the correct shape?"


**Task 6** - Continuing the matrix multiplication

Calculate the inverse of the matrix you calculated in **Task 5**. This would be written as $(X^TX)^{-1}$.  

* Assign your calculation to the variable `X_T_X_inv` which should be a `numpy.ndarray`.

Explain what the inverse of a matrix is. You can use examples from other areas of math to illustrate.

YOUR ANSWER

In [57]:
# Task 6

# YOUR CODE HERE
X_T_X_inv = np.linalg.inv( np.matmul(X_T, X) )

# Look at your matrix
print(X_T_X_inv)

[[ 4.23638519e-01 -1.15415543e-04]
 [-1.15415543e-04  3.17599920e-08]]


In [58]:
# Task 6 - Test

assert X_T_X_inv.shape == (2, 2), "Is your X_T_X matrix of the correct shape?"


**Task 7** - More and more matrix multiplication

Next, we'll continue with the matrix multiplication. Let's calculate the result of X transpose multiplied by Y which is written $X^TY$.

* Assign your result to the variable `X_T_Y` which should be a `numpy.ndarray`.

In [59]:
# Task 7

# YOUR CODE HERE
X_T_Y = np.matmul(X_T, Y)

# Look at your matrix and the shape
print(X_T_Y)
print("X_T_Y shape: ", X_T_Y.shape)

[[    304041]
 [1113176805]]
X_T_Y shape:  (2, 1)


In [60]:
# Task 7 - Test

assert X_T_Y.shape == (2, 1), "Is your X_T_Y matrix of the correct shape?"


**Task 8** - Calculate the slope and intercept 

Finally! We can calculate the slope and intercept from all of the matrix multiplication we just did.

Use the results from your previous tasks to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

* Assign the results of your calculation to the variable `B` which should be a `numpy.ndarray`.

In [61]:
# Task 8

# YOUR CODE HERE
B = np.matmul(X_T_X_inv, X_T_Y) #HAD TO RECALCULATE B BECAUSE OF A 10^-12 DIFFERENCE

# Look at your matrix and the shape
print(B)
print("The shape of B is: ", B.shape)

[[3.25573421e+02]
 [2.63429339e-01]]
The shape of B is:  (2, 1)


In [62]:
# Task 8 - Test

assert B.shape == (2, 1), "Is your B matrix of the correct shape?"


**Task 9** - Verify the linear algebra calculations

Now, we're going to verify our linear algebra calculations by fitting a linear model!

We'll use the results of this linear model to compare to our calculated values for the slope and intercept.

* Import the `statsmodels.formula.api` library and fit an `ols` model (enter the model in the format Y ~ X).
* Assign your model to the variable `model`

Compare your `model.params` output to the two values in the `B` matrix you calculated above - this isn't autograded, just a chance to see if the results agree.

In [64]:
# Task 9

# YOUR CODE HERE
from statsmodels.formula.api import ols
model = ols('Brain~Head', data=df).fit()
print(model.params)

Intercept    325.573421
Head           0.263429
dtype: float64


In [None]:
# Task 9 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 10** - Apply matrix calculations to the data

Now that we've verified that our matrix calculations result in the same parameters as we fit with our model, let's practice using the matrix calculations with some of the data.

Create a new X matrix that includes columns for both head size and age group.

**Your tasks** (read carefully)

* Create an array for the head size column 'Head' and assign it to the variable `head`; you should have a `numpy.ndarray`
* Create a new column by subtracting `1` from each value in `df['Age']`; call this new column `Age_01`
* Create an array from the 'Age_01' column and assign it to the variable `age`; this should be a `numpy.ndarray`
* Concatenate the following three arrays and assign the result to the variable `X`: `ones`, `head`, `age` (use the `ones` array from Task 3).

In [67]:
# Task 10

df['Age_01'] = df['Age'] - 1
head = np.array(df['Head']).reshape(-1,1)
age = np.array(df['Age_01']).reshape(-1,1)

# INPUTS ***********************************************************************

the_df = df #PANDAS DATAFRAME
ind_var_cols = ['Head', 'Age_01'] #LIST OF COLUMN LABELS
dep_var_col = 'Brain' #SINGLE COLUMN LABEL

# CALCULATIONS *****************************************************************

#X MATRIX
X = np.repeat(1, len(the_df)).reshape(-1,1) 
for col in ind_var_cols:
  X = np.concatenate([X, np.array(the_df[col]).reshape(-1,1)], axis=1)

#X TRANSPOSE
X_T = np.transpose(X)

#COEFFICIENTS
B = np.matmul( np.matmul( np.linalg.inv(np.matmul(X_T,X)), X_T ), Y)

#RESIDUALS
E = np.array(the_df[dep_var_col]).reshape(-1,1) - np.matmul(X,B)

# PRINT STATEMENTS *************************************************************
print("X"); print(X[:10]); print(X.shape); print()


X
[[   1 4512    0]
 [   1 3738    0]
 [   1 4261    0]
 [   1 3777    0]
 [   1 4177    0]
 [   1 3585    0]
 [   1 3785    0]
 [   1 3559    0]
 [   1 3613    0]
 [   1 3982    0]]
(237, 3)



In [68]:
# Task 10 - Test

assert head.shape == (237, 1), "Is your 'head' matrix of the correct shape?"
assert age.shape == (237, 1), "Is your 'age' matrix of the correct shape?"
assert X.shape == (237, 3), "Is your X matrix of the correct shape?"


**Task 11** - Calculate the slope and intercept

Now, calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

* Break it down into smaller steps, in separate cells if needed
* COMMENT all the steps, so you can troubleshoot if there is a mistake

This task will not be autograded but verify your answers! 

In [69]:
# Task 11
# B = (X'X)^-1 X'Y

# YOUR CODE HERE
B = B

# Print your final result!
print(B)

[[ 3.47550501e+02]
 [ 2.60438766e-01]
 [-2.07316446e+01]]


**Task 12** - Verify the matrix calculation with a linear model

We're going to fit another OLS model and then confirm our answer with what we calculated for `B` in **Task 11**

* Fit an OLS model as you did above but your input will be in the two independent variables head size and age (your model input should be Y ~ X1 + X2). **Make sure to use `Head` for X1 and `Age_01` (not `C(Age_01)`, which would change the order of the ouput) for X2.**
* Assign the model results to the variable `model2`.

Compare your `model2.params` to the values you determined for `B` in Task 11 - do they match?

In [70]:
# Task 12

# YOUR CODE HERE
model2 = ols('Brain~Head+Age_01', data=df).fit()
print(model2.params)

Intercept    347.550501
Head           0.260439
Age_01       -20.731645
dtype: float64


**Task 12 Test**

In [None]:
# Task 12 - Test
# Hidden tests - you will see the results when you submit to Canvas

## Part B: Cosine Similarity

Use the following information to answer the remaining Tasks in the Module Project.

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collaborative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collaboration, primarily (or totally) written by Lennon, or primarily (or totally) written by McCartney.  

**However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.**

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collaborative, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums". In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

**Task 13** - Import all the song lyrics

Import the text of "Strawberry Fields", "From Me to You", and "Ticket to Ride". The code has been provided for you - run the cell to see the output. For each set of lyrics, the word frequencies have also been calculated.

**Run the following cell to load all the lyrics.**

In [75]:
# Task 13

# RUN THIS CELL TO LOAD THE LYRICS

# Strawberry Fields - John Lennon (not disputed)

Strawberry_ = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Strawberry_df = pd.DataFrame({"Words": Strawberry_.lower().split()})
Strawberry_df_freq = pd.DataFrame(pd.crosstab(index=Strawberry_df['Words'],columns='count'))

# Uncomment to display
print(Strawberry_df_freq[0:5]); print()


# From Me to You - Lennon and McCartney (not disputed)

Me_ = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"
Me_df = pd.DataFrame({"Words": Me_.lower().split()})
Me_df_freq = pd.DataFrame(pd.crosstab(index=Me_df['Words'],columns='count'))

# Uncomment to display
print(Me_df_freq[0:5]); print()


# Ticket to Ride (disputed)

Ticket_ = "I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care she said that living with me is bringing her down yeah for she would never be free when I was around shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away yeah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me she said that living with me is bringing her down yeah for she would never be free when I was around ah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care"
Ticket_df = pd.DataFrame({"Words": Ticket_.lower().split()})
Ticket_df_freq = pd.DataFrame(pd.crosstab(index=Ticket_df['Words'],columns='count'))

# Uncomment to display
print(Ticket_df_freq[0:5]); print()

col_0   count
Words        
a           1
about       4
all         4
always      1
and         4

col_0     count
Words          
a             1
along         5
and           9
anything      6
arms          2

col_0   count
Words        
a          12
ah          1
around      2
away        2
baby        6



**Task 14** - Cosine similarity calculations

Now it's your turn to complete some linear algebra calculations! In this task, we need to concatenate "Ticket to Ride" and "Strawberry Fields" and calculate the cosine similarity. Two linear algebra imports have been provided

* Concatenate `Strawberry_df_freq` and `Ticket_df_freq` to create a new DataFrame.
* Assign your final result to `cos_sim_1`; your final answer should be a float.

Hint: you might need to use `.fillna()` on your DataFrame

In [88]:
# Task 14
# Compare 'Strawberry Fields' to 'Ticket to Ride'

# Imports for linear algebra
from numpy import dot
from numpy.linalg import norm

# INPUT
df_list = [Strawberry_df_freq, Ticket_df_freq]

#CALCULATIONS
new_df = pd.concat(df_list,axis=1).fillna(0)
col_A = np.array(new_df.iloc[:,0])
col_B = np.array(new_df.iloc[:,1])
A_dot_B = np.dot(col_A, col_B)
norm_A = np.linalg.norm(col_A)
norm_B = np.linalg.norm(col_B)

# OUTPUT
cos_sim_1 = A_dot_B / (norm_A * norm_B)

# Print out your result
print(cos_sim_1)

0.324035859004908


In [None]:
# Task 14 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 15** - More cosine similarity calculations

Now, we'll compare "Ticket to Ride" and "From Me to You" and calculate the cosine similarity.

* Concatenate `Ticket_df_freq` and `Me_df_freq` to create a new DataFrame.
* Assign your final result to `cos_sim_2`; your final answer should be a float.

Hint: you might need to use `.fillna()` on your DataFrame

In [89]:
# Task 15
# Compare From Me to You to Ticket to Ride

# INPUT
df_list = [Ticket_df_freq, Me_df_freq]

#CALCULATIONS
new_df = pd.concat(df_list,axis=1).fillna(0)
col_A = np.array(new_df.iloc[:,0])
col_B = np.array(new_df.iloc[:,1])
A_dot_B = np.dot(col_A, col_B)
norm_A = np.linalg.norm(col_A)
norm_B = np.linalg.norm(col_B)

# OUTPUT
cos_sim_2 = A_dot_B / (norm_A * norm_B)

# Print out your result
print(cos_sim_2)

0.2882268853551227


In [None]:
# Task 15 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 16** - Summary


What is your conclusion about "Ticket to Ride"?  Does it appear most similar to "Strawberry Fields" (Lennon) or "From Me to You" (collaborative)?

Select the answer that best describes the best answer. Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: Ticket to Ride is most similar to Strawberry Fields

B: Ticket to Ride is most similar to From Me to You

C: Ticket to Ride is equally similar to both songs

D: Tickt to Ride was probably not written beither by John Lennon or collaboratively.

In [None]:
# Task 16

# YOUR CODE HERE
Answer = 'A' #COS(0) = 1 <--> NO DIFFERENCE, PICK VALUE CLOSEST TO 1

**Task 16 Test**

In [None]:
# Task 16 - Test
# Hidden tests - you will see the results when you submit to Canvas