## Part A: Introduction

### Statistical significance between head size and brain weight in healthy adult humans

The `Brainhead.csv` dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed).

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

### Use the above information to complete the following tasks.

**Task 1** - Load the data

Let's load the data! The URL has been provided as well as the imports for pandas and numpy.

* load your CSV file into a DataFrame.

In [2]:
import pandas as pd
import numpy as np

data_url = 'https://raw.githubusercontent.com/pixeltests/datasets/main/Brainhead.csv'
df = pd.read_csv(data_url)
df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


**Task 2**

Store the response variable (brain size) as a matrix called Y.

* Import numpy
* Assign the `Brain` column to a matrix variable called `Y` which should be a `np.array` with dimensions `(237, 1)`
* Print out your matrix shape to check the dimensions.

In [5]:
Y = np.array(df['Brain']).reshape(-1,1)

print('Matrix Y shape', Y.shape)

Matrix Y shape (237, 1)


**Task 3** - Create the X and Y matrices

Store the explanatory variable (head size) as the matrix `X`.  Don't forget to include the **column of ones** for the intercept term (refer to the Guided Project notes if needed).

* Create an `np.array` of ones with the name `ones`; it should be the same length as your DataFrame.
* Create another `np.array` with the name `head`; it should have a shape of `(237, 1)`.
* Combine the `ones` and `head` arrays to create your matrix and call it `X`; it should have a shape of `(237, 2)`

In [10]:
ones = np.repeat(1,len(df)).reshape(-1,1)
head = np.array(df['Head']).reshape(-1,1)
X = np.concatenate((ones,head),axis=1)

**Task 4** - Transpose of a matrix

Calculate the transpose of X which is written as $X^T$.  

* Calculate the transpose of `X` and assign it to the variable `X_T`.

Explain what the transpose of a matrix is.

YOUR ANSWER = Transpose of a matrix is matrix which flips the rows and columns.

In [12]:
X_T = np.transpose(X)
print(X_T.shape)

(2, 237)


**Task 5** - More matrix multiplication!

Use matrix multiplication to calculate the transpose of X multiple by X, which is written as $X^TX$.

* Assign the results of your calculation to the variable `X_T_X` which should be a `numpy.ndarray`.

In [13]:
X_T_X = np.matmul(X_T,X)
X_T_X

array([[       237,     861256],
       [    861256, 3161283190]])

**Task 6** - Continuing the matrix multiplication

Calculate the inverse of the matrix you calculated in **Task 5**. This would be written as $(X^TX)^{-1}$.  

* Assign your calculation to the variable `X_T_X_inv` which should be a `numpy.ndarray`.

Explain what the inverse of a matrix is. You can use examples from other areas of math to illustrate.

YOUR ANSWER

In [15]:
X_T_X_inv = np.linalg.inv(X_T_X)
X_T_X_inv

array([[ 4.23638519e-01, -1.15415543e-04],
       [-1.15415543e-04,  3.17599920e-08]])

**Task 7** - More and more matrix multiplication

Next, we'll continue with the matrix multiplication. Let's calculate the result of X transpose multiplied by Y which is written $X^TY$.

* Assign your result to the variable `X_T_Y` which should be a `numpy.ndarray`.

In [17]:
X_T_Y = np.matmul(X_T,Y)
X_T_Y

array([[    304041],
       [1113176805]])

**Task 8** - Calculate the slope and intercept

Finally! We can calculate the slope and intercept from all of the matrix multiplication we just did.

Use the results from your previous tasks to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

* Assign the results of your calculation to the variable `B` which should be a `numpy.ndarray`.

In [18]:
B = np.matmul(X_T_X_inv,X_T_Y)
B

array([[3.25573421e+02],
       [2.63429339e-01]])

**Task 9** - Verify the linear algebra calculations

Now, we're going to verify our linear algebra calculations by fitting a linear model!

We'll use the results of this linear model to compare to our calculated values for the slope and intercept.

* Import the `statsmodels.formula.api` library and fit an `ols` model (enter the model in the format Y ~ X).
* Assign your model to the variable `model`

Compare your `model.params` output to the two values in the `B` matrix you calculated above - this isn't autograded, just a chance to see if the results agree.

In [21]:
from statsmodels.formula.api import ols
equation = 'Brain~Head'
model = ols(equation,data=df).fit()
print(model.params)
print(model.summary())

Intercept    325.573421
Head           0.263429
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                  Brain   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.638
Method:                 Least Squares   F-statistic:                     416.5
Date:                Mon, 15 Jan 2024   Prob (F-statistic):           5.96e-54
Time:                        14:49:04   Log-Likelihood:                -1350.3
No. Observations:                 237   AIC:                             2705.
Df Residuals:                     235   BIC:                             2711.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------

**Task 10** - Apply matrix calculations to the data

Now that we've verified that our matrix calculations result in the same parameters as we fit with our model, let's practice using the matrix calculations with some of the data.

Create a new X matrix that includes columns for both head size and age group.

**Your tasks** (read carefully)

* Create an array for the head size column 'Head' and assign it to the variable `head`; you should have a `numpy.ndarray`
* Create a new column by subtracting `1` from each value in `df['Age']`; call this new column `Age_01`
* Create an array from the 'Age_01' column and assign it to the variable `age`; this should be a `numpy.ndarray`
* Concatenate the following three arrays and assign the result to the variable `X`: `ones`, `head`, `age` (use the `ones` array from Task 3).

In [37]:
head = np.array(df['Head']).reshape(-1,1)

df['Age_01'] = df['Age']-1
age = np.array(df['Age_01']).reshape(-1,1)

X1 = np.concatenate((ones,head,age),axis=1)
print(x[:5])

[[   1 4512    0]
 [   1 3738    0]
 [   1 4261    0]
 [   1 3777    0]
 [   1 4177    0]]


**Task 11** - Calculate the slope and intercept

Now, calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

* Break it down into smaller steps, in separate cells if needed
* COMMENT all the steps, so you can troubleshoot if there is a mistake.


In [40]:
def get_B(X1,Y):
  X1_T = np.transpose(X1)
  X1_T_X1 = np.matmul(X1_T,X1)
  X1_T_X1_inv = np.linalg.inv(X1_T_X1)
  X1_T_Y = np.matmul(X1_T,Y)
  B = np.matmul(X1_T_X1_inv,X1_T_Y)
  return B
  print(B)

In [41]:
B

array([[3.25573421e+02],
       [2.63429339e-01]])

**Task 12** - Verify the matrix calculation with a linear model

We're going to fit another OLS model and then confirm our answer with what we calculated for `B` in **Task 11**

* Fit an OLS model as you did above but your input will be in the two independent variables head size and age (your model input should be Y ~ X1 + X2). **Make sure to use `Head` for X1 and `Age_01` (not `C(Age_01)`, which would change the order of the ouput) for X2.**
* Assign the model results to the variable `model2`.

Compare your `model2.params` to the values you determined for `B` in Task 11 - do they match?

In [36]:
from statsmodels.formula.api import ols

model2 = ols('Brain~Head + Age_01',data = df).fit()
print(model2.params)

Intercept    347.550501
Head           0.260439
Age_01       -20.731645
dtype: float64


## Part B: Cosine Similarity

Use the following information to answer the remaining Tasks in the Module Project.

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collaborative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collaboration, primarily (or totally) written by Lennon, or primarily (or totally) written by McCartney.  

**However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.**

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collaborative, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums". In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

**Task 13** - Import all the song lyrics

Import the text of "Strawberry Fields", "From Me to You", and "Ticket to Ride". The code has been provided for you - run the cell to see the output. For each set of lyrics, the word frequencies have also been calculated.

**Run the following cell to load all the lyrics.**

In [43]:
Strawberry_ = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Strawberry_df = pd.DataFrame({"Words": Strawberry_.split()})
Strawberry_df_freq = pd.DataFrame(pd.crosstab(index=Strawberry_df['Words'],columns='count'))



Me_ = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"
Me_df = pd.DataFrame({"Words": Me_.split()})
Me_df_freq = pd.DataFrame(pd.crosstab(index=Me_df['Words'],columns='count'))



Ticket_ = "I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care she said that living with me is bringing her down yeah for she would never be free when I was around shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away yeah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me she said that living with me is bringing her down yeah for she would never be free when I was around ah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care"
Ticket_df = pd.DataFrame({"Words": Ticket_.split()})
Ticket_df_freq = pd.DataFrame(pd.crosstab(index=Ticket_df['Words'],columns='count'))

**Task 14** - Cosine similarity calculations

Now it's your turn to complete some linear algebra calculations! In this task, we need to concatenate "Ticket to Ride" and "Strawberry Fields" and calculate the cosine similarity. Two linear algebra imports have been provided

* Concatenate `Strawberry_df_freq` and `Ticket_df_freq` to create a new DataFrame.
* Assign your final result to `cos_sin_1`; your final answer should be a float.



In [44]:
Strawberry_df_freq.isnull().sum().sum()


0

In [45]:
Ticket_df_freq.isnull().sum().sum()


0

In [47]:
from numpy import dot
from numpy.linalg import norm
import pandas as pd



In [62]:
dfa = [Strawberry_df_freq,Ticket_df_freq]
all_words = pd.concat(dfa,axis=1)
all_words = all_words.fillna(0)
all_words.columns = ['Strawberry Fields','Penny Lane']
all_words[0:50]

Unnamed: 0_level_0,Strawberry Fields,Penny Lane
Words,Unnamed: 1_level_1,Unnamed: 2_level_1
Fields,10.0,0.0
I,8.0,8.0
Im,4.0,2.0
Strawberry,10.0,0.0
a,1.0,12.0
about,4.0,0.0
all,4.0,0.0
always,1.0,0.0
and,4.0,0.0
bad,1.0,0.0


In [68]:
dfa = [Strawberry_df_freq,Ticket_df_freq]
all_words = pd.concat(dfa,axis=1)
all_words = all_words.fillna(0)
all_words.columns = ['Strawberry Fields','Ticket to Ride']
all_words[0:50]
cos_sin_1 = dot(all_words['Strawberry Fields'],all_words['Ticket to Ride'])/norm(all_words['Strawberry Fields'])*norm(all_words['Ticket to Ride'])

In [66]:
print(cos_sin_1)

0.324035859004908


**Task 15** - More cosine similarity calculations

Now, we'll compare "Ticket to Ride" and "From Me to You" and calculate the cosine similarity.

* Concatenate `Ticket_df_freq` and `Me_df_freq` to create a new DataFrame.
* Assign your final result to `cos_sin_2`; your final answer should be a float.


**Task 16** - Summary


What is your conclusion about "Ticket to Ride"?  Does it appear most similar to "Strawberry Fields" (Lennon) or "From Me to You" (collaborative)?

Select the answer that best describes the best answer. Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: Ticket to Ride is most similar to Strawberry Fields

B: Ticket to Ride is most similar to From Me to You

C: Ticket to Ride is equally similar to both songs

D: Tickt to Ride was probably not written beither by John Lennon or collaboratively.