# OPTIONAL READING

Hello! You've opened a notebook that's **OPTIONAL**. You don't have to read it. That's because it's **optional.** You can close it if you want, or you can read it if you want. Either one is fine. But please don't ask in OH or on Ed if you need to do anything with this notebook. You don't. Because it's **optional.** But hopefully you find it interesting :) 

## Walkthrough of Building a Model

For this **optional** reading, we will take you on a brief walkthrough of how to create models of your data that can be used to understand and make predictions about your dataset. The following cells have code that is provided for you. You can't run some of them because they rely on completed versions of the code from the main notebook. Copy that other stuff over if you want. Otherwise, you can investigate the pre-computed outputs as you follow along. There is also a discussion of the model's underlying features that you are welcome to read.


### What Is My Kind of Song?

Our aim will be to build a model that is capable of recognizing whether or not some **new** song would fit into my library or not. In order to do that, it will be important to have examples of songs that are not currently among my Liked Songs. These come from an anonymous friend—let's call her "E"—and are found in the file called `other_songs.csv`.

The main hypothesis that drives the model I'm building is that the songs that I have saved to my library might tend to fit a certain profile. Maybe my taste is, for example, happy, high-energy, acoustic-sounding music. By introducing songs that come from another library—from another person with a different taste for music—we might introduce some examples that have notably different musical profiles. This will give our model a notion of what music outside of my personal taste might look like so that it can learn to identify examples of it.

The first step for building our model will be to add a new column to the underlying data. We'll call this column `"harry"` and have it represent whether or not a song came from my library. Songs in my library will take the value `1` for this attribute and songs from E's library will take the value `0`. Then, we'll join the two libraries together using `pd.concat()`.

In [None]:
harry = read_songs('sharry_songs.csv')
harry["harry"] = 1
not_harry = read_songs('other_songs.csv')
not_harry["harry"] = 0

combined = pd.concat([harry, not_harry])
combined.sample(15)[['Track Name', 'Artist Name(s)', 'harry']]

Unnamed: 0,Track Name,Artist Name(s),harry
1519,Denial,Mannequin Pussy,1
2494,Breaks,The Black Keys,1
1374,10 Minutes 10 Years,Tennis,1
939,"Freddy My Love - From ""Grease Live!"" Music Fro...","Keke Palmer, Kether Donohue, Vanessa Hudgens, ...",0
2633,Effect and Cause,The White Stripes,1
125,Friend of Nothing - Acoustic,Together Pangea,1
4075,Down Rodeo,Rage Against The Machine,1
745,"Schubert: Auf dem Wasser zu singen, Op. 72, D....","Franz Schubert, Barbara Bonney, Geoffrey Parsons",0
249,Olivia,One Direction,0
1988,Submarine,Silicon,1


### Building a Model for Taste

We can use this combined DataFrame as the source for a **logistic regression model** implemented in the `statsmodels` library that we import below. A logistic regression model is a specific example of a **regression**, which is a statistical technique that builds a mathematical description for how different features of your data influence some outcome. In building a logistic regression model, we are constructing a mathematical object that explains how different features of the songs can predict whether a song is "one of mine" or "one of E's". 

Whenever we build a logistic regression model, we have to let `statsmodels` know which column we think can be predicted using a combination of the others. 

```python
model = smf.logit("harry ~ Energy + Danceability + Speechiness + Acousticness + Instrumentalness + Liveness + Valence", data=combined)
```

The big, long string passed in as an argument to `smf.logit()` tells `statsmodels` that we would like to see how well the `harry` value of a song can be predicted by a combination of the `Energy`, `Danceability`, `Speechiness`, `Acousticness`, `Instrumentalness`, `Liveness`, and `Valence` of that song. 

When we specify the model, all we have to do is pick the features that we expect to be useful in predicting which person's library the song belongs to. We don't provide any explicit information about which way each feature goes, though; instead, we ask Python to **fit** the model to the data we provided it. This is the (complex) process of taking each song in the dataset, looking at whether it's a "Harry song" or an "E song", and then learning how much each of the specified features actually matter for songs of either category. After the model is fitted to the data, we can inspect it to determine the influence of each feature. `result.summary()` returns a table with this information.

In [None]:
import statsmodels.formula.api as smf
import numpy as np
import pickle

model = smf.logit("harry ~ Energy + Danceability + Speechiness + Acousticness + Instrumentalness + Liveness + Valence", data=combined)
result = model.fit()
with open ("model.pkl", "wb") as f:
    pickle.dump(result, f)
result.summary()

Optimization terminated successfully.
         Current function value: 0.447949
         Iterations 6


0,1,2,3
Dep. Variable:,harry,No. Observations:,5835.0
Model:,Logit,Df Residuals:,5827.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 02 Aug 2024",Pseudo R-squ.:,0.1712
Time:,14:45:47,Log-Likelihood:,-2613.8
converged:,True,LL-Null:,-3153.8
Covariance Type:,nonrobust,LLR p-value:,5.789e-229

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.1849,0.213,10.273,0.000,1.768,2.602
Energy,2.4853,0.226,10.997,0.000,2.042,2.928
Danceability,-3.5034,0.255,-13.763,0.000,-4.002,-3.005
Speechiness,-4.1893,0.438,-9.574,0.000,-5.047,-3.332
Acousticness,-0.8345,0.150,-5.580,0.000,-1.128,-0.541
Instrumentalness,1.6393,0.144,11.392,0.000,1.357,1.921
Liveness,0.3013,0.236,1.278,0.201,-0.161,0.763
Valence,-0.7190,0.179,-4.024,0.000,-1.069,-0.369


OK. That is extremely intimidating and contains a bunch of stuff that requires much more knowledge of statistics to use and explain. But there is one useful gem in here: the `coef` column. These values are the *coefficients* of the logistic regression model that are used to indicate how changes in those features change the output probability that a song belongs to my library vs. E's. First, in order to interpret the coefficients in a slightly more human way, we can convert them to **odds-ratios**, which are quantities that tell us how the output probability changes with a unit change (change of `1`) in that feature.

In [None]:
coefs = pd.DataFrame({
    "coef": result.params.values,
    "odds ratio": np.exp(result.params.values),
    "name": result.params.index
})
coefs

Unnamed: 0,coef,odds ratio,name
0,2.184899,8.88975,Intercept
1,2.48529,12.004601,Energy
2,-3.503449,0.030093,Danceability
3,-4.189306,0.015157,Speechiness
4,-0.834488,0.434097,Acousticness
5,1.639266,5.151388,Instrumentalness
6,0.301303,1.351619,Liveness
7,-0.71899,0.487244,Valence


In order to interpret the meaning of the odds ratio, let's imagine that we have a song with the following profile:

|Feature|Value|
|-------|-----|
|Energy|0|
|Danceability|0.8|
|Speechiness|0.3|
|Acousticness|0.2|
|Instrumentalness|0.1|
|Liveness|0.8|
|Valence|0.4|

If we use the `predict()` method of the model that we've built, we can observe that the probability that this hypothetical song is a "Harry song" is about $12.7\%$.

In [None]:
sample_df = pd.DataFrame([
    {'Energy': 0,
     'Danceability': 0.8,
     'Speechiness': 0.3,
     'Acousticness': 0.2,
     'Instrumentalness': 0.1,
     'Liveness': 0.8,
     'Valence': 0.4}
])
result.predict(sample_df)

0    0.127397
dtype: float64

The probability of an event is a representation of the likelihood of that event. When expressed as a percentage, it represents how many times out of a hundred that event will be expected to happen. In the context of our prediction model, the percentage refers to the number of songs with exactly this profile that I would be expected to like out of a hundred.

Odds are another way of representing an event's likelihood. If something happens with a probability of $x\%$, then out of $100$ tries, we would expect it to happen $x$ times and therefore for it **not** to happen $100 - x$ times. The odds are expressed as the quotient between the amount of times it would be expected to happen and the amount of times it would not be expected to happen. To give it some specific numbers, if something has a probability of $80\%$, then out of $100$ tries, it could be expected to happen $80$ times and not happen $20$ times. The odds are therefore expressed as `80:20`, or `4:1` when simplified. *"For every four successes, we'd expect one failure."* The model predicts the song above to have a $12.7\%$ chance of being from my library instead of E's, or `12.7:87.3` odds, which is very nearly `1:7` (more precisely, `1:6.874`).

The **odds-ratio**, then, is a quantity that tells us how much the odds change when the value of a certain variable increases by a value of one. The song above has ~ `1:7` odds of being in my library with the current features—what would the likelihood of the song being mine be if the Energy value was at `1` instead of `0`?



In [None]:
with_high_energy = pd.DataFrame([
    {'Energy': 1,
     'Danceability': 0.8,
     'Speechiness': 0.3,
     'Acousticness': 0.2,
     'Instrumentalness': 0.1,
     'Liveness': 0.8,
     'Valence': 0.4}
])
result.predict(with_high_energy)

0    0.63671
dtype: float64

The probability goes all the way up to $63.7\%$! That means the odds are now `63.7:36.3`, which is roughly equal to `1.75:1` (or `7:4`, depending on how you want to express that ratio). If we recall from before that the odds-ratio for Energy in our model is `12.0`, then we would expect that increasing Energy by `1`—going from a totally laconic song to an absolutely frenetic one—should increase the odds by a factor of `12`. 

||Rough Value|Exact Value|
|----|----|----|
|Old Odds|`1:7`|`1:6.874` $\approx 0.1459965$|
|New Odds|`7:4`|`1.755:1` $\approx 1.7526219$|

Dividing the new odds by the old gives us a quotient of `12.004`, exactly equal to the odds-ratio of the model stated above. 

### Why Talk About Odds?

Logistic regression models are useful for two main reasons: 1) they allow for automatic classification of different datapoints into different categories and 2) they explain how different features ultimately affect the categorization. Sometimes it's enough to build a model that can do classification for you: maybe you want to build a "priority" filter for your email so that certain kinds of emails are displayed more prominently in your inbox. In this case, it's not all that important to understand what features of an email (the sender, the time it was sent, the subject line, the number of images contained in it) influence this decision, since you can still see all of the emails you receive anyways. Other models for more serious tasks, like those currently used for approving and denying loans or weighing an incarcerated person's risk of recidivism, raise serious questions about how they make their decisions. Moreover, from an investigative sense, it is often vital to look beyond the outcomes and try to understand the effects that different "variables" have on the systems we use. How does a person's age, gender, race, level of education, or even zip code get used to make decisions about whether they can take out a mortgage or be released from prison on parole?

The odds-ratio in a logistic regression model is useful for exactly these purposes. Given the model, we can analyze the characteristics of a data point that lead to the decision being made.