# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### First I'll download all the data and set a few parameters for Pandas. 

In [2]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)  # Unlimited columns

In [3]:
#!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
#!unzip fma_metadata.zip


### Now I'll open the Tracks.csv and create my dataframe.


In [4]:
# Reading this CSV but the names of the columns look all messed up.
names = pd.read_csv('fma_metadata/tracks.csv')
names.head()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,album.9,album.10,album.11,album.12,artist,artist.1,artist.2,artist.3,artist.4,artist.5,artist.6,artist.7,artist.8,artist.9,artist.10,artist.11,artist.12,artist.13,artist.14,artist.15,artist.16,set,set.1,track,track.1,track.2,track.3,track.4,track.5,track.6,track.7,track.8,track.9,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
1,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World


Ooh, those headers look all messed up. I'll replace them with a cleaner list.

### Replace headers with tidy list of good names.

In [5]:
# So I need to make a list of the good names, and drop all the rows that aren't good. 
# I'm going to save these names and use them as the feature names.
cols = names[0:1].values.tolist()
cols = cols[0]
cols[0] = 'track_id'
print(f"The fixed list of names now: {cols[0:3]} ...\n")

# Instead of renaming these I'm going to reimport my csv and drop the top rows.
# This way it will import the datatypes for each column
tracks = pd.read_csv('fma_metadata/tracks.csv', skiprows=[0,1,2], header=None, names=cols)
tracks.head(2)

The fixed list of names now: ['track_id', 'comments', 'date_created'] ...



Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
0,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
1,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave


### Inspect the data a bit more....

In [6]:
print(tracks.shape)


(106574, 53)


### Here I'm going to add in the Features.csv field. 

These features are from the features extraction from the audio (used to create features.csv).
https://github.com/mdeff/fma

In [7]:
#features = pd.read_csv('fma_metadata/features.csv', skiprows=[1,2,3])
#features = features.rename(columns={'feature': 'track_id'})

In [8]:
# Here I printed out a longer head table to make sure that the rows allign.
# I also print out the shape to make sure it looks right for mergeing the two dataframes. 
#print(features.shape)
#features.head(3)

### Merge Tracks and Features into one dataset

In [9]:
#df = pd.merge(tracks, features, on=['track_id', 'track_id'])
df=tracks

In [10]:
df.head()

Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
0,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
1,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
2,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
3,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
4,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [11]:
topgenre_unique = df['genre_top'].value_counts()
print(topgenre_unique)
#Country music? Not in my classifier. Hahaha
replace_these = ["Jazz", "Old-Time / Historic", "Spoken", "Soul-RnB", "Blues", "Country","Easy Listening"]
for i in replace_these:
    df["genre_top"] = df['genre_top'].replace(i,"other")

topgenre_unique = df['genre_top'].value_counts()
print(topgenre_unique)

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64
Rock             14182
Experimental     10608
Electronic        9372
Hip-Hop           3552
Folk              2803
Pop               2332
Instrumental      2079
other             2051
International     1389
Classical         1230
Name: genre_top, dtype: int64


### Now a bit of Feature Engineering on the merged DF


In [12]:
# Which columns are missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

Unnamed: 0,Total,Percent
lyricist,106263,0.997082
publisher,105311,0.988149
information.1,104225,0.977959
composer,102904,0.965564
active_year_end,101199,0.949566
wikipedia_page,100993,0.947633
date_recorded,100415,0.942209
related_projects,93422,0.876593
associated_labels,92303,0.866093
language_code,91550,0.859028


In [13]:
# Lets take some of the data and see if we can do some tricky feature engineering.

# Here I'll use Label Encoding 
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

# df['name'] Should encode this name. Categorical variable.
#df = df.dropna(subset=['name'])
df["name_code"] = lb_make.fit_transform(df["name"])
print(df[["name", "name_code"]].head(10))

# df['type'] Should encode this name. Categorical variable.
df = df.dropna(subset=['type'])
df["type_code"] = lb_make.fit_transform(df["type"])
print(df[["type", "type_code"]].head(10))

# df['genre_top'] Should encode this name. Categorical variable.
df = df.dropna(subset=['genre_top'])
df["genre_top_code"] = lb_make.fit_transform(df["genre_top"])
print(df[["genre_top", "genre_top_code"]].head(10))


         name  name_code
0        AWOL        299
1        AWOL        299
2        AWOL        299
3   Kurt Vile       7425
4  Nicky Cook       9558
5  Nicky Cook       9558
6  Nicky Cook       9558
7  Nicky Cook       9558
8  Nicky Cook       9558
9        AWOL        299
    type  type_code
0  Album          0
1  Album          0
2  Album          0
3  Album          0
4  Album          0
5  Album          0
6  Album          0
7  Album          0
8  Album          0
9  Album          0
       genre_top  genre_top_code
0        Hip-Hop               4
1        Hip-Hop               4
2        Hip-Hop               4
3            Pop               7
9        Hip-Hop               4
10          Rock               8
11          Rock               8
12  Experimental               2
13  Experimental               2
14          Folk               3


In [14]:
df.shape

(47551, 56)

In [15]:
# Drop all remaining columns missing data. 
nan_columns = df.columns[df.isna().any()].tolist()
df = df.drop(columns=nan_columns)
df.head(1)

# Make sure we didn't miss any.
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(4)

Unnamed: 0,Total,Percent
genre_top_code,0,0.0
type_code,0,0.0
comments,0,0.0
date_created,0,0.0


In [16]:
# Which columns are missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(4)

Unnamed: 0,Total,Percent
genre_top_code,0,0.0
type_code,0,0.0
comments,0,0.0
date_created,0,0.0


In [19]:
# All my rows are dropped that will be dropped. Now I'll make my key lists for 
df_key= df[["genre_top", "genre_top_code","type", "type_code","name", "name_code"]]

# Lets drop all the non-numerical columns now. 
data = df._get_numeric_data()
data.shape

(47551, 19)

### Now for some Logistic Regression

In [20]:
# Define my X & Y
y=data["genre_top_code"]
X=data.drop(columns=['genre_top_code'])

# Going to try a standard scaling my X for fast converge with SAG / SAGA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

# Here is where I split the model into test/train sets. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, 
                                                    test_size=0.2, random_state=42)


In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

# Cross-Validation Logistic Regression to determine C hyperparameter
logregCV = LogisticRegressionCV(Cs=[.01,.1,.5,1,5,10,30],
                                cv=2,
                                class_weight ='balanced', 
                                scoring='f1_score',
                                solver='saga', 
                                max_iter=3000,
                                n_jobs=-1, 
                                verbose=1,
                                random_state=42)

logregCV.fit(X_train, y_train)


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


convergence after 207 epochs took 301 seconds
convergence after 232 epochs took 335 seconds
convergence after 250 epochs took 363 seconds
convergence after 283 epochs took 410 seconds
convergence after 283 epochs took 412 seconds
convergence after 285 epochs took 418 seconds
convergence after 294 epochs took 428 seconds
convergence after 298 epochs took 435 seconds
convergence after 303 epochs took 437 seconds
convergence after 304 epochs took 448 seconds
convergence after 355 epochs took 517 seconds
convergence after 366 epochs took 534 seconds
convergence after 451 epochs took 652 seconds
convergence after 673 epochs took 974 seconds
convergence after 352 epochs took 511 seconds
convergence after 756 epochs took 1107 seconds
convergence after 79 epochs took 115 seconds
convergence after 825 epochs took 1194 seconds
convergence after 75 epochs took 108 seconds
convergence after 921 epochs took 1323 seconds
convergence after 14 epochs took 19 seconds
convergence after 926 epochs took 1

ValueError: 'f1_score' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options.

convergence after 758 epochs took 745 seconds
convergence after 132 epochs took 100 seconds
convergence after 92 epochs took 68 seconds
convergence after 650 epochs took 497 seconds
convergence after 1269 epochs took 1129 seconds
convergence after 642 epochs took 479 seconds
convergence after 203 epochs took 123 seconds
convergence after 203 epochs took 123 seconds
convergence after 178 epochs took 110 seconds


In [22]:
print ('Max auc_roc:', logregCV.scores_[1].mean(axis=0).max()) 
y_predicted = logregCV.predict(X_test)

AttributeError: 'LogisticRegressionCV' object has no attribute 'scores_'

convergence after 31 epochs took 20 seconds
convergence after 175 epochs took 108 seconds
convergence after 21 epochs took 13 seconds
convergence after 31 epochs took 16 seconds
convergence after 21 epochs took 11 seconds
convergence after 2038 epochs took 1686 seconds
convergence after 2102 epochs took 1215 seconds
convergence after 2529 epochs took 1638 seconds
convergence after 1417 epochs took 606 seconds
convergence after 937 epochs took 324 seconds
convergence after 273 epochs took 95 seconds
convergence after 192 epochs took 67 seconds
convergence after 2926 epochs took 1527 seconds
convergence after 1006 epochs took 262 seconds
convergence after 1318 epochs took 347 seconds
convergence after 1385 epochs took 367 seconds
convergence after 196 epochs took 40 seconds
convergence after 133 epochs took 26 seconds
convergence after 1552 epochs took 224 seconds
convergence after 1343 epochs took 132 seconds
convergence after 1436 epochs took 143 seconds
convergence after 254 epochs to

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def get_metrics(y_test, y_predicted):  
    # true positives / (true positives+false positives)
    precision = precision_score(y_test, y_predicted, pos_label=None,
                                    average='weighted')             
    # true positives / (true positives + false negatives)
    recall = recall_score(y_test, y_predicted, pos_label=None,
                              average='weighted')
    
    # harmonic mean of precision and recall
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    
    # true positives + true negatives/ total
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1

accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

In [None]:
# Lets see if we can find the ideal features.
logreg = LogisticRegression(solver='saga', multi_class='auto', n_jobs=-1, 
                         random_state=42, verbose=1, max_iter=500, warm_start=True)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe = RFE(logreg, 20)
rfe = rfe.fit(X_test, y_test)

print(rfe.support_)
print(rfe.ranking_)

## Guidance for the Assignment




This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.