# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

In [0]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

--2019-01-24 14:32:54--  https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358412441 (342M) [application/zip]
Saving to: ‘fma_metadata.zip’


2019-01-24 14:33:19 (17.5 MB/s) - ‘fma_metadata.zip’ saved [358412441/358412441]

Archive:  fma_metadata.zip
 bunzipping: fma_metadata/README.txt  
 bunzipping: fma_metadata/checksums  
 bunzipping: fma_metadata/not_found.pickle  
 bunzipping: fma_metadata/raw_genres.csv  
 bunzipping: fma_metadata/raw_albums.csv  
 bunzipping: fma_metadata/raw_artists.csv  
 bunzipping: fma_metadata/raw_tracks.csv  
 bunzipping: fma_metadata/tracks.csv  
 bunzipping: fma_metadata/genres.csv  
 bunzipping: fma_metadata/raw_echonest.csv  
 bunzipping: fma_metadata/echonest.csv  
 bunzipping: fma_metadata/features.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
tracks = pd.read_csv('fma_metadata/tracks.csv', header=[0,1]) # Turning rows 0 & 1 into headers

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head(3)

Unnamed: 0_level_0,Unnamed: 0_level_0,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
Unnamed: 0_level_1,Unnamed: 0_level_1.1,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
0,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168.0,2.0,Hip-Hop,[21],[21],,4656.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293.0,,3.0,,[],Food
2,3,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000.0,0.0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237.0,1.0,Hip-Hop,[21],[21],,1470.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514.0,,4.0,,[],Electric Ave


In [0]:
tracks.shape

(106575, 53)

In [0]:
# Making a new copy of original df
tracks1 = tracks

# Joining headers & subheaders with '_'
tracks1.columns = tracks1.columns.map('_'.join)

# Renaming first column from 'Unnamed...' into 'track_id'
tracks1 = tracks1.rename(columns={'Unnamed: 0_level_0_Unnamed: 0_level_1': 'track_id'})

# Dropping row 0 (with mostly useless info)
tracks1 = tracks1.drop(tracks1.index[0])

# Resetting index of tracks df
tracks1 = tracks1.reset_index(drop=True)

In [0]:
tracks1.head(3)

Unnamed: 0,track_id,album_comments,album_date_created,album_date_released,album_engineer,album_favorites,album_id,album_information,album_listens,album_producer,album_tags,album_title,album_tracks,album_type,artist_active_year_begin,artist_active_year_end,artist_associated_labels,artist_bio,artist_comments,artist_date_created,artist_favorites,artist_id,artist_latitude,artist_location,artist_longitude,artist_members,artist_name,artist_related_projects,artist_tags,artist_website,artist_wikipedia_page,set_split,set_subset,track_bit_rate,track_comments,track_composer,track_date_created,track_date_recorded,track_duration,track_favorites,track_genre_top,track_genres,track_genres_all,track_information,track_interest,track_language_code,track_license,track_listens,track_lyricist,track_number,track_publisher,track_tags,track_title
0,2,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168.0,2.0,Hip-Hop,[21],[21],,4656.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293.0,,3.0,,[],Food
1,3,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000.0,0.0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237.0,1.0,Hip-Hop,[21],[21],,1470.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514.0,,4.0,,[],Electric Ave
2,5,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206.0,6.0,Hip-Hop,[21],[21],,1933.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151.0,,6.0,,[],This World


In [0]:
# dropping all rows without 'track_genre_top'
tracks1 = tracks1.dropna(subset=['track_genre_top'])

In [0]:
# dropping latitude and longitude
tracks1 = tracks1.drop(['artist_latitude', 'artist_longitude'], axis=1)

In [0]:
tracks1.shape

(49598, 51)

In [0]:
tracks_numeric = tracks1.select_dtypes('number')

In [0]:
tracks_numeric.isna().sum()

album_comments      0
album_favorites     0
album_id            0
album_listens       0
album_tracks        0
artist_comments     0
artist_favorites    0
artist_id           0
track_bit_rate      0
track_comments      0
track_duration      0
track_favorites     0
track_interest      0
track_listens       0
track_number        0
dtype: int64

In [0]:
tracks_numeric.shape

(49598, 15)

In [0]:
tracks_numeric.head()

Unnamed: 0,album_comments,album_favorites,album_id,album_listens,album_tracks,artist_comments,artist_favorites,artist_id,track_bit_rate,track_comments,track_duration,track_favorites,track_interest,track_listens,track_number
0,0.0,4.0,1.0,6073.0,7.0,0.0,9.0,1.0,256000.0,0.0,168.0,2.0,4656.0,1293.0,3.0
1,0.0,4.0,1.0,6073.0,7.0,0.0,9.0,1.0,256000.0,0.0,237.0,1.0,1470.0,514.0,4.0
2,0.0,4.0,1.0,6073.0,7.0,0.0,9.0,1.0,256000.0,0.0,206.0,6.0,1933.0,1151.0,6.0
3,0.0,4.0,6.0,47632.0,2.0,3.0,74.0,6.0,192000.0,0.0,161.0,178.0,54881.0,50135.0,1.0
9,0.0,4.0,1.0,6073.0,7.0,0.0,9.0,1.0,256000.0,0.0,207.0,3.0,1126.0,943.0,5.0


In [0]:
tracks_numeric.describe()

Unnamed: 0,album_comments,album_favorites,album_id,album_listens,album_tracks,artist_comments,artist_favorites,artist_id,track_bit_rate,track_comments,track_duration,track_favorites,track_interest,track_listens,track_number
count,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0
mean,0.333179,1.145449,11813.603351,19974.69,21.763035,1.16291,16.721602,11400.403786,260278.65793,0.024336,268.627263,2.381447,2523.905,1586.32838,8.545607
std,1.312889,2.463242,6455.900325,57369.8,51.492489,4.18622,58.966302,7046.466624,65663.862632,0.332693,284.327919,11.147578,19802.85,6039.952955,16.98826
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.0,0.0,0.0,2.0,1.0,0.0
25%,0.0,0.0,6530.0,2602.0,7.0,0.0,1.0,5449.0,192000.0,0.0,146.0,0.0,456.0,212.0,2.0
50%,0.0,0.0,11887.0,6092.0,11.0,0.0,4.0,11384.0,256000.0,0.0,211.0,1.0,938.0,520.0,5.0
75%,0.0,1.0,17410.0,16219.0,17.0,1.0,12.0,17450.0,320000.0,0.0,299.0,2.0,2091.0,1321.0,9.0
max,17.0,40.0,22940.0,1193803.0,652.0,68.0,963.0,24357.0,448000.0,37.0,11030.0,1482.0,3293557.0,543252.0,255.0


## Regression

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [0]:
# Logistic Regression
y = tracks1['track_genre_top']
X = tracks_numeric

# Splitting into train and test groups
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                   random_state=42)
log_reg = LogisticRegression(multi_class='ovr',
                            solver='liblinear',
                            max_iter=500).fit(X_train, y_train)
log_reg.score(X_test, y_test)

0.3815322580645161

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.