# 1. Project info

**Project title**: Kick out the jams: Classifying song genres

**Name:** Lina Tran, Joel Ostblom, Madeleine Bonsma-Fisher, Ahmed Hasan

**E-mail:** lina.mntran@gmail.com, joel.ostblom@gmail.com, m.bonsma@mail.utoronto.ca, ahmed.hasan@mail.utoronto.ca

**GitHub username**: linanmt, joelostblom, mbonsma, aays

**Link to prior writing**: [Lina](https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/python/classes/Classes_in_Python.ipynb), [Joel](https://github.com/UofTCoders/rcourse/blob/master/lec02-basic-r.Rmd), [Madeleine](https://github.com/UofTCoders/rcourse/blob/master/lec07-pop-models.Rmd), [Ahmed](https://github.com/UofTCoders/studyGroup/blob/gh-pages/lessons/r/dplyrmagrittr/lesson.Rmd) 

**Short description**: A short description of the project, less than 110 characters. This will be read by the students on the DataCamp platform **before** deciding to start the project.

> Using a dataset of song properties, apply machine learning methods in Python to classify tracks by genre.

---

#### Long description ####

A longer description of the project, around four sentences in length. 
This will be read by the students on the DataCamp platform **before** deciding to start the project.

It should mention some of the major prerequisites for completing the project 
(for example "familiarity with `pandas` DataFrames" or "know how to use the Naive Bayes method from `scikit learn`")

> Using a subset of the dataset comprised of two genres (Hip-Hop and Rock), we will train a classifier to distinguish between the two genres based only on track information derived from Echonest (now part of Spotify) data for each track. We will first make use of `pandas` and `seaborn` packages in Python for subsetting the data, aggregating information, and creating plots when exploring the data for obvious trends or factors we should be aware of when doing machine learning. Next, we will use of the `scikit-learn` package to predict whether we can correctly classify a song's genre based on features such as danceability, energy, acousticness, tempo etc. We will go over implementations of common algorithms such as PCA, logistic regression, decision trees and so forth.


#### Datasets used ####

Short description (and ideally links) to the datasets used in the project. This will be read my me (David) only.

> An open dataset for music analysis - https://github.com/mdeff/fma. The tracks represented in the dataset are all from the Free Music Archive, a large library of free audio downloads. 

> https://www.dropbox.com/sh/bab3yu27thfcv45/AAAHoBux36O54Dyg6UUOBv_8a?dl=0

#### Assumed student background ####

What background knowledge you assume the student doing this project will have. The more specific the better. This will be read my me (David) only. Please list things like modules, tools, functions, methods, and statistical concepts and jargon.

Not so useful: "The student has a basic familiarity with `pandas`."

More useful: "The student knows how to read in a csv file using `pandas` and how to compute grouped summary statistics using `groupby()`."

> The student knows how to read in csv and json files using `pandas`, and examine correlations across the columns of a `pandas` DataFrame. The student understands basic plotting (scatter plots, pair plots) with `seaborn`. The student is also familiar with implementing principal component analysis, supervised learning methods (decision trees and logistic regression), and model accuracy scoring using k-fold crossvalidation and the classification report in scikit-learn.

# 2. Project narrative intro

## These recommendations are so on point! How does this playlist know me so well?

> Over the past few years, streaming services with vast catalogues have increasingly become the primary means through which most people listen to their favourite music. But at the same time, the wealth of options on offer can mean users might be a bit overwhelmed when trying to look for newer music that suits their tastes. 

> For this reason, streaming services have found various means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics. Today, we'll be examining data compiled by a research group known as The Echo Nest. Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves.

## 1. Read in the data

> To begin with, let's load the data on our tracks alongside the track metrics compiled by The Echo Nest. These exist in two different files, which are in different formats - `csv` and `json`. While `csv` is a popular file format for denoting tabular data, `json` is another common file format in which databases often return the results a given query. 

> Let's start by creating two `pandas` DataFrames out of these files, and then merging the DataFrames into one for our genre classification analysis.

In [1]:
import pandas as pd

tracks = pd.read_csv('./fma-rock-vs-hiphop.csv')
tracks.info()
track_metrics = pd.read_json('./echonest-clean.json')
track_metrics.info()

# Students will remove NAs here

echo_tracks = pd.merge(track_metrics, tracks[['genre_top', 'title', 'track_id']], on='track_id')
echo_tracks.head()

# Students will make a correlation matrix to explore if there are correlated variables

## 2. PCA and visualization

> The variance between genres could possibly be explained by just a few variables in the data set. To identify these, a commonly used dimensionality reduction approach is Principal Component Analysis (PCA), which rotates the data along the axis of highest variance and allows for visualization in lower dimensions.

> First, we will preprocess the data by assigning all numerical features into the features variable and the genres into the labels variable.

> Next, we will use PCA to find out the intrinsic number of dimensions for the data, and visualize the results in a bar plot with the explained variance on the y-axis.

In [2]:
%matplotlib inline
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

features = echo_tracks.drop(['genre_top', 'title', 'track_id'], axis=1) 
labels = echo_tracks['genre_top']

pca = PCA()
pca.fit(features)

fig, ax = plt.subplots()
x = range(len(pca.explained_variance_))
y = pca.explained_variance_ratio_
ax.bar(x, y)

## 3. Compare the predictive power of logistic regression and a decision tree

> After using PCA to visualize and qualitatively inspect the data, it is now time to make quantitative predictions about the genres of the missing songs. There are many algorithms that could perform well on this type of task. Here, we will try a few common algorithms.

> The first algorithm will be a decision tree. We'll start by splitting the data into training and test sets, and then train a decision tree classifier on the training data.

In [3]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels, random_state=10)

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)
tree.score(test_features, test_labels)

*Stop here! Only the three first tasks :)*