In [None]:
# modules for research report
from datascience import *
import numpy as np
import random
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# module for YouTube video
from IPython.display import YouTubeVideo

# okpy config
from client.api.notebook import Notebook
ok = Notebook('music-final-project.ok')
_ = ok.auth(inline=True)

# Free Music Archive: A Dataset For Music Analysis

This dataset was introduced by Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson at the International Society for Music Information Retrieval (MIR) Conference in 2017.
It has been cleaned for your convenience: all missing values have been removed, and low-quality observations and variables have been filtered out. A brief summary of the dataset, originally
given at the conference, is provided below. 

**NB: You may not copy any public analyses of this dataset. Doing so will result in a zero.**

## Summary

>We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and
organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets.
The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a
hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text
such as biographies.

A small, random subset of this larger dataset is provided to you.

## Data Description

This dataset consists of three tables stored in the `data` folder:
1. `tracks` provides information on individual tracks.
2. `genres` contains information on all of the genres.
3. `features` contains information on the Spotify audio features of each track.

A description of each table's variables is provided below:

`tracks`:
* `track_id`: a unique ID for each track
* `track_title`: title of each track
* `artist_name`: name of the artist
* `album_title`: title of the album that the track comes from
* `track_duration`: the length of the song in seconds
* `track_genre`: the genre(s) that the track fall(s) into
* `album_date_released`: a string indicating the album release date
* `album_type`: specifies whether the album is studio-recorded, live, or from a radio program
* `album_tracks`: number of tracks on the album

`genres`:
* `genre_id`: a unique ID for each genre
* `title`: the name of the genre
* `# tracks`: the number of tracks that fall into this genre
* `parent`: the genre that this subgenre falls under (will be 0 if not a subgenre)

`features` (descriptions from the [Spotify API page](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)):
* `track_id`: a unique ID for each track
* `acousticness`: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
* `danceability`: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
* `energy`: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
* `instrumentalness`: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. 
* `liveness	`: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
* `speechiness`: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. 
* `tempo`: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. 
* `valence`: 	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

## Inspiration

A variety of exploratory analyses, hypothesis tests, and predictions problems can tackled with this data. Here are a few ideas to get
you started:


1. Which genre has the longest songs?
3. Is there a relationship between danceability and energy? What about danceability and valence?
4. Can you classify which genre (of [pick 2 once we see data]) based on its features?
5. Do (pick 2 genres or parent genres) have the same average energy?

Don't forget to review the [Final Project Guidelines](https://docs.google.com/document/d/1NuHDYTdWGwhPNRov8Y3I8y6R7Rbyf-WDOfQwovD-gmw/edit?usp=sharing) for a complete list of requirements.

## Preview

The tables are loaded in the code cells below. Take some time to explore them!

In [None]:
#load genres
genres = Table().read_table("data/genres_final.csv")
genres

In [None]:
#load features
features = Table().read_table("data/features_final.csv")
features

In [None]:
#load tracks
tracks = Table().read_table("data/tracks_final.csv")
tracks

<br>

# Research Report

## Introduction

*Replace this text with your introduction*

## Hypothesis Testing and Prediction Questions

**Please bold your hypothesis testing and prediction questions.**

*Replace this text with your hypothesis testing and prediction questions*

## Exploratory Data Analysis

**You may change the order of the plots and tables.**

**Quantitative Plot:**

In [None]:
# Use this cell to generate your quantitative plot
...

*Replace this text with an analysis of your plot*

**Qualitative Plot:**

In [None]:
# Use this cell to generate your qualitative plo# Use this cell to generate your qualitative plot
...

*Replace this text with an analysis of your plot*

**Aggregated Data Table:**

In [None]:
# Use this cell to generate your aggregated data table
...

*Replace this text with an analysis of your plot*

**Table Requiring a Join Operation:**

In [None]:
# Use this cell to join two datasets
...

*Replace this text with an analysis of your plot*

## Hypothesis Testing

**Do not copy code from demo notebooks or homeworks! You may split portions of your code into distinct cells. Also, be sure to
set a random seed so that your results are reproducible.**

In [None]:
# set the random seed so that results are reproducible
random.seed(1231)

...

## Prediction

**Be sure to set a random seed so that your results are reproducible.**

In [None]:
# set the random seed so that results are reproducible
random.seed(1231)

...

## Conclusion

*Replace this text with your conclusion*

## Presentation

*In this section, you'll need to provide a link to your video presentation. If you've uploaded your presentation to YouTube,
you can include the URL in the code below. We've provided an example to show you how to do this. Otherwise, provide the link
in a markdown cell.*

**Link:** *Replace this text with a link to your video presentation*

In [None]:
# Full Link: https://www.youtube.com/watch?v=BKgdDLrSC5s&feature=emb_logo
# Plug in string between "v=" and ""&feature":
YouTubeVideo('BKgdDLrSC5s')

# Submission

*Just as with the other assignments in this course, please submit your research notebook to Okpy. We suggest that you
submit often so that your progress is saved.*

In [None]:
# Run this line to submit your work
_ = ok.submit()