BerkeleyX: Data8.3x

Foundations of Data Science: Prediction and Machine Learning

In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Lab 5: Song Classification, Part 2

In [None]:
# dataset was extracted from the Million Song Dataset
# http://labrosa.ee.columbia.edu/millionsong/
lyrics = Table.read_table('../../data/lyrics.csv')

In [None]:
# lab setup
training_proportion = 11/16

num_songs = lyrics.num_rows
num_train = int(num_songs * training_proportion)
num_valid = num_songs - num_train

train_lyrics = lyrics.take(np.arange(num_train))
test_lyrics = lyrics.take(np.arange(num_train, num_songs))

def most_common(label, table):
    return table.group(label).sort('count', descending=True).column(label).item(0)

1: Features

In [None]:
x = np.array(range(1, 4))
y = np.array(range(3))

print(x, y)
print(x + y)
print(x - y)

In [None]:
# Question 1.1
# Write a function to compute the Euclidean distance between two arrays of features of arbitrary (but equal) length.
# Use it to compute the distance between the first song in the training set and the first song in the test set,
#  using all of the features. (Remember that the title, artist, and genre of the songs are not features.)
def distance(features1, features2):
    """The Euclidean distance between two arrays of feature values."""
    f1 = np.array(features1[3:], float)
    f2 = np.array(features2[3:], float)
    return np.sum((f1 - f2) ** 2) ** .5

x = np.array(lyrics.row(0))
y = np.array(lyrics.row(1))
distance(x , y)

1.1. Creating your own feature set

In [None]:
# Question 1.1.1
# Choose 20 common words that you think might let you distinguish between country and hip-hop songs
my_20_features = [
    'love', 'burn', 'break', 'alway',
    'fight', 'kill', 'lord', 'blood',
    'death', 'smile', 'gone', 'babi',
    'vida', 'face', 'pena', 'kick',
    'plan', 'wanna', 'god', 'girl'
]

train_20 = train_lyrics.select(my_20_features)
test_20 = test_lyrics.select(my_20_features)

In [None]:
def fast_distances(test_row, train_rows):
    """An array of the distances between test_row and each row in train_rows.

    Takes 2 arguments:
      test_row: A row of a table containing features of one
        test song (e.g., test_20.row(0)).
      train_rows: A table of features (for example, the whole
        table train_20)."""
    assert train_rows.num_columns < 50, "Make sure you're not using all the features of the lyrics table."
    counts_matrix = np.asmatrix(train_rows.columns).transpose()
    diff = np.tile(np.array(test_row), [counts_matrix.shape[0], 1]) - counts_matrix
    distances = np.squeeze(np.asarray(np.sqrt(np.square(diff).sum(1))))
    return distances

** Question 1.1.2 ** <br/>
Use the `fast_distances` function provided above to compute the distance from the first song in the test set to all the songs in the training set, **using your set of 20 features**.  Make a new table called `genre_and_distances` with one row for each song in the training set and two columns:
* The `"Genre"` of the training song
* The `"Distance"` from the first song in the test set 

Ensure that `genre_and_distances` is **sorted in increasing order by distance to the first test song**.

In [None]:
genre_and_distances = Table().with_columns(
    'Genre', train_lyrics.column('Genre'),
    'Distance', fast_distances(test_20.row(0), train_20)
).sort("Distance")
genre_and_distances

In [None]:
test_lyrics['Genre'][0]

In [None]:
# Question 1.1.3
# compute the 5-nearest neighbors classification of the first song in the test set.

# Set my_assigned_genre to the most common genre among these.
my_assigned_genre = most_common('Genre', test_lyrics)

# Set my_assigned_genre_was_correct to True if my_assigned_genre
# matches the actual genre of the first song in the test set.
my_assigned_genre_was_correct = my_assigned_genre == test_lyrics['Genre'][0]

print("The assigned genre, {}, was{}correct.".format(my_assigned_genre, " " if my_assigned_genre_was_correct else " not "))

1.2. A classifier function

** Question 1.2.1 ** <br/>
Write a function called `classify`.  It should take the following four arguments:
* A row of features for a song to classify (e.g., `test_20.row(0)`).
* A table with a column for each feature (e.g., `train_20`).
* An array of classes that has as many items as the previous table has rows, and in the same order.
* `k`, the number of neighbors to use in classification.

It should return the class a `k`-nearest neighbor classifier picks for the given row of features (the string `'Country'` or the string `'Hip-hop'`).

In [None]:
def classify(test_row, train_rows, train_classes, k):
    """Return the most common class among k nearest neigbors to test_row."""
    distances = fast_distances(test_row, train_rows)

    genre_and_distances = Table().with_columns(
        'Class', train_classes,
        'Distance', distances
    ).sort("Distance")

    return most_common('Class', genre_and_distances.take(range(k)))

In [None]:
# Question 1.2.2 
# Assign grandpa_genre to the genre predicted by your classifier
#  for the song "Grandpa Got Runned Over By A John Deere"
#  in the test set, using 9 neighbors and using your 20 features

grandpa_features = test_lyrics.where('Title', 'Grandpa Got Runned Over By A John Deere').select(my_20_features).row(0)
grandpa_genre = classify(grandpa_features, train_20, train_lyrics.column('Genre'), 9)
grandpa_genre

In [None]:
# Question 1.2.3 
# Create a classification function that takes as its argument a row containing your 20 features
# and classifies that row using the 5-nearest neighbors algorithm with train_20 as its training set.

def classify_one_argument(row):
    return classify(row, train_20, train_lyrics.column('Genre'), 5)

# When you're done, this should produce 'Hip-hop' or 'Country'.
classify_one_argument(test_20.row(0))

1.3 Evaluating your classifier

In [None]:
# Question 1.3.1
# Use classify_one_argument and apply to classify every song in the test set.
# Name these guesses test_guesses. Then, compute the proportion of correct classifications.

test_guesses = test_20.apply(classify_one_argument)
proportion_correct = np.count_nonzero(test_guesses == (test_lyrics.column('Genre'))) / test_lyrics.num_rows
proportion_correct