## KNN implementation to classify song genre 
### Tasneem Abed (1408535)
#### COMS4030A project 2019

In this notebook, a dataset containing 3000 instances of songs and their lyrics is used to train a K Nearest Neighbour algorithm in order to predict the genre of the song. There are 4 genres, namley country, pop, rap and rock. The data is split into 80% for training and 20% for testing. There are 5 features that are used: Number of words per line(WPL), number of unique words per line(UWPL), token ration which is the ratio of the unique words to the total number of words(Token ratio), Mean word length, and the total number of words in the song(Total).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
from sklearn.model_selection import train_test_split
import operator

%matplotlib inline

In [2]:
#Load dataset

data = pd.read_csv("FinalNorm.csv")

In [3]:
train = data.copy()

In [4]:
#Drop unnecessary columns
trainData = train.drop(['song','artist','lyrics'], axis=1);

In [5]:
leftover, newtraining = train_test_split(trainData, test_size=0.4)
newtraining.shape

(1200, 6)

In [6]:
newtraining.head()

Unnamed: 0,WPL,Unique WPL,Token ration,Mean word length,Total,genre
580,0.416667,0.5,0.574803,0.422451,0.055633,country
438,0.583333,0.6,0.354167,0.400609,0.130274,country
1697,0.416667,0.4,0.384615,0.44119,0.261938,rap
757,0.416667,0.3,0.207675,0.348824,0.202133,pop
2618,0.083333,0.1,0.219858,0.230786,0.062123,rock


In [41]:
#Euclidean distance
def distanceEuclid(data1, data2, length):
    d = 0
    for x in range(length-1):
        d += np.square(data1[x] - data2[x])
    return(np.sqrt(d))

In [42]:
def KNN(train, test, K):
    distances = {}
    sort = {}
    length = test.shape[1] #Note that this will be number of features excluding genre
    
    for x in range(len(train)):
        dist = distanceEuclid(test, train.iloc[x], length)
        
        distances[x] = dist[0]
    
    sortedDistance = sorted(distances.items(), key=operator.itemgetter(1))
    
    neighbours = []
    for x in range(K):
        neighbours.append(sortedDistance[x][0])
    
    classVotes = {}
    
    for x in range(len(neighbours)):
        response = train.iloc[neighbours[x]][-1]
        
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
            
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return (sortedVotes[0][0], neighbours)

In [48]:
#Run
predictions = []
g = []

i = 0
while i < 5:
    print(i, ":")
    training, testing = train_test_split(newtraining, test_size=0.2)
    tester = testing.iloc[0:5].copy()
    for i in range(0, tester.shape[0]):
        g.append(tester.genre.iloc[i])
    acc = 0
    j1 = 0
    for row in tester.iterrows():
        temp = []
        index, data = row
        temp.append(data.tolist())
        y = data[-1]
        currTest = pd.DataFrame(temp)
        result, neighb = KNN(training, currTest, 7)
        predictions.append(result)
        print(j1, ":",y, result)
        j1 += 1
        if (result == y):
            acc += 1
    accuracy = (acc/tester.shape[0])*100
    print("Accuracy: ", accuracy, "%")
    file5 = open("KNNAccuracy.txt","a") #append mode
    file5.write(str(accuracy))
    file5.write("\n")
    file5.close()
    class_error = np.sum(g != predictions) / tester.shape[0] 
    print("Testing error: ", class_error)
    i += 1
#     break

0 :
0 : rock rock
1 : country country
2 : pop pop
3 : pop pop
4 : pop country
Accuracy:  80.0 %
Testing error:  0.2


In [47]:
#Remove WPL and run for 100 tests


predictions1 = []
i = 0
while i < 5:
    print(i)
    training, testing = train_test_split(newtraining, test_size=0.2)
    training1 = training.drop(['Total'], axis=1)
    testing1 = testing.drop(['Total'], axis=1)
    tester1 = testing1.iloc[0:5].copy()
    if (i == 0):
        print(tester1.shape)
    acc1 = 0
    acc3 = 0
    acc5 = 0
    j = 0
    for row in tester1.iterrows():
        temp = []
        index, data = row
        temp.append(data.tolist())
        y = data[-1]
        currTest = pd.DataFrame(temp)
        result1, neighb1 = KNN(training1, currTest, 5)
        predictions1.append(result1)
        print(j, ":",y, result1)
        j += 1
        if (result1 == y):
            acc1 += 1
    accuracy1 = (acc1/tester1.shape[0])*100
    print("Accuracy1: ", accuracy1, "%")
    
    file1 = open("Remove.txt","a") #append mode
    file1.write(str(accuracy1))
    file1.write("\n")
    file1.close()
    
    i += 1
#     break


0
(5, 5)
0 : country country
1 : country rap
2 : rock rock
3 : rock country
4 : rap country
Accuracy1:  40.0 %


In [49]:
co = 0
po = 0
ra = 0
ro = 0
for k in range(0,len(predictions1)):
    if (predictions1[k] == 'country'):
        co += 1
    elif (predictions1[k] == 'pop'):
        po += 1
    elif (predictions1[k] == 'rap'):
        ra += 1
    elif (predictions1[k] == 'rock'):
        ro += 1
print(co, po, ra, ro)

3 0 1 1
