## Analysis and Classification of Song dataset
This notebook contains the instructions for the mini project on classiﬁcation for the course [Statistical Machine Learning](http://www.it.uu.se/edu/course/homepage/sml), 1RT700. The problem is to classify a set of 200 songs, and predict whether Andreas Lindholm(Course Instructor) would like them or not, with the help from a training data set with 750 songs. 
We are expected to (i) try some (or all) classiﬁcation methods from the course and evaluate their performance on the problem, and (ii) make a decision which one to use and ‘put in production’ by uploading your predictions to [this](http://www.it.uu.se/edu/course/homepage/sml/project/submit/) website.

In [None]:
import pandas as pd
import seaborn
import numpy as np
from matplotlib import pyplot as plt

## Dataset Visualisation and Number of features
Make sure to check the path before reading the training and test file. The dataset consist of a total of 750 samples and 13 features and a labe associative to it(14 features in total). Test dataset consist of 200 samples for which we need to predict the label.

In [None]:
train_path = 'data/training_data.csv'
test_path = 'data/songs_to_classify.csv'

In [None]:
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

print('Total number of samples in training dataset: \t%s' % df_train.shape[0])
print('Total number of features: \t%s' % len(df_train.columns.values))
print('Total number of samples in test dataset: \t%s' % df_test.shape[0])

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
# Below method checks returns the boolean result for each of the column in dataframe telling whether it has null value or not
def null_column(df):
    return df.isnull().any()

We can check the result returned by null_column function for training dataset that it doesn't have any value as Nan.

In [None]:
null_column(df_train)

## Data Normalization
Normalization refers to the process of standardizing the values of independent features of a dataset. Since many of the machine learning techniques use distance to compute the difference between two or more distinct samples, a feature within these samples that has a broad range of values will dominate the process. In order to avoid this, the range of all features are normalized so that each feature contributes approximately proportionately to the computation.

In [None]:
from sklearn.preprocessing import normalize
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
# Separate features from their labels
features = [x != 'label' for x in df_train.columns.values]
train_values = df_train.loc[:, features].values
train_labels = df_train.loc[:,['label']].values
test_values = df_test.values

In [None]:
# Scale values between 0 and 1
#train_values = normalize(train_values, axis=0,norm='max')
scaler = StandardScaler()
scaler.fit(train_values)
train_values_n =scaler.transform(train_values)
test_values_n =scaler.transform(test_values)

In [None]:
pd.DataFrame(train_values_n).describe()

In [None]:
pd.DataFrame(test_values_n).describe()

# Correlation Check

We find correlation among all the attributes in the dataset. We use pandas corr() function, which gives the correlation factor for each column in the dataframe passed to it. If the value is close to 1 or -1, we say that the columns are correlated.

In [None]:
corr=pd.DataFrame(train_values_n).corr()
print([{x:y} for x in range(0,13) for y in range(0,13) if corr.iloc[x,y]>0.65 and x!=y])
print([{x:y} for x in range(0,13) for y in range(0,13) if corr.iloc[x,y]<-0.65 and x!=y])
print("Hence, this columns are correlated")

In [None]:
corr.style.background_gradient()

## Class Imbalance

Class imbalance refers to the phenomenon where some classes (labels) of a dataset have more samples than others. This is a problem because the machine learning algorithms will tend to focus on the classification of the samples that are overrepresented while ignoring or misclassifying the underrepresented samples.

In [None]:
ones,zeros = df_train.label.value_counts()

print('Percentage of label 1 in training dataset is {}'.format(ones/df_train.shape[0]))
print('Percentage of label 0 in training dataset is {}'.format(zeros/df_train.shape[0]))

From the output above we can see the dataset has a bit of class imbalance problem. The Label 1 has 452 samples, while Label 0 has only 298. It is a binary classification problem but still we can check whether this difference in number of samples will affect our result or not.

# Confusion Matrix
Before working with the different machine learning methods which we'll use for classification, let's create a helper method which renders a confusion matrix of a specified model prediction output. A confusion matrix is a table often used to analyze the performance of a classifier on samples for which the true values are known (we'll use it to analyze the performance of the machine learning methods in the test set). Each row in the table represents the instances in an actual class while each column instances in a predicted class

In [None]:
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, labels):
    '''
    Plot confusion matrix of the specified accuracies and labels
    '''
    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(6, 4))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title('Confusion matrix')
    plt.colorbar()
    
    # Draw ticks
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)
    
    # Normalize
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.2f}; misclass={:0.2f}'.format(accuracy, misclass))
    plt.show()

In [None]:
labels_list = np.unique(df_train.loc[:,['label']].values)

# Methods
Methods to explore. We need to implement atleast as many 'families' as we are in group members and decide in each 'family' atleast one method to explore.
- [ ] Logistic Regression
- [ ] Discriminanat Analysis: LDA, QDA
- [ ] K-nearest neighbor
- [ ] Tree-based methods: classification trees, random forest, bagging
- [ ] Boosting


# PCA Analysis

Correlated parameters can have bad affect for the given prediction problem. Earlier, we found out that column 0,3 and 7 are correlated. Hence, we use PCA class from sklearn.decomposition to tackle this problem. 


In [None]:
from sklearn.decomposition import PCA

pcaComps=PCA(n_components=13)

In [None]:
pcaComps.fit(train_values_n)

#The amount of variance that each PC explains
var= pcaComps.explained_variance_ratio_

#Cumulative Variance explains
var1=np.cumsum(np.round(pcaComps.explained_variance_ratio_, decimals=4)*100)

print(var1)

In [None]:
plt.plot(var1)
plt.xlabel('Number of components')
plt.ylabel('Amount of variance')
plt.show()

Hence, we will use 9 PCA components.

In [None]:
pca=PCA(n_components=9);
train_values_pca=pca.fit_transform(train_values_n);
test_values_pca=pca.transform(test_values_n);

## LDA
### SVD

PCA and LDA are linear transformation techniques. However, PCA is an unsupervised while LDA is a supervised dimensionality reduction technique.

PCA has no concern with the class labels. LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. LDA tries to find a decision boundary around each cluster of a class. It then projects the data points to new dimensions in a way that the clusters are as separate from each other as possible and the individual elements within a cluster are as close to the centroid of the cluster as possible.



In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(train_values_pca, train_labels, test_size = 0.2, random_state = 0)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

lda = LinearDiscriminantAnalysis()
model_lda_svd=lda.fit(X_train, Y_train.flatten())
predicted_labels = model_lda_svd.predict(X_val)
model_lda_svd_acc = accuracy_score(Y_val, predicted_labels)
model_lda_svd_acc_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(model_lda_svd_acc_cm, labels_list)


In [None]:
lda = LinearDiscriminantAnalysis(solver='eigen',shrinkage='auto')
model_lda_eigen=lda.fit(X_train, Y_train.flatten())
predicted_labels = model_lda_eigen.predict(X_val)
model_lda_eigen_acc = accuracy_score(Y_val, predicted_labels)
model_lda_eigen_acc_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(model_lda_eigen_acc_cm, labels_list)


# Logistic Regression

Now that we have all the data processed, we can apply logistic regression on the input

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
model_lr=classifier.fit(X_train,Y_train.flatten());
predicted_labels = model_lr.predict(X_val)
model_lr_acc = accuracy_score(Y_val, predicted_labels)
model_lr_acc_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(model_lr_acc_cm, labels_list)


## KNN

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import neighbors
from sklearn.model_selection import cross_val_score

k_values = [2,3,4,5,6,7,8]

val_accuracy = []
for k in k_values:
    kNNClassifier = neighbors.KNeighborsClassifier(n_neighbors = k)
    accuracy = np.mean(cross_val_score(kNNClassifier, 
                                     train_values_pca, 
                                     y = train_labels, 
                                     cv = 5, 
                                     n_jobs = -1))
    val_accuracy.append(accuracy)

In [None]:
print('Best K value: {}'.format(k_values[np.argmax(val_accuracy)]))

In [None]:
kNNClassifier = neighbors.KNeighborsClassifier(n_neighbors=3).fit(X_train, Y_train.flatten())
predicted_labels = kNNClassifier.predict(X_val)
knn_pca_acc = accuracy_score(Y_val, predicted_labels)
knn_pca_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(knn_pca_cm, labels_list)

# Tree based methods
Tree based methods for classification

# Classification tree

In [None]:
from sklearn import tree

dtClassifier = tree.DecisionTreeClassifier(max_depth=2) 
dtClassifier.fit(X_train, y_train)

import os
import graphviz
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

dot_data = tree.export_graphviz(dtClassifier, out_file=None,
#                                 feature_names = df_train.copy().drop(columns=["label"]).columns,
#                                 class_names = model.classes_, 
                                filled=True, rounded=True, 
                                leaves_parallel=True, proportion=True) 
graph = graphviz.Source(dot_data) 
graph

In [None]:
#validation_labels = val_label.flatten()
predicted_labels = dtClassifier.predict(X_val)
print('Accuracy rate is %.2f' % np.mean(predicted_labels == Y_val))
pd.crosstab(predicted_labels, Y_val.flatten())

dt_pca_acc = accuracy_score(Y_val, predicted_labels)
dt_pca_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(dt_pca_cm, labels_list)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X=X_train, y=y_train.flatten())

predicted_labels = model.predict(X_val)
rf_pca_acc = accuracy_score(Y_val, predicted_labels)
rf_pca_cm = confusion_matrix(Y_val, predicted_labels)
plot_confusion_matrix(rf_pca_cm, labels_list)

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
    "max_depth":[3,4,5,6],
    "max_features":["log2","sqrt"],
    "n_estimators":[50,80,100,150,120,200]
    }

clf = GridSearchCV(ensemble.GradientBoostingClassifier(), parameters, cv=10, n_jobs=-1)

clf.fit(X_train_tree, Y_train_tree)
print(clf.score(X_train_tree, Y_train_tree))
print(clf.best_params_)

In [None]:
from sklearn import ensemble

X_train_tree, X_val_tree, Y_train_tree, Y_val_tree = train_test_split(train_values, train_labels, test_size = 0.2, random_state = 0)


n_estimators = 80;

#ind2=[x for x in range(14) if x not in [0,7,3]]

gb = ensemble.GradientBoostingClassifier(n_estimators=n_estimators,
                                             random_state=0, learning_rate=0.1, max_depth=4,max_features=3)
gb.fit(X_train_tree,Y_train_tree.ravel())

predicted_labels = gb.predict(X_val_tree)
#print(predicted_labels)

gb_acc = accuracy_score(Y_val_tree, predicted_labels)
gb_cm = confusion_matrix(Y_val_tree, predicted_labels)
plot_confusion_matrix(gb_cm, labels_list)