**Machine Learning Basic Principles 2018 - Data Analysis Project Report**

*All the text in italics is instructions for filling the template - remove when writing the project report!*

# *Title* 

*Title should be concise and informative, describes the approach to solve the problem. Some good titles from previous years:*

*- Comparing extreme learning machines and naive bayes’ classifier in spam detection*

*- Using linear discriminant analysis in spam detection*

*Some not-so-good titles:*

*- Bayesian spam filtering with extras*

*- Two-component classifier for spam detection*

*- CS-E3210 Term Project, final report*




## Abstract

*Precise summary of the whole report, previews the contents and results. Must be a single paragraph between 100 and 200 words.*



## 1. Introduction

*Background, problem statement, motivation, many references, description of
contents. Introduces the reader to the topic and the broad context within which your
research/project fits*

*- What do you hope to learn from the project?*
*- What question is being addressed?*
*- Why is this task important? (motivation)*

*Keep it short (half to 1 page).*



## 2. Data analysis

*Briefly describe data (class distribution, dimensionality) and how will it affect
classification. Visualize the data. Don’t focus too much on the meaning of the features,
unless you want to.*

*- Include histograms showing class distribution.*



In [56]:
from math import log
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as la
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [2]:
###### DATA

def get_train_set_splited(df_train_data, df_train_labels, test_size = 1./7.):
    df_train_set, df_val_set, df_train_lbl, df_val_lbl = train_test_split(df_train_data, df_train_labels, test_size = test_size, random_state = 0)
    return df_train_set.values, df_train_lbl.values, df_val_set.values, df_val_lbl.values


def remove_columns(df_data, indexes):
    new_values = np.delete(df_data.values, indexes, 1)
    return pd.DataFrame(new_values)

In [3]:
######### PCA
def make_tuples(first_array, second_array):
    n = len(first_array)
    
    tuples = []
    for i in range(n):
        tuples.append((first_array[i], second_array[i]))
        
    return tuples

def get_pca(raw_data, number_of_components):
    N = raw_data.shape[0]
    raw_data_transpose = raw_data.transpose()
    Q = (1./N) * np.dot(raw_data_transpose, raw_data)

    eigenvalues, eigenvectors = la.eig(Q)
    
    tuples = make_tuples(eigenvalues, eigenvectors.transpose())
    ordered_tuples = sorted(tuples, reverse = True, key = lambda item: item[0])
    
    pca = []
    for i in range(number_of_components):
        pca.append(ordered_tuples[i][1])
        
    pca = np.array(pca)
    
    map_tuples_to_eigenvalues = map(lambda item: item[0], ordered_tuples)
    
    return pca.real, np.array(list(map_tuples_to_eigenvalues))

In [50]:
###### LABELS

def get_labeled_array(label, labels):
    N = labels.shape[0]
    y = np.zeros((N, 1))
        
    for index in range(N):
        if labels[index][0] == label: 
            y[index][0] = 1
    
    return y

def get_predicted_labels(probabilities):
    N = probabilities.shape[0]
    predicted_labels = np.zeros((N, 1))

    for i in range(N):
        index_of_max_value = np.argmax(probabilities[i])
        predicted_labels[i] = index_of_max_value + 1
        
    return predicted_labels

def calculate_accuracy(actual_labels, predicted_labels):
    width, height = actual_labels.shape
    result = actual_labels - predicted_labels
    zeros = np.count_nonzero(result == 0)
    N = width * height
    accuracy = zeros / N
    return accuracy * 100

def calculate_log_loss_accuracy(actual_labels, probabilites):
    N = actual_labels.size    
    tmp_sum = 0
    
    for i in range(N):
        tmp_sum += log(probabilites[i][int(actual_labels[i]) - 1]) 
    
    log_loss_accuracy = tmp_sum * -1/N
    return log_loss_accuracy

In [5]:
######### LOGISTIC REGRESSION

def sigmoid(z):
    return np.divide(1, (1 + np.exp((-1) * z)))


def derivative_sigmoid(z):
    return np.multiply(np.exp(-z), np.power(sigmoid(z), 2))


def gradient_logistic_regression(X, y, w):
    N = X.shape[0]
    z = np.dot(X, w)
    sigmoid_z = sigmoid(z)
    derivative_sigmoid_z = derivative_sigmoid(z)
    
    first_term = np.multiply(y, np.divide(derivative_sigmoid(z), sigmoid(z)))
    second_term = np.multiply(1 - y, np.divide((-1) * derivative_sigmoid(z), (1 - sigmoid(z))))
    
    sum = first_term + second_term
    return (-1 / N) * np.dot(np.transpose(X), sum)


def empirical_risk_logistic_regression(X, y, w):
    z = np.dot(X, w)

    first_term = np.multiply(y, np.log(sigmoid(z)))
    second_term = np.multiply(1 - y, np.log(1 - sigmoid(z)))
    
    sum = first_term + second_term
    return (-1) * np.mean(sum)


def predicted_probabilities_logistic_regresion(X, w):
    y = np.dot(X, w)
    return sigmoid(y)

In [6]:
###### LINEAR REGRESSION

def gradient_linear_regression(X, y, w):
    N = X.shape[0]
    X_transposed = np.transpose(X)
    final_matrix = np.dot(X_transposed, y - np.dot(X, w))
    gradient = (-2. / N) * final_matrix
    return gradient

def empirical_risk_linear_regression(X, y, w):
    sum = np.power((y - np.dot(X, w)), 2)
    return np.mean(sum)

In [7]:
###### REGRESSION

def regression(X, y, step_size, iterations, gradient, empirical_risk):
    d = X.shape[1]
    w = np.zeros((d, 1))
    loss_list = []

    for i in range(iterations):
        grad = gradient(X, y, w)
        w = w - step_size * grad
        loss_list.append(empirical_risk(X, y, w))

    return loss_list, w

def get_w_opt_regression(X, train_labels_values, gradient, empirical_risk, step_size = 1e-5, iterations = 3000):
    N = X.shape[0]
    d = X.shape[1]
    labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    quantity_of_labels = len(labels)

    y = np.zeros((N, quantity_of_labels))
    w = np.zeros((d, quantity_of_labels))
    
    for i in range(quantity_of_labels):
        y_subproblem = get_labeled_array(i+1, train_labels_values)
        loss_list_subproblem, w_subproblem = regression(X, y_subproblem, step_size, iterations, gradient, empirical_risk)
    
        w[:, i:i+1] = w_subproblem
        y[:, i:i+1] = y_subproblem
    
    return w

In [74]:
#### GET DATA
df_data = pd.read_csv('train_data.csv', header=None)
print("Train data shape is ", df_data.shape)

df_labels = pd.read_csv('train_labels.csv', header=None)
print("Train label shape is ", df_labels.shape)

df_test_data = pd.read_csv('test_data.csv', header=None)
print("Test data shape is ", df_test_data.shape)

grouped = df_labels.groupby(by=0)
for name, group in grouped:
    print("Label %d has %d songs" % (name, group.size))

Train data shape is  (4363, 264)
Train label shape is  (4363, 1)
Test data shape is  (6544, 264)
Label 1 has 2178 songs
Label 2 has 618 songs
Label 3 has 326 songs
Label 4 has 253 songs
Label 5 has 214 songs
Label 6 has 260 songs
Label 7 has 141 songs
Label 8 has 195 songs
Label 9 has 92 songs
Label 10 has 86 songs


In [75]:
#### CLEAN DATA
columns_to_remove = [216, 217, 218, 219]
df_data = remove_columns(df_data, columns_to_remove)
df_test_data = remove_columns(df_test_data, columns_to_remove)

train_data, train_labels, val_data, val_labels = get_train_set_splited(df_data, df_labels, test_size = 0.25)

In [66]:
#### STANDARDIZATION
scaler = StandardScaler()
scaler.fit(pd.DataFrame(train_data))
X = scaler.transform(pd.DataFrame(train_data))
val_data = scaler.transform(pd.DataFrame(val_data))
test_data = scaler.transform(df_test_data)

In [76]:
#### NORMALIZATION
scaler = MinMaxScaler()

df_train_data = pd.DataFrame(train_data)

scaler.fit(df_train_data)

X = scaler.transform(df_train_data)
val_data = scaler.transform(pd.DataFrame(val_data))
test_data = scaler.transform(df_test_data)

In [77]:
#### APPLY PCA
number_of_pca_components = 200

pca, eigenvalues = get_pca(X, number_of_pca_components)
X = np.dot(X, pca.transpose())
val_data = np.dot(val_data, pca.transpose())
test_data = np.dot(test_data, pca.transpose())

In [19]:
#### RUN LOGISTIC REGRESSION
w_opt = get_w_opt_regression(X, train_labels, gradient_logistic_regression, empirical_risk_logistic_regression,
                             step_size = 0.1, iterations = 3000)

probabilities = predicted_probabilities_logistic_regresion(val_data, w_opt)
predicted_labels = get_predicted_labels(probabilities)
accuracy = calculate_accuracy(val_labels, predicted_labels)
# log_loss_accuracy = calculate_log_loss_accuracy(val_labels, probabilities)

print("Accuracy of logistic regression is ", accuracy)
# print("Log loss of logistic regression is ", log_loss_accuracy)

[[-1.40659630e-01  4.30163246e-02  5.06849392e-03 ...  7.44957102e-04
   4.65091405e-03  4.23378984e-03]
 [-2.68428899e-01  1.86725223e-02  9.36401233e-02 ...  6.48026491e-03
   1.35259099e-02 -1.23490342e-04]
 [ 5.48179762e-02 -7.85127912e-02 -6.78170089e-02 ...  3.52144916e-02
  -1.11111140e-02  3.27232204e-03]
 ...
 [-2.86249047e-01  7.44024418e-02  1.21760913e-01 ...  5.55494560e-03
   3.27025587e-02  5.48743446e-02]
 [ 2.29847676e-01 -1.65868455e-01 -6.65240307e-02 ...  1.77454854e-02
  -8.34977235e-03 -6.68086058e-03]
 [-5.34223523e-02  1.62402749e-01 -1.93307929e-01 ... -7.51268355e-02
  -9.01203051e-03  1.82770904e-03]]


In [78]:
#### RUN LINEAR REGRESSION
w_opt = get_w_opt_regression(X, train_labels, gradient_linear_regression, empirical_risk_linear_regression,
                             step_size = 0.01, iterations = 10000)

probabilities = np.dot(val_data, w_opt)
predicted_labels = get_predicted_labels(probabilities)
accuracy = calculate_accuracy(val_labels, predicted_labels)
# log_loss_accuracy = calculate_log_loss_accuracy(val_labels, probabilities)

print("Accuracy of linear regression is ", accuracy)
# print("Log loss of linear regression is ", log_loss_accuracy)

In [20]:
#### PROBABILITIES FOR LOGISTIC REGRESSION
test_probabilities = predicted_probabilities_logistic_regresion(test_data, w_opt)

In [81]:
#### PROBABILITIES FOR LINEAR REGRESSION
test_probabilities = np.dot(test_data, w_opt)

In [82]:
#### GETTING PREDICTED LABELS FOR TEST DATA
predicted_test_labels = get_predicted_labels(test_probabilities).astype(int)


#### FILE FOR ACCURACY CHALLENGE
ids = np.arange(1, len(predicted_test_labels) + 1)
df_submission_accuracy = pd.DataFrame({"Sample_id" : ids, "Sample_label" : predicted_test_labels.flatten()})
df_submission_accuracy.to_csv("accuracy_challenge.csv", index=False)

#### FILE FOR LOGLOSS CHALLENGE
ids = np.arange(1, len(predicted_test_labels) + 1)
df_submission_logloss = pd.DataFrame(
    {
        "Sample_id": ids,
        "Class_1": test_probabilities[:,0],
        "Class_2": test_probabilities[:,1],
        "Class_3": test_probabilities[:,2],
        "Class_4": test_probabilities[:,3],
        "Class_5": test_probabilities[:,4],
        "Class_6": test_probabilities[:,5],
        "Class_7": test_probabilities[:,6],
        "Class_8": test_probabilities[:,7],
        "Class_9": test_probabilities[:,8],
        "Class_10": test_probabilities[:,9]
    }
)
df_submission_logloss.to_csv("logloss_challenge.csv", index=False)

In [None]:
# Load the data and cleanup


In [None]:
#Analysis of the input data
# ...

## 3. Methods and experiments

*- Explain your whole approach (you can include a block diagram showing the steps in your process).* 

*- What methods/algorithms, why were the methods chosen. *

*- What evaluation methodology (cross CV, etc.).*



In [None]:
# Trials with ML algorithms

## 4. Results

*Summarize the results of the experiments without discussing their implications.*

*- Include both performance measures (accuracy and LogLoss).*

*- How does it perform on kaggle compared to the train data.*

*- Include a confusion matrix.*



In [None]:
#Confusion matrix ...

## 5. Discussion/Conclusions

*Interpret and explain your results *

*- Discuss the relevance of the performance measures (accuracy and LogLoss) for
imbalanced multiclass datasets. *

*- How the results relate to the literature. *

*- Suggestions for future research/improvement. *

*- Did the study answer your questions? *



## 6. References

*List of all the references cited in the document*

## Appendix
*Any additional material needed to complete the report can be included here. For example, if you want to keep  additional source code, additional images or plots, mathematical derivations, etc. The content should be relevant to the report and should help explain or visualize something mentioned earlier. **You can remove the whole Appendix section if there is no need for it.** *