<a href="https://colab.research.google.com/github/dsvalencias/MachineLearning/blob/main/Assignment_3_Kernels_and_SVMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 3: Kernels and SVMs, Machine Learning 2021**
> *ml-assign3-diagarciaar-lahiguarans-dsvalencias.ipynb*

> García Arenas, Diego Alejandro - diagarciaar@unal.edu.co

> Higuaran Serrano, Luis Alejandro - lahiguarans@unal.edu.co

> Valencia Salazar, Dave Sebastian - dsvalencias@unal.edu.co


In [None]:
#import list
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

##1. Train an SVM for detecting whether a word belongs to English or Spanish

### (a) Build training and test data sets. You can use the most frequent words in http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists. Consider words at least 4 characters long and ignore accents.




*   English: The initial data set comes from the 10000 most common words in English data set, from the Gutenberg project 2006.
*   Spanish: The initial data set comes from the 10000 most common words from the RAE.

We deleted special characters in both languages and removed all words with less than four (4) characters before uploading it to Github.



In [None]:
#Setting dataframe from a github URL as pandas object.

github_raw_url_dataset = "https://raw.githubusercontent.com/dsvalencias/MachineLearning/main/Datasets/languages_words.csv"
words_df = pd.read_csv(github_raw_url_dataset)
words_df = words_df.sample(frac = 1)
print("Words dataframe size: %s" % str(words_df.shape))

Words dataframe size: (7000, 4)


After finishing data cleansing we end up with 7000 words to implement in all kernels.

In [None]:
#Lowercase language columns
words_df['Language'] = words_df['Language'].str.lower()
words_df['Language'] = words_df['Language'].str.strip()
words_df.head()

Unnamed: 0,Index,Word,Length,Language
6548,12503,coste,5,spanish
2717,2719,depth,5,english
3906,9861,estamos,7,spanish
3021,3023,paragraphs,10,english
2659,2661,format,6,english


In [None]:
#Validation for word length
print("Unique values for Length column: %s" % str(words_df.Length.unique()))

#Validation for word language
print("Unique values for Language column: %s" % str(words_df.Language.unique()))

Unique values for Length column: [ 5  7 10  6  4  9  8 11 12 13 14 15 17 16]
Unique values for Language column: ['spanish' 'english']


In [None]:
#Word column becomes input
X = words_df['Word']
#Language column becomes its associated label
y = words_df['Language']

# Applying <One hot encode> for language
Y = list(y)
Y = np.array([1 if y == 'english' else 0 for y in Y])

In [None]:
#Split the data with training and testing
data_train, data_validation, target_train, target_validation = train_test_split(X, Y, test_size=0.33, random_state=42)

print("Data train size: %s\nData validation size: %s\nTarget train size: %s\nTarget validation size: %s\n" % (str(data_train.shape) , str(data_validation.shape) , str(target_train.shape) , str(target_validation.shape)))

Data train size: (4690,)
Data validation size: (2310,)
Target train size: (4690,)
Target validation size: (2310,)



In [None]:
print("Data train head")
print(data_train.head())

print("\nTarget train head")
print(target_train[:5])

Data train head
6291    observacion
5390    comerciales
3888      capacidad
556       including
1186           thin
Name: Word, dtype: object

Target train head
[0 0 0 1 1]


###(b) Implement different string kernels:

####i. Histogram cosine kernel: calculate a bag of n-grams representation (use the CountVectorizer from scikit-learn) and apply the cosine_similarity from scikit-learn.

####ii. Histogram intersection: calculate a bag of n-grams representation, normalize it (the sum of the bins must be equal to 1 ∀i, ||xi||1 = 1.) and calculate the sum of the minimum for each bin of the histogram

####iii. χ2 kernel: calculate a bag of n-grams representation and apply the chi2_kernel from scikit-learn.


####iv. SSK kernel: use the code available at this repository https://github.com/helq/python-ssk.

###(c) Use scikit-learn to train different SVMs using precomputed kernels. Use cross validation to find appropriate regularization parameters plotting the training and validation error vs. the regularization parameter. Use a logarithmic scale for C,{2^−15, 2^−14, . . . , 2^10}. Try different configurations of the parameters (in particular different n values for the n-grams).

###(d) Evaluate the performance of the SVMs in the test data set:

####i. Report the results in a table for the different evaluated configurations

####ii. Illustrate examples of errors (English words mistaken as Spanish, Spanish words mistaken as English). Give a possible explanation for these mistakes.

####iii. Discuss the results.

##2. SVM interpretability

###(a) Use the same dataset from question 1 and calculate a bag of n-grams representation.

###(b) Train a SVM using the histogram intersection kernel on this dataset.

###(c) Identify the support vectors found by the SVM training algorithm. Show the samples corresponding to the support vectors with the maximum absolute value of the αi coefficients, for both positive and negative values. Do they make sense? Analyze and discuss.

###(d) For different test samples, calculate the classification manually, i.e. compute the kernel between the sample and the support vectors and check how the contribute, positively or negatively, to the final classification. Show those vectors that have the highest value of the kernel. Analyze and discuss.

###(e) Propose a method that for a given word to be classified, highlight in one color (e.g. blue) those n-grams that suggest the word is from English and in another color (e.g. red) those n-grams that suggest the word is from Spanish.


##3. Kernel logistic regression


###(a) Write a expression of the discriminant function expressed in terms of the kernel and the coefficients αi.

###(b) Formulate the problem of learning the parameters of the model as an optimization problem that looks for the parameters αi that minimize a cross entropy loss function.

###(c) Write a function that receives a training data set and a kernel function and finds a vector α that minimizes the loss function using gradient descent.

###(d) Test your algorithm using different kernels (linear, polynomial, Gaussian, etc.) on synthetic 2D datasets from sklearn (https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html). Plot the decision regions and discuss the results: