# HSE 2024: Mathematical Methods for Data Analysis

## Homework 3

**Warning 1**: some problems require (especially the lemmatization part) significant amount of time, so **it is better to start early (!)**

**Warning 2**: it is critical to describe and explain what you are doing and why, use markdown cells

In [5]:
from typing import Tuple, List

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

sns.set(style="darkgrid")

## PART 1: Logit model

We consider a binary classification problem. For prediction, we would like to use a logistic regression model. For regularization we add a combination of the $l_2$ and $l_1$ penalties (Elastic Net).

Each object in the training dataset is indexed with $i$ and described by pair: features $x_i\in\mathbb{R}^{K}$ and binary labels $y_i$. The model parametrized with bias $w_0\in\mathbb{R}$ and weights $w\in\mathbb{R}^K$. Note: Bias is included in $w$ vector

The optimization problem with respect to the $w_0, w$ is the following (Logistic loss with Elastic Net regularizers):

$$L(w, w_0) = \sum_{i=1}^{N} -y_i \log{\sigma{(w^\top x_i)}} - (1 - y_i) \log{(1 - \sigma{(w^\top x_i)})} + \gamma \|w\|_1 + \beta \|w\|_2^2$$

#### 1. [0.5 points]  Find the gradient of the Elastic Net loss and write its formulas (better in latex format). Remember what derivative sigmoid has (gradient in fact is a lot simpler than you may get using automatic tools like sympy, matlab or whatever)

Градиент для Elastic Net Loss равен:

$$ \nabla_w L(w, w_0) = \sum_{i=1}^{N}(\sigma{(w^\top x_i)} - y_i)x_i + \gamma_1 \, k + 2 \beta \|w\|_1 $$


#### 2. [0.25 points] Implement the Elastic Net loss (as a function)

In [6]:
def loss(X, y, w: List[float], gamma=1., beta=1.) -> float:
    w = np.array(w)
    pred = w.T * X
    sig = 1 / (1 + np.exp(-pred))
    
    main_loss = np.sum(-y * np.log(sig) - (1 - y) * np.log(1 - sig))
    
    l1 = gamma * np.sum(np.abs(w))
    
    l2 = beta * np.sum(w ** 2)
    
    final_loss = main_loss + l1 + l2
    return final_loss

#### 3. [0.25 points] Implement the gradient (as a function)

In [7]:
def get_grad(X, y, w: List[float], gamma=1., beta=1.) -> Tuple[List[float], float]:
    w = np.array(w)
    
    w = np.array(w)
    
    N = X.shape[0]
    
    pred = w.T * X
    
    sig = 1 / (1 + np.exp(-pred))
    
    error = sig - y
    grad_logistic = error * X

    grad_l1 = gamma * np.sign(w)

    grad_l2 = 2 * beta * w

    grad_w = grad_logistic + grad_l1 + grad_l2
    
    return grad_w.tolist()
    

    return grad_w

#### Check yourself

In [8]:
np.random.seed(42)
X = np.random.multivariate_normal(np.arange(5), np.eye(5), size=10)
X = np.c_[np.ones(X.shape[0]), X]
y = np.random.binomial(1, 0.42, size=10)
w = np.random.normal(size=5 + 1)


grad_w = get_grad(X, y, w)
print(grad_w)
assert(np.allclose(grad_w,
                   [-3.99447493, -1.84786723,  0.64520104,  1.67059973, -5.03858487, -5.21496336],
                   rtol=1e-2)
)

ValueError: operands could not be broadcast together with shapes (10,6) (10,) 

####  4. [1 point]  Implement gradient descent which works for both tol level and max_iter stop criteria and plot the decision boundary of the result

The template provides basic sklearn API class. You are free to modify it in any convenient way.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin

In [None]:
class Logit(BaseEstimator, ClassifierMixin):
    def __init__(self, beta=1.0, gamma=1.0, lr=1e-3, tolerance=0.01, max_iter=1000, random_state=42):
        self.beta = beta
        self.gamma = gamma
        self.tolerance= tolerance
        self.max_iter= max_iter
        self.learning_rate = lr
        self.random_state = random_state
        self.w = None
        # you may additional properties if you wish

    def fit(self, X, y):
        # add weights and bias and optimize Elastic Net loss over (X,y) dataset
        # save history of optimization steps

        # your code here

        return self

    def predict(self, X):
        # return vector of predicted labels (0 or 1) for each object from X
        # your code here

        return predict

    def predict_proba(self, X):
      # return vector of probabilities for each object from X
        return np.array([1 / (1 + np.exp(np.dot(X, self.w))),\
                         1 / (1 + np.exp(-np.dot(X, self.w)))])

In [None]:
# sample data to test your model
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=180, n_features=2, n_redundant=0, n_informative=2,
                               random_state=42, n_clusters_per_class=1)

In [9]:
# a function to plot the decision boundary
def plot_decision_boundary(model, X, y):
    fig = plt.figure()
    X1min, X2min = X.min(axis=0)
    X1max, X2max = X.max(axis=0)
    x1, x2 = np.meshgrid(np.linspace(X1min, X1max, 200),
                         np.linspace(X2min, X2max, 200))
    ypred = model.predict(np.c_[x1.ravel(), x2.ravel()])
    ypred = ypred.reshape(x1.shape)

    plt.contourf(x1, x2, ypred, alpha=.4)
    plt.scatter(X[:,0], X[:,1], c=y)

In [10]:
model = Logit(0, 0)
model.fit(X, y)
plot_decision_boundary(model, X, y)

NameError: name 'Logit' is not defined

#### 5. [0.25 points] Plot loss diagram for the model, i.e. show the dependence of the loss function from the gradient descent steps

In [11]:
# your code here

## PART 2: Support Vector Machines

#### 6. [2 point] Using the same dataset, train SVM Classifier from Sklearn.
Investigate how different parameters influence the quality of the solution:
+ Try several kernels: Linear, Polynomial, RBF (and others if you wish). Some Kernels have hypermeters: don't forget to try different.
+ Regularization coefficient

Show how these parameters affect accuracy, roc_auc and f1 score.
Make plots for the dependencies between metrics and parameters.
Try to formulate conclusions from the observations. How sensitive are kernels to hyperparameters? How sensitive is a solution to the regularization? Which kernel is prone to overfitting?

In [12]:
# your code here

## PART 3: Natural Language Processing

#### 7. [1.75 point] Form the dataset

We are going to form a dataset that we will use in the following tasks for binary and multiclass classification

0. Choose **six** authors that you like (specify who you've chosen) and download the <a href="https://www.kaggle.com/d0rj3228/russian-literature?select=prose">relevant data</a> from **prose** section
1. Build your own dataset for these authors:
    * divide each text into sentences such that we will have two columns: *sentence* and *target author*, each row will contain one sentence and one target
    * drop sentences where N symbols in a sentence < 15
    * fix random state and randomly choose sentences in the folowing proportion "5k : 15k : 8k : 11k : 20k : 3k" for the authors respectively
    
    sample data may look like:
    
    <center>
    <table>
        <tr>
            <th> sentence </th>
            <th> author </th>
        </tr>
        <tr><td> Несколько лет тому назад в одном из своих поместий жил старинный русской барин, Кирила Петрович Троекуров. </td><td> Пушкин </td><td>
        <tr><td> Уже более недели приезжий господин жил в городе, разъезжая по вечеринкам и обедам и таким образом проводя, как говорится, очень приятно время. </td><td> Гоголь </td><td>
        <tr><td> ... </td><td> ... </td><td>
        <tr><td> Я жил недорослем, гоняя голубей и играя в чехарду с дворовыми мальчишками. </td><td> Пушкин </td><td>         
    </table>
</center>
     
2. Preprocess (tokenize and clean) the dataset
    * tokenize, remove all stop words (nltk.corpus.stopwords), punctuation (string.punctuation) and numbers
    * convert to lower case and apply either stemming or lemmatization of the words (on your choice)
    * vectorize words using both **bag of words** and **tf-idf** (use sklearn)
    * observe and describe the difference between vectorized output (what do numbers look like after transformations and what do they represent?)

In [30]:
from pathlib import Path
import nltk
nltk.download('punkt')  # потребуется загрузить punkt, если не установлен
from nltk.tokenize import sent_tokenize

# 0
authors = ['Blok', 'Bryusov', 'Chekhov', 'Dostoevsky', 'Gogol', 'Gorky']

# 1
path = 'data/prose/'

sentences = []
for i in authors:
    folder_path = Path(path + i)
    for file_path in folder_path.glob('*.txt'):
        with open(file_path, encoding = 'utf-8') as f:
            text = f.read().replace('\n', ' ')
            split_text = sent_tokenize(text)
            for sentence in split_text:
                if len(sentence) > 15:
                    sentences.append({"sentence": sentence.strip(), "author": i})

sentences_df = pd.DataFrame(sentences)
sentences_df.head(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,sentence,author
0,"Ни сны, ни явь Мы сидели на закате всем се...",Blok
1,За сиренями из оврага уже поднимался туман.,Blok
2,"Стало слышно, как точат косы.",Blok
3,Соседние мужики вышли косить купеческий луг.,Blok
4,"Не орут, не ругаются, как всегда.",Blok
5,"Косы зашаркали по траве, слышно - штук двадцать.",Blok
6,Вдруг один из них завел песню.,Blok
7,"Без усилия полился и сразу наполнил и овраг, и...",Blok
8,"За сиренью, за туманом ничего не разглядеть, п...",Blok
9,Мужики подхватили песню.,Blok


In [42]:
print(sentences_df.shape)


sample_sizes = [5000, 15000, 8000, 11000, 20000, 3000]
sampled_sentences = pd.DataFrame()

for author, size in zip(authors, sample_sizes):
    author_sentences = sentences_df[sentences_df['author'] == author]
    sampled_sentences = pd.concat([sampled_sentences, author_sentences.sample(n=size, random_state=42, replace=True)])

shuffled_df = sampled_sentences.sample(frac=1).reset_index(drop=True)
shuffled_df.head(10)

(225859, 2)


Unnamed: 0,sentence,author
0,"124 Господь не изменял, да Никон изменил.",Gorky
1,"Он жадно всматривался в лица, слушал не мигая,...",Chekhov
2,Уже тонкие чары темной женщины не давали ему п...,Blok
3,"– Затем начал он слегка поворачивать бричку, п...",Gogol
4,Не ночевать же в такое время в степи.,Gogol
5,"ступай в кладовую, вынь ковер самый лучший — ч...",Gogol
6,"— от каких-нибудь негодных клопов, которым бы ...",Gogol
7,Мудрость обладает силой побеждать все бедствия...,Bryusov
8,"Он вспомнил вопросы, которые задавали ему, пок...",Chekhov
9,"Я рассказываю теми краткими словами, как расск...",Gorky


In [44]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


import string
from nltk import WordNetLemmatizer, word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('russian'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and word.isalpha()]
    return ' '.join(words)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...


In [45]:
df = shuffled_df
df['cleaned_text'] = df['sentence'].apply(preprocess_text)


In [47]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer()
bow_vectors = bow_vectorizer.fit_transform(df['cleaned_text'])


In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(df['cleaned_text'])


In [None]:
# your code here

###  Binary classification

#### 8. [2 point] Train model using Logistic Regression (your own) and SVC (SVM can be taken from sklearn)

* choose *two* authors from the dataset that you have formed in the previous task
* check the balance of the classes
* divide the data into train and test samples with 0.7 split rate (don't forget to fix the random state)
* using GridSearchCV - find the best parameters for the models (by F1 score) and use it in the next tasks
* make several plots to address the dependence between F1 score and parameters
* plot confusion matrix for train and test samples
* compute some relevant metrics for test sample (useful to check the seminars 5 and 6, use sklearn)
* make conclusions about the performance of your models


#### 9. [1 point] Analysing ROC AUC

It is possible to control the proportion of statistical errors of different types using different thresholds for choosing a class. Plot ROC curves for Logistic Regression and SVC, show the threshold on ROC curve plots. Choose such a threshold that your models have no more than 30% of false positive errors rate. Pay attention to `thresholds` parameter in sklearn roc_curve

In [None]:
# your code here

### Multiclass logit

#### 10. [1 point] Take the One-VS-One classifier (use sklearn) and apply to Logit model (one you've made in the 4th task) in order to get multiclass linear classifier

*It is possible to use sklearn model instead of your own one but with a penalty of 0.5*

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html">OneVsOneClassifier</a>

* use the data you got at the previous step for 6 authors
* divide the data into train and test samples with 0.7 split rate
* using GridSearchCV - find the best parameters for the models (by F1 score)
* plot confusion matrix for train and test samples
* compute relevant metrics for test sample (use sklearn)

In [None]:
# your code here