## Assignment 3
1. Implement linear regression model for multiclass classification using pytorch.
2. Implement multinomial and one-vs-rest variants on multiclass classification.
3. Implement L2 relularization for your model.
4. Test your model on 20newsgroups dataset. Your baseline is accuracy=0.75.
5. How can we justify using accuracy score for this problem?
6. What is acuraccy score for random answer for this problem?

Follow #TODO in the code below.
Feel free to add additional regularizers to your model.
Remember, that SGD convergence is slower that lbfgs from scikit-learn. Manage your time.

Usefull links:
https://pytorch.org/
https://gluon.mxnet.io/chapter06_optimization/gd-sgd-scratch.html
(bonus) http://ruder.io/optimizing-gradient-descent/

In [185]:
import torch as tt
from torch.optim import SGD
from torch import nn
import numpy as np
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from sklearn import metrics
from scipy import sparse
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import normalize

%matplotlib inline

SEED = 42
np.random.seed(SEED)

import re
import string
from tqdm import tqdm
from tqdm import tqdm_notebook

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
stemms = SnowballStemmer('english')

from nltk.tokenize import TweetTokenizer
tok = TweetTokenizer()

In [186]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# retrieve dataset
data = fetch_20newsgroups()


X = data['data']
y = data['target']
#TODO some feature engineering
# If you want to use some sparse feature vectors, pay attention to feature size.
# While your feature matrix can be sparse, weight tensor in the model is always dense.

In [187]:
X[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [188]:
def cleaning(text):
    text = text.lower()
    text = re.sub('[0-9]', '', text)
    text = re.sub('\W', ' ', text)
    text = text.strip()
    words = [stemms.stem(word) for word in tok.tokenize(text)]
    return ' '.join(words)

In [189]:
cleaning(X[0])

'from lerxst wam umd edu where s my thing subject what car is this nntp post host rac wam umd edu organ univers of maryland colleg park line i was wonder if anyon out there could enlighten me on this car i saw the other day it was a door sport car look to be from the late s earli s it was call a bricklin the door were realli small in addit the front bumper was separ from the rest of the bodi this is all i know if anyon can tellm a model name engin spec year of product where this car is made histori or whatev info you have on this funki look car pleas e mail thank il brought to you by your neighborhood lerxst'

In [190]:
X = [cleaning(i) for i in tqdm_notebook(X)]

In [191]:
tfidf = TfidfVectorizer(min_df=5, stop_words='english', ngram_range=(1,2))
X = tfidf.fit_transform(X)
y = np.array(y)

In [192]:
X.shape, y.shape

((11314, 62019), (11314,))

In [308]:
class LogisticRegressionNN(nn.Module):
    """
    All neural networks in pytorch are descendants of nn.Module class
    As you remember, Logistic regression is just a 1-layer neural network
    #TODO implement multinomial logistic regression
    """
    
    def __init__(self, d, k):
        """
        In the constructor we define model weights and layers
        d: feature size
        k: number of classes
        """
        super(LogisticRegressionNN, self).__init__()
        
        # TODO create tensor of weights and tensor of biases
        # initialize tensors from N(0,1) using np.random.rand
        # W has shape (d,k)
        # b has shape (d,)
        # set requires_grad=True for tensors, so they can be learned during training
        self.W = tt.tensor(np.random.rand(d, k), requires_grad=True, dtype=tt.float32)
        self.b = tt.tensor(np.random.rand(k,), requires_grad=True, dtype=tt.float32)
        
    def forward(self, x):
        """
        In this method we implement connections between neural network weights
        x: batch feature matrix
        returns: probability logits
        """
        # TODO implement linear model without softmax
        result = tt.matmul(x.double(), self.W.double()).add(self.b.double())
        return result
    
    def parameters(self):
        """
        learnable model parameters
        """
        return [self.W, self.b]
    
    
class LogisticRegressionEstimator(BaseEstimator, ClassifierMixin):
    """
    Logistic Regression estimator coping interface from scikit-learn
    """
    def __init__(self, learning_rate, n_epochs, batch_size, alpha=1, multi_class='multinomial', verbose=False):
        """
        learning_rate: SGD learning rate
        n_epochs: number of epochs
        batch_size: size of mini-batch
        alpha:  regularizer coef
        multi_class: ['multinomial', 'ovr']
        verbose:
        """
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.alpha = alpha
        self.multi_class = multi_class
        self.verbose = verbose
        self.model_nn = None
        self.batch_size = batch_size
        
    def _train_nn(self, model, X, y):
        """
        Train neural network
        model: neural network module
        X: - feature matrix
        y: - target values
        """
        
        # criterion to minimize
        criterion = nn.CrossEntropyLoss()
        # optimization algorithm
        optimizer = tt.optim.SGD(model.parameters(), lr=self.learning_rate)

        #TODO calculate number of batches, round to the ceil
        n_batches = int(np.ceil(X.shape[0] / self.batch_size)) 

        if self.verbose:
            # nice progress bar
            t_epochs = tqdm_notebook(range(self.n_epochs), desc='epochs', leave=True)
        else:
            t_epochs = range(self.n_epochs)

        # iterate over epochs
        for epoch in t_epochs:

            # TODO make random permutation over indices, use np.random.choice
            indices = np.random.choice(list(range(X.shape[0])), size=X.shape[0])
            
            epoch_average_loss = 0

            # iterate over mini-batches
            for j in range(n_batches):

                batch_idx = indices[j: j + self.batch_size]

                # we have to wrap data into tensors before feed them to neural network
                #TODO: batch feature float tensor. use tt.from_numpy
                batch_x = tt.from_numpy(X[batch_idx].toarray())
                #TODO batch target long tensor. use tt.from_numpy
                batch_y = tt.from_numpy(y[batch_idx]).long()

                # reset gradients for the new iteration
                optimizer.zero_grad()
                # get predictions
                pred = model.forward(batch_x)
                
                # cross-entropy loss
                loss = criterion(pred, batch_y)
                #TODO: add regularizer on weights
                loss += self.alpha/2 * tt.norm(model.W.double())**2

                # calculate gradients
                loss.backward()
                # make optimization step
                optimizer.step()

                epoch_average_loss += loss.data.detach().item()

            # average loss for epoch
            epoch_average_loss /= n_batches
            if self.verbose:
                t_epochs.set_postfix(loss='%.3f' % epoch_average_loss)
        
        
    def fit(self, X, y):
        """
        X: feature matrix
        y: target values
        """
        
        n_features = X.shape[1]
        self.n_classes_ = len(np.unique(y))
        
        # binary classification
        if self.n_classes_ == 2:
            self.model_nn = LogisticRegressionNN(n_features, 2)
            self._train_nn(self.model_nn, X, y)
            
        else:
            
            if self.multi_class == 'multinomial':
                # TODO: multinomial classification
                self.model_nn = LogisticRegressionNN(n_features, self.n_classes_)
                self._train_nn(self.model_nn, X, y)
                
            # ovr classification
            elif self.multi_class == 'ovr':
                
                if self.verbose:
                    t_ovr = tqdm_notebook(range(self.n_classes_), desc='ovr')
                else:
                    t_ovr = range(self.n_classes_)
                
                # TODO: one-vs-rest classification
                for key, i in enumerate(t_ovr):
                    self.model_nn = LogisticRegressionNN(n_features, 2)
                    y_new = np.array([1 if item == key else 0 for item in y])
                    self._train_nn(self.model_nn, X, y_new)
        return self
                    
    def predict_proba(self, X):
        
        if sparse.issparse(X):
            # create sparse tensor
            X = X.tocoo()
            ii = tt.LongTensor([X.row, X.col])
            X = tt.sparse.FloatTensor(ii, tt.from_numpy(X.data).float(), X.shape)
        else:
            # create dense tensor
            X = tt.from_numpy(X).float()
            
        
        if self.n_classes_ == 2:
            pred = self.model_nn.forward(X)
            pred = tt.softmax(pred, dim=-1)
            pred = pred.detach().numpy()
            return pred
            
        else:
            if self.multi_class == 'multinomial':
                # TODO return class probabilities
                pred = self.model_nn.forward(X)
                pred = tt.softmax(pred, dim=-1)
                pred = pred.detach().numpy()
                return pred
                
            elif self.multi_class == 'ovr':
                # TODO return class probabilities
                # remember to normalize probabities from different binary classification models, so they sum up to 1.0
                pred = []
                
                for est in self.models:
                    item = tt.sigmoid(est.forward(X))
                    pred.append(item.detach().numpy()[:, 1])
                
                return normalize(np.array(pred).T)
            
    def predict(self, X):
        proba = self.predict_proba(X)
        return proba.argmax(axis=1)

In [None]:
# select hyperparams to obtain good quality in reasonable time

est = LogisticRegressionEstimator(...)

est.fit(X, y)

In [233]:
# fetch test subset
test_data = fetch_20newsgroups(subset='test')

X_test = tfidf.transform(test_data['data'])
y_test = test_data['target']
X_test.shape, y_test.shape

((7532, 62019), (7532,))

подбор с помощью GridSearchCV (работал оооочень долго и поэтому остановила)

In [222]:
from sklearn.model_selection import GridSearchCV

In [None]:
# select hyperparams to obtain good quality in reasonable time

est = LogisticRegressionEstimator(
    learning_rate=1,
    n_epochs=10,
    batch_size=100,
    alpha=1,
    multi_class='multinomial',
    verbose=False
)

In [None]:
%%time

gscv = GridSearchCV(
    est,
    {'learning_rate': [1, 5, 10],
     'alpha': [1e-5, 1e-3, 0.1, 1, 10],
     'batch_size': [32, 64, 128]
    },
    scoring='accuracy',
    cv=5
)

gscv = gscv.fit(X, y)
print('Best params:', grid.best_params_)

подбор вручную

In [267]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 20,
    batch_size = 32,
    alpha = 1e-05,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 3min 18s, sys: 1min 12s, total: 4min 31s
Wall time: 3min 25s


In [268]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.6485661178969729


In [272]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 50,
    batch_size = 64,
    alpha = 1e-05,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 6min 26s, sys: 1min 24s, total: 7min 50s
Wall time: 5min 31s


In [273]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.7033988316516198


In [274]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 50,
    batch_size = 128,
    alpha = 1e-05,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 6min 8s, sys: 57 s, total: 7min 5s
Wall time: 3min 46s


In [275]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.7047265002655337


In [278]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 50,
    batch_size = 64,
    alpha = 0,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 6min 15s, sys: 1min 19s, total: 7min 35s
Wall time: 5min 23s


In [279]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.6911842804036112


"из крайности в крайность" попробовала с совсем другими параметрами и все равно не удалось побить бейзлайн 

In [247]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 1,
    n_epochs = 1000,
    batch_size = 1000,
    alpha = 0.001,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 2h 35min 7s, sys: 45min 4s, total: 3h 20min 12s
Wall time: 1h 21min 40s


In [248]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.6614445034519384


In [258]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 10,
    n_epochs = 100,
    batch_size = 500,
    alpha = 0.001,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 13min 49s, sys: 4min 38s, total: 18min 28s
Wall time: 8min 15s


In [259]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.5845724907063197


In [262]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 10,
    n_epochs = 100,
    batch_size = 1000,
    alpha = 0.001,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 15min 9s, sys: 4min 20s, total: 19min 29s
Wall time: 7min 40s


In [263]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.5860329261816251


In [269]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 20,
    batch_size = 64,
    alpha = 1e-05,
    multi_class = 'multinomial',
    verbose = False
)
est.fit(X,y)

CPU times: user 2min 38s, sys: 35.4 s, total: 3min 14s
Wall time: 2min 14s


In [271]:
print('acc', metrics.accuracy_score(y_test, est.predict(X_test)))

acc 0.6594530005310675


###### ИТОГ multinomial: acc 0.7047265002655337

ovr попробуем с теми же параметрами, что дали наивысший acc multinominal

In [294]:
%%time

est = LogisticRegressionEstimator(
    learning_rate = 5,
    n_epochs = 50,
    batch_size = 128,
    alpha = 1e-05,
    multi_class = 'ovr',
    verbose = False
)
est.fit(X,y)

CPU times: user 1h 6min 35s, sys: 14min 44s, total: 1h 21min 20s
Wall time: 43min 20s


При выдаче acc выдавал ошибку, я ее исправила, но тк работает почти полтора часа, не успею заново пересчитать(

### How can we justify using accuracy score for this problem?

Посмотрим, как у нас сбалансированы классы: 

In [310]:
Counter(list(y))

Counter({0: 480,
         1: 584,
         2: 591,
         3: 590,
         4: 578,
         5: 593,
         6: 585,
         7: 594,
         8: 598,
         9: 597,
         10: 600,
         11: 595,
         12: 591,
         13: 594,
         14: 593,
         15: 599,
         16: 546,
         17: 564,
         18: 465,
         19: 377})

Разброс в значениях небольщой, можем сказать, что классы сбалансированы, значит accuracy нам подходит

### What is acuraccy score for random answer for this problem?

Мы имеем 20 классов, следовательно вероятность получить верный ответ на 1 объекте 1/20 или 0.05