# Dependencies
Let's download and import all the needed libraries.
- **NumPy** is needed for intermidiate computaions,
- **Pandas** is needed to store data,
- **Python's RE** is needed for text preprocessing,
- **lxml** is needed to parse data files stored in XML format,
- **Gensim** is needed to use FastText embedding,
- **Scikit-learn** is needed to evaluate F1 score,
- **MXNet** is choosed as deep learning framework,
- **google.colab.drive** is need to mount Google Drive with all the data.

In [0]:
%%capture
!pip install mxnet-cu100

In [0]:
import numpy as np
import pandas as pd

import re
import lxml.etree
import smart_open
from gensim.utils import tokenize
from gensim.models.fasttext import FastText
from sklearn.metrics import f1_score

import mxnet as mx
import mxnet.ndarray as nd
import mxnet.gluon as gluon
import mxnet.autograd as autograd

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')
DRIVE_PATH = '/content/gdrive/My Drive/NLP Data/'

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# Data Loading
Data should by extracted from XML files: tweets and their sentiments.
And then before we can apply any algorithm data should be normalized.

**Tweets normalization**
- delete URLs
- delete user tags
- delete non-russian letters

**Sentiments normalization**

In data provided sentiments is given company wise, so it is need to assign sentiment on corresponding company to the whole tweet. In case of multiple different sentiments the given tweet will be assumed as sum of all sentiments.

In [0]:
def normalize_tweet(string):
    string = re.sub(r'(?:http[^\s]+)($|\s)', '', string)
    string = re.sub(r'(?:@[^\s]+)($|\s)', '', string)
    string = re.sub(r'[^абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ0123456789()!?\- ]', '', string)
    return string

def normalize_sentiment(raw_sentiments_on_companies):
    sentiment = 0
    for raw_value in raw_sentiments_on_companies:
        if raw_value.text != 'NULL':
            sentiment += int(raw_value.text)
    return np.sign(sentiment)

def load_dataframe(filename, n_companies):
    tweets = []
    sentiments = []
    
    for sample in lxml.etree.parse(filename).xpath('database/table'):
        tweet = normalize_tweet(sample[3].text)
        sentiment = normalize_sentiment(sample[4:4+n_companies])

        tweets.append(tweet)
        sentiments.append(sentiment)
    
    return pd.DataFrame({'tweet': tweets, 'sent': sentiments})

In [0]:
bank_train = load_dataframe(DRIVE_PATH + 'SentiRuEval/bank_train_2016.xml', n_companies=8)
bank_test = load_dataframe(DRIVE_PATH + 'SentiRuEval/bank_test_etalon.xml', n_companies=8)

comm_train = load_dataframe(DRIVE_PATH + 'SentiRuEval/tkk_train_2016.xml', n_companies=7)
comm_test = load_dataframe(DRIVE_PATH + 'SentiRuEval/tkk_test_etalon.xml', n_companies=7)

Let's see the data we are working with.

In [6]:
bank_train.head(20)

Unnamed: 0,tweet,sent
0,Взять кредит тюмень альфа банк,0
1,Мнение о кредитной карте втб 24,0
2,Райффайзенбанк Снижение ключевой ставки ЦБ на ...,0
3,Современное состояние кредитного поведения в р...,0
4,Главное чтоб банки СБЕР и ВТБ!!!,1
5,Оформить краткосрочный кредит оао банк москвы,0
6,Самый выгодный автокредит в втб 24,1
7,Кредит иногородним в москве сбербанк,0
8,Кредитный калькулятор россельхозбанк чита,0
9,Легко можно получить денежный кредит ы втб 24 ...,1


# Word CNN (pretrained FastText embedding)

### Training own FastText embedding
In order to train FastText on Julia's dataset it is needed to parse SQL script file. In my solution every line starting with `INSERT` treated as mini-batch. So finding text shifts for different tweets and skipping other fields of database makes possible to extract mini-batch of tweets per line of SQL script.


In [0]:
class TweetsIterator(object):
    def __init__(self, path):
        self.path = path
        self.insert_prefix = 'INSERT INTO `sentiment` VALUES '
        
    def __iter__(self):
        with smart_open.open(self.path, 'r', encoding='utf-8') as sql_script:
            for line in sql_script:
                if not line.startswith(self.insert_prefix):
                    continue

                idx = line.find('(')
                while idx != -1:
                    idx = line.find(',', idx + 1)
                    idx = line.find(',', idx + 1)
                    idx = line.find(',', idx + 1) + 1
                    begin = idx + 1

                    idx = line.find("'", idx + 1)
                    while line[idx + 1] == "'":
                        idx = line.find("'", idx + 1)

                    end = idx
                    idx = line.find('(', idx)
                    tweet = normalize_tweet(line[begin:end])
                    yield list(tokenize(tweet))

**WARNING**: Several hours of execution for training FastText

In [0]:
embedding = FastText()
embedding.build_vocab(sentences=TweetsIterator(DRIVE_PATH + 'db.sql'))
embedding.train(sentences=TweetsIterator(DRIVE_PATH + 'db.sql'), total_examples=embedding.corpus_count, epochs=10)
embedding.save(DRIVE_PATH + 'twitter.model')

In [9]:
embedding = FastText.load(DRIVE_PATH + 'twitter.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Word embedding

In [0]:
EMBEDDING_SIZE = embedding.vector_size
MAX_WORDS = 40

def tweet_to_matrix(tweet):
    matrix = nd.zeros((EMBEDDING_SIZE, MAX_WORDS))
    for i, word in enumerate(tweet.split()):
        if word in embedding.wv.vocab:
            matrix[:, i] = embedding.wv[word]

    return matrix

def one_hot_label(sentiment):
    return (np.arange(3) == (sentiment + 1)).astype(np.float32)

### Preprocess data before word-level CNN
Here we can choose dataset for our investigation:
1. Tweets about banks,
2. Tweets about telecommunication companies.

In [0]:
dataframes = [bank_train, bank_test]
#dataframes = [comm_train, comm_test]

In [0]:
X_train = nd.zeros((len(dataframes[0]), EMBEDDING_SIZE, MAX_WORDS))
y_train = nd.zeros((len(dataframes[0]), 3))
                        
X_test = nd.zeros((len(dataframes[1]), EMBEDDING_SIZE, MAX_WORDS))
y_test = nd.zeros((len(dataframes[1]), 3))

for i in range(len(bank_train)):
    X_train[i] = tweet_to_matrix(dataframes[0].tweet[i])
    y_train[i] = one_hot_label(dataframes[0].sent[i])
    
for i in range(len(bank_test)):
    X_test[i] = tweet_to_matrix(dataframes[1].tweet[i])
    y_test[i] = one_hot_label(dataframes[1].sent[i])

In [0]:
train_dataset = gluon.data.ArrayDataset(X_train, y_train)
test_dataset = gluon.data.ArrayDataset(X_test, y_test)

train_dataloader = gluon.data.DataLoader(train_dataset, batch_size=1024, shuffle=False)
test_dataloader = gluon.data.DataLoader(test_dataset, batch_size=1024, shuffle=False)

### Network configuration

In [0]:
class WordCNN(gluon.nn.HybridBlock):
    def __init__(self, **kwargs):
        super(WordCNN, self).__init__(**kwargs)
        with self.name_scope():
            self.conv1 = gluon.nn.Conv1D(channels=100, kernel_size=3, activation='relu')
            self.conv2 = gluon.nn.Conv1D(channels=100, kernel_size=4, activation='relu')
            self.conv3 = gluon.nn.Conv1D(channels=100, kernel_size=5, activation='relu')
            
            self.pool = gluon.nn.GlobalMaxPool1D()
            self.drop = gluon.nn.Dropout(0.5)
            self.fc = gluon.nn.Dense(units=3)
            
    def forward(self, x):
        with x.context:
            y1 = self.conv1(x)
            y2 = self.conv2(x)
            y3 = self.conv3(x)
            
            z1 = self.pool(y1)
            z2 = self.pool(y2)
            z3 = self.pool(y3)
            
            u = nd.concat(z1, z2, z3, dim=1)
            v = self.drop(u)
            w = self.fc(u)
            
            return w
        
net = WordCNN()
net.hybridize()
softmax = gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=False)

### Training

In [0]:
net.initialize(mx.init.Normal(sigma=0.05), force_reinit=True, ctx=mx.gpu())
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': 1e-3})

In [29]:
net.collect_params().reset_ctx(mx.gpu())
for epoch in range(250):
    accumulated_loss = 0
    for features, label in train_dataloader:
        features = features.as_in_context(mx.gpu())
        label = label.as_in_context(mx.gpu())
        with autograd.record(train_mode=True):
            output = net(features)
            loss = softmax(output, label)
        loss.backward()
        accumulated_loss += nd.mean(loss).asscalar()
        trainer.step(batch_size=len(label))
  
    if (epoch + 1) % 10 == 0:
        print('epoch', epoch + 1, '-- train loss', accumulated_loss / len(train_dataloader))

epoch 10 -- train loss 0.3011109381914139
epoch 20 -- train loss 0.19143805764615535
epoch 30 -- train loss 0.06812883708626032
epoch 40 -- train loss 0.0445197707042098
epoch 50 -- train loss 0.03419495970010757
epoch 60 -- train loss 0.02715745447203517
epoch 70 -- train loss 0.02228016182780266
epoch 80 -- train loss 0.01886808481067419
epoch 90 -- train loss 0.016498337732627988
epoch 100 -- train loss 0.015237399842590094
epoch 110 -- train loss 0.018640868552029132
epoch 120 -- train loss 0.07791596073657274
epoch 130 -- train loss 0.40588442608714104
epoch 140 -- train loss 0.01711213616654277
epoch 150 -- train loss 0.012992948153987527
epoch 160 -- train loss 0.011234527151100338
epoch 170 -- train loss 0.010167771903797983
epoch 180 -- train loss 0.009420553455129266
epoch 190 -- train loss 0.008859317540191114
epoch 200 -- train loss 0.00842060890281573
epoch 210 -- train loss 0.008142104919534177
epoch 220 -- train loss 0.008422175701707602
epoch 230 -- train loss 0.0081534

### Testing
For testing we will use only positive and negative tweets ignoring neutral ones.

In [30]:
net.collect_params().reset_ctx(mx.cpu())
y_true = nd.argmax(y_test, axis=1).asnumpy()
y_pred = nd.argmax(net(X_test), axis=1).asnumpy()

mask = np.logical_or(y_true == 0, y_true == 2)

accuracy = np.mean(y_true[mask] == y_pred[mask])
f1_macro = f1_score(y_true[mask], y_pred[mask], average='macro', labels=(0,2))
f1_micro = f1_score(y_true[mask], y_pred[mask], average='micro', labels=(0,2))

print('Accuracy:', accuracy)
print('F1-macro:', f1_macro)
print('F1-micro:', f1_micro)

Accuracy: 0.528584817244611
F1-macro: 0.5882524894549593
F1-micro: 0.6535341830822712
