In [1]:
import os
import re
import tqdm
import pickle
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Data

To detect intent of users questions we will need two text collections:

- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (positive samples).
- `dialogues.tsv` — dialogue phrases from movie subtitles (negative samples).

For those questions, that have programming-related intent, we will proceed as follow predict programming language (we allowed only one tag per question here) and rank candidates within the tag using embeddings. For the ranking part, we will need:

- `word_embeddings.tsv` — word embeddings, that you trained with StarSpace in the 3rd assignment. It's not a problem if you didn't do it, because we can offer an alternative solution for you.

As a result of this notebook, we should obtain the following new objects that we will then use in the running bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.

In [3]:
if not os.path.exists('./data'):
    !mkdir ./data

In [None]:
!wget --no-check-certificate \
    https://github.com/hse-aml/natural-language-processing/releases/download/project/tagged_posts.tsv \
    -O ./data/tagged_posts.tsv > /dev/null 2>&1

In [None]:
!wget --no-check-certificate \
    https://github.com/hse-aml/natural-language-processing/releases/download/project/dialogues.tsv \
    -O ./data/dialogues.tsv > /dev/null 2>&1

In [4]:
seed = 781
sample_size = 200000

df_stackoverflow = pd.read_csv('./data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=seed)
df_dialogues = pd.read_csv('./data/dialogues.tsv', sep='\t').sample(sample_size, random_state=seed)

In [5]:
df_stackoverflow.head()

Unnamed: 0,post_id,title,tag
631024,9071076,C++ virtual method overload/override compiler ...,c_cpp
353311,5298353,Check a condition and also identify the patter...,php
1547617,23947511,isset($_POST['x']) only works if the submit bu...,php
70588,1353559,Trying to make this star output using a for lo...,c_cpp
534998,7753016,"Django+Postgres: ""current transaction is abort...",python


In [6]:
df_dialogues.head()

Unnamed: 0,text,tag
154349,What's that got to do with you?,dialogue
105643,Nooo. Is it your story?,dialogue
122343,"No Bela, that's ""incorporates."" Look, just sa...",dialogue
183491,For getting a divorce?,dialogue
129003,"No danger of attack, as long as you don't trig...",dialogue


# Part I. Intent and language recognition

We want to write a bot, which will not only **answer programming-related questions**, but also will be able to **maintain a dialogue**. We would also like to detect the *intent* of the user from the question (we could have had a 'Question answering mode' check-box in the bot, but it wouldn't fun at all). So the first thing we need to do is to **distinguish programming-related questions from general ones**.

It would also be good to predict which programming language a particular question referees to. By doing so, we will speed up question search by a factor of the number of languages (10 here).

## Data preparation

In [7]:
def text_prepare(text):
    """Performs tokenization and simple preprocessing."""

    replace_by_space_re = re.compile('[/(){}\[\]\|@,;]')
    bad_symbols_re = re.compile('[^0-9a-z #+_]')
    stopwords_set = set(stopwords.words('english'))

    text = text.lower()
    text = replace_by_space_re.sub(' ', text)
    text = bad_symbols_re.sub('', text)
    text = ' '.join([x for x in text.split() if x and x not in stopwords_set])

    return text.strip()

In [8]:
%%time
df_stackoverflow.title = df_stackoverflow.title.apply(text_prepare)
df_dialogues.text = df_dialogues.text.apply(text_prepare)

CPU times: user 57.4 s, sys: 6.38 s, total: 1min 3s
Wall time: 1min 4s


## Intent recognition

We will do a binary classification on TF-IDF representations of texts. Labels will be either `dialogue` for general questions or `stackoverflow` for programming-related questions. First, we prepare the data for this task:

- concatenate dialogue and stackoverflow examples into one sample
- split it into train and test in proportion 90/10 %, use random_state=0 for reproducibility
- transform it into TF-IDF features

In [9]:
def extract_tfidf_features(X_train, X_test, to_='./out'):
    if not os.path.exists(to_):
        !mkdir {to_}
    vect = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2), token_pattern='(\S+)')
    vect.fit(X_train)
    with open(os.path.join(to_, 'tfidf_vectorizer.pkl'), 'wb') as file:
        pickle.dump(vect, file)
    X_train = vect.transform(X_train)
    X_test = vect.transform(X_test)
    return X_train, X_test

In [None]:
%%time
X = np.concatenate([df_dialogues.text.values, df_stackoverflow.title.values])
y = ['dialogue'] * df_dialogues.shape[0] + ['stackoverflow'] * df_stackoverflow.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed)
print(f'Train size={len(X_train)}, test size={len(X_test)}')

X_train_tfidf, X_test_tfidf = extract_tfidf_features(X_train, X_test)

Train size=360000, test size=40000


In [None]:
%%time
intent_recognizer = LogisticRegression(penalty='l2', C=10, random_state=seed, solver='liblinear')
intent_recognizer.fit(X_train_tfidf, y_train)

In [None]:
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test accuracy={test_accuracy}')

In [None]:
pickle.dump(intent_recognizer, open('./out/intent_recognizer.pkl', 'wb'))

## Programming Language Classifcation

We will train one more classifier for the programming-related questions. It will predict exactly one tag (=programming language) and will be also based on Logistic Regression with TF-IDF features.

First, let us prepare the data for this task.

In [None]:
X = df_stackoverflow.title.values
y = df_stackoverflow.tag.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
print(f'Train size={len(X_train)}, test size={len(X_test)}')

In [None]:
vectorizer = pickle.load(open('./out/tfidf_vectorizer.pkl', 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

In [None]:
%%time
tag_classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', C=5, random_state=seed, solver='liblinear'))
tag_classifier.fit(X_train_tfidf, y_train)

In [None]:
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test accuracy={test_accuracy}')

In [None]:
pickle.dump(tag_classifier, open('./out/tag_classifier.pkl', 'wb'))

# Part II: Ranking questions with embeddings

To find a relevant answer (a thread from StackOverflow) on a question we will use vector representations to calculate similarity between the question and existing threads. We create `question_to_vec` function, which can make such a representation based on word vectors.

However, it would be costly to compute such a representation for all possible answers in online mode of the bot (e.g. when bot is running and answering questions from many users). This is the reason why we will create a database with pre-computed representations. These representations will be arranged by non-overlaping tags (programming languages), so that the search of the answer can be performed only within one tag each time. This will make our bot even more efficient and allow not to store all the database in RAM.

In [2]:
def question_to_vec(question, embeddings, dim=300):
    """
        question: a string
        embeddings: dict where the key is a word and a value is its' embedding
        dim: size of the representation

        result: vector representation for the question
    """
    vec = []
    for token in question.split():
        if token in embeddings:
            vec.append(embeddings[token])
    if len(vec) == 0:
        return np.zeros((dim,))
    return np.stack(vec).mean(axis=0)

Since we want to precompute representations for all possible answers, we need to load the whole posts dataset, unlike we did for the intent classifier:

In [None]:
posts_df = pd.read_csv('./data/tagged_posts.tsv', sep='\t')

In [None]:
posts_df.head()

In [None]:
counts_by_tag = posts_df.groupby('tag').count().max(axis=1)
counts_by_tag.head()

Now for each tag, we need to create two data structures, which will serve as online search index:

- `tag_post_ids` — a list of post_ids with shape `(counts_by_tag[tag],)`. It will be needed to show the title and link to the thread;
- `tag_vectors` — a matrix with shape `(counts_by_tag[tag], embeddings_dim)` where embeddings for each answer are stored.


In [None]:
if not os.path.exists('./out/thread_embeddings_by_tags'):
    !mkdir ./out/thread_embeddings_by_tags

In [None]:
for tag, count in tqdm.tqdm(counts_by_tag.items()):
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = tag_posts.post_id.tolist()
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = question_to_vec(title, w2v_embeddings, embeddings_dim)

    # Dump post ids and vectors to a file.
    filename = os.path.join('./out/thread_embeddings_by_tags', os.path.normpath('%s.pkl' % tag))
    pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))

In [20]:
import chatterbot

ModuleNotFoundError: No module named 'chatterbot'

In [22]:
!pip install chatterbot

Collecting chatterbot
  Using cached ChatterBot-1.0.5-py2.py3-none-any.whl (67 kB)
Collecting spacy<2.2,>=2.1
  Using cached spacy-2.1.9-cp37-cp37m-manylinux1_x86_64.whl (30.8 MB)
Processing /root/.cache/pip/wheels/23/b9/73/57aaccb6957d94ed63f474b51a9f7f992c5eff4635052c0557/PyYAML-5.1.2-cp37-cp37m-linux_x86_64.whl
Installing collected packages: spacy, pyyaml, chatterbot
Killed


In [16]:
!conda remove PyYAML -y

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /root/anaconda3

  removed specs:
    - pyyaml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_1         156 KB
    conda-4.8.3                |           py37_0         2.8 MB
    decorator-4.4.2            |             py_0          14 KB
    idna-2.9                   |             py_1          49 KB
    ipython-7.13.0             |   py37h5ca1d4c_0         991 KB
    jedi-0.16.0                |           py37_1         769 KB
    json5-0.9.4                |             py_0          21 KB
    jupyter_client-6.1.2       |             py_0          82 KB
    jupyterlab_server-1.1.0    |             py_0          27 KB
    openssl-1.1.1e             |       h7b6447c_0         2.5 MB
    parso-0.6.2                |    

json5-0.9.4          | 21 KB     | ##################################### | 100% 
wcwidth-0.1.9        | 24 KB     | ##################################### | 100% 
jedi-0.16.0          | 769 KB    | ##################################### | 100% 
decorator-4.4.2      | 14 KB     | ##################################### | 100% 
pycparser-2.20       | 92 KB     | ##################################### | 100% 
idna-2.9             | 49 KB     | ##################################### | 100% 
openssl-1.1.1e       | 2.5 MB    | ##################################### | 100% 
requests-2.23.0      | 92 KB     | ##################################### | 100% 
jupyter_client-6.1.2 | 82 KB     | ##################################### | 100% 
tqdm-4.44.1          | 57 KB     | ##################################### | 100% 
jupyterlab_server-1. | 27 KB     | ##################################### | 100% 
pygments-2.6.1       | 654 KB    | ##################################### | 100% 
setuptools-46.1.1    | 512 K