# Overview

I'll start with simple data exploration.

In [None]:
import pandas as pd

df = pd.read_csv('./data/train_set.csv', encoding='utf8', engine='python')
print('Thr dataset has {} rows and {} columns'.format(len(df.index), len(df.columns)))
df.head()

Let's count labels:

In [None]:
print(df['author'].value_counts())

So the dataset is pretty well balanced except it lacks tweets of Donald J. Trump. 

Let's look at some tweets.

In [None]:
pd.set_option('display.max_colwidth', 280)
df['tweet'].head()

So the dataset contains different languages, hashtags, URLs, and mentions. All that seems important. That could be tough for many NLU techniques, but I'll try to deal with it.
# The first approach
I usually start working on a task by building a simple baseline estimator. I do it before going deep into data analysis and exploration. It helps me to understand task complexity.

Obviously, tweets are key to this classification task. So I'll ignore metadata fields for now. 

To build sentence representations suitable for classification, I'll use my default approach, which is TFIDF vectors of tokenized and lemmatized sentences.

I'll use RandomForest as my baseline estimator. It requires a rather small amount of time for learning and often works fine even without hyperparameters optimization.

In [None]:
import spacy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score
from spacy.lang.en import English
import numpy as np
from sklearn.model_selection import train_test_split

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
nlp = English()

clzs = {n: i for i, n in enumerate(df['author'].unique())}

labels = list()
tweets = list()

for index, row in df.iterrows():
    labels.append(clzs[row['author']])
    tweets.append(' '.join([w.lemma_ for w in nlp(row['tweet']) if w.is_stop==False]))


labels = np.array(labels)
tweets = np.array(tweets)

X_train, X_test, y_train, y_test = train_test_split(tweets, labels, test_size=0.1, random_state=10)

vectorizer = TfidfVectorizer()
model = RandomForestClassifier(n_estimators=10)

X_train = vectorizer.fit_transform(X_train)
model.fit(X_train,y_train)

y_train_predict = model.predict(X_train)
print('Train set score {:.3f}'.format(precision_score(y_train, y_train_predict, average='micro')))

X_test = vectorizer.transform(X_test)
test_predict = model.predict(X_test)

print('Test set score {:.3f}'.format(precision_score(y_test, test_predict, average='macro')))

I got micro-precision ~70% from that estimator. It is lower then I could expect to get from my solution, but it's a good start.

Now I'm ready to explore the metadata fields.

# Utilizing the metadata fields
Let's do some aggregations on the metadata fields to explore them.

In [None]:
df_agg = df.groupby('author')
df_agg.agg({
'year' : [np.min, np.max],
     'day_of_week': 'nunique',
     'is_retweet': [np.mean],
     'has_hashtag': [np.mean],
     'has_mentions': [np.mean],
     'has_url': [np.mean],
     'has_media': [np.mean],
'source': 'nunique', 'lang': 'nunique'})

Retweets are absent in this dataset. But other fields could be useful. As we can see, Elon Mask is loyal to his iPhone, while Cristiano Ronaldo is a leader by the number of sources he uses. Also, we know that Elon Musk and Donald Trump rarely use hashtags while Sebastian Ruder never posts media.

I see a problem with the year field here. I've checked the test set manually and found out that all tweets in the test set before 2012 seem to be authored by Cristiano Ronaldo. So it could help me to improve the score on the test set.

I don't know whether lower limits of the year column here actually show when those people joined twitter, or it is a limitation of the dataset. If we were building a production-ready software system, I would be cautious using the year for classification.

Now, I'm ready to utilize the metadata in my baseline estimator. Fortunately, I used a decision tree estimator and it's already suitable for heterogeneous data. I only have to map string fields to numbers and attach metadata to TFIDF vectors.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
import scipy 

langs = ['en', 'pt', 'und', 'tl', 'ht', 'in', 'sv', 'es', 'it', 'pl', 'hi', 'et', 'ca', 'de', 'fi', 'fr',
             'ru', 'hu', 'nl', 'da', 'tr', 'lt', 'ro', 'eu', 'ja', 'cy', 'sl', 'is', 'no', 'cs']

days_of_week = ['Mon', 'Sat', 'Fri', 'Wed', 'Tue', 'Sun', 'Thu']

lang_mappings = {n: i for i, n in enumerate(langs)}
dow_mappings = {n: i for i, n in enumerate(days_of_week)}

metadata = np.vstack([df['source'],
                          df['lang'].apply(lambda x: lang_mappings[x]).values,
                          df['day_of_week'].apply(lambda x: dow_mappings[x]).values,
                          df['has_hashtag'].astype(int).values,
                          df['has_mentions'].astype(int).values,
                          df['has_url'].astype(int).values,
                          df['has_media'].astype(int).values,
                          df['day'].values, df['month'].values,
                          df['year'].values, df['hour'].values, df['minute'].values,
                          df['second'].values, df['day_of_year'].values,
                          df['week_of_year'].values]).T

split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
train_index, test_index = next(split.split(tweets, labels))

tweets_train, tweets_test = tweets[train_index], tweets[test_index]
labels_train, labels_test = labels[train_index], labels[test_index]
metadata_train, metadata_test = metadata[train_index], metadata[test_index]

vectorizer = TfidfVectorizer()
tweets_train = vectorizer.fit_transform(tweets_train)
tweets_test = vectorizer.transform(tweets_test)

X_train = scipy.sparse.hstack([tweets_train, metadata_train])
X_test = scipy.sparse.hstack([tweets_test, metadata_test])


model = RandomForestClassifier(n_estimators=10)
model.fit(X_train,labels_train)

train_predict = model.predict(X_train)
print('Train set score {:.3f}'.format(precision_score(labels_train, train_predict, average='micro')))

test_predict = model.predict(X_test)

print('Test set score {:.3f}'.format(precision_score(labels_test, test_predict, average='macro')))

I've got 82% here. That's a significant improvement! 

Now I'm ready to select a more suitable model if anyone exists.

# Model selection
I've started with RandomForest from scikit-learn. The data is heterogeneous and high-dimensional. So a tree-based estimator is a right choice for that data. But the most efficient tree-based ensemble estimator right now is a gradient boosting tree. So I'll try to improve my scores by switching to CatBoost. CatBoost is highly optimized, easy to train. It doesn't require a GPU on inference. And it could be trained both on a CPU and a GPU. There're several popular gradient boosting tree algorithms alongside CatBoost (XGBoost, for example), but CatBoost is my default choice. 

I'm not very experienced with the optimization of hyperparameters of gradient boosting trees. So I'll use a grid search for that. Note that the following code could run up to a couple of hours on a GPU and up to a day on an average CPU.

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(task_type="GPU")#Change it to CPU if you don't have a GPU
grid = {'learning_rate': [0.1, 0.3, 0.6],
        'depth': [6, 10],
        'l2_leaf_reg': [0.01, 0.05, 0.1, 0.5]}

grid_search_result = model.grid_search(grid,
                                       X=X_train,
                                       y=labels_train,
                                       plot=True)

The grid search showed the best result with learning_rate=0.3, depth=10, l2_leaf_reg=0.05.

Now I'm ready to train and test model with optimized hyperparameters. To be sure that my results are valid, I'll use 5-Fold cross-validation. Note that the following code could run up to an hour on a GPU and up to twelve hours on an average CPU.

In [None]:
from catboost import CatBoostClassifier

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

prs = list()
for idx, (train_index, test_index) in enumerate(skf.split(metadata, labels)):
    print('K-Fold iteration {}'.format(idx+1))

    tweets_train = tweets[train_index]
    metadata_train = metadata[train_index]

    tweets_test = tweets[test_index]
    metadata_test = metadata[test_index]

    y_train, y_test = labels[train_index], labels[test_index]



    vectorizer = TfidfVectorizer()
    tweets_train_tfidf = vectorizer.fit_transform(tweets_train)
    tweets_test_tfidf = vectorizer.transform(tweets_test)

    X_train = scipy.sparse.hstack([tweets_train_tfidf, metadata_train])
    X_test = scipy.sparse.hstack([tweets_test_tfidf, metadata_test])

    model = CatBoostClassifier(learning_rate=0.3, depth=10, l2_leaf_reg=0.1, task_type="GPU")
    model.fit(X_train,y_train)

    y_train_predict = model.predict(X_train)
    print('Train set score {}'.format(precision_score(y_train, y_train_predict, average='micro')))

    y_test_predict = model.predict(X_test)

    pr = precision_score(y_test, y_test_predict, average='micro')
    prs.append(pr)
    print('Test set score {}'.format(pr))


print(prs)

I've got 90-91%. And I think that's an amazing result. And I'm ready to build the final model on the whole dataset.

In [None]:
import os
import pickle
import shutil

vectorizer = TfidfVectorizer()
tweets_train_tfidf = vectorizer.fit_transform(tweets)
X_train = scipy.sparse.hstack([tweets_train_tfidf, metadata])
model = CatBoostClassifier(learning_rate=0.3, depth=10, l2_leaf_reg=0.1, task_type="GPU")
model.fit(X_train,labels)
if os.path.exists('./sample_model'):
    shutil.rmtree('./sample_model')
os.makedirs('./sample_model')
model.save_model('./sample_model/model.cbm')
with open('./sample_model/vectorizer.pickle', 'wb') as f:
    pickle.dump(vectorizer, f)

# Model analysis
Let's explore the most important features of the final model.

In [None]:
import os
from catboost import CatBoostClassifier
import numpy as np
import pickle

with open(os.path.join('./model_cb_full', 'vectorizer.pickle'), 'rb') as f:
    vectorizer = pickle.load(f)

    
features = vectorizer.get_feature_names()
metadata_features =     ['source',
     'lang',
     'day_of_week',
     'has_hashtag',
     'has_mentions',
     'has_url',
     'has_media',
     'day', 'month',
     'year', 'hour', 'minute',
     'second', 'day_of_year',
     'week_of_year']

model = CatBoostClassifier(learning_rate=0.3, depth=10, l2_leaf_reg=0.1, task_type="GPU")
model.load_model(os.path.join('./model_cb_full', './model.cbm'))
feature_importance = model.get_feature_importance(type= "FeatureImportance")
important_features = np.argsort(-feature_importance)

print([features[x] if x < len(features) else metadata_features[x-len(features)] for x in important_features[:100]])

It turned out to be tractable. Metadata fields are the most important ones. The most important words are hashtags like kkw*, ellentube, theellenshow. The model learned to predict authors based on some salient features rather than some in-depth text analysis. And it works well. 

I don't think that any publicly available word embeddings or pre-trained BERT models could deal with that data significantly better. [Character-based CNNs](https://ieeexplore.ieee.org/document/7966145/citations?tabFilter=papers#citations) could do so. Using of [pretrained tweet2vec](https://github.com/bdhingra/tweet2vec) could worse trying. Models based on some combinations of word embeddings, character embeddings, and metadata (like in [BIDAF](https://arxiv.org/abs/1611.01603)) could be better. But it's a matter of long research with unpredictable results.

I also think that it is possible to achieve better results with my model by improving tokenization techniques to work better with emojis, emoticons, URLs, and other none-linguistic textual data.