# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

The first change that I made was to add the recall, precision, f-beta, and confusion matrix metrics so that I could get 
a good idea of the current performance. I also added K-Fold cross-validation with a value of n=5. I chose 5 since it 
would perform 5 iterations which seemed like a good balance of multiple attempts without causing the code to run for a 
long time. It also meant that 20% of the data would be held for testing and 80% would be used for training. These
values are similar to the original ratio in the provided data and also show a good tradeoff between training data size
and test data size.

The base results were:

Avg. Recall: 0.571

Avg. Precision: 0.889

Avg. F-Beta: 0.696

Avg. ROC AUC: 0.955

Time to run:  0:04:56.208785

In [2]:
import pandas as pd
import urllib
import re
import string
import numpy as np
from datetime import datetime
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score, precision_recall_fscore_support, confusion_matrix
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

In [5]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634'
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637'


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)


start = datetime.now()                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [6]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [7]:
len(annotations['rev_id'].unique())

115864

In [8]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [9]:
# join labels and comments
comments['attack'] = labels

8. Text Cleanup

a. I attempted many text cleanup methods. The ones included in the code are replacing quotes with an empty string,
replacing all punctuation with spaces, reducing multiples consecutive spaces to one, and making the text lowercase. I
had also tried removing stopwords using the built in 'english' stop word dictionary in the CountVectorizer. I had also 
tried replacing all punctuation with empty strings.

With replacing all punctuation with empty strings:

Avg. Recall: 0.566

Avg. Precision: 0.891

Avg. F-Beta: 0.692

Avg. ROC AUC: 0.954

Time to run:  0:03:37.461214

With the features that were left in:

Avg. Recall: 0.571

Avg. Precision: 0.889

Avg. F-Beta: 0.695

Avg. ROC AUC: 0.955

Time to run:  0:03:49.056966

Adding stopwords to the features that were left in:

Avg. Recall: 0.559

Avg. Precision: 0.894

Avg. F-Beta: 0.688

Avg. ROC AUC: 0.949

Time to run:  0:03:38.880159

As you can see, using spaces provided better performance than empty strings and adding stopwords made the results worse.
The use of stopwords making the results worse surprised me at first as this can typically help remove noise, but
it does seem that the stopword lists may include some terms that would not be considered as stopwords in all domains
as referenced in the docs:  https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words

In [None]:
# remove newline, tab tokens, and many forms of punctuation
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
# Replace quote with nothing
comments['comment'] = comments['comment'].apply(lambda x: x.replace("'", ""))
# Idea borrowed from https://stackoverflow.com/questions/34860982/replace-the-punctuation-with-whitespace
comments['comment'] = comments['comment'].apply(lambda x: x.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))))
# Collapse multiple spaces to one
comments['comment'] = comments['comment'].apply(lambda x: re.sub(' +', ' ', x))
# Make every string lowercase
comments['comment'] = comments['comment'].apply(lambda x: x.lower())

In [12]:
comments.query('attack')['comment'].head(10)

rev_id
801279             Iraq is not good  ===  ===  USA is bad   
2702703      ____ fuck off you little asshole. If you wan...
4632658         i have a dick, its bigger than yours! hahaha
6545332      == renault ==  you sad little bpy for drivin...
6545351      == renault ==  you sad little bo for driving...
7977970    34, 30 Nov 2004 (UTC)  ::Because you like to a...
8359431    `  ::You are not worth the effort. You are arg...
8724028    Yes, complain to your rabbi and then go shoot ...
8845700                     i am using the sandbox, ass wipe
8845736      == GOD DAMN ==  GOD DAMN it fuckers, i am us...
Name: comment, dtype: object

In [13]:
# fit a simple text classifier

# get different data groups
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")
# Tuning data
dev_comments = comments.query("split=='dev'")
# Test/Train data when using KFold
non_dev_comments = comments.query("split=='test' or split=='train'")

if __name__ == '__main__':



Test ROC AUC: 0.957


12. Parameter Tuning (Part 1)

e. For the parameter tuning phase, I started with the documentation for the model and searched around on the internet to
see what parameters could make a difference and the typical values used. I came up with the set below. I did reduce set
the cv on the GridSearch to 3 to reduce the runtime, but it was able to improve the results.

The previous results versus post tuning:

Avg. Recall: 0.452 v .613  -- improved by .161

Avg. Precision: 0.946 v .900 -- decreased by .46

Avg. F-Beta: 0.611 v .729 -- increased by .118

Avg. ROC AUC: 0.952 v .965 -- increased by .013

Time to run:  0:09:24.646272 v 0:05:41.022994 -- time decreased by over 3.5 minutes

While it did drop down the high precision that I originally found with the classifier, the tuning did improve the recall,
F-beta, and ROC AUC significantly.

In [None]:
    # parameters to tune
    parameters = {
        'clf__alpha': [.00000001, .000001, .0001, .01, 1, 100],
        'clf__loss': ['log', 'modified_huber'],
        'clf__penalty': ['l2', 'l1', 'elasticnet'],
        'clf__max_iter': [1000, 2000],
        'clf__n_iter_no_change': [5, 10],
        'clf__class_weight': ['balanced', None],
    }

10. Feature Extraction and 11. Modeling the Data

b. Most of the work I did around features revolved around using words versus character n-grams and the size of the n-grams.
I first tried word features and experimented with unigrams, unigrams and bigrams, and unigrams, bigrams, and trigrams.
Ultimately, only using unigrams provided the best results:

Avg. Recall: 0.581

Avg. Precision: 0.891

Avg. F-Beta: 0.703

Avg. ROC AUC: 0.958

Time to run:  0:02:35.807145

I then tried running it with only word boundary character n-grams. I tried bigrams, trigrams, 4-grams, and 5-grams. The 
best results were produced by the 4-grams run.

Avg. Recall: 0.611

Avg. Precision: 0.890

Avg. F-Beta: 0.725

Avg. ROC AUC: 0.963

Time to run:  0:04:51.814777

I then combined word unigrams with 4-grams and 5-grams to see what combinations worked best and the best results came 
from the unigrams with 4-grams.

Avg. Recall: 0.612

Avg. Precision: 0.893

Avg. F-Beta: 0.726

Avg. ROC AUC: 0.964

Time to run:  0:05:21.840541

I then tried to add the length of the comments, which made the results worse. I also tried upping the number of features
for the word and character n-grams to 20,000 and the results were worse.

After completing all of the other steps, I returned to each step in turn to see if I could make any other improvements.
Ultimately I added a feature capturing the logged_in field of the data and found that adding that improved the results.
These results were found after performing all of the other steps, so the improvements are not solely from adding this feature.

Avg. Recall: 0.621

Avg. Precision: 0.899

Avg. F-Beta: 0.734

Avg. ROC AUC: 0.964

Time to run:  0:05:56.381002

The features included in the final system are word unigrams, character 4-grams, and the boolean value of the
logged_in field.

d. I tried 4 different models. The first was the LogisticRegression that was part of the base code.

Avg. Recall: 0.612

Avg. Precision: 0.893

Avg. F-Beta: 0.726

Avg. ROC AUC: 0.964

Time to run:  0:05:21.840541

Multinomial Naive Bayes

Avg. Recall: 0.602

Avg. Precision: 0.842

Avg. F-Beta: 0.702

Avg. ROC AUC: 0.932

Time to run:  0:04:51.760757

RandomForest

Avg. Recall: 0.522

Avg. Precision: 0.888

Avg. F-Beta: 0.658

Avg. ROC AUC: 0.914

Time to run:  0:08:07.777086

SGDClassifier

Avg. Recall: 0.452

Avg. Precision: 0.946

Avg. F-Beta: 0.611

Avg. ROC AUC: 0.952

Time to run:  0:09:24.646272

Ultimately, I chose the SGDClassifier. These results actually seem to be worse than the LogisticRegression, 
but the difference in the precision was significant and none of the other models provided a value near it. 
I figured I could try it and see what would happen with tuned parameters. If I was looking to maximize the ROC AUC score,
I would have kept the LogisticRegression model, but I wanted to see if I could use the SGDClassifier and get better results.

In [None]:
    # Combine word and character features
    word_and_char = FeatureUnion([
        ('vect_word', CountVectorizer(max_features=10000, analyzer='word', ngram_range=(1, 1))),
        # Borrowed from https://stackoverflow.com/questions/39121104/how-to-add-another-feature-length-of-text-to-current-bag-of-words-classificati
        ('vect_char', CountVectorizer(max_features=10000, analyzer='char_wb', ngram_range=(4, 4)))
    ])

    clf = Pipeline([
        # Combine word/char features with the logged_in column
        ('all', FeatureUnion([
            ('comments', Pipeline([
                ('extract_field', FunctionTransformer(lambda x: x['comment'], validate=False)),
                ('vects', word_and_char),
                ('tfidf', TfidfTransformer(norm='l2'))
            ])),
            ('login', Pipeline([
                ('extract_field', FunctionTransformer(lambda x: x['logged_in'][:, np.newaxis], validate=False)),
                ('encoder', OneHotEncoder())
            ]))
        ])),
        # Classify using the parameters that were found to be the best values
        # The search.fit below can be commented out as these values were found using that and it takes ~3.5 hours
        ('clf', SGDClassifier(alpha=.0001, class_weight=None, loss='modified_huber', max_iter=1000, n_iter_no_change=10, penalty='elasticnet', random_state=5))
    ])

12. Parameter Tuning (Part 2)

There parameters in the SGDClassifier above are the values that were produced by GridSearchCV as the best parameters.
The code below can be commented out to not perform the parameter tuning steps which take about 3.5 hours.

In [None]:
    # Find best parameters
    # search = GridSearchCV(clf, parameters, cv=3, verbose=10)
    # search.fit(dev_comments['comment'], dev_comments['attack'])
    # 
    # # Print out the best score and parameter set
    # print("Best Score: %.3f" %search.best_score_)
    # print("Best parameters set:")
    # 
    # best_parameters = search.best_estimator_.get_params()
    # for param_name in sorted(parameters.keys()):
    #     print("\t%s: %r" % (param_name, best_parameters[param_name]))

    # Set up KFold
    kf = KFold(n_splits=5)

    recalls = []
    precisions = []
    fbetas = []
    roc_aucs = []

    i =  1
    # For each split in KFold
    for training_data_indices, test_data_indices in kf.split(non_dev_comments):
        print("**********************************************************")
        print("Test Run: " + i)
        # Get training and test data
        training_data = non_dev_comments.iloc[training_data_indices]
        test_data = non_dev_comments.iloc[test_data_indices]

        # Fit the training data
        clf = clf.fit(training_data, training_data['attack'])

9. Metrics (Part 1)

f. The metrics provided a lot of information. The ROC AUC metric was a bit confusing at first but some internet research
helped to clarify it. Adding in the precision, recall, and confusion matrix was very useful as they provided a very clear 
picture of where the classifier could improve. They also helped to provide information on how changes could be influencing
the performance in one direction or another. As seen by the results for the models, based solely on the ROC AUC score, I 
would have chosen LogisticRegression but the high precision suggested that SDGClassifier may perform well after being tuned.
Having a variety of metric allowed for a much deeper understanding of the results and how they were affected by the decisions 
that I made.

I did add cross validation. I used a K-Fold method with n=5 that I provided all non tuning comments to. As mentioned above
the value of 5 was chosen to balance out the size of training versus test data and the amount of time it would take to run
multiple iterations. I tracked the recall, precision, f-beta, and roc auc score for each iteration and then averaged them
in order to provide the metrics seen throughout this notebook. I did not average the confusion matrix as it didn't make 
sense as something to be averaged, but do print it for each iteration of the K-Fold.

In [None]:
        # Predict the test data
        predictions = clf.predict(test_data)
        # Get the precision, recall, and fbeta
        (precision, recall, fbeta, support) = precision_recall_fscore_support(test_data['attack'], predictions, average='binary')
        print('Recall: %.3f' %recall)
        print('Precision: %.3f' %precision)
        print('F-Beta: %.3f' %fbeta)
        recalls.append(recall)
        precisions.append(precision)
        fbetas.append(fbeta)
        
        # Get the confusion matrix
        conf_matrix = confusion_matrix(test_data['attack'], predictions)
        print('Confusion Matrix:\n', conf_matrix)
        

In [None]:
        # Get the roc auc score
        auc = roc_auc_score(test_data['attack'], clf.predict_proba(test_data)[:, 1])
        print('Test ROC AUC: %.3f' %auc)
        roc_aucs.append(auc)
        
        i += 1
        print("**********************************************************")

9. Metrics (Part 2)

In [None]:
    # Average the recall, precision, F-beta, and roc auc
    print('Avg. Recall: %.3f' %np.mean(recalls))
    print('Avg. Precision: %.3f' %np.mean(precisions))
    print('Avg. F-Beta: %.3f' %np.mean(fbetas))
    print('Avg. ROC AUC: %.3f' %np.mean(roc_aucs))

end = datetime.now()

# Find time to run to understand performance from that perspective
time = end - start
print("Time to run: ", time)

c. No optimizations were performed in this code. A few tweaks were made though. The first change was the the classifier
does not operate on just the column field but the entire data set so that it can access the logged_in field. Also, when
performing parameter turning I set cv to 3 so that the process would be faster due to the number of parameters I tuned.
With the current setup it took about 3.5 hours to run through each possibility. For the most part, having a long runtime
was not an issue so effort was not spent trying to optimize the system.

g. Final metric v Original metrics

Avg. Recall: 0.621 v .571 -- improved by .05

Avg. Precision: 0.899 v .889 -- improved by .01

Avg. F-Beta: 0.734 v .696 -- improved by .038

Avg. ROC AUC: 0.964 v .955 -- improved by .009

Time to run:  0:05:56.381002 v 0:04:56.208785 -- slower by 1 minute

Every metric improved, except for runtime, which is expected since we are now doing K-Fold with n=5.

These results come from the SGDClassifier model.


h. The most interesting thing that I learned was all of the different ways that there are to approach a problem like this.
Starting with the text cleanup and all the way through the parameter tuning, there were so many decisions to make that
at times it was overwhelming to decide what to try to do to improve the performance. I hadn't grasped the amount of
thought and decision making that goes into designing a classification system and tuning it to perform as best as it can.
It was very interesting to try out different things and see how that decision affected the metrics and how some improved
recall but hurt precision and others improved both but could take a very long time to run. There was a lot of time spent
trying to balances all of the metrics and find the "best" classifier.

i. The hardest thing to do in this project was identify what decisions would be helpful to improving the classifier. As 
mentioned above, there was many decisions to make at each step and trying to make the best one was very difficult. Looking
through the scikit documentation and searching online provided many different ways of cleaning text data and many different
models that could be used. Even after narrowing down the list to a handful of models, I had to decide which classifier seemed
to be the best in order to run parameter training on it as there was not enough time to try tuning every model. I had to
make these decisions with the information I had an not having a complete grasp on how each model worked. After choosing a model
it was difficult to decide which parameters to tune and how to tune them. Given the sharp increase in runtime for every
parameter that is tuned and each value to be tried, there was a need to reduce the size of the set. I ultimately found a
set that could run in about 3.5 hours and tune a handful of the parameters that SGDClassifier has.