# Analysis of attacks in Wikipedia comments

## Approach 

### Initial Code
The strawman code was used initally and then it was modified as per the need. The comments_annotation and comments file was downloaded to the machine. 

*Please note that since data is imbalanced F1 Score was reported for 'attack' label. It is note worthy that F1 score for all 'not attack' label was greater than 0.9 for all the experiments.*

In [26]:
import pandas as pd
from stemmer import StemTokenizer
from pprint import pprint
from time import time
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

len(annotations['rev_id'].unique())

# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

# join labels and comments
comments['attack'] = labels

# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# fit a simple text classifier
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


### Baseline
Running the given strawman code resulted in following metrics.

Test ROC AUC: 0.957

Precison and recall:

    True Positive: (0.94272964, 0.91476591)
    True Negative: (0.99304671, 0.55297533) 
    
F1 Score: 

    True Positive: 0.96722596
    True Negative: 0.68928087

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20280 | 0.142 |
| predicted -ve |  1232  | 0.955 |
 
Thus, though accuracy is high, F1 score and false negatives indicates, the classifier is incorrectly classifying data with labels 1 (attacks).

### Data pruning 
Along with given methods of data pruning, following two methods were explored
1. Punctuation Removal
2. Stopwords Removal
3. Stemming

#### Punctuations  Removal
Removing all the punctuations from the data. This did not significantly improve the accuracy and f1 score.

#### Stopwords Removal
Words such as 'is', 'the', 'a' are fairly common in english language. Hence, there is a fair chance that these words might not contirbute to the overall efficieny of the algorithm. Sklearn count vectorizer's stopwords removal was used.

Test ROC AUC: 0.951

Precison and recall:

    True Positive: (0.94141386, 0.92160494)
    True Negative: (0.99378122, 0.54172714) 
    
F1 Score: 

    True Positive: 0.96688899
    True Negative: 0.68235832

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20295 | 127   |
| predicted -ve |  1263  | 1493  |

Result: The accuracy decreased along with F1 score. Hence, it was not used further. Stop words from nltk included words with negative connotation, hence a customized list of stopwords was prepared and used. But it neither yield any significant results.


In [27]:
# stopwords=["the", "a", "all", "am", "an", "and", "are", "as", "at", "be", "because", "been", "being",
#            "between", "both", "by", "did", "do", "does", "doing", "during",  "for", "from",
#            "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers",
#            "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into",
#            "is", "it", "it's", "its", "itself", "let's", "me", "my", "myself", "of", "other", "our", "ours", "ourselves",
#            "she", "she'd", "she'll", "she's", "should", "so", "that", "that's", "the", "their", "about", "again",
#            "theirs", "them", "themselves", "there", "there's", "these", "they", "they'd", "they'll", "they're",
#            "they've", "this", "those", "through", "to", "was", "we", "we'd", "we'll", "we're", "we've", "were",
#            "what", "what's", "when", "when's", "where", "where's", "which", "who", "who's", "whom", "why", "why's",
#            "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]
#
# def preprocess(x):
#     x = re.sub('[^a-z\s\:\)\(]', '', x.lower())                  # get rid of punctuations
#     x = [w for w in x.split() if w not in set(stopwords)]        # remove stopwords
#     return ' '.join(x)
#
# train_comments['comment'] = train_comments['comment'].apply(preprocess)
# Y_train = train_comments['attack']
#
# test_comments['comment']=test_comments['comment'].apply(preprocess)
# Y_test=test_comments['attack']


#### Stemming 
Words such as 'lovingly', 'loved' imply same meaning. So these words are reduced to there original word form 'love'. This method potentially helps in coalesing the features and thus enhancing the model performance.

Result:

Test ROC AUC: 0.959

Precison and recall:

    True Positive: (0.94424502, 0.90330052)
    True Negative: (0.99182254, 0.56603774) 
    
F1 Score: 

    True Positive: 0.96744919
    True Negative: 0.69596253

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20295 | 127   |
| predicted -ve |  1263  | 1493  |

 
Thus stemming was found useful and hence was used further.

#### Combined (Stemming and Stop word removal)
A combination of the above two features was carried out on the strawman code.

Result:

Test ROC AUC: 0.955

Precison and recall:

    True Positive: (0.94057111, 0.91780822)
    True Negative: (0.99353638, 0.53483309) 
    
F1 Score: 

    True Positive: 0.96632852
    True Negative: 0.67583677

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20295 | 127   |
| predicted -ve |  1263  | 1493  |



Result: Hence the combination of stemming and stopwords was not found useful.

Alongwith the above two new methods, some combinations of given methods, such as 3-grams, were tried.
Following are the results.

*Note: These experiments were carried out with strawman code.*

| Data Pruning  | F1            | AUC   |
| ------------- |:-------------:| -----:|
| -             | 0.689204      | 0.957 |
| 3 Gram        | 0.687812      | 0.955 |
| Stopword      | 0.682358      | 0.951 |
| Stemming      | 0.696544      | 0.959 |
| Both above    | 0.675837      | 0.955 |
| Char 7 gram   | 0.673421      | 0.951 |
| Stem + above  | 0.683660      | 0.949 |

3 Grams is a feature extraction procedure, rather than just data pruning. But it has been included it here for comparison.

### Feature Extraction
Along with Count Vectorization and TF-IDF vectorization already given, following features were used.

#### Data grams
As mentioned in the document, following combination of n_grams was carried out

1. (1, 1)
2. (1, 2)
3. (1, 3)
4. 2
5. (1, 2)
6. 3

The ranges indicate that all the gram formats from 1...n were used. With extensive experimentation it was found that (1, 2) resulted in best accuracy.

#### Char versus Word
Used characters as features, instead of words, **with n_grams (1, 2)**

Result:

Test ROC AUC: 0.915

Precison and recall:

    True Positive: (0.92754089, 0.83380481)
    True Negative: (0.9884928, 0.4277939) 
    
F1 Score: 

    True Positive: 0.95704736
    True Negative: 0.56546763

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20187 | 235   |
| predicted -ve |  1577  | 1179  |

**with n_grams (1, 5)**

Result:

Test ROC AUC: 0.957

Precison and recall:

    True Positive: (0.94138395, 0.91538933)
    True Negative: (0.99324258, 0.54172714) 
    
F1 Score: 

    True Positive: 0.96661822
    True Negative: 0.69064737

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  20284 | 138   |
| predicted -ve |  1263  | 1493  |

From the above experiments, it is clear that characters as features with n_grams ()1, 5 do yield improvement.
Hence, it was used in the code.

In [28]:
pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MLPClassifier()),
    ])
feature_params={
    'vect__max_features': (5000, 10000, 30000, 50000),
    'vect__ngram_range': ((1, 5)),
    'vect__tokenizer': (StemTokenizer()),
    'vect__analyzer': ('char'),
    'tfidf__norm': (['l2'])
}                        #Didn't try with l1 Normalization

There are some non-text features in data such as

1. year
2. is_logged

#### Year
Year as a feature could be important, as vocabulary of general spoken english has changed in the past few years. This would mean certain new words or short-handed words are introduced. Thus, separating the newer comments (with newer words) could be useful.

Normalisaion of data: It was found that tfidf vectors of the matrix were ranging from 0.01 to 0.99. But the year as a feature is in format 2xxx which is in thousands.Hence scaling of data was applied and year numbers were scaled down in the following fashion-

(year % 2000)/100

Other techniques such as np.processing were also applied.

Result:

Test ROC AUC: 0.701

Precison and recall:

    True Positive: (0.88616349, 0.19274681)
    True Negative: (0.94114191, 0.10413643) 
    
F1 Score: 

    True Positive: 0.91282563
    True Negative: 0.1352179

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  19220 | 1202  |
| predicted -ve |  2469  | 287   |

Here, it is clear that year as feature yielded bad results. Hence, it was not used further.

#### Is_logged
Is_logged value could generally separate genuine users, who wish to give constructive comments versus fake users who want to poke fun at others. Hence, this could be good feature. 

Normalisation of data : It is in binary format. Hence it is normalised using the formula using preprocessing.scale from python, which finds z-scores.

Result:

Test ROC AUC: 0.722

Precison and recall:

    True Positive: (0.9127871, 0.1465798)
    True Negative: (0.99324258, 0.54172714) 
    
F1 Score: 

    True Positive: 0.9127871
    True Negative: 0.1465798

|  Conf. Matrix |  True  | False |
| ------------- |:------:| -----:|
| predicted +ve |  19195 | 1227  |
| predicted -ve |  2441  | 315   |

Here, it is clear that similar to year as a feature is_logged did not improve the baseline. hence, this feature was dropped.


### Models
Experiments were carried out using the following models-
1. Multinomial Naive based
2. Support Vector machine
3. MLP classifier

#### Multinomial Naive Bayes
Multinomial Navie Bayes algorithm with regularization and fit_priori hyperparameters tuning yielded following results. GridSearchCV was used to get the best parameters.

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': [1e-06, 1e-04],
 'clf__fit_prior': (True, False),
 'tfidf__norm': ['l2'],
 'vect_analyzer': ('word', 'char')
 'vect__max_features': (10000, 30000),
 'vect__ngram_range': ((1, 2),(1, 5),),
 'vect__tokenizer': (<stemmer.StemTokenizer object at 0x126a2a898>,)}
done in 29770.999s

Best score: 0.927
Best parameters set:
clf__alpha: 1e-06
clf__fit_prior: True
tfidf__norm: 'l2'
vect__max_features: 10000
vect__ngram_range: (1, 2)
vect__tokenizer: <stemmer.StemTokenizer object at 0x126a2a898>

Test ROC AUC: 0.929

F1 score 0.672893


Thus, accuracy and F1 score is less than baseline parameters.  

#### Support Vector Machine
Support vector Machines with linear kernel and balanced class weight, yielded following result. Due to imbalanced data as 'balanced' class weight were used. This would help SVM train better as it would penalize the cost function in accordance to the weight.

('vect', CountVectorizer(max_features = 10000, ngram_range = (1,5), analyzer='char', tokenizer=StemTokenizer())),
('tfidf', TfidfTransformer(norm = 'l2')),
('clf', SVC(C=0.001, class_weight='balanced', kernel='linear') )

Test ROC AUC: 0.960

F1 Score: 0.72196821

F1 score and accuracy have improved from the baseline.


#### MLP classifier
Multi-Layer Perceptron algorithm was applied and intial results showed improvement good improvement over baseline results. Hence, MLP hyper-parameters were tuned to get better result.

Initial result 
('vect', CountVectorizer(max_features = 10000, ngram_range = (1,5), analyzer='char', # stopwords=stopwords, tokenizer=StemTokenizer())),
('tfidf', TfidfTransformer(norm = 'l2')),
('clf', MLPClassifier(hidden_layer_sizes=(50,), activation='logistic', alpha=0.01, n_iter_no_change=4, batch_size=200))
])

Test ROC AUC: 0.960

F1 Score: 0.72750698


### Hyper-Parameters Optimization

Following hyper-parameters were tuned in MLP perceptron model.

#### C (alpha) Regularization constant 
Used alpha in the range of (0.0000005, 0.00002, 0.001, 0.01).
Among these 0.001 resulted in the best performance.

#### Activation function
Applied following activation functions ('logistic', 'tanh', 'relu'). It was found that logistic function gave the best F1 score (precision and recall). Tanh, Relu (which is the default) actually decreased the accuracy.

#### Hidden Layers
Following hidden layer combinations were used ((50,), (50, 50), (100, ), (50, 50, 50))
A network with (50,) hidden neruons was found to give best results in terms of F1 score.

#### Max Iteration
100, 200, 400 max iteration values were used among which 200 was the best estimate. For some of the combinations though 200 resulted in convergence warning. For these, other hyper parameter combinations with 400 max iteration did not yield significant improvements.

In [28]:
hyper_params={
    'clf__alpha': (0.0000005, 0.00002, 0.001, 0.01),
    'clf__hidden_layer_sizes': ((50,), (50, 50), (100, ), (50, 50, 50)),
    'clf__max_iter': (200, 400),
    'clf__batch_size': (200,),
    'clf__activation': (['logistic', 'tanh', 'relu'])
}

parameters={}
parameters.update(hyper_params)
parameters.update(feature_params)

grid_search = GridSearchCV(pipeline,
                           parameters,
                           cv=5,                    #KFold KFoldStratified
                           n_jobs=-1,
                           refit=True,
                           scoring='f1_weighted')

t0 = time()

grid_search.fit(train_comments['comment'], train_comments['attack'])

print("done in %0.3fs" % (time() - t0))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

done in 67309.999s
 
Best score: 0.961
Best parameters set:
	clf__alpha: '1e-02'
	clf__hidden_layer_sizes: (50,)
	clf__max_iter: 200
	clf__batch_size: 200
	clf__activation: 'logistic'
	vect__max_features: 10000
	vect__ngram_range: (1, 5)
	vect__tokenizer: <stemmer.StemTokenizer object at 0x126a2a898>
	vect__analyzer: 'char'
	tfidf__norm: 'l2'


## Metrics and Cross vaildation
#### Metrics
Rather than accuracy, precision and recall were better estimates. The data was **unbalanced** and hence vey high accuracy might indicate that the classifiers is very good at predicting one of the classes, but not the other. On the otherhand precision and recall is better estimate as it separates True Postives from False positives and same for thr negatives.

#### Cross Validation 
KFold cross vaildation was used where k = 5. Cross-validation was conducted by using built in GridSearchCV method.

In [28]:
test_result_proba = grid_search.predict_proba(test_comments['comment'])
test_result = grid_search.predict(test_comments['comment'])

auc = roc_auc_score(test_comments['attack'], test_result_proba[:, 1])
print('Test ROC AUC: %.3f' % auc)

fscore = f1_score(test_comments['attack'], test_result)
print('F1 score %3f' % fscore)

matrix = confusion_matrix(Y_test, test_pred)
print('Confusion Matrix', matrix)

Test ROC AUC: 0.961
F1 score 0.727
array([[20239, 183],
[1075, 1681]])



#### Results

Accuracy increased by 0.4%

F1 score for true negatives increased by 4%

### Interesting things learned from the project
1. I got to dive deep into several methods of machine learning. It was interesting to see how hyper-parameters would affect the result impactfully. 
2. Along, with the above methods, I tried Keras a deeplearning method based upon CNN. In Keras a word is mapped into a 32 bit vector and semantically closer words are mapped closer in the vector space. Such a classifier can carryout  several meaningfull derivations of sentences. 

### Difficult part of the project
1. Tuning hyper-parameters as to tackle the imbalance of the data was a difficult learning curve for me. 
1. Also, some of the know methods such as 'stopwords' didn't have significant impact, which was difficult to understand.
1. Running GridSearchCV on Jupyter notebook is very time consuming. I had to run my code separately on the machine.
1. Implementing feature union (trying out Columnar Transforms) was difficult, as it resulted in several python errors.