# Personal Attack Identifier - Hao Qin

###### Dataset from Wikipedia 
Q a-i is not in the alphabetical order but the order of classifer implementation.

### C. Improvements
1. Used columnTransformer instead of featureUnion.
2. Used array of scoring to print out comprehensive result

### F. Metrics & CrossValidation
1. Metrics tells me more information about my classifier. For example, recall tells me what percent of personal attack my classifier can caught, and precision can tell me the correctness of my model.
2. In my opinion we do not want to miss bad comments since Wikipedia is open to all ages. Based on this we want to lower the false positive rate. **Therefore this model is going to put slightly more weight on Recall, then AUC, then Precision.**
3. CrossValidation：Yes I think crossvalidation is necessary for decrease sample bias.

In [1]:
import urllib
import pandas as pd
import string
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

In [2]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 

def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
# download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
# download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [3]:
comments = pd.read_table('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [4]:
len(annotations['rev_id'].unique())

115864

In [5]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [6]:
# join labels and comments
comments['attack'] = labels

### A. Text clean up
1. The original schema that delete 'newline' and 'tab' 
2. clean up punctuation which improve performance a little but decrease slightly in scores.
3. clean up stopwords which actually cause slightly decreasing in scores
4. **clean up digits and all to lowercase improve all scores by 0.001% (chosen)**

In [7]:
# clean data
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", ""))
comments.comment = comments.comment.apply(lambda x: x.lower())
comments.comment = comments.comment.apply(lambda x: x.translate(str.maketrans('','',string.digits)))

In [8]:
comments.query('attack')['comment'].head(10)

rev_id
2702703    ____fuck off you little asshole. if you want t...
4632658         i have a dick, its bigger than yours! hahaha
6545332    == renault ==you sad little bpy for driving a ...
6545351    == renault ==you sad little bo for driving a r...
7977970    ,  nov  (utc)::because you like to accuse me o...
8359431    `::you are not worth the effort. you are argui...
8724028    yes, complain to your rabbi and then go shoot ...
8845700                     i am using the sandbox, ass wipe
8845736    == god damn ==god damn it fuckers, i am using ...
Name: comment, dtype: object

In [9]:
# baseline
attack = comments['attack']
baseline = 1-attack.mean()
print(baseline)

0.8827073120209901


In [10]:
#test case
train = comments.query("split=='train'")
test = comments.query("split=='test'")

### E.Tuning Parameter
1. **C:I tried 0.25, 0.5, 1, 1.5, 2. AUC and Precision increase with C value going up, decrease with C going down. Recall score is the other way around. After combination of test I decide to go with value=0.6 which create a good balance between scores. (chosen)**
2. loss: 'hinge' decrease all scores.
3. dual, intercept_scaling, max_iter: These parameters does not have obvious effects on scores.
4. **TfidfVectorizer: Maxfeature=10000 and n_gram=1 also improves the socres (chosen)**

In [11]:
#param for gridseachCV
pipe_parms=[{
    'union__transformer_weights': [{'non-text': 0.2, 'text':0.8}],
    'union__text__max_features':[10000],
    'union__text__ngram_range': [(1,1)],
    'lsvc__C': [0.6],
}]

### B.Feature Extraction
1. year: no significant improvement showing. With increasing weight add to year auc actually went down.
2. logged_in: no significant improvement showing. With increasing weight add to year auc actually went down.
3. **sample: slight improvement in recall with 0.2 weight. Increasing weight of sample rise recall by 1-5% however, precision and AUC decrease significantly. After different weight testing the best weight distribution so far is "sample": 0.2, "comment": 0.8 (chosen)**

In [12]:
#classifier & feature extraction
pipeline = Pipeline([
    ('union', ColumnTransformer([
        ('non-text', OneHotEncoder(categories='auto'),['sample']),
         ('text', TfidfVectorizer(), 'comment')])
    ),
    ('lsvc', LinearSVC()) 
])

### D.Model Selection
1. LogisticRegression: This method achieves AUC with around 95.7% and 91% precision. However recall is only 55%.
2. MultinomialNB: This method has highest AUC with 96%, but recall and precision are 51% and 87% respectively.
3. RandomForestClassifier: This method only has AUC 86%, recall 43%, precision 81%, and takes way longer time then other methods
4. MLPClassifier: Could not finish running this method in hours, this is not a suitable method for large scale problem.
5. **LinearSVC: This method can achieve AUC 96%, recall 64%, precision 87%. (chosen)**
6. SVC: Spent hours and couldn't finish running, so abandoned.

In [13]:
#fitting
scoring = {'AUC': 'roc_auc', 'Accuracy': metrics.make_scorer(metrics.accuracy_score),
           'Recall': 'recall', 'Precision': 'precision'}

gs = GridSearchCV(pipeline, param_grid = pipe_parms, cv=5, scoring=scoring, refit='AUC', return_train_score=True)
%time gs.fit(comments, comments['attack'])

Wall time: 2min 36s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('union',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('non-text',
                                                                         OneHotEncoder(categorical_features=None,
                                                                                       categories='auto',
                                                                                       drop=None,
                                                                                       dtype=<class 'numpy.float64'>,
                      

### G.Reults
1. AUC:96.07%, Recall:65.47%, Precision:87.3%.
2. Compare to strawman AUC:0.37%, Recall:10.32%, Precision: -4%.
3. LinearSVC produced the best result.

In [14]:
#result
gs.cv_results_

{'mean_fit_time': array([5.83453908]),
 'std_fit_time': array([0.57147617]),
 'mean_score_time': array([5.17539859]),
 'std_score_time': array([0.76853994]),
 'param_lsvc__C': masked_array(data=[0.6],
              mask=[False],
        fill_value='?',
             dtype=object),
 'param_union__text__max_features': masked_array(data=[10000],
              mask=[False],
        fill_value='?',
             dtype=object),
 'param_union__text__ngram_range': masked_array(data=[(1, 1)],
              mask=[False],
        fill_value='?',
             dtype=object),
 'param_union__transformer_weights': masked_array(data=[{'non-text': 0.2, 'text': 0.8}],
              mask=[False],
        fill_value='?',
             dtype=object),
 'params': [{'lsvc__C': 0.6,
   'union__text__max_features': 10000,
   'union__text__ngram_range': (1, 1),
   'union__transformer_weights': {'non-text': 0.2, 'text': 0.8}}],
 'split0_test_AUC': array([0.9549124]),
 'split1_test_AUC': array([0.96438798]),
 'split2_

In [16]:
#best parameters
gs.best_params_

{'lsvc__C': 0.6,
 'union__text__max_features': 10000,
 'union__text__ngram_range': (1, 1),
 'union__transformer_weights': {'non-text': 0.2, 'text': 0.8}}

In [15]:
#best score
gs.best_score_

0.9606676083422383

In [17]:
#confusion metrics
y_pred=gs.predict(test)
metrics.confusion_matrix(test['attack'], y_pred)

array([[20284,   138],
       [  852,  1904]], dtype=int64)

#### Confusion matrix
    [[20284,   138]
    [  852,  1904]]

### H&I. Interesting & Difficult things of project
1. Overall the whole mechian learning project is one of the most interesting project I've had so far.
2. The most interesting part is understanding models and trying to figure out how can I improve the result. The creators of these models and sklearn are all so genius. I learned a lot form their model and way of thinking. These project reminds me how profund computer science is.
3. The hardest part is also trying to understand and improve calssifier. Some models and their parameters are hard to understand. Therefore, what I did was trying different combination(so many combinations) to guess what would be the best one and trying to understand them along the way.
4. There are a lot of other things that may improve my classifier but beyond my knowleadge at this point, which I will look in to in the future.