# Classification

In this unit we will learn different methods for classification using the `scikit-learn` library. We'll start off using short texts collected from the [Universal Periodic Review](https://en.wikipedia.org/wiki/Universal_Periodic_Review), an international human rights mechanism. Each of these texts have an attached *label* or *labels* that pertain to the human rights issue the corresponding text is concerned with.

From these texts, we're going to estimate a supervised model that tries to guess the label(s) from the text data. Note that this is a *multilabel* classification problem, because each text may have more than one label, or no label.

In [1]:
import os
import re
import csv
import sys
import random
from pandas import DataFrame
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics, tree, cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import RandomizedLogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import coverage_error
from sklearn.grid_search import GridSearchCV



## 1) Load and Preprocess

We'll first load in a csv file that contains our texts and their corresponding label(s).

In [13]:
# read in full csv
recs = []
with open('data/upr.csv','r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        recs.append(row)
print(len(recs))

41066


Here are a few of the texts and their labels.

In [14]:
for x in recs[:5]:
    print("Text:", x['Text'])
    print("Labels:", x['Issue'])
    print()

Text: Consider the possibility of acceding to the International Covenant on Economic, Social and Cultural Rights and the International Covenant on Civil and Political Rights
Labels: CP rights - general,ESC rights - general,International instruments

Text: Establish a national institution to promote and protect human rights
Labels: NHRI

Text: Consider the possibility of establishing a national human rights institution in accordance with the Paris Principles
Labels: NHRI

Text: Consider the possibility of ratifying the International Covenant on Civil and Political Rights, the International Covenant on Economic, Social and Cultural Rights, the International Convention for the Protection of All Persons from Enforced Disappearance and the Convention on the Rights of Persons with Disabilities
Labels: CP rights - general,Disabilities,Enforced disappearances,ESC rights - general,International instruments

Text: Continue with the efforts to prevent, punish and eradicate all forms of violence a

First we'll turn the labels into a list and remove any texts that don't have a label.

In [15]:
for i in recs:
    issues = i['Issue'].split(',')
    i['Issue'] = [x for x in issues if x != 'Other' and x != 'General']
    
rec_sub = [i for i in recs if i['Issue']]
print("Number of recs:", len(rec_sub))

Number of recs: 39475


Now we'll turn our data into a `pandas` DataFrame.

In [19]:
data = DataFrame(rec_sub)
print(data.shape)

data.head()

(39475, 2)


Unnamed: 0,Issue,Text
0,"[CP rights - general, ESC rights - general, In...",Consider the possibility of acceding to the In...
1,[NHRI],Establish a national institution to promote an...
2,[NHRI],Consider the possibility of establishing a nat...
3,"[CP rights - general, Disabilities, Enforced d...",Consider the possibility of ratifying the Inte...
4,[Women's rights],"Continue with the efforts to prevent, punish a..."


Let's extract the text and label data.

In [21]:
text = data['Text'].values
labels = data['Issue'].values

We now have to "binarize" the labels, meaning that we transform the list of labels into an array of binary indicators: the one, i.e. the non zero elements, corresponds to the subset of labels. For instance, an array such as `np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])` represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample. 

The `MultiLabelBinarizer` transformer can be used to convert between a collection of collections of labels and the indicator format.

In [30]:
mlb = MultiLabelBinarizer()
labels_binary = mlb.fit_transform(labels)
print(labels_binary)

[[0 1 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


We're now ready to extract our `X_train, X_test, y_train, y_test`:

In [43]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    text, labels_binary, test_size=0.2, random_state=40)
print("Number of training data observations:", len(X_train))

Number of training data observations: 31580


For each observation, `X` is the text, and `y` is the corresponding array of labels.

In [44]:
for x in range(5):
    print("Text:", X_train[x])
    print("Labels:", y_train[x])
    print()

Text: Maintain its commitment to improve the quality of education to ensure the full enjoyment of the right to education
Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Text: Incorporate into the new plans the need for a unified register of cases of violence against women, and to increase efforts to combat that scourge and impunity for those who commit such acts, and in particular to consider criminalizing the crime of femicide
Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]

Text: Make necessary amendments to the national legislation in order to bring it into line with international obligations and commitments for the protection of children and in particular for their protection against sexual abuses, as well as against trafficking of persons
Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 0 0

Here are the labels we have:

In [47]:
label_names = list(mlb.classes_)
print(label_names)

['Asylum-seekers - refugees', 'CP rights - general', 'Civil society', 'Corruption', 'Counter-terrorism', 'Death penalty', 'Detention', 'Development', 'Disabilities', 'ESC rights - general', 'Elections', 'Enforced disappearances', 'Environment', 'Extrajudicial executions', 'Freedom of association and peaceful assembly', 'Freedom of movement', 'Freedom of opinion and expression', 'Freedom of religion and belief', 'Freedom of the press', 'HIV - Aids', 'Human rights defenders', 'Human rights education and training', 'Human rights violations by state agents', 'Impunity', 'Indigenous peoples', 'Internally displaced persons', 'International humanitarian law', 'International instruments', 'Justice', 'Labour', 'Migrants', 'Minorities', 'NHRI', 'National plan of action', 'Poverty', 'Public security', 'Racial discrimination', 'Right to education', 'Right to food', 'Right to health', 'Right to housing', 'Right to land', 'Right to water', 'Rights of the Child', 'Sexual Orientation and Gender Identi

## 2) Pipelines

Machine learning often involves a fixed sequence of steps for processing the data, for example feature selection, normalization and classification. 

Scikit-learn includes a [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) structure to help with this. Pipelines serve 2 purposes:

1. **Convenience:** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
2. **Joint parameter selection**: You can [grid search](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) over parameters of all estimators in the pipeline at once.

We'll build a pipeline with a Support Vector Machine.

In [49]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', OneVsRestClassifier(LinearSVC(random_state=0)))
                     ])

Now we'll fit the model using our pipeline.

In [50]:
clf = text_clf.fit(X_train, y_train)

## 3) Predicting

Now we can use our trained model to predict the held-out "test" set. Better yet, there's no need to explicitly extract features or preprocess the data, since it uses the same pipeline as the training sequence.

In [57]:
predicted = clf.predict(X_test)
clf.score(X_test, y_test) 

0.7584547181760608

In [52]:
# mean agreement
# what does this mean?
np.mean(predicted == y_test)

0.99388853704876501

Let's look at a few of the test texts and see what labels our model predicted:

In [58]:
for doc, label in zip(list(X_test[:5]), predicted[:5]):
    print('%r => %s' % (doc, ", ".join(list(np.array(label_names)[label==1]))))
    print()

"Take stronger measures to combat discrimination in both the public and private sectors while promoting greater women's participation at the highest levels of decision-making" => Women's rights

'Submit all pending reports to the respective United Nations treaty bodies, namely, the Committee on Economic, Social and Cultural Rights, the Human Rights Committee and the Committee on the Rights of the Child' => ESC rights - general, Treaty bodies

'Ratify CRPD' => Disabilities, International instruments

'Ensure that the new Constitution fully guarantees the right to freedom of religion or belief and the right to equality and non-discrimination in line with international standards' => Freedom of religion and belief

'Establish a moratorium on executions with a view to abolishing the death penalty' => Death penalty



Here are some metrics that show how well our model does on individual labels:

In [60]:
print(metrics.classification_report(y_test, predicted,
    target_names=label_names)) 

                                              precision    recall  f1-score   support

                   Asylum-seekers - refugees       0.95      0.80      0.87       120
                         CP rights - general       0.88      0.79      0.83       115
                               Civil society       0.93      0.88      0.91       155
                                  Corruption       0.95      0.85      0.90        47
                           Counter-terrorism       1.00      0.65      0.79        26
                               Death penalty       0.99      0.94      0.96       379
                                   Detention       0.93      0.89      0.91       471
                                 Development       0.80      0.58      0.67       179
                                Disabilities       0.96      0.95      0.96       283
                        ESC rights - general       0.90      0.80      0.85       228
                                   Elections       0.

## 4) Cross Validation and Grid Search

`scikit-learn` can be used differently depending on the task. It allows the user to build a basic single model from scratch, but also includes a grid-search function and cross-validation for more sophisticated exploration.

Here are the cross validation scores for our model:

In [61]:
## cross validation
scores = cross_validation.cross_val_score(
   text_clf, text, labels_binary, cv=5)
scores

array([ 0.74122863,  0.73552882,  0.7409753 ,  0.72868904,  0.74756175])

Now we'll use a grid search to try out different parameters. In this instance we're going to see what happens with different ngram ranges, and with and without tfidf.

In [65]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
              'tfidf__use_idf': (True, False),
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

What are the best parameters?

In [63]:
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

tfidf__use_idf: True
vect__ngram_range: (1, 2)


We can also see the cross validation scores for all the models:

In [66]:
gs_clf.grid_scores_

[mean: 0.72970, std: 0.00508, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True},
 mean: 0.73284, std: 0.00197, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True},
 mean: 0.72217, std: 0.00214, params: {'vect__ngram_range': (1, 3), 'tfidf__use_idf': True},
 mean: 0.71482, std: 0.00466, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False},
 mean: 0.72289, std: 0.00223, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False},
 mean: 0.72131, std: 0.00241, params: {'vect__ngram_range': (1, 3), 'tfidf__use_idf': False}]