**Gustavo Espinosa** &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;   <img src= "https://i0.wp.com/immap.org/wp-content/uploads/2016/12/immap-logo-strap2.png?resize=300%2C45&ssl=1"> <br />
*July 2020*


# 2) UNHCR DEEP Data Analysis (Multi Label Text Classification)

<br />

**GitHub Project Card**: https://github.com/ghespinosa/iMMAPColombia/projects/2#card-42450420

*With help from: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5*




In [1]:
#Import packages

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
#stop_words = set(stopwords.words('english')) - m
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

import pandas as pd
import numpy as np

## Data Analysis - Multi Label Text Classification

In [2]:
# Load data from CSV and do basic prep...

raw_data = pd.read_csv("CleanData.csv")
raw_data

Unnamed: 0,Lead Title,Author,Modified Excerpt,Lang,Affected,All,Asylum Seekers,Displaced,Host,IDP,In transit,Migrants,Non Displaced,Not affected,Others,Pendular,Permanent,Refugees,Returnees
0,Venezuelans in Colombia struggle to find healt...,Doctors Without borders,the lack of access to basic and specialized he...,en,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,"Venezuelan refugees are very vulnerable, emoti...",Handicap International - Humanity & Inclusion,the serious political and economic crisis in v...,en,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0
2,Waning welcome: the growing challenges facing ...,Mixed Migration Centre,family reunification combines with socio econo...,en,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
3,Supporting solidarity: why the world must bols...,refugeesinternational,1 4 million meanwhile in nds department alone ...,en,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0
4,Waning welcome: the growing challenges facing ...,Mixed Migration Centre,11 out of 13 participating states agreed among...,en,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439,"WFP Colombia External Situation Report, 18 Jul...",World Food Programme,in coordination with the government of colombi...,en,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1
440,Regional Refugee and Migrant Response Plan for...,R4V,protection refugees and migrants from venezuel...,en,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0
441,Regional Refugee and Migrant Response Plan for...,R4V,socio economic inclusion as the venezuelan pop...,en,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
442,Colombia and Ecuador: Population Movement - In...,international federation of red cross and cres...,the colombian red cross monitors from the nati...,en,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


## Train N' Test Function

In [3]:
def TrainNTest(Categories,Test_Size,Model,Print=True):
    
    
    """ 
    |Summary Line - Description of Function
    |
    |
    |Parameters: 
    |-----------
    |Categories (list): output categories, e.g. (Categories=['Casualties', 'Context', 'Displacement',...)
    |Test_Size (int): decimal determining size of test, e.g. (Test_Size=.33); meaning 33% of the data will be split for testing
    |Model (str): 'NB' is Naive Bayes; 'SVC' is Support Vector Classification; 'LogReg' is Logistic Regression
    |Print (bool): default = True, if True, then results (confusion matrix and accuracy are printed)
    |
    |Returns: 
    |-----------
    |comparison (DataFrame): pd DataFrame with the modified excerpt along with actual and predicted values
    |accuracies (List): list of the various accuracy scores
    |
    """
    
    #splitting into training and testing data

    categories = Categories
    train, test = train_test_split(raw_data, test_size=Test_Size, shuffle=True)
    X_train = train['Modified Excerpt']
    X_test = test['Modified Excerpt']
    
    # Depending on user input, define a pipeline combining a text feature extractor with multi label classifier
    if Model=='NB':
        pipeline = Pipeline([
                    ('tfidf', TfidfVectorizer()),
                    ('clf', OneVsRestClassifier(MultinomialNB(
                        fit_prior=True, class_prior=None))),
                ])
    elif Model=='SVC':
        pipeline = Pipeline([
                    ('tfidf', TfidfVectorizer()),
                    ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
                ])
    elif Model=='LogReg':
        pipeline = Pipeline([
                    ('tfidf', TfidfVectorizer()),
                    ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),
                ])
    else:
        print("Please enter 'NB', 'SVC', or'LogReg'")
        
    if Print:
        print("#############################################")
        print(Model)
        print("------------")
    
    #prep our arrays to store actual and predicted values for the comparison Data Frame
    actuals = []
    predicted = []
    text = []
    accuracies = []

    d = {'Modified Excerpt':X_test}
    comparison = pd.DataFrame(data=d)

    #iterate through the categories, train the model, predict values, store them, print results
    for i in range(0,len(categories)):

        category = categories[i]

        # train the model using X_dtm & y
        pipeline.fit(X_train, train[category])

        # generate predicted classification based on fitted model
        prediction = pipeline.predict(X_test)

        #add actual and predicted values to our dataframe
        actuals.append(test[category])
        predicted.append(prediction)

        col_name_act = ('Actual ' + categories[i])
        col_name_pred = ('Predicted ' + categories[i])

        comparison[col_name_act] = actuals[i]
        comparison[col_name_pred]= predicted[i]    

        #generate confusion matrix and accuracy
        conf_mat,acc = CompareClasses(comparison[col_name_act],comparison[col_name_pred])
        
        if Print:
            print(category.upper())
            
            print('Confusion Matrix:')
            print(conf_mat)
            print('----')
            print('Testing Accuracy: ')
            print(format(acc, '.4f'))
            print()
            print()
        
        #append the accuracy score to the list
        accuracies.append(acc)
            
    return comparison,accuracies


## Confusion Matrix Function

In [6]:
def CompareClasses(actual, predicted, names=None):

    """ 
    |Summary Line - Description of Function
    |
    |
    |Parameters: 
    |-----------
    |actual (list): list of actual 
    |predict (list): list of predicted values
    |names (list): relabels the rows/columns to the provided names
    |
    |Returns: 
    |-----------
    |conf_mat (DataFrame): crosstab of the actual against the predicted # can return accuracy too
    |accuracy (float): float of the accuracy
    """
    y_names = names
    
    accuracy = sum(actual==predicted)/actual.shape[0]
    classes = pd.DataFrame(columns=['Actual','Predicted'])
    classes['Actual'] = actual
    classes['Predicted'] = predicted
    conf_mat = pd.crosstab(classes['Predicted'],classes['Actual'])
    
    if type(names) != type(None):
        conf_mat.index=y_names
        conf_mat.index.name='Predicted'
        conf_mat.columns=y_names
        conf_mat.columns.name = 'Actual'
        
    return conf_mat, accuracy


In [4]:
# Generate names of categories (will only work if the desired categories are the only integer type columns in raw_data)

categories=('')

for i in raw_data.loc[:,raw_data.dtypes==np.int64].columns:
    categories = categories + str(i)
    
categories

#categories = [categories]
categories = raw_data.loc[:,raw_data.dtypes==np.int64].columns
categories = categories.tolist()

In [9]:
# Define the categories and run all the tests

categories

nb_comparison,nb_acc = TrainNTest(categories,.33,'NB')
svc_comparison,svc_acc = TrainNTest(categories,.33,'SVC')
logreg_comparison,logreg_acc = TrainNTest(categories,.33,'LogReg') #Max iter is reached...?

#############################################
NB
------------
AFFECTED
Confusion Matrix:
Actual       0   1
Predicted         
0          110  37
----
Testing Accuracy: 
0.7483


ALL
Confusion Matrix:
Actual       0  1
Predicted        
0          145  2
----
Testing Accuracy: 
0.9864


ASYLUM SEEKERS
Confusion Matrix:
Actual       0  1
Predicted        
0          145  2
----
Testing Accuracy: 
0.9864


DISPLACED
Confusion Matrix:
Actual       0   1
Predicted         
0          119  28
----
Testing Accuracy: 
0.8095


HOST
Confusion Matrix:
Actual       0  1
Predicted        
0          141  6
----
Testing Accuracy: 
0.9592


IDP
Confusion Matrix:
Actual       0  1
Predicted        
0          146  1
----
Testing Accuracy: 
0.9932


IN TRANSIT
Confusion Matrix:
Actual       0   1
Predicted         
0          128  19
----
Testing Accuracy: 
0.8707


MIGRANTS
Confusion Matrix:
Actual      0    1
Predicted         
1          13  134
----
Testing Accuracy: 
0.9116


NON DISPLACED
Confu



PERMANENT
Confusion Matrix:
Actual       0   1
Predicted         
0          125  22
----
Testing Accuracy: 
0.8503


REFUGEES
Confusion Matrix:
Actual      0   1
Predicted        
0          44  25
1          24  54
----
Testing Accuracy: 
0.6667


RETURNEES
Confusion Matrix:
Actual       0   1
Predicted         
0          128  17
1            0   2
----
Testing Accuracy: 
0.8844


#############################################
LogReg
------------
AFFECTED
Confusion Matrix:
Actual       0   1
Predicted         
0          118  29
----
Testing Accuracy: 
0.8027


ALL
Confusion Matrix:
Actual       0  1
Predicted        
0          146  1
----
Testing Accuracy: 
0.9932


ASYLUM SEEKERS
Confusion Matrix:
Actual       0  1
Predicted        
0          144  3
----
Testing Accuracy: 
0.9796






DISPLACED
Confusion Matrix:
Actual       0   1
Predicted         
0          116  31
----
Testing Accuracy: 
0.7891


HOST
Confusion Matrix:
Actual       0  1
Predicted        
0          141  6
----
Testing Accuracy: 
0.9592


IDP
Confusion Matrix:
Actual       0  1
Predicted        
0          146  1
----
Testing Accuracy: 
0.9932






IN TRANSIT
Confusion Matrix:
Actual       0   1
Predicted         
0          133  14
----
Testing Accuracy: 
0.9048


MIGRANTS
Confusion Matrix:
Actual      0    1
Predicted         
1          17  130
----
Testing Accuracy: 
0.8844


NON DISPLACED
Confusion Matrix:
Actual       0  1
Predicted        
0          146  1
----
Testing Accuracy: 
0.9932






NOT AFFECTED
Confusion Matrix:
Actual       0
Predicted     
0          147
----
Testing Accuracy: 
1.0000


OTHERS
Confusion Matrix:
Actual       0  1
Predicted        
0          146  1
----
Testing Accuracy: 
0.9932


PENDULAR
Confusion Matrix:
Actual       0  1
Predicted        
0          144  3
----
Testing Accuracy: 
0.9796


PERMANENT
Confusion Matrix:
Actual       0   1
Predicted         
0          129  18
----
Testing Accuracy: 
0.8776


REFUGEES
Confusion Matrix:
Actual      0   1
Predicted        
0          43  19
1          24  61
----
Testing Accuracy: 
0.7075


RETURNEES
Confusion Matrix:
Actual       0   1
Predicted         
0          128  19
----
Testing Accuracy: 
0.8707




In [10]:
# Display the scores of the three different models

d = {}
d['Model']=['Naive Bayes','Support Vector Classifier','Logistic Regression']

d['Model Average'] = [(sum(nb_acc)/len(nb_acc)),(sum(svc_acc)/len(svc_acc)),(sum(logreg_acc)/len(logreg_acc))]

for i in range(0,len(categories)):
    d[categories[i]]=[nb_acc[i],svc_acc[i],nb_acc[i]]
    

df = pd.DataFrame(data=d)
df

Unnamed: 0,Model,Model Average,Affected,All,Asylum Seekers,Displaced,Host,IDP,In transit,Migrants,Non Displaced,Not affected,Others,Pendular,Permanent,Refugees,Returnees
0,Naive Bayes,0.909751,0.748299,0.986395,0.986395,0.809524,0.959184,0.993197,0.870748,0.911565,0.993197,0.993197,1.0,0.986395,0.877551,0.653061,0.877551
1,Support Vector Classifier,0.909751,0.836735,0.986395,0.979592,0.721088,0.959184,0.993197,0.884354,0.918367,1.0,0.993197,0.993197,0.979592,0.85034,0.666667,0.884354
2,Logistic Regression,0.915193,0.748299,0.986395,0.986395,0.809524,0.959184,0.993197,0.870748,0.911565,0.993197,0.993197,1.0,0.986395,0.877551,0.653061,0.877551
