### Voting algorithm test.
Data is from a conduct risk domain. Sentences are labelled as problematic from a conduct risk perspective or innocent/neutral from a conduct risk perspective. For this exercise models are trained and tested on imbalanced classes. I have printed out the top of both the training and test data sets to give some context to the sentences used in the exercise.
#### There are four custom classes below.
The "prep" class reads in the test and training data and rebalances the classes if required. In this exercise the class of interest is set at 3% of the entire dataset. It can be set at whatever is needed or just ignored if appropriate.
The "clean" class uses Spacy to produce a corpus ready for the pipeline and ultimately the voting algorithm.
The "mean embedding class" (not my original code) allows for the word embeddings to be passed to the sklearn pipeline. The "voter" class runs all the pipelines and produces a confusion matrix from the final voting algorithm.
#### Overview of the voting algorithm.
This is a natural language processing exercise. The data consists of sentences and accompanying labels. The final task performed in the notebook is to classify sentences that are regarded as ok or problematic. Problem sentences are those viewed as being of concern and would be viewed by a risk person as worthy of further examination.
The confusion matrix summarises the accuracy of the voting algorithm.
The various classes that are imported are doing the following under the hood:
1. Producing word embeddings from the Facebook Fasttext model and the Google Word2Vec model.
2. Each set of embeddings along with labels are passed to four classification algorithms:KNN,Bagging classifier(KNN base), fully connected neural network classifier and extratrees(extremely randomised trees) classifier. Each of these classification algorithms in effect produces a prediction for the voting algorithm but with their own bias-variance characteristics.
4. In total there are eight classifiers working in the pipeline ahead of the final voting algorithm. There are four classifiers for each of the two word embedding models.  
4. The voting algorithm, in this exercise, treats each of of the eight predictions equally in deciding how to classify the unseen test sentences. Results/predictions are summarised in the confusion matrix.


 
##### Oct 2019


In [1]:
#Use pandas for this notebook to read in the test and training data. 
import pandas as pd
#Import custom classes
from Downloads.prepdata import prep 
from Downloads.spacyclean import clean
from Downloads.meanembeddings  import MeanEmbeddingVectorizer
from Downloads.vote import voter

In [2]:
#Define test and training data
test = pd.read_csv('/Users/brianfarrell/Desktop/test_client_dissatisfaction.csv',encoding='utf8')
training = pd.read_csv('/Users/brianfarrell/Desktop/training.csv',encoding='utf8')

In [3]:
#Display top of the training dataset
training.head()

Unnamed: 0,Sentences,Label_bin,Misconduct_type
0,I need to speak offline,1,Evassiveness
1,Can you call me on my mobile,1,Evassiveness
2,Can you call me on my cell now,1,Evassiveness
3,Can we talk offline,1,Evassiveness
4,LTOL,1,Evassiveness


In [4]:
#Display top of test dataset
test.head()

Unnamed: 0,Sentences,Label_bin,Misconduct_type
0,Client is not happy can't talk right now,1,Client dissatisfaction
1,I just bought some apples at the market on the...,0,Conduct Neutral
2,But that would nt work .,0,Conduct Neutral
3,Well there are times when you are on and times...,0,Conduct Neutral
4,I got a great deal on my mobile plan .,0,Conduct Neutral


In [5]:
#Run voting algorithm and produce confusion matrix.
def produce_confusion_matrix():
    data= prep.balancedata(test,training)
    cleaned_training_corpus = clean.spacy_cleanup(data.Sentences.str.cat(sep='\n')) 
    corpus= clean.get_corpus(cleaned_training_corpus) 
    labels=data.Label_bin
    X_train, X_test, y_train, y_test = clean.split_data(cleaned_training_corpus,labels)
    ft_embeddings=voter.get_ft_embeddings(corpus) 
    w2v_embeddings=voter.get_w2v_embeddings(corpus)
    voter.run_pipe(ft_embeddings,w2v_embeddings,X_train, X_test, y_train, y_test)
produce_confusion_matrix() 

100%|██████████| 3157/3157 [00:00<00:00, 212164.81it/s]

 All sentences parsed





     VotingClassifier confusion matrix

         predicted ok  predicted problem
ok                763                  0
problem             0                 27
