# Lab 3 / Text Classification of Consumer Complaints

Before we get started, we need to load the relevant Python modules that us used in the code later by running the next code cell.

In [None]:
import somialabs.lab3
%matplotlib inline

To complete this lab:
1. Follow the instructions running the code when asked.
2. Discuss each question in your group.
3. Keep notes for your answers to the questions in a separate MS Word document (you can use [this template](Lab3_answers_template.docx)).
4. When completed, briefly discuss your answers with the Lecturer/Teaching Assistant attending your lab. You **do not** need to submit your answers to Studentportalen.

In this lab, you will try to categorize consumer complaints, based on the complaint narrative, using supervised machine learning with Support Vector Machines (SVM). You will also be able to experiment with different forms of data pre-processing to test the effects on the categorization of the text.

### Loading the data

To read the dataset, run the following code cell:

In [None]:
somialabs.lab3.view_complaints_table()

**Question 1.** How many complaints records are present in the data set?

Run the following cell to get some summary information about the dataset:

In [None]:
somialabs.lab3.view_complaints_table_info()

**Question 2.** How many records contain a complaint narrative (transcript of the complaint from the complainer)?

### What is a Term Document Matrix?

The dataset you work with consists of consumer complaints narrative (some description from a consumer about their complaint) alongside a lot of extra data about the complaints, such as when each complaint was made, which company it relates to, and some categorisations such as product or issue category.

For this lab we are interested in predicting the `Product` relating to each complaint. Each row in the dataset corresponds to a complaint. We need to start by creating a TDM that is a representation of these complaints in terms of a feature vector, like we did in Lab 1. We can experiment with several techniques for optimizing the input dataset and inspect the TDMs after processing.

First, let's take a closer look at a sample of 5 complaint narratives. Run the following cell to get a sample from the dataset:

In [None]:
somialabs.lab3.view_sample_narratives()

#### Stemming

Stemming is a method where words are shortened to their morphological root. The algorithm that performs this truncation is adapted to the features of specific languages and thus it is not possible to use the same algorithm in Swedish as you would use in English. In this lab we focus on data in English.

We will create three different TDMs based on a sample of the `Consumer_Complaints.csv` dataset. We use a sample initially because the inspecting and manipulating a TDM with a large input dataset easily becomes unworkable.

Let us first take a closer look applying stemming to a TDM:

In [None]:
somialabs.lab3.filter_stem()

**Question 3.** How many features (terms) are present in the initial TDM generated from the sampled corpus, and after stemming? Explain your observations.

Let's now study some of the terms in our original corpus against those in the stemmed corpus:

In [None]:
somialabs.lab3.compare_stemmed().head(50)

**Question 4.** How do the terms differ in a TDM with stemming from a TDM without stemming?

#### Stopwords

Stopwords are words of limited importance that do not significantly affect the text analysis. Words that are filtered out are, for example, prepositions (prepositioner) and conjunctions (konjunktioner). We experimented with stopwords in Lab 1.

- A *preposition*  is a word that tells you where or when something is in relation to something else. For example, words like "after", "before", "on", "under", "inside" and "outside".`
- A *conjunction* is a connective word that join sentences together. For example, the FANBOYS words: "for", "and", "nor", "but", "or", "yet", "so".

Run the following code to apply stopword filtering to the sampled corpus and resulting TDM:

In [None]:
somialabs.lab3.filter_stopwords()

**Question 5.** How many features (terms) are present in the stopword-truncated TDM generated from the sampled corpus? How might stopword deletion effect the quality of the TDM? Explain your observations.

#### Frequency

In generating the TDMS the meaning of words in the feature vectors is recorded based on only the number of occurences of each term in each record of the corpus.

Another further matrix we can derive is a TF-IDF (term frequency inverse document frequency) matrix. This emphasizes the occurrence of a word in a particular document in relation to whether the word appears in the other documents. This means that if a word occurs in almost all documents, it is allocated a lower value in the TDM. A word that appears only in a few documents is instead weighted higher. An easier way to fold a word into the feature vector is by means of TF (term frequency). TF weight the words in the feature vector in such a way that it only calculates the occurrence of the word in a document and records this in the feature vector.

Consider this small example corpus:

    She watches bandy and football
    Alice likes to play bandy 
    Karl loves to play football


Inspect the TF-IDF matrix created below with a small corpus:

In [None]:
somialabs.lab3.interact_tdm()

In [None]:
somialabs.lab3.interact_tf_idf_matrix()

**Question 6.** Describe how the weighting of terms differs depending on how the frequency is calculated based on the terms found above.

*You can try adding and removing documents (each line is one document in the corpus), or editing each document to help you observe changes in weights.*

### Create a Term Document Matrix

Now it's time to get back to our consumer complaints dataset and create a TF-IDF matrix for text analysis. We can then apply stopword removal and then stemming.

Run the next code cell to make an interactive TDM to explore the input corpus size and the effects of applying stopword removal and stemming (*Note: for larger corpus sizes, be patient as it takes a bit longer to process and see the effects. It is clearer to see the matrix values using smaller corpus sizes.*):

In [None]:
somialabs.lab3.interact_complaints_tdm()

Now create the TF-IDF matrix:

In [None]:
somialabs.lab3.interact_complaints_tf_idf_matrix()

**Question 7.** What are the implications of data pre-processing for the objectivity of an analysis? (e.g. see Boyd & Crawford 2012 for a discussion)

## Training of SVM and classification

When you have created a TDM, it is time to divide the data set into a training set and a test set. Classification with supervised machine learning requires a training set as the algorithm learns how to categorize data. A SVM is customized so that they can classify the training set. The classifier is then tested on the test set. This is the same process as we used when we trained the decision tree is Lab 2.

Since training our classifier takes some time if we use the full complaints dataset, we will load a sample of the first $x$ number of records only (default 10000) for the purposes of the rest of this lab. 

Let us now visualize the distribution of complaint records according to the product categorization. Use the slider to explore how the sample size effects the distribution. You can also manually set the sample size by clicking on the number and entering your own value (*Hint: this is useful for exploring low values, for example less than 1000 or less than 100*):

In [None]:
somialabs.lab3.interact_plot_product_distribution()

**Question 8.** What do you notice about the shape of the distribution at lower and higher sample sizes?

**Question 9.** What can you observe about the number of complaints per product? How might this affect our analysis?

### Classifier with no data pre-processing

We will use the a Linear Support Vector Classification model from the Pythonm`sklearn` to create our classifier. We train the model using as input the TF-IDF matrix, alongside the relevant product labels. The TF-IDF matrix provides us the training features and the product labels provide our target classes.

When training a model, we take an input dataset, in our case the input complaints records, and split it into a training dataset and a test dataset. This allows us to train the model with labelled data, and then test the trained model with labeled data that was not used in the training process. The `train_test_split()` function by default split the input data into 75% training data and 25% test data.

Run the next cell to train the model on the input complaints data and then outputs a summary table of some complaint narratives, its true classification, and the predicted classification as output by the Linear SVC model we just trained. Use the slider to explore how the training and test dataset sizes effects the performance of the predictions:

In [None]:
somialabs.lab3.interact_linearSVC()

**Question 10.** From inspecting the table, how well do you think the classifier performs? Were there any misclassifications? Explain your observations.

By inspecting the results table above, we can see if the classifier has done a good or bad job (it should have done an OK job). However, we can quantify the accuracy. We do this using cross-validation. This checks the predictions against known values to produce some quantifiable statistics about the performance of the classifier.

Run the next cell to run cross-validation on our classifier. *Note: when using large corpus size, be patient. There may be a delay in producing the cross validation scores*:

In [None]:
somialabs.lab3.interact_linearSVC_cross_validation()

The array produced gives us a list of scores of the classifier for each of 5 runs of the cross validation.

**Question 11.** Based on the cross validation score, how well does the classifier perform with different input sizes? Does removing stopwords or applying stemming have different effects at low and high corpus sizes?

### Comparison of accuracies

You can probably see from each of the cross-validation results the general accuracies, but to make things a little bit clearer we can the visualize the results.

*Note: At larger corpus sizes, the plots below will take some time to process since they are training 3 Linear SVMs and then running the cross validation on them. Be patient if it takes some time to render (should only be a couple of minutes max).*

Let us first look at the cross-validation scores for 1000 records as input to our models, then try using the slider to explore the effects of using less training data, or more training data:

In [None]:
somialabs.lab3.plot_interact_linearSVC_cross_val_comparison()

**Question 12.** What do you observe about the cross-validated accuracies using Linear SVC without pre-processed features, stopword removed features, and stemmed features? Can you explain the reasons behind your observation(s)?

Re-run the analysis using 25000 records from the input complaints dataset:

In [None]:
somialabs.lab3.plot_interact_linearSVC_cross_val_comparison()

**Question 13.** Does increasing the input training data size effect your previous observations? If so, provide possible reasons.