# Sentiment Classification with a Deep Learning Model

This notebook introduces a machine learning task from the field of natural language processing (machine learning focused on the processing of spoken and written text).

## Sentiment Analysis

The modelled task is a classification task called sentiment analysis. 
Text snippets are classified according to their positive or negative sentiment that is expressed in them. 
This can be modelled as 3-class problem (negative, neutral, positive), or as a degree of sentiment on a 5-class or 10-class scale. 




## Acknowledgement

The notebook is based on https://www.manning.com/books/real-world-natural-language-processing, an upcoming book focused on NLP.

The ML frameworks used are:

* pytorch
* allennlp
* spacy


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/erikgraf/deepLearning/blob/master/Deep_Learning_Sentiment_classifier.ipynb)

## Installing Dependencies

The cell below installs the main dependencies and clones some a repository that forms the basis of the implementation. 

Executing it with `CTRL + Enter` (`STRG +Enter` on a german keyboard) could take a couple of minutes.

In [None]:
!pip install allennlp==1.0.0
!pip install allennlp_models==1.0.0


In [None]:
!git clone https://github.com/mhagiwara/realworldnlp.git
%cd realworldnlp

## Imports

Execute the cell below to load all required modules. 

In [None]:
from typing import Dict

import numpy as np
import torch
import torch.optim as optim
from allennlp.data import DataLoader
from allennlp.data.samplers import BucketBatchSampler
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.nn.util import get_text_field_mask
from allennlp.training.metrics import CategoricalAccuracy, F1Measure
from allennlp.training.trainer import GradientDescentTrainer
from allennlp_models.classification.dataset_readers.stanford_sentiment_tree_bank import \
    StanfordSentimentTreeBankDatasetReader

from realworldnlp.predictors import SentenceClassifierPredictor

## Hyperparameters

The cell below sets the hyperparameters.

* EMBEDDING_DIM: This is the dimensionality of the word embeddings (numeric representations of words such as word2vec or glove (https://nlp.stanford.edu/projects/glove/))
* HIDDEN_DIM: This is the dimensionality of the LSTM (Long Short Term Memory) Deep Learning network. 

A value of 128 is pretty standard for the embeddings and hidden_dim.




In [None]:
EMBEDDING_DIM = 128
HIDDEN_DIM = 128

## Training Data Set

For training we will use the Stanford Sentiment Treebank data set.
A data set for training sentiment analysis models. It is annotated both on the sentence and the word level with regard to the sentiment. 

When loading the data set we can configure the granularity to `'5-class'` or `'3-class'`.

`'3-class'` represents classification on the level of `negative`, `neutral`, `positive` encoded as `0`, `1`, `2` (positive). `'5-class'` on a level from `0` to `4`.





In [None]:
reader = StanfordSentimentTreeBankDatasetReader(granularity='5-class')

In [None]:
train_dataset = reader.read('https://s3.amazonaws.com/realworldnlpbook/data/stanfordSentimentTreebank/trees/train.txt')
dev_dataset = reader.read('https://s3.amazonaws.com/realworldnlpbook/data/stanfordSentimentTreebank/trees/dev.txt')

## Model Implementation in AllenNLP

Execute the cell below to load the model classification.

Depending on the class level chosen (3 vs 5) change the positive label in the init method to ('2' or '4').


In [None]:
# Model in AllenNLP represents a model that is trained.
@Model.register("lstm_classifier")
class LstmClassifier(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 vocab: Vocabulary,
                 positive_label: str = '4') -> None:
        super().__init__(vocab)
        # We need the embeddings to convert word IDs to their vector representations
        self.word_embeddings = word_embeddings

        self.encoder = encoder

        # After converting a sequence of vectors to a single vector, we feed it into
        # a fully-connected linear layer to reduce the dimension to the total number of labels.
        self.linear = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                      out_features=vocab.get_vocab_size('labels'))

        # Monitor the metrics - we use accuracy, as well as prec, rec, f1 for 4 (very positive)
        positive_index = vocab.get_token_index(positive_label, namespace='labels')
        self.accuracy = CategoricalAccuracy()
        self.f1_measure = F1Measure(positive_index)

        # We use the cross entropy loss because this is a classification task.
        # Note that PyTorch's CrossEntropyLoss combines softmax and log likelihood loss,
        # which makes it unnecessary to add a separate softmax layer.
        self.loss_function = torch.nn.CrossEntropyLoss()

    # Instances are fed to forward after batching.
    # Fields are passed through arguments with the same name.
    def forward(self,
                tokens: Dict[str, torch.Tensor],
                label: torch.Tensor = None) -> torch.Tensor:
        # In deep NLP, when sequences of tensors in different lengths are batched together,
        # shorter sequences get padded with zeros to make them equal length.
        # Masking is the process to ignore extra zeros added by padding
        mask = get_text_field_mask(tokens)

        # Forward pass
        embeddings = self.word_embeddings(tokens)
        encoder_out = self.encoder(embeddings, mask)
        logits = self.linear(encoder_out)

        # In AllenNLP, the output of forward() is a dictionary.
        # Your output dictionary must contain a "loss" key for your model to be trained.
        output = {"logits": logits}
        if label is not None:
            self.accuracy(logits, label)
            self.f1_measure(logits, label)
            output["loss"] = self.loss_function(logits, label)

        return output

    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        precision, recall, f1_measure = self.f1_measure.get_metric(reset)
        return {'accuracy': self.accuracy.get_metric(reset),
                'precision': precision,
                'recall': recall,
                'f1_measure': f1_measure}

## Transform Text into Numeric Representation

The following cells are responsible for the transformation of text in string form into numeric representations that are suitable as learning input for the neural network.

1. Extract vocabulary of unique terms from the text
2. Create embeddings for the terms
3. Define transformation (encoding) for a sequence of text (i.e. a sentence)

In [None]:
# You can optionally specify the minimum count of tokens/labels.
# `min_count={'tokens':3}` here means that any tokens that appear less than three times
# will be ignored and not included in the vocabulary.
vocab = Vocabulary.from_instances(train_dataset + dev_dataset,
                                  min_count={'tokens': 3})

In [None]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

In [None]:
# BasicTextFieldEmbedder takes a dict - we need an embedding just for tokens,
# not for labels, which are used as-is as the "answer" of the sentence classification
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

In [None]:
# Seq2VecEncoder is a neural network abstraction that takes a sequence of something
# (usually a sequence of embedded word vectors), processes it, and returns a single
# vector. Oftentimes this is an RNN-based architecture (e.g., LSTM or GRU), but
# AllenNLP also supports CNNs and other simple architectures (for example,
# just averaging over the input vectors).
encoder = PytorchSeq2VecWrapper(
    torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

## Configure Model for Training

The following four cells configure the model for training.

1. The LstmClassifier class takes the word_embeddings, the define sequence encoder and the vocabulary as input configuration. 

2. The BucketIterator is a helper class for iterating over the full training set and randomly selects batches of instances for the training. 

3. optimizer specifies the learning rate for Adam (a mathmatical optimisation function that will guide the weight adaptations of our model).

4. trainer holds our instatiation of the model, and defines the number of epochs.



In [None]:
model = LstmClassifier(word_embeddings, encoder, vocab)

In [None]:
iterator = BucketIterator(batch_size=32, sorting_keys=[("tokens", "num_tokens")])
iterator.index_with(vocab)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)

In [None]:
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=dev_dataset,
                  patience=40,
                  num_epochs=40)


## Train

Execute the cell below to train.

In [None]:
trainer.train()

## Sanity Check

The cell below will allow you to enter sample sentences and test the predictions of the model.

In [None]:
predictor = SentenceClassifierPredictor(model, dataset_reader=reader)
logits = predictor.predict("Don't waste your money")['logits']
label_id = np.argmax(logits)

print(model.vocab.get_token_from_index(label_id, 'labels'))

## More Substantive Checks

In order to do some more in depth checks how well the model does, and how well it might generalize we can utilize a set of Amazon reviews. 

http://jmcauley.ucsd.edu/data/amazon/

The site above holds a very large of Amazon reviews that can be used for scientific purposes. 

### Task 1: Choose and Download a Subcategory

From the table below, choose a category that you will use for testing. 
Download the 5 core links that hold the full text, title and rating of a review. 


<html>

<table>
<tbody><tr>
  <td>Books</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_10.json.gz">10-core</a> (4,701,968 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz">5-core</a> (8,898,041 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Books.csv">ratings only</a> (22,507,155 ratings)</td>
</tr>

<tr>
  <td>Electronics</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_10.json.gz">10-core</a> (347,393 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz">5-core</a> (1,689,188 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Electronics.csv">ratings only</a> (7,824,482 ratings)</td>
</tr>

<tr>
  <td>Movies and TV</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_10.json.gz">10-core</a> (958,986 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz">5-core</a> (1,697,533 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Movies_and_TV.csv">ratings only</a> (4,607,047 ratings)</td>
</tr>

<tr>
  <td>CDs and Vinyl</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_CDs_and_Vinyl_10.json.gz">10-core</a> (445,412 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_CDs_and_Vinyl_5.json.gz">5-core</a> (1,097,592 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_CDs_and_Vinyl.csv">ratings only</a> (3,749,004 ratings)</td>
</tr>

<tr>
  <td>Clothing, Shoes and Jewelry</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz">5-core</a> (278,677 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Clothing_Shoes_and_Jewelry.csv">ratings only</a> (5,748,920 ratings)</td>
</tr>

<tr>
  <td>Home and Kitchen</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_10.json.gz">10-core</a> (25,445 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz">5-core</a> (551,682 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Home_and_Kitchen.csv">ratings only</a> (4,253,926 ratings)</td>
</tr>

<tr>
  <td>Kindle Store</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Kindle_Store_10.json.gz">10-core</a> (367,478 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Kindle_Store_5.json.gz">5-core</a> (982,619 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Kindle_Store.csv">ratings only</a> (3,205,467 ratings)</td>
</tr>

<tr>
  <td>Sports and Outdoors</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz">5-core</a> (296,337 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Sports_and_Outdoors.csv">ratings only</a> (3,268,695 ratings)</td>
</tr>

<tr>
  <td>Cell Phones and Accessories</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_10.json.gz">10-core</a> (1,854 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz">5-core</a> (194,439 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Cell_Phones_and_Accessories.csv">ratings only</a> (3,447,249 ratings)</td>
</tr>

<tr>
  <td>Health and Personal Care</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_10.json.gz">10-core</a> (55,076 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_5.json.gz">5-core</a> (346,355 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Health_and_Personal_Care.csv">ratings only</a> (2,982,326 ratings)</td>
</tr>

<tr>
  <td>Toys and Games</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_10.json.gz">10-core</a> (18,637 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_5.json.gz">5-core</a> (167,597 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Toys_and_Games.csv">ratings only</a> (2,252,771 ratings)</td>
</tr>

<tr>
  <td>Video Games</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_10.json.gz">10-core</a> (52,158 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz">5-core</a> (231,780 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Video_Games.csv">ratings only</a> (1,324,753 ratings)</td>
</tr>

<tr>
  <td>Tools and Home Improvement</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Tools_and_Home_Improvement_5.json.gz">5-core</a> (134,476 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Tools_and_Home_Improvement.csv">ratings only</a> (1,926,047 ratings)</td>
</tr>

<tr>
  <td>Beauty</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Beauty_10.json.gz">10-core</a> (28,798 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Beauty_5.json.gz">5-core</a> (198,502 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Beauty.csv">ratings only</a> (2,023,070 ratings)</td>
</tr>

<tr>
  <td>Apps for Android</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Apps_for_Android_10.json.gz">10-core</a> (264,050 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Apps_for_Android_5.json.gz">5-core</a> (752,937 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Apps_for_Android.csv">ratings only</a> (2,638,172 ratings)</td>
</tr>

<tr>
  <td>Office Products</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Office_Products_10.json.gz">10-core</a> (25,374 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Office_Products_5.json.gz">5-core</a> (53,258 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Office_Products.csv">ratings only</a> (1,243,186 ratings)</td>
</tr>

<tr>
  <td>Pet Supplies</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Pet_Supplies_10.json.gz">10-core</a> (3,152 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Pet_Supplies_5.json.gz">5-core</a> (157,836 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Pet_Supplies.csv">ratings only</a> (1,235,316 ratings)</td>
</tr>

<tr>
  <td>Automotive</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Automotive_5.json.gz">5-core</a> (20,473 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Automotive.csv">ratings only</a> (1,373,768 ratings)</td>
</tr>

<tr>
  <td>Grocery and Gourmet Food</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Grocery_and_Gourmet_Food_10.json.gz">10-core</a> (37,348 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Grocery_and_Gourmet_Food_5.json.gz">5-core</a> (151,254 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Grocery_and_Gourmet_Food.csv">ratings only</a> (1,297,156 ratings)</td>
</tr>

<tr>
  <td>Patio, Lawn and Garden</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Patio_Lawn_and_Garden_5.json.gz">5-core</a> (13,272 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Patio_Lawn_and_Garden.csv">ratings only</a> (993,490 ratings)</td>
</tr>

<tr>
  <td>Baby</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Baby_5.json.gz">5-core</a> (160,792 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Baby.csv">ratings only</a> (915,446 ratings)</td>
</tr>

<tr>
  <td>Digital Music</td>
  <!-- <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_10.json.gz">10-core</a> (22,772 reviews)</td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz">5-core</a> (64,706 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Digital_Music.csv">ratings only</a> (836,006 ratings)</td>
</tr>

<tr>
  <td>Musical Instruments</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz">5-core</a> (10,261 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Musical_Instruments.csv">ratings only</a> (500,176 ratings)</td>
</tr>

<tr>
  <td>Amazon Instant Video</td>
  <!-- <td></td> -->
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz">5-core</a> (37,126 reviews)</td>
  <td><a href="http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ratings_Amazon_Instant_Video.csv">ratings only</a> (583,933 ratings)</td>
</tr>
</tbody>
    </table>
</html>

### Task 2

Sanity checks with the reviews.

The format of the files is as follows:

`{
	"reviewerID": "A2ICI6VUC0U5K6",
	"asin": "B0014JKKGK",
	"reviewerName": "Jermin Botrous \"gigigigi\"",
	"helpful": [0, 0],
	"reviewText": "Don't waste your money because elastic goes bad after 2 washes",
	"overall": 1.0,
	"summary": "One Star",
	"unixReviewTime": 1404432000,
	"reviewTime": "07 4, 2014"
}`

Use the following code snippets to load individal review texts.

Opening a file in python:

``
test_file = open('file_name.json', 'r')
first_line = test_file.readline()
``

Transform the line into a json object to access the individual fiels (such as reviewText).


``
import json
j_obj = json.loads(first_line)
print('reviewText:' + j_obj['reviewText'])
``

finally use the code from above to test the predictions

``
logits = predictor.predict("Don't waste your money")['logits']
label_id = np.argmax(logits)
prediction = model.vocab.get_token_from_index(label_id, 'labels')
print(prediction)
``

## Task 3

Iterate over the reviews and extract:

* 100 positive predictions (i.e. 4)
* 100 negative predictions (i.e. 0)

Save the sets of positive and negative predictions as plain text files:

* categoryName_100_pos.txt
* categoryName_100_neg.txt

Manually inspect the predictions to identify potential false positives in boths sets.
Store a couple of those false positives in the files:

* categoryName_100_pos_fp.txt
* categoryName_100_neg_fp.txt

## Task 4

Generate listing of false positives.
Analyse the data from Amazon.
What would be a way to utilize this data in order to generate larger lists of false positives?
Derive a method that will allow you to predict over the full content of the file and create lists of:
* True positive 'positive' predictions
* False positive 'positive' predictions
* True positive 'negative' predictions
* False positive 'negative' predictions

Save the four sets four your group submission:

* categoryName_pos_tp.txt
* categoryName_pos_fp.txt
* categoryName_neg_tp.txt
* categoryName_neg_fp.txt



## Task 5 

Calculate approximate precision values based on your mapping from task 4.

Store the calculations as part of a readme or send the values by e-mail submission. 

Submit the files from Task 3 and Task 4 as your group submission.

