# Text Classification: Be lazy, use Prodigy !

Text Classification could be complex to tune and to implement. If your objective is satisfied with a good-enough, ready-to-production, easy-to-upgrade model, use [Prodigy](https://prodi.gy/) to train a smooth classifier for [spaCy](https://spacy.io/), the production-ready python framework for NLP.

Text classification is often a great exercice to deep dive into NLP technics, because you can test and apply a lot of tools: from TF-IDF to words embeddings, training your own doc2vec/word2vec, applying some classic classifiers, testing neural net like RNN, etc.

Indeed, you can spent **HUGE** time to build-up your classification strategy and to improve your algorithms: you will augment your data, create new features, try new tricks to increase the accuracy of your models, etc. It's endless. If you're lucky, you will find the gold feature to kick-up your model on your specific data, and if you're even more lucky, a secret sauce to generalize it on a wider domain.
But it's hard. Sometime, it's really specific to your business domain (it could depend on the used vocabulary, the size of sentences, etc.) and it usually takes time to be tuned… that could be not compatible with a production environment where you have to quickly provide several classifiers and maintain them in time.
Then, If your objective will be satisfied with a good-enough (i.e. not 0.9999), ready-to-production, easy-to-upgrade model, I suggest to test and use Prodigy to train a classification model usable with spaCy, the production-ready python framework for NLP. 

# Prodigy
As explained by its creators, [Prodigy](https://prodi.gy/) is : 

`
an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you’re working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. Stream in your own examples or real-world data from live APIs, update your model in real-time and chain models together to build more complex systems.
`

So Prodigy will provide you web interfaces to tag you data (texts, images, etc.) but also command lines to train new models. That's this last part I will use in my example to train a text classifier.

## Example Description : Spooky Competition
For my example, I am using the dataset provided by the [Spooky Kaggle Competition](), where the objective was to predict the author of excerpts from horror stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).
The tagging is already done, so I do not have to use the Prodigy's web interface for this time being (which is great, by the way!).

What I have to do is:
- To prepare the Data for Prodigy,
- Build the dataset for prodigy,
- and train a new model using the tool

### Import libraries

In [None]:
import os, sys, time
import json
import csv

## Prepare Data for Prodigy
Kaggle provides data in CSV. I create first some functions to manipulate json instead of csv files:
- create_json : create a json file from a csv file
- load_json : load json from a json file

In [None]:
def create_json(csv_file, json_file):
    try :
        if os.path.exists(csv_file) :
            with open(csv_file) as f:
                reader = csv.DictReader(f)
                rows = list(reader)

            with open(json_file, 'w') as f:
                json.dump(rows, f)
        else :
            return {}
    except :
            return {}

In [None]:
def load_json(json_file):
    try :
        if os.path.exists(json_file) :
            with open(json_file) as f:
                return json.load(f)
        else :
            return {}
    except :
            return {}

### Build the dataset for Prodi.gy
Build the data set for Prodi.gy:
the function open a json, then retrieve:
- **texts**: text of the sentece
- **ids**: the id of the sentence
- **labels**: the result (author) from train file. Retrieve nothing if missing (test file)

In [None]:
def build_dataset(json_files):
    ids = []
    texts = []
    labels = []

    for jfile in json_files:
        j = load_json(jfile)
        for entry in j :
            ids.append(entry['id'])
            texts.append(entry['text'])
            if entry.get('author'):
                labels.append(entry['author'])
    return texts, ids, labels

It's time to create the input file for Prodigy.

### Create the JSONL file for Prodigy

Prodigy requires a jsonl file as input data, with a json dictionary per line.

These json data must have the following parameters as minimal:
`
{“text”:”text of the sentence”,”label”:”category of the sentence”,”answer”:”’reject’ or ‘accept’”}
`

So, I have to create json data using:
- the text of sentences and populating the text item,
- the author of the sentences and populating the label items,
- I also create a meta item, only for my convinience, using the id of the sentence,
- and a answer item, describing the validity of the association between the text and the label.

`
Note regarding the answer item: prodigy request ‘reject’ and ‘accept’ data to train its model. If we only provide ‘accept’ data, the ouputs will not be good at all.
So, I create ‘accept’ data with correct author, and ‘reject’ data with others.
Example:
The sentence “It never once occurred to me that the fumbling might be a mere mistake.” is from H.P. Lovecraft (HPL). So, the jsonl file must looks like:
{“answer”: “reject”, “meta”: {“id”: “id17569”}, “text”: “It never once occurred to me that the fumbling might be a mere mistake.”, “label”: “MWS”}
{“answer”: “accept”, “meta”: {“id”: “id17569”}, “text”: “It never once occurred to me that the fumbling might be a mere mistake.”, “label”: “HPL”}
{“answer”: “reject”, “meta”: {“id”: “id17569”}, “text”: “It never once occurred to me that the fumbling might be a mere mistake.”, “label”: “EAP”}
`

In [None]:
#path to folders
path_to_json = './data/'
save_dir = "./save/"

#labels
labels = ['MWS','HPL','EAP']

#transform cvs in json files:
create_json(path_to_json + "train.csv", path_to_json+ "train.json")
create_json(path_to_json + "test.csv", path_to_json + "test.json")

#Retrieve data from json files:
json_file_train = [path_to_json + pos_json for pos_json in os.listdir(path_to_json) if pos_json == 'train.json']
json_file_test = [path_to_json + pos_json for pos_json in os.listdir(path_to_json) if pos_json == 'test.json']

texts_train, ids_train, labels_train = build_dataset(json_file_train)
texts_test, ids_test, _ = build_dataset(json_file_test)

#create jsonl file for prodigy:
jsonl = open(save_dir + "spooky.jsonl", "w")
for i in range(len(texts_train)):
    line = {}
    line['text'] = texts_train[i]
    line['label'] = labels_train[i]
    meta = {}
    meta['id'] = ids_train[i]
    line['meta'] = meta
    for l in labels:
        line['label'] = l
        if l == labels_train[i]:
            line['answer'] ='accept'
        else:
            line['answer'] ='reject'            
        jsonl.writelines(json.dumps(line) + "\n")
jsonl.close()

# Now you have to train the model using Prodigy

I first create the dataset for my task:

`
bash-3.2$ prodigy dataset spooky

`
Thsn upload your jsonl:
`

bash-3.2$ prodigy db-in spooky path_to_your_jsonl
`

And train the model:

`
bash-3.2$ prodigy textcat.batch-train spooky --output spooky-model --eval-split 0.1 --n-iter 30
`

# Import and test the model
Now, I just have to import the created model into spaCy to use it. And it is really easy to do:

In [None]:
import spacy

In [None]:
nlp = spacy.load('path_to_spooky-model')

In [None]:
solution = [['id','EAP','HPL','MWS']]
for i in range(len(texts_test)):
    doc = nlp(texts_test[i])
    solution.append([ids_test[i],doc.cats['EAP'],doc.cats['HPL'],doc.cats['MWS']])

### Create the result file
write the results in a csv

In [None]:
with open(save_dir + "/result_pdy.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(solution)