# Prodigy - Text Annotation

In Prodigy you can manually annotate data, but due to time constraints, we decided to forgo this. Because of the fact that we did not have a 'gold standard' to test against, we only have the probability scores below to showcase whether our model worked.

```
# textcat.teach helps you annotate the data. It can take in a NER model.
prodigy textcat.teach text_class_hurricane_conspiracy ./Jared/temp_model/ ./data/conspiracy_3000.jsonl --label HOAX --patterns ./Jared/harvey_model_annotations.jsonl

# train textcat trains and generates a model. It can take in a NER model.
prodigy train textcat text_class_hurricane_conspiracy "./Jared/temp_model/" --output "./text_classification_ner" >> textcat_conspiracy.out
```

Below is the result of this command run.

![train_result.png](attachment:4659a7b7-5689-4c5f-a718-c558e74bf68a.png)

In [2]:
import pandas as pd
import json
import spacy

Load the NLP text classification with NER model

In [3]:
nlp = spacy.load("text_classification_ner")

Load some test data (note, because we did not manually label the data, we have no basis to test it on).

In [30]:
df = pd.read_csv("../../data/irma/hurricaneirma_10000.csv")

In [35]:
df.shape

(10000, 15)

In [46]:
text_scores = []

for idx,row in df.iterrows():
    text = row['text']
    doc = nlp(text)
    text_scores.append([doc.cats['HOAX'],doc.text])

Unnamed: 0,P(Hoax),Text
0,0.994979,SW #Lakeland! @Spectrum Cable OUT For 1 Hour&a...
1,0.997419,@Jney2Leadership is back on track after being ...
2,0.978492,The latest The ASEAN Global Hub! http://paper....
3,0.998786,"With #HurricaneIrma following so closely, it m..."
4,0.987689,Send dm for more info! mariapauladc #puertoric...
...,...,...
9995,0.997527,If you're looking for a pet you lost during #H...
9996,0.998619,Ministering to the Needs of Those Affected by ...
9997,0.994531,"Before making charitable donations, review the..."
9998,0.993639,BACK TO EXPLORING! #HurricaneIrma made some bi...


In [47]:
scores = pd.DataFrame(text_scores, columns=['P(Hoax)','Text'])

In [48]:
scores['P(Hoax)'].mean()

0.991505790707469

I can't make sense of this score, except to say that the way we've annotated (a subsample) and used both NER and text-classification is either incorrect, or I have not annotated enough samples in the text-classification train step. It's either that, or the model assumes due to the fact that it has been trained on such a large number of posts from r/Conspiracy that bias has increased drastically. I assumed the model was overfit due to a high score in train (see above), but the high probability of most messages being a Hoax using text-classification means that something needs to be fixed in the pipeline.