## 02: Process text and counts

This script uses a pretrained [spaCy](https://spacy.io) model to extract entities from JSONL-formatted data and count them. It expects each record to have a `"meta"` dict with a `"utc"` value containing the UTC timestamp. Counts are generated by month and for each entity and saved out as a CSV. For example:

```csv
,2012-01,2012-02
meat,1011.0,873.0
salt,805.0,897.0
chicken,694.0,713.0
```

> ⚠️ **Important note:** If you have a lot of data, you probably want to split up your raw data and run multiple jobs in parallel. The next script that calculates the final counts and variance can take a directory of `.csv` files as its input, so reconciling the counts afterwards is no problem.

In [None]:
SPACY_MODEL = "./food_model"      # path to spaCy model with entity recognizer
DATA_FILE = "./reddit.jsonl"      # preprocessed Reddit data created in previous step
OUTPUT_FILE = "./raw_counts.csv"  # path to output file
N_PROCESSES = 16                  # number of processes for multiprocessing
ENTITY_LABEL = "INGRED"           # label of entity to count

In [None]:
!pip install spacy srsly pandas

In [None]:
import spacy
from collections import Counter, defaultdict
import srsly
from datetime import datetime
import pandas as pd

In [None]:
counts = defaultdict(Counter)
nlp = spacy.load(SPACY_MODEL)
data = srsly.read_jsonl(DATA_FILE)

data_tuples = ((eg["text"], eg) for eg in data)
for doc, eg in nlp.pipe(data_tuples, as_tuples=True, n_process=N_PROCESSES):
    timestamp = int(eg["meta"]["utc"])
    year_month = datetime.utcfromtimestamp(timestamp).strftime("%Y-%m")
    for ent in doc.ents:
        if ent.label_ == ENTITY_LABEL:
            counts[ent.lower_][year_month] += 1

df = pd.DataFrame(data=counts).transpose()
df.to_csv(OUTPUT_FILE)