# PROJECT 1: Categorizing news articles

### Your task
* Given a bunch of Reuters news service articles, develop a set of labels for categorizing them
* Labels should be a single word or short phrase. Some articles might fit more than one label, and some might not fit any.
* Aim for about 10–15 labels, give or take
* Use methods from labs so far (keyword analysis, terminology extraction, topic models)
* No specific ‘correct’ answer; the process you use to develop the list is more important than the solution.

### Deliverables
* List of labels
* For each label, the number of articles from the dataset that fit that label
* The number of articles that don't fit any of the labels (ideally this won't be a big number)
* Annotated notebook showing your process

Download packages for text analysis:

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *
import re
from tqdm.auto import tqdm
import pycld2

tqdm.pandas()

Grab the File:

In [2]:
df = pd.read_parquet('s3://ling583/project1.parquet', storage_options={'anon':True})

Set up Spacy

In [3]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm', exclude=[
                 'parser', 'ner', 'lemmatizer', 'attribute_ruler'])

matcher = Matcher(nlp.vocab)
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN', 'NNP']}},
                      {'TAG': {'IN': ['JJ', 'NN', 'IN',
                                      'HYPH', 'NNP']}, 'OP': '*'},
                      {'TAG': {'IN': ['NN', 'NNP']}}]])


def get_candidates(text):
    doc = nlp(text)
    spans = matcher(doc, as_spans=True)
    return [tuple(tok.norm_ for tok in span) for span in spans]

Start a dask cluster to go through the articles

In [4]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:37385")
client

0,1
Client  Scheduler: tcp://127.0.0.1:37385  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


Here we find domain-specific terminology that is relevant for Reuters Data by importing Spacy: 

In [5]:
import dask.bag as db
import dask.dataframe as dd

texts = dd.from_pandas(df['text'], npartitions=50).to_bag()

graph = texts.map(get_candidates).flatten().frequencies()

In [6]:
%%time

candidates = graph.compute()

CPU times: user 5.59 s, sys: 797 ms, total: 6.39 s
Wall time: 3min 49s


In [7]:
from nltk import ngrams


def get_subterms(term):
    k = len(term)
    for m in range(k-1, 1, -1):
        yield from ngrams(term, m)

In [8]:
from collections import Counter, defaultdict
from math import log2

freqs = defaultdict(Counter)
for c, f in candidates:
    freqs[len(c)][c] += f


def c_value(F, theta):

    termhood = Counter()
    longer = defaultdict(list)

    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [11]:
terms = c_value(freqs, theta=200)

Save the MWE's to reuters-terms.txt for use in the next steps

In [14]:
with open('reuters-terms.txt', 'w') as f:
    for t in terms:
        print(' '.join(t), file=f)

Read the reuters data and tokenize it

In [15]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open('reuters-terms.txt'))

In [16]:
import tomotopy as tp
import time

In [17]:
k = 100
min_df = 100
rm_top = 75
tw = tp.TermWeight.ONE
alpha = 0.1
eta = 0.01
tol = 1e-3

Tokenize

In [18]:
df['tokens'] = pd.Series(df['text'].progress_apply(tokenizer.tokenize))

  0%|          | 0/50085 [00:00<?, ?it/s]

Estimate modeling

In [19]:
%%time

mdl = tp.LDAModel(k=k, min_df=min_df, rm_top=rm_top, tw=tw, alpha=alpha, eta=eta)

for doc in df['tokens']:
    if doc:
        mdl.add_doc(doc)

last = np.NINF
for i in range(0, 5000, 50):
    mdl.train(50)
    ll = mdl.ll_per_word
    print(f'{i:5d} LL = {ll:7.4f}', flush=True)
    if ll - last < tol:
        break
    else:
        last = ll

print(f'Done!')

    0 LL = -7.8677
   50 LL = -7.7306
  100 LL = -7.6745
  150 LL = -7.6443
  200 LL = -7.6265
  250 LL = -7.6128
  300 LL = -7.6054
  350 LL = -7.5990
  400 LL = -7.5933
  450 LL = -7.5907
  500 LL = -7.5878
  550 LL = -7.5854
  600 LL = -7.5838
  650 LL = -7.5830
Done!
CPU times: user 19min 23s, sys: 9.99 s, total: 19min 33s
Wall time: 6min 15s


Apply topic model

In [20]:
topics = pd.DataFrame({'words': [' '.join(map(first, mdl.get_topic_words(k))) for k in range(mdl.k)]})

Create topics.csv and add labels

In [25]:
topics.to_csv('topics.csv', index=False)

In [50]:
topics = pd.read_csv('topics.csv')

For loop to count number of articles associated with each label as well as the number of articles NOT associated with ANY labels by subtracting the sum of articles associated from the total number of articles.

In [51]:
label_freqs = {}

for article in mdl.docs:
    for tag,prob in article.get_topics():
        if prob > 0.01:
            not_found = False
            label = str(topics['label'].loc[tag])
            if label == 'nan':
                continue
            found = label_freqs.get(label)
            if found:
                found = {label:found+1}
            else:
                found = {label:1}
            label_freqs.update(found)
print(label_freqs)
print(50085 - sum(label_freqs.values()))

{'hotels': 2686, 'telecommunication': 2379, 'business': 10722, 'stock': 5297, 'airline': 4099, 'europe': 2138, 'labor': 4988, 'politics': 3587, 'asia': 4004, 'maritime transport': 3728, 'media': 1291, 'shipping': 2298, 'stocks': 1240, 'north america': 1276}
352
