Topic research
============

This notebook looks into the possibilities to give insights into a corpus through unsupervised topic detection. It looks at the basic methods of tfidf and word embeddings. It is not yet looking at LDA or other unsupervised clustering.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../')

In [2]:
import json

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import spacy

from readers import JsonReader
from analysis import TopicDetector, TopicDetector1

In [3]:
#from spacy.lang.nl.stop_words import STOP_WORDS
STOP_WORDS = "english"  # takes stop words from sklearn

nlp = spacy.load("en_core_web_lg")
json_reader = JsonReader(source="death_penalty.json", subjects=["death", "penalty"])
texts = json_reader.get_texts()

Tfidf sums
-------------

This section looks at creating ngram word frames and creating sums of these to get to a topic list.

In [4]:
def get_word_frame(texts, ngram, sums=True):
    tfidf_vectorizer = TfidfVectorizer(stop_words=STOP_WORDS, ngram_range=(ngram, ngram))
    tfidf_vectorizer.fit(texts)
    feature_names = tfidf_vectorizer.get_feature_names()
    tfidf_vectors = tfidf_vectorizer.transform(texts)
    frame = pd.DataFrame(tfidf_vectors.toarray(), columns=feature_names)
    if ngram == 1:
        number_features = [feature for feature in feature_names if not feature.isalpha()]
        frame.drop(labels=number_features, axis=1, inplace=True)
    return frame.sum(axis=0) if sums else frame

In [5]:
tfidf_frame = get_word_frame(texts, 1, sums=False)
tfidf_words_sorted = tfidf_frame.sum(axis=0).sort_values(ascending=False)
#tfidf_words_sorted_bi = get_word_frame(texts, 2)
#tfidf_words_sorted_bi.sort_values(ascending=False, inplace=True)
#tri_frame = get_word_frame(texts, 3)
#tfidf_words_sorted_tri = tri_frame.sum(axis=0).sort_values(ascending=False)
#tetra_frame = get_word_frame(texts, 4)
#tfidf_words_sorted_tetra = tetra_frame.sum(axis=0).sort_values(ascending=False)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [6]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 5):
    print(tfidf_words_sorted[:10])

#tfidf_words_sorted_bi
#tfidf_words_sorted_tri
#tfidf_words_sorted_tetra

death         33.849226
penalty       24.563764
row           14.395764
punishment     9.664681
court          9.277523
state          9.233713
capital        9.089691
said           7.893743
execution      7.819045
life           7.427824
dtype: float64


The process above of making frames per ngram has been automized through the TopicDetector. There are two version. TopicDetector1 is the original version crafted with this notebook. It only supports English. The current version (TopicDetector) is more memory efficient and supports Dutch. This detector works in production. Here we compare the output of the two version for quality assurance.

In [7]:
td = TopicDetector(lambda text: text)
td.run(texts)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


[['death penalty information center', 1.9240768079324992],
 ['help improve online experience', 1.3726796729596455],
 ['site uses cookies help', 1.3726796729596455],
 ['uses cookies help improve', 1.3726796729596455],
 ['cookies help improve online', 1.3726796729596455],
 ['arabia seeks death penalty', 1.2234212999172254],
 ['saudi arabia seeks death', 1.2234212999172254],
 ['amnesty international site uses', 1.0379342145438972],
 ['international amnesty international site', 1.0379342145438972],
 ['amnesty international skip main', 1.0379342145438972],
 ['death row records', 0.7265589527499751],
 ['people death row', 0.4631655057691228],
 ['death penalty new', 0.39693434619882884],
 ['mumia abu jamal', 0.38542143825936626],
 ['sentenced life prison', 0.37157138317259486],
 ['death lethal injection', 0.3638097536717997],
 ['death penalty said', 0.363570873989616],
 ['victims family members', 0.33809783507194774],
 ['international human rights', 0.3314568658817309],
 ['death penalty uncon

In [8]:
td1 = TopicDetector1()
td1(texts)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [9]:
results = []
for ix, serie in td1.sorted_ngrams.items():
    results += [(topic, len(topic.split(" ")), importance) for topic, importance in serie[:10].items()]
results.sort(key=lambda result: (result[1], result[2],), reverse=True)
results 

[('death penalty information center', 4, 1.9240768079324986),
 ('help improve online experience', 4, 1.3726796729596453),
 ('site uses cookies help', 4, 1.3726796729596453),
 ('uses cookies help improve', 4, 1.3726796729596453),
 ('cookies help improve online', 4, 1.3726796729596453),
 ('arabia seeks death penalty', 4, 1.2234212999172256),
 ('saudi arabia seeks death', 4, 1.2234212999172256),
 ('amnesty international site uses', 4, 1.0379342145438972),
 ('international amnesty international site', 4, 1.0379342145438972),
 ('amnesty international skip main', 4, 1.0379342145438972),
 ('death row inmate', 3, 3.3824444067473713),
 ('death row inmates', 3, 2.2086166482165432),
 ('abolish death penalty', 3, 1.7496502727860928),
 ('years death row', 3, 1.3803295831273121),
 ('new york times', 3, 1.2243457105756672),
 ('death penalty cases', 3, 1.216477621946909),
 ('support death penalty', 3, 1.1884812437281662),
 ('death row prisoners', 3, 1.0963549766828462),
 ('use death penalty', 3, 1.038

There are some differences between the TopicDetector and TopicDetector1. However these are due to one simple difference. The original TopicDetector1 takes tfidf values above 0.95, while TopicDetector takes the top 5% of terms. I think that the latter approach is more accurate and leads to more interesting results in the case of the death penalty corpus.

Word simularity
------------------

The cell below prints all words that are similar to each other according to the large English word vectors by spaCy.

In [None]:
all_words = tfidf_words_sorted.index.tolist()
all_tokens = list(map(lambda word: nlp.vocab[word], all_words))
most_important_tokens = all_tokens[:50]
for important_token in most_important_tokens:
    similarities = []
    for token in all_tokens:
        if token is important_token:
            continue
        similarities.append((token.text, important_token.similarity(token),))
    similarities = sorted(similarities, key=lambda item: item[1])
    print(important_token.text)
    print("*" * 10 + "most similar" + "*" * 10)
    most_similar = similarities[-5:]
    most_similar.reverse()
    print("\n".join(
        "{0} {1:.2f}".format(word, similarity)
        for word, similarity in most_similar
    ))
    print("*" * 10 + "most different" + "*" * 10)
    print("\n".join(
        "{0} {1:.2f}".format(word, similarity)
        for word, similarity in similarities[:5]
    ))
    print()
    print()

Co-occurance
-----------------

Below are some early experiments with co-occurance of words.

In [None]:
all_words = tfidf_words_sorted.index.tolist()
least_important_words = all_words[500:]
most_important_frame = tfidf_frame.drop(labels=least_important_words, axis=1)

most_important_cooccurence = most_important_frame.T.dot(most_important_frame)
#np.fill_diagonal(most_important_cooccurence.values, 0)
#most_important_cooccurence = most_important_cooccurence.applymap(lambda v: v if v >= 0.3 else 0.0)

most_important_cooccurence

In [None]:
mic_sum = most_important_cooccurence.sum(axis=0).sort_values()
mic_sum

In [None]:
life_text_ixs = tfidf_frame["life"].argsort()[::-1]
justice_text_ixs = tfidf_frame["justice"].argsort()[::-1]
court_text_ixs = tfidf_frame["court"].argsort()[::-1]
law_text_ixs = tfidf_frame["law"].argsort()[::-1]
reward_text_ids = tfidf_frame["reward"].argsort()[::-1]

In [None]:
life_text_ixs = tfidf_frame["life"].argsort()
justice_text_ixs = tfidf_frame["justice"].argsort()
court_text_ixs = tfidf_frame["court"].argsort()
law_text_ixs = tfidf_frame["law"].argsort()
reward_text_ids = tfidf_frame["reward"].argsort()

In [None]:
texts[life_text_ixs.iloc[0]]

In [None]:
texts[justice_text_ixs.iloc[2]]

In [None]:
texts[court_text_ixs.iloc[0]]

In [None]:
texts[law_text_ixs.iloc[2]]

In [None]:
texts[reward_text_ids.iloc[5]]

In [None]:
mins = most_important_cooccurence.min()
nzeros = mins[mins > 0]
frame = most_important_cooccurence.drop(labels=nzeros.index, axis=0)
frame = frame.drop(labels=nzeros.index, axis=1)
frame.shape

In [None]:
nodes = [{"name": column, "group": 0} for column in frame.columns]
node_names = [node["name"] for node in nodes]
links = [{"source": node_names.index(column), "target": node_names.index(key), "value": 1} for column, row in frame.iteritems() for key, value in row.iteritems() if value == 0]
with open("../data/cooccurence-graph.json", "w") as file:
    json.dump({
        "nodes": nodes,
        "links": links
    }, file, indent=4)