# Topic Vectors

An experiment in creating document topic vectors by TFIDF and PCA.

## First step - get data

Before running this notebook, we'll need to grab some text. Running `scripts/wikipedia_grab.py` will pull down text from Wikipedia articles. Here we'll attempt to discriminate topics based on TFIDF vectors squeeezed down by PCA.

In [1]:
import json
from pathlib import Path

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

In [3]:
import pandas as pd

In [8]:
cd ..

/Users/christopherbare/Documents/personal/nlp-play


In [9]:
path = Path('./data/')

## Iterate documents

After scraping documents from Wikipedia, we'll have subdirectories for each topic. We use a generator to read the files and yield JSON documents.

In [10]:
def iter_pages(path):
    for d in path.iterdir():
        if d.is_dir():
            label = d.name
            for p in d.iterdir():
                with p.open() as f:
                    page = json.load(f)
                yield label, page

In [15]:
xs = [(label, page['title']) for label, page in iter_pages(path)]
labels, titles = zip(*xs)

In [16]:
len(titles)

90

## Compute document vectors

How much meaning can be crammed into 2 dimensions? Not much, I guess, but you can plot it, so let's try.

In [72]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(page['extract'] for label, page in iter_pages(path))

In [73]:
pca = PCA(n_components=2)
Y = pca.fit_transform(X.todense())

In [74]:
for label, title, yi in zip(labels, titles, Y):
    print(f'{label:16} {title:25} {yi[0]:10.4f} {yi[1]:10.4f}')

writers          Charles Dickens              -0.1315    -0.0998
writers          Isaac Asimov                 -0.0826     0.0099
writers          Neal Stephenson              -0.0335     0.0454
writers          Jane Austen                  -0.0997     0.0442
writers          Philip K. Dick               -0.1304    -0.0454
writers          Jorge Luis Borges            -0.1283    -0.0235
writers          Douglas Adams                -0.0663    -0.0372
writers          Edgar Allan Poe               0.0574     0.1946
writers          James Joyce                  -0.1168    -0.0668
writers          Kurt Vonnegut                -0.0647    -0.0192
writers          William Gibson               -0.1263    -0.0449
writers          George Orwell                -0.1405    -0.1453
writers          Don DeLillo                  -0.0820     0.0097
writers          Dave Eggers                  -0.0935     0.0225
writers          F. Scott Fitzgerald          -0.0409    -0.0331
writers          Ursula K

In [75]:
df = pd.DataFrame(Y, index=titles, columns=[f'PC{i+1}' for i in range(Y.shape[1])])
df['label'] = labels

In [76]:
df.columns

Index(['PC1', 'PC2', 'label'], dtype='object')

In [77]:
df = df.reset_index().rename(columns={'index': 'title'})

In [78]:
import altair as alt

In [79]:
alt.Chart(df).mark_circle(opacity=0.5, size=120).encode(
    x='PC1',
    y='PC2',
    color=alt.Color('label', scale=alt.Scale(scheme='category10')),
    tooltip=['title', 'label', 'PC1', 'PC2']
).interactive()