This is the first in a series of posts on extracting word representations using statistical language modeling techniques. This first installment includes rudimentary corpus preprocessing, tokenization, vectorization, and inferences within the vector space model. The corpus is a public domain dataset of a million news headlines from the Australian Broadcasting Corporation between 2003 and 2021.

All code blocks for this part of the project are included in this document. The first block includes the imports used in this part of the project.

https://github.com/Using-Namespace-System/Syntagmatic-And-Paradigmatic-Word-Associations.git

The Whole series can be cloned from the link above into a dev container and the configs will include the necessary dependencies.

In [None]:
from itertools import zip_longest
from matplotlib.pyplot import figure
import nltk
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from scipy.sparse import csr_array
from scipy.sparse import find
from pickleshare import PickleShareDB

df = pd.read_csv('../input/abcnews-date-text.csv')
nltk.download('stopwords')
stopwords_set = set(stopwords.words('english'))


Preprocessing the corpus is simplified to filtering out short headlines, small words, and stop-words. Each action is completed in pandas, I believe this may improve readability. The documents are exploded into a single series representing the whole corpus. From here stop-words can be filtered out. No further sanitation is performed.  

In [37]:
#tokenize and sanitize

#tokenize documents into individual words
df['tokenized'] = df.headline_text.str.split(' ')

#remove short documents from corpus
df['length'] = df.tokenized.map(len)
df = df.loc[df.length > 1]

#use random subset of corpus
#df=df.sample(frac=0.0016).reset_index()

df = df.reset_index()

#flatten all words into single series
ex = df.explode('tokenized')

#remove shorter words
ex = ex.loc[ex.tokenized.str.len() > 2]

#remove stop-words
ex = ex.loc[~ex.tokenized.isin(stopwords_set)]

Tokenization of the corpus is performed by creating forward and backwards lookup dictionaries. Each unique word is represented as a unique number. This is a very simple method of tokenization.

In [38]:
#create dictionary of words

#shuffle for sparse matrix visual
dictionary = ex.tokenized.drop_duplicates().sample(frac=1)

#dataframe with (index/code):word
dictionary = pd.Series(dictionary.tolist(), name='words').to_frame()

#store code:word dictionary for reverse encoding
dictionary_lookup = dictionary.to_dict()['words']

#offset index to prevent clash with zero fill
dictionary['encode'] = dictionary.index + 1

#store word:code dictionary for encoding
dictionary = dictionary.set_index('words').to_dict()['encode']

#use dictionary to encode each word to integer representation
encode = ex.tokenized.map(dictionary.get).to_frame()
encode.index.astype('int')
encode.tokenized.astype('int')
#un-flatten encoded words back into original documents
docs = encode.tokenized.groupby(level=0).agg(tuple)

#match up document indexes for reverse lookup
df = df.sort_index().iloc[docs.index].reset_index()
docs = docs.reset_index()['tokenized']



In its simplest form the word vector for each term would be the one-hot(binary) encoding of the documents they are (1) and are not (0) present in. Likewise, the transform is comprised of document-word vectors where each is a one-hot encoding of the terms in the corpus that are and are not present in a document.

In this instance the word vector is a count vector. This is similar to one-hot but is able to convey how many times the term occurred in the document.

For the news headline dataset, document-wise term repetition is minimal and the statistical weight it provides is negligible.

In [None]:

#zero pad x dimension by longest sentence
encoded_docs = list(zip(*zip_longest(*docs.to_list(), fillvalue=0)))

#convert to sparse matrix
encoded_docs = csr_array(encoded_docs, dtype=int)

#convert to index for each word
row_column_code = find(encoded_docs)

#presort by words
word_sorted_index = row_column_code[2].argsort()
doc_word = np.array([row_column_code[0][word_sorted_index], row_column_code[2][word_sorted_index]])

#presort by docs and words
doc_word_sorted_index = doc_word[0].argsort()
doc_word = pd.DataFrame(np.array([doc_word[0][doc_word_sorted_index], doc_word[1][doc_word_sorted_index]]).T, columns=['doc','word'])

#offset code no longer needed after zero-fill
doc_word.word = doc_word.word - 1

#convert to index of word counts per document
doc_word_count  = doc_word.groupby(['doc','word']).size().to_frame('count').reset_index().to_numpy().T

#convert to sparse matrix
sparse_word_doc_matrix = csr_array((doc_word_count[2],(doc_word_count[0],doc_word_count[1])), shape=(np.size(encoded_docs, 0),len(dictionary)), dtype=float).T

#visualize sparse matrix
fig = figure(figsize=(12,12))
sparse_word_doc_matrix_visualization = fig.add_subplot(1,1,1)
sparse_word_doc_matrix_visualization.spy(sparse_word_doc_matrix, markersize=0.007, aspect = 'auto')

%store sparse_word_doc_matrix
%store dictionary
%store dictionary_lookup

The visualization below shows the words (y-axis) and the documents (x-axis) they are in. Across 1200000 documents the terms in the corpus that re-occur more regularly form an interesting pattern of lines.

![Sparse Word Doc Matrix](sparse_word_doc_matrix.png)

The words that occur together often in the corpus also, as word vectors, are closer together in this 1200000 dimensional vector space. This is demonstrated in the table below.

In [41]:
#approximating cosine similarity with dot product of the term document matrix and its transform

similarity_matrix  = sparse_word_doc_matrix @ sparse_word_doc_matrix.T

#displaying slice of matrix with highest similarity scores

similarity_matrix_compressed = similarity_matrix[(-similarity_matrix.sum(axis = 1)).argsort()[:6]].toarray()

pd.DataFrame((-similarity_matrix_compressed).argsort(axis = 1)[:6,:6].T).applymap(df.headline_text.to_dict().get)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,police,new,man,says,court,nsw,australia,govt,council,fire,australian,qld,sydney,plan,death,water,health,crash,back,coast
1,man,zealand,charged,govt,man,police,day,urged,plan,house,open,north,man,council,police,plan,mental,fatal,hits,gold
2,investigate,laws,police,minister,accused,rural,south,nsw,new,crews,market,govt,police,water,man,restrictions,service,car,bounce,north
3,probe,police,court,trump,face,govt,coronavirus,qld,considers,police,dollar,rural,hobart,govt,toll,council,minister,plane,fight,sunshine
4,missing,cases,murder,new,told,country,new,vic,water,man,south,central,western,new,inquest,supply,new,dies,track,nsw
5,search,australia,jailed,australia,faces,hour,live,says,land,threat,year,police,charged,basin,charged,govt,services,killed,plan,south
6,car,york,dies,labor,murder,coast,test,plan,seeks,govt,share,health,nsw,backs,court,murray,qld,police,court,west
7,death,council,missing,union,high,new,india,local,says,destroys,new,new,morning,murray,rises,new,indigenous,man,urged,police
8,officer,year,accused,could,front,coronavirus,world,new,plans,nsw,china,government,briefing,group,probe,use,funding,driver,get,mid
9,hunt,nsw,guilty,government,hears,government,cup,fed,city,school,first,south,airport,says,woman,irrigators,says,road,says,man


In [None]:
#previewing document similarity

doc_similarity_matrix  = sparse_word_doc_matrix.T @ sparse_word_doc_matrix

#displaying slice of matrix with highest similarity scores

doc_similarity_matrix_compressed = doc_similarity_matrix[(-doc_similarity_matrix.sum(axis = 1)).argsort()[:6]].toarray()

pd.DataFrame((-similarity_matrix_compressed).argsort(axis = 1)[:6,:6].T).applymap(dictionary_lookup.get)