# Vectorise Corpus
Felix Zaussinger | 08.01.2021

## Core Analysis Goal(s)
1. Explore textacy vectorisation functions
2. Transform corpus to sparse matrix representations

## Key Insight(s)
1. Creating the sparse matrix format from the corpus takes long (V1 ~ 30 min, > V2 ~ 7 min)
2. Compared to the full corpus, parse matrices are really small memory-wise
3. There is a lot of subtlety in choosing parameters, particularly weighting schemes. I tried to follow a standard implementations of a TF-IDF weighting scheme.

## Sources
- https://textacy.readthedocs.io/en/stable/api_reference/vsm_and_tm.html#textacy.vsm.vectorizers.Vectorizer

In [1]:
# magic commands
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# imports
import os
import re
import sys
import glob
import pickle
import textacy
import logging
import numpy as np
import textacy.vsm
import pandas as pd
from tqdm import tqdm
import en_core_web_lg
import seaborn as sns
from pathlib import Path
import matplotlib.pyplot as plt
from textacy import preprocessing
from dotenv import find_dotenv, load_dotenv

# module settings
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("ticks")

# logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

#### Define directory structure

In [2]:
# project directory
abspath = os.path.abspath('')
project_dir = str(Path(abspath).parents[0])

# sub-directories
data_raw = os.path.join(project_dir, "data", "raw")
data_interim = os.path.join(project_dir, "data", "interim")
data_processed = os.path.join(project_dir, "data", "processed")
model_dir = os.path.join(project_dir, "models")
figure_dir = os.path.join(project_dir, "reports", "figures")

#### Load and configure spacy nlp model

In [3]:
%%time
nlp = en_core_web_lg.load()
nlp.max_length = 30000000

CPU times: user 4.12 s, sys: 800 ms, total: 4.92 s
Wall time: 5.43 s


#### Define version/iteration ID

In [4]:
version = "V6"

#### Load textacy corpus that stores the pre-processed BBC Monitoring data and its metadata

In [5]:
%%time
fname_corpus = "BBC_2007_07_04_CORPUS_TEXTACY_{}.bin.gz".format(version)
corpus = textacy.Corpus.load(
    lang=nlp,
    filepath=os.path.join(data_processed, fname_corpus),
    store_user_data=True
)

CPU times: user 7min 25s, sys: 5.48 s, total: 7min 31s
Wall time: 7min 31s


In [6]:
print(corpus)

Corpus(1691 docs, 42797573 tokens)


### Vectorisation (textacy.vsm.vectorizers)

Two key options:
1. Vectorizer: Transform a collection of tokenized documents into a **document-term matrix of shape (# docs, # unique terms)**, with various ways to filter or limit included terms and flexible weighting schemes for their values.
2. GroupVectorizer: Transform a collection of tokenized documents into a **group-term matrix of shape (# unique groups, # unique terms)**, with various ways to filter or limit included terms and flexible weighting schemes for their values.

Further info:
- *doc.to_terms_list* (generator function!): Transform Doc into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in Doc.
- *textacy.vsm.vectorizers.GroupVectorizer*: Transform one or more tokenized documents into a group-term matrix of shape (# groups, # unique terms), with tf-, tf-idf, or binary-weighted values.This is an extension of typical document-term matrix vectorization, where terms are grouped by the documents in which they co-occur. It allows for customized grouping, such as by a shared author or publication year, that may span multiple documents, without forcing users to merge those documents themselves.

**Tokenize and vectorize the documents of this corpus**

In [7]:
# returns a generator
tokenized_docs, basin_group, year_group = textacy.io.unzip(
    (doc._.to_terms_list(
        ngrams=(1, 2),
        entities=False,
        normalize="lemma",
        as_strings=True,
        filter_stops=True,
        filter_nums=True,
        include_pos={'ADJ', 'NOUN', 'VERB'},  # also test: {'ADJ', 'NOUN', 'VERB', 'PROPN'}
        min_freq=2,
    )
     , doc._.meta["basin"]
     , doc._.meta["year"]) for doc in corpus
)

**Filter out terms that are too common and/or too rare (by document frequency), and compactify the top max_n_terms in the id_to_term mapping accordingly.**

- min_df (float or int): If float, value is the fractional proportion of
            the total number of documents, which must be in [0.0, 1.0]. If int,
            value is the absolute number. Filter terms whose document frequency
            is less than ``min_df``.
            
- max_df (float or int): If float, value is the fractional proportion of
    the total number of documents, which must be in [0.0, 1.0]. If int,
    value is the absolute number. Filter terms whose document frequency
    is greater than ``max_df``.
    
- max_n_terms (int): Only include terms whose document frequency is within
    the top ``max_n_terms``.

**Weighting schemes**


-    “tf”: Weights are simply the absolute per-document term frequencies (tfs), i.e. value (i, j) in an output doc-term matrix corresponds to the number of occurrences of term j in doc i. Terms appearing many times in a given doc receive higher weights than less common terms. Params: tf_type="linear", apply_idf=False, apply_dl=False

-    “**tfidf**”: Doc-specific, local tfs are multiplied by their corpus-wide, global inverse document frequencies (idfs). Terms appearing in many docs have higher document frequencies (dfs), correspondingly smaller idfs, and in turn, lower weights. **Params: tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=False**

-    “bm25”: This scheme includes a local tf component that increases asymptotically, so higher tfs have diminishing effects on the overall weight; a global idf component that can go negative for terms that appear in a sufficiently high proportion of docs; as well as a row-wise normalization that accounts for document length, such that terms in shorter docs hit the tf asymptote sooner than those in longer docs. Params: tf_type="bm25", apply_idf=True, idf_type="bm25", apply_dl=True

-    “binary”: This weighting scheme simply replaces all non-zero tfs with 1, indicating the presence or absence of a term in a particular doc. That’s it. Params: tf_type="binary", apply_idf=False, apply_dl=False

Slightly altered versions of these “standard” weighting schemes are common, and may have better behavior in general use cases:

-    “lucene-style tfidf”: Adds a doc-length normalization to the usual local and global components. Params: tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="sqrt"

-    “lucene-style bm25”: Uses a smoothed idf instead of the classic bm25 variant to prevent weights on terms from going negative. Params: tf_type="bm25", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="linear"


#### Create and fit vectorizer

In [8]:
vectorizer = textacy.vsm.GroupVectorizer(
    tf_type="linear",  # {"linear", "sqrt", "log", "binary"}
    apply_idf=True,
    idf_type="standard",  # {"standard", "smooth", "bm25"}
    apply_dl=False,
    dl_type="linear",  # {"linear", "sqrt", "log"}
    norm="l2",  # {"l1", "l2"} or None
    min_df=0.3,  # Filter terms whose document frequency is less than ``min_df``
    max_df=0.95,  # Filter terms whose document frequency is greater than ``max_df``,
    max_n_terms=None,
    vocabulary_terms=None, 
    vocabulary_grps=None
)

- **dim(matrix)** = unique groups x unique terms = 105 x 6819
- **N = 309338** stored elements

In [9]:
%%time
grp_term_matrix = vectorizer.fit_transform(tokenized_docs, basin_group)

CPU times: user 6min 24s, sys: 1.27 s, total: 6min 26s
Wall time: 6min 27s


In [10]:
type(grp_term_matrix)

scipy.sparse.csr.csr_matrix

In [11]:
# save group-term matrix to disk (former "step 1" was based on a ~60% sample split)
textacy.io.matrix.write_sparse_matrix(
    data=grp_term_matrix,
    filepath=os.path.join(data_processed, "BBC_2007_07_04_CORPUS_TEXTACY_{}_GROUPTERMMATRIX_STEP1".format(version)),
    compressed=True  # writes to single .npz file (numpy binary format)
)

#### Vectorize the remaining documents of the corpus, using only the groups, terms, and weights learned in the previous step

In [12]:
# (former "step 2": applied the trained model that used sample 1 to sample 2)
#tokenized_docs, basin_group, year_group = textacy.io.unzip(
#    (doc._.to_terms_list(ngrams=(1, 2, 3), entities=True, as_strings=True), doc._.meta["basin"], doc._.meta["year"]) for doc in corpus[n_split:]
#)

In [13]:
grp_term_matrix_fitted = vectorizer.transform(tokenized_docs, basin_group)

In [14]:
# save group-term matrix (step 2) to disk
textacy.io.matrix.write_sparse_matrix(
    data=grp_term_matrix_fitted,
    filepath=os.path.join(data_processed, "BBC_2007_07_04_CORPUS_TEXTACY_{}_GROUPTERMMATRIX_STEP2".format(version)),
    compressed=True  # writes to single .npz file (numpy binary format)
)

#### Save trained vectorizer object

In [15]:
pickle.dump(vectorizer, open(os.path.join(model_dir, 'BBC_2007_07_04_CORPUS_TEXTACY_{}_VECTORIZER.pkl'.format(version)), "wb"))

#### Inspect the terms associated with columns and groups associated with rows (get's sorted alphabetically)

**(,6819)**

In [16]:
vectorizer.vocabulary_terms  # unique word counts
vectorizer.terms_list  # terms
textacy.vsm.matrix_utils.get_term_freqs(grp_term_matrix)
textacy.vsm.matrix_utils.get_doc_freqs(grp_term_matrix)
textacy.vsm.matrix_utils.get_inverse_doc_freqs(grp_term_matrix)
textacy.vsm.matrix_utils.get_information_content(grp_term_matrix)

array([0.87988131, 0.9991421 , 0.91441667, ..., 0.99871833, 0.97227946,
       0.99617391])

**(105,)**

In [17]:
vectorizer.vocabulary_grps  # group id's
vectorizer.grps_list  # group elements
# textacy.vsm.matrix_utils.get_doc_lengths(grp_term_matrix)

['Akpa',
 'Amazon',
 'Amazonas',
 'Amur',
 'Araks',
 'Aral Sea',
 'Artibonite',
 'Asi',
 'Astara',
 'Atrak',
 'Atrek',
 'Aviles',
 'Awash',
 'Ayeyarwady',
 'Aysen',
 'Baker',
 'Baraka',
 'Beilun',
 'Belize',
 'Benito',
 'Bia',
 'Bidasoa',
 'Black River',
 'Brahmaputra',
 'Buzi',
 'Ca',
 'Candelaria',
 'Catatumba',
 'Catatumbo',
 'Cavally',
 'Cestos',
 'Changuinola',
 'Chico',
 'Chiloango',
 'Chira',
 'Choluteca',
 'Chu',
 'Chui',
 'Chuy',
 'Coco',
 'Colorado',
 'Columbia',
 'Congo',
 'Corentyne',
 'Coruh',
 'Cross',
 'Cullen',
 'Cuvelai',
 'Danube',
 'Dara',
 'Dasht',
 'Dayan',
 'Dnepr',
 'Dnieper',
 'Dniester',
 'Dnipro',
 'Don',
 'Douro',
 'Dra',
 'Draa',
 'Drava',
 'Drim',
 'Drin',
 'Duero',
 'Ebro',
 'Elbe',
 'Ertis',
 'Etosha',
 'Euphrates',
 'Evros',
 'Fane',
 'Flurry',
 'Fly',
 'Foyle',
 'Gambia',
 'Ganges',
 'Garona',
 'Garonne',
 'Garun',
 'Gash',
 'Golok',
 'Grijalva',
 'Guadiana',
 'HIrmand',
 'Han',
 'Har Nuur',
 'Hari',
 'Harirud',
 'Helmand',
 'Hirmand',
 'Hondo',
 'Hsi',

#### Repeat the same procedure to create a year-term matrix

In [18]:
# TODO: implement
# yr_term_matrix = vectorizer.fit_transform(tokenized_docs, year_group)