# 3.3 Looking at the Lexical Vocabulary from the Perspective of the Literary Material

Goal of this notebook is to explore the connection between the literary corpus and individual lexical texts. In order to do so we will construct a full DTM of the literary vocabulary with trigrams and see which lexical texts have a larger or smaller intersection with that vocabulary.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # this suppresses a warning about pandas from tqdm
import pandas as pd
from ipywidgets import interact
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tqdm.auto import tqdm
tqdm.pandas() # initiate pandas support in tqdm, allowing progress_apply() and progress_map()
from nltk import trigrams, bigrams
import re

In [None]:
lit_lines = pd.read_pickle('output/litlines.p')
lit_lines

Make ngrams: unigrams, bigrams, and trigrams. Represent bigrams and trigrams as MWEs, connected by underscores. Create a full list of all lemmas and ngrams, omitting all non-lemmatized words (or ngrams that include non-lemmatized words).

In [None]:
def make_ngrams(lemmas):
    lemmas = lemmas.split()
    lemmas_bi = bigrams(lemmas)
    lemmas_tri = trigrams(lemmas)
    lemmas_n = list(lemmas_bi) + list(lemmas_tri)
    lemmas_n = ['_'.join(lem) for lem in lemmas_n]
    lemmas = set(lemmas + lemmas_n)
    lemmas = [lem for lem in lemmas if not '[na]na' in lem]
    lit_vocab.extend(lemmas)
    return

In [None]:
lit_vocab = []
lit_lines['lemma'].progress_apply(make_ngrams)
lit_vocab = list(set(lit_vocab))
lit_vocab.sort()
lit_vocab

> Note: This step can be done with Countvectorizer, with setting ngrams = (1,3).

In [None]:
lit_comp = lit_lines.groupby(['id_text']).agg({'lemma' : ' '.join}).reset_index()
#lit_comp['lemma'] = [lem for lem in lit_comp['lemma'] if not '[na]na' in lem] # remove unlemmatized 

In [None]:
tv = TfidfVectorizer(token_pattern = r'[^ ]+' ngram_range = (1,3))
dtm = tv.fit_transform(lit_comp['lemma'])
lit_df = pd.DataFrame(dtm.toarray(), columns= tv.get_feature_names(), index=lit_comp["id_text"])
cols = [col for col in lit_df.columns if not '[na]na' in col]
lit_df = lit_df[cols]

In [None]:
lit_df

Issues with the TfidfVectorizer

- do we need the tf-idf scores?
- unlemmatized words are removed after vectorizing
- this ensures that words separated by unlemmatized words do not get into bigram/trigram
- but it makes the tf-idf score invalid (probably very small difference)
- words on consecutive lines become part of bigram/trigram
- other way: use ngrams as determined above to create MWEs with MWEtokenizer?

In [None]:
lit_df.columns[12000:12100]

In [None]:
lit_comp.loc[lit_comp.lemma.str.contains('ŋi\[night\]n ud\[sun\]n')]