# 1-5-2 First Feature Set

## Term Frequencies

The basis for comparing texts begins with term frequencies. From there you can examine relative frequencies and topics, to name two things which we will be doing shortly. As you can imagine, we have to start from our ground truth and then decide how we are going to abstract in order to facilitate the kinds of comparisons we want to make. If we are interested in stylistics and/or attribution, then we will want to pay attention to the *function words* and often punctuation which often contain author signals. If we are interested in topics, then we will want to pay attention to *lexical words* (and throw away the function words).

To do any of these things, we need to establish the term frequency for each token in each text for all our texts. As you can see from that statement, this involves several steps:

1. Determine what we are going to include as tokens: words, words+punctuation, words-stopwords, etc.
2. Count those tokens in a text
3. Compile the tokens and their frequencies across our corpus

If you're thinking "That's a lot of work," you are correct but there is a process, and it looks like this ...

In [None]:
# IMPORTS
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt

# Set plt parameters
plt.rcParams['figure.dpi'] = 300
plt.rcParams["figure.figsize"] = (10,5)

In [None]:
files = ["A", "B", "C", "D", "E", "F", "G", "H", "mdg"]

strings = []
for i in files:
    # Create the path to the file
    the_file = "../data/1924/texts/"+i+".txt"
    # Read the file to a string
    the_string =  open(the_file, 'r').read()
    # Add the string to a list of strings
    strings.append(the_string)

print(len(strings), strings[8][0:50])

In [None]:
df = pd.DataFrame({"labels":files, "text":strings})
df.head()

In [None]:
texts = df.text.tolist()
print(texts[8][0:50])

## Tokenization(s)

Before we can count words and establish frequencies, we need to settle upon what we are going to consider words, which means determining our method of tokenizing our strings of characters into lists of tokens.

- The first tokenizer is regex that I have long used in order to keep contractions as single words, but it throws away all other forms of punctuation.
- The NLTK's `word_tokenize()` function is based on a TreebankWordTokenizer: basically it tokenizes text like in the Penn Treebank, which means apostrophes break contractions into their distinct parts — e.g., `I'm` becomes `I` + `'m`. Whereas `wordpunct_tokenize()` is a regex that breaks the apostrophes of contractions into their own tokens.
- SciKit Learn's tokenization comes up the leanest. 

### Experiments

#### Word Counts

In [None]:
# REGEX
regex = [word for word in re.sub("[^a-zA-Z']"," ", texts[8]).lower().split()]

# NLTK
w_tokens = [word.lower() for word in word_tokenize(texts[8])]
wp_tokens = [word.lower() for word in wordpunct_tokenize(texts[8])]

# SciKit-Learn
vectorizer = CountVectorizer( lowercase = True ) # We are vectorizing
x = vectorizer.fit_transform([texts[8]])         # the same text as above
sk_count = np.sum(x.toarray(), axis = 1)            # then summing the freq count

# Print to Compare
print(f"regex:       {len(regex)}")
print(f"nltk words:  {len(w_tokens)}")
print(f"nltk wpunct: {len(wp_tokens)}")
print(f"scikit:      {sk_count[0]}")

#### Vocabularies

In [None]:
# Let's compare vocabulary sizes:
print(f"METHOD : TOKEN SET")
print(f"regex  :  {len(set(regex))}")
print(f"NLTK   :  {len(set(w_tokens))}")
print(f"SciKit :  {x.shape[1]}")

In [None]:
difference = set(w_tokens) - set(vectorizer.get_feature_names_out())
print(difference)

## Creating a Document-Term Matrix

These experiments reveal the strengths and weaknesses of SciKit-Learn's built-in tokenizer. We will explore alternate tokenizers later, for now, please be aware that if you run `CountVectorizer` unadorned, it has the following defaults:

- lowercase everything, 
- get rid of all punctuation, 
- make a word out of anything more than two characters long, 
- split contractions, and 
- no stopwords.

The tokenizer is not without its problems: while it breaks contractions at the apostrophe, like NLTK, it then throws away anything less than two letters, which means `I'm` disappears entirely. And pity the indefinite article *a(n)*, which is pitched while the definite article *the* remains. (More on this later, but you should know that the documentation for the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is quite good.)

In [None]:
# We are going with the defaults, 
# so no options/arguments are being passed:
vectorizer = CountVectorizer()

# fit the model to the data 
# vecs = vectorizer.fit(texts)
X = vectorizer.fit_transform(texts)

# see how many features we have
X.shape

With our nine observations, we have over seven thousand features!

The easiest way to "see" this is to convert the array to a dataframe.

In [None]:
# Convert:
df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

# See what this looks like:
df.head()

In [None]:
# As always, we can save to a CSV file and look at this in other apps
# df.to_csv("../data/mdg_texts.csv")

In [None]:
vectorizer_min = CountVectorizer(min_df = 2)
X2 = vectorizer_min.fit_transform(texts)
X2.shape

In [None]:
df2 = pd.DataFrame(X2.toarray(), 
                   columns = vectorizer_min.get_feature_names_out())

df2.head(9)

In [None]:
df2["label"] = files
df2.set_index("label", inplace=True)
df2.head(9)