# Punchlines and Pipelines: A Python NLP Tutorial for Comedic Text Vectorization

Author: Duncan Calvert
Last Modified: 9/24/24

This article is part of my series Watch One, Do One, Teach One (WDT), A data science series focused around helping beginner data scientist learn AI concepts, practice implementing them, and then teach the concept to others in order to cement their understanding.

## Table of Contents
1. [What is Text Vectorization](#what-is-text-vectorization)
2. [General NLP Vocabulary](#general-nlp-vocabulary)
3. [Applying Text Vectorization to a Real World Use Case and Data Set](#applying-text-vectorization-to-a-real-world-use-case-and-data-set)
4. [Library Imports and Configurations](#library-imports-and-configurations)
5. [Downloading the Data Set](#downloading-the-data-set)
6. [Text Vectorization Approaches](#text-vectorization-approaches)
    1. [One-Hot Encoding](#1-one-hot-encoding)
    2. [Bag of Words](#2-bag-of-words-bow)
    3. [NGrams](#3-ngrams)
    4. [TF-IDF](#4-tf-idf)
    5. [Word Embeddings (Word2Vec, GloVe)](#5-word-embeddings-word2vec-glove)
    6. [Sentence Embeddings (BERT, GPT)](#6-sentence-embeddings-bert-gpt)


## What is Text Vectorization

Text Vectorization is the process of converting natural language text (think emails, tweets, and books) into numbers that represent them. This step is crucial for Natural Language Processing (NLP) problems as computers can only understand numbers, not raw text. Examples of text vectorization include methods like Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word Embeddings (Word2Vec, GloVe), and Sentence/Document Embeddings (BERT, GPT, other Transformers). More generally, text vectorization can be thought of as a specific sub-type of "feature engineering" as it creates secondary derived features from raw data. Similarly, you'll occassionally hear it referred to as "feature extraction" or "text representation" but generally "text vectorization" is the most commonly used term.

## General NLP Vocabulary

To get us started and ensure we're all clear on the general terms, let's define some foundational NLP vocabulary.

* <ins>Corpus</ins>: The complete list of words in a dataset. Importantly, the corpus is a list of <ins>non-unique</ins> words and will often have repeat words
* <ins>Vocabulary</ins>: The complete list of <ins>unique</ins> words in a dataset. The Vocabulary is the Corpus with no duplicate words.
* <ins>Document</ins>: a single unique record in a dataset. For example, if you were a library, a document would be a single book. For tabular data sets a document is generally a row.
* <ins>Word</ins>: a single word in a document

## Applying Text Vectorization to a Real World Use Case and Data Set

Okay, so why are these concepts useful for a real world use case?

Imagine that you're a comedian and you have a dataset containing thousands of jokes. You might be interested in doing any of the following:
1. Analyze the data set to find major themes or keywords in popular jokes and see how your jokes compare
2. Analyze the data set to find jokes that are very similar to your own to identify cases where a comedian may be using your material, and also avoid accidentally using someone else's material
2. Create an ML model to automatically classify your jokes into categories, give them a quality rating, or a recommendation on which cities to include them in your set
3. Create an AI model tailored to your own comedy style that can act as a co-writer to help you generate new jokes

If any of those use cases sound interesting I have good news. There are a number of available large [joke data sets](https://github.com/taivop/joke-dataset) online. For this tutorial, we'll be using the wocka.com scrape. This includes about 10,000 jokes and comes in the following JSON schema:
* id -- page ID on wocka.com.
* category -- see available categories here.
* title -- title of the joke.
* body -- the main text of the joke

Example of a single data set item: 

In [71]:
{
        "title": "Infants vs Adults",
        "body": "Do infants enjoy infancy as much as adults enjoy adultery?",
        "category": "One Liners",
        "id": 17
}

{'title': 'Infants vs Adults',
 'body': 'Do infants enjoy infancy as much as adults enjoy adultery?',
 'category': 'One Liners',
 'id': 17}

### Library Imports and Configurations

In [None]:
import pandas as pd
import numpy as np
import itertools
import json # for parsing our data set file

# The Natural Language Toolkit (NLTK) provides a common basic tokenizer
import nltk
from nltk.tokenize import word_tokenize # for tokenization


In [100]:
# Download the necessary tokenizer data from nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/attis/nltk_data...


### Downloading the Jokes Data Set
1. Navigate to the [joke data set link](https://github.com/taivop/joke-dataset)
2. Click on the file wocka.json
3. Click on the download button (situated on the middle right of the page)
4. Place it in your project directory
5. Update the code below with your file path

In [101]:
# Load JSON dataset
json_data = """
[
{
        "title": "Infants vs Adults",
        "body": "Do infants enjoy infancy as much as adults enjoy adultery?",
        "category": "One Liners",
        "id": 17
}]
"""

data = json.loads(json_data)

Step 1: Convert the wocka.json file into a DataFrame

In [118]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,title,body,category,id
0,Infants vs Adults,Do infants enjoy infancy as much as adults enj...,One Liners,17


Step 2: Pre-process each document (row of the dataframe) to make it lowercase and split it into individual words. To verify this ran correctly we'll print out a few examples and thenn measure the total corpus size.

In [117]:
# corpus = df['body'].apply(lambda x: x.lower().split())
corpus = df['body'].apply(word_tokenize)

print(corpus)
print(len(corpus[0]))

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/attis/nltk_data'
    - '/Users/attis/VS_Code_Repos/nlp_notes_and_resources/.venv/nltk_data'
    - '/Users/attis/VS_Code_Repos/nlp_notes_and_resources/.venv/share/nltk_data'
    - '/Users/attis/VS_Code_Repos/nlp_notes_and_resources/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Step 3: Create a vocabulary of unique words from the corpus by turning it into a "set" and flatten the list of lists into a single list

In [88]:
vocabulary = set(word for document in corpus for word in document)

Since it's a set we can't use slicing. Instead let's print out the first 5 elements.

In [89]:
print("--- A few random vocabulary words ---")
for i, val in enumerate(itertools.islice(vocabulary, 5)):
    print(val)

--- A few random vocabulary words ---
adultery?
as
much
infants
infancy


Let's also check the total length to get the vocabulary size.


In [90]:
print("The total vocabulary size is {} unqiue words/tokens".format(len(vocabulary)))

The total vocabulary size is 8 unqiue words/tokens


## Text Vectorization Approaches

Okay that's pretty good! From our initial round of pre-processing we've been able to:
1. Measure the vocabulary size (unique words)
2. Meausre the corpus size (non-unique words)

Now that we've done the basics, let's start exploring an assortment of text vectorization approaches!

### 1. One-Hot Encoding
<ins>One-Hot Encoding (OHE)</ins>: is used with categorical and natural language datasets to convert their values into 2-dimensional vector representations. Specifically, it functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector's size equivalent to the number of potential categories (i.e. Vocabulary size in the case of natural language)

Using the same example data set as before, let's create a list of all unique words to use as features for one-hot encoding

In [91]:
# Convert the set into list
vocabulary_list = list(vocabulary)

# Create a DataFrame to hold the one-hot encoded representation
one_hot_encoded_df = pd.DataFrame(0, index=range(len(df)), columns=vocabulary_list)
one_hot_encoded_df.head()

Unnamed: 0,adultery?,as,much,infants,infancy,do,adults,enjoy
0,0,0,0,0,0,0,0,0


In [92]:
# Iterate over each document and set the corresponding word column to 1
for idx, document in enumerate(corpus):
    for word in document:
        one_hot_encoded_df.at[idx, word] = 1

print("One-Hot Encoded DataFrame:")
one_hot_encoded_df.head()

One-Hot Encoded DataFrame:


Unnamed: 0,adultery?,as,much,infants,infancy,do,adults,enjoy
0,1,1,1,1,1,1,1,1


### 2. Bag of Words (BoW)

### 3. NGrams

### 4. TF-IDF

### 5. Word Embeddings (Word2Vec, GloVe)

### 6. Sentence Embeddings (BERT, GPT)