## Table of Contents
1. [General Vocabulary](#general-vocabulary)
2. [Feature Engineering](#feature-engineering)
    1. [One Hot Encoding](#one-hot-encoding)
    2. [Bag of Words](#bag-of-words-bow)
    3. [NGrams](#ngrams)
    4. [TF-IDF](#tf-idf)
    5. [Word Embeddings](#word-embeddings)


## General Vocabulary

* <ins>Corpus</ins>: All words in a dataset
* <ins>Vocabulary</ins>: Unique words in a dataset
* <ins>Document</ins>: a single unique record in a dataset
* <ins>Word</ins>: a single word in a document

To better exemplify these concepts, let's use the following example dataset below

In [28]:
import pandas as pd
import numpy as np
import itertools

# 
data = {
    'Documents': [
        'I love machine learning and data science',
        'Data science is fascinating and challenging',
        'Machine learning is a subset of data science',
        'I enjoy learning new things in data science',
        'This is an example document, it may have a lot of symbols like (! @ # $ or %)'
    ]
}

Step 1: Convert the corpus (data dictionary) into a DataFrame

In [29]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Documents
0,I love machine learning and data science
1,Data science is fascinating and challenging
2,Machine learning is a subset of data science
3,I enjoy learning new things in data science
4,"This is an example document, it may have a lot..."


Step 2: Pre-process each document (row of the dataframe) to make it lowercase and split it into individual words

In [30]:
corpus = df['Documents'].apply(lambda x: x.lower().split())
print(corpus)

0     [i, love, machine, learning, and, data, science]
1    [data, science, is, fascinating, and, challeng...
2    [machine, learning, is, a, subset, of, data, s...
3    [i, enjoy, learning, new, things, in, data, sc...
4    [this, is, an, example, document,, it, may, ha...
Name: Documents, dtype: object


Step 3: Create a vocabulary of unique words from the corpus by turning it into a "set" and flatten the list of lists into a single list

In [35]:
vocabulary = set(word for document in corpus for word in document)

Since it's a set we can't use slicing. Instead let's print out the first 5 elements.


In [36]:
print("--- A few random vocabulary words ---")
for i, val in enumerate(itertools.islice(vocabulary, 5)):
    print(val)

--- A few random vocabulary words ---
a
subset
in
@
learning


Let's also check the total length to get the vocabulary size.


In [34]:
print("The total vocabulary size is {} unqiue words/tokens".format(len(vocabulary)))

The total vocabulary size is 33 unqiue words/tokens


## Feature Engineering

Feature Engineering, also known as "feature extraction", "text representation", or "text vectorization" is an integral step that allows us to represent text as numbers (feature vectors). This is required when working on NLP problem as computers can only understand numbers, not text in its raw form. Various approaches are covered below.

### One Hot Encoding
<ins>One Hot Encoding (OHE)</ins>: is used with categorical and natural language datasets to convert their values into 2-dimensional vector representations. Specifically, it functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector's size equivalent to the number of potential categories (i.e. Vocabulary size in the case of natural language)

Using the same example data set as before, let's create a list of all unique words to use as features for one-hot encoding

In [44]:
# Convert the set into list
vocabulary_list = list(vocabulary)

# Create a DataFrame to hold the one-hot encoded representation
one_hot_encoded_df = pd.DataFrame(0, index=range(len(df)), columns=vocabulary_list)
one_hot_encoded_df.head()

Unnamed: 0,a,subset,in,@,learning,it,i,and,fascinating,love,...,is,like,this,science,%),an,"document,",have,of,(!
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
# Iterate over each document and set the corresponding word column to 1
for idx, document in enumerate(corpus):
    for word in document:
        one_hot_encoded_df.at[idx, word] = 1

print("One-Hot Encoded DataFrame:")
one_hot_encoded_df.head()

One-Hot Encoded DataFrame:


Unnamed: 0,a,subset,in,@,learning,it,i,and,fascinating,love,...,is,like,this,science,%),an,"document,",have,of,(!
0,0,0,0,0,1,0,1,1,0,1,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,1,0,...,1,0,0,1,0,0,0,0,0,0
2,1,1,0,0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
3,0,0,1,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,1,0,0,1,0,1,0,0,0,0,...,1,1,1,0,1,1,1,1,1,1


### Bag of Words (BoW)

### NGrams

### TF-IDF

### Word Embeddings