In [7]:
import string
import pandas as pd
import nltk

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer # BOW
from sklearn.feature_extraction.text import TfidfVectorizer # TF-IDF
from sklearn.metrics.pairwise import cosine_similarity # Cosine Similarity

Ensure that our NLTK resources are available. Run code below once to download NLTK resources. NLTK's '*stopwords*' is used for removing stop-words in our corpus and the '*wordnet*' dictionary is used for stemming.

In [8]:
nltk.download('stopwords') 
nltk.download('wordnet')   

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cherwah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/cherwah/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

___

# CODE TINKERING: Bag of Words (BOW)

Each entry in the 'docs' array represents a document in our corpus. Hence, we have 3 documents in our corpus and they will be used for Bag-Of-Words, TF-IDF and Cosine Similarity computations.

In [9]:
docs = [
    'John has some cats.',
    'Cats, being cats, eat fish.',
    'I ate a big fish.'
]

Preparing for Stemming and Stop-Words removal.

In [10]:
lemmatizer = WordNetLemmatizer()

stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We place the pre-processing code in a function as it will be used by Bag-of-Words, TF-IDF and Cosine Similarity. The function removes stop-words and punctuations, convert each term to lowercase and apply stemming.

In [11]:
def preprocess(docs):
    docs_clean = []
    punc = str.maketrans('', '', string.punctuation)
    for doc in docs:
        doc_no_punc = doc.translate(punc)
        words = doc_no_punc.lower().split()    
        words = [lemmatizer.lemmatize(word, 'v')
                        for word in words if word not in stop_words]    
        docs_clean.append(' '.join(words))
    
    return docs_clean

docs_clean = preprocess(docs)

Let's take a peek at our documents after pre-processing.

In [12]:
docs_clean

['john cat', 'cat cat eat fish', 'eat big fish']

Using NLTK's CountVectorizer, we generate our BOW feature vectors via Bag-of-Words technique. The output array shows the frequency-count of terms in each document.

In [13]:
bow = CountVectorizer()

# toarray() converts to a numpy array
feature_vectors = bow.fit_transform(docs_clean).toarray()
feature_vectors

array([[0, 1, 0, 0, 1],
       [0, 2, 1, 1, 0],
       [1, 0, 1, 1, 0]])

We can view our vocabulary (every unique word in our corpus is a feature) by asking for the feature names in the BOW matrix.

In [29]:
vocab = bow.get_feature_names_out()
vocab

array(['big', 'cat', 'eat', 'fish', 'john'], dtype=object)

Let's pretty-print our BOW results by combining our vocabulary and feature-vectors into a Pandas' dataframe. Here, we can see the frequency-count against each term in a document much cleanly.

In [15]:

df = pd.DataFrame(data=feature_vectors,
                index=['doc1', 'doc2', 'doc3'],
                columns=vocab)

df

Unnamed: 0,big,cat,eat,fish,john
doc1,0,1,0,0,1
doc2,0,2,1,1,0
doc3,1,0,1,1,0


___

# CODE TINKERING: TF-IDF

Using NLTK's TfidfVectorizer, we generate a feature-vector for each document in our corpus.

In [16]:
tfidf = TfidfVectorizer()

feature_vectors = tfidf.fit_transform(docs_clean).toarray()
feature_vectors


array([[0.        , 0.60534851, 0.        , 0.        , 0.79596054],
       [0.        , 0.81649658, 0.40824829, 0.40824829, 0.        ],
       [0.68091856, 0.        , 0.51785612, 0.51785612, 0.        ]])

We can view the unique tokens in our corpus (aka Vocabulary) by querying the feature names in TF-IDF matrix. 

In [30]:
vocab = tfidf.get_feature_names_out()
vocab

array(['big', 'cat', 'eat', 'fish', 'john'], dtype=object)

We place our TF-IDF feature vectors in a Pandas DataFrame to visualize the data easily. Notice that for document 1, the values in its feature vector that corresponds to 'big', 'eat', and 'fish' are all zeros as document 1 does not contain such terms. In addition, note the high value of 0.816 for the term 'cat' in document 2's feature vector due to the two occurences of 'cat' in its content.

In [18]:
df = pd.DataFrame(data=feature_vectors,
                index=['doc1', 'doc2', 'doc3'],
                columns=vocab)

df

Unnamed: 0,big,cat,eat,fish,john
doc1,0.0,0.605349,0.0,0.0,0.795961
doc2,0.0,0.816497,0.408248,0.408248,0.0
doc3,0.680919,0.0,0.517856,0.517856,0.0


___

# CODE TINKERING: Cosine Similarity

We now use 3 different query strings, stored in the 'query' array, to compare against the 3 documents in our corpus for similarity.

In [19]:
query = [
    'cats and fish',
    'and he',
    'john'
]

Like before, we pre-process our query strings to remove stop-words and punctuations, and to apply stemming. The output shows the query-strings after pre-processing. 

In [25]:
query_clean = preprocess(query)
query_clean

['cat fish', '', 'john']

We then generate the TF-IDF feature-vectors for the 3 query strings using the earlier computed TF-IDF weights from our corpus. The TF-IDF weights were obtained when we perform a tfidf.fit_transform() in an early step (under section CODE TINKERING: TF-IDF).

In [21]:
query_feature_vector = tfidf.transform(query_clean).toarray()
query_feature_vector

array([[0.        , 0.70710678, 0.        , 0.70710678, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

Now we use a DataFrame to pretty-print our 3 query strings' TF-IDF feature vectors. Remember that each query-string's feature vector has the same dimension as the vocabulary in our corpus. Since we have 5 tokens in our vocabulary, the dimension for each query-string's feature vector is 5.

In [22]:
query_df = pd.DataFrame(data=query_feature_vector,
                        index=['q1', 'q2', 'q3'],
                        columns=vocab)

query_df

Unnamed: 0,big,cat,eat,fish,john
q1,0.0,0.707107,0.0,0.707107,0.0
q2,0.0,0.0,0.0,0.0,0.0
q3,0.0,0.0,0.0,0.0,1.0


Next, we compute Cosine Similarity of the 3 query-strings' TF-IDF feature vectors against the 3 documents in our corpus. Notice that the first query-string is most similar to 'doc2', having a cosine similarity value of 0.866. The third query-string is most similar to 'doc1', with a cosine similarity value of 0.796. However, the second query-string is not similar to any of the 3 documents as non of the words in its query-string is found in our corpus. 

In [27]:
similarity = cosine_similarity(query_feature_vector, feature_vectors)

cs = pd.DataFrame(data=similarity,
                index=['cs1', 'cs2', 'cs3'],
                columns=['doc1', 'doc2', 'doc3'])

cs

Unnamed: 0,doc1,doc2,doc3
cs1,0.428046,0.866025,0.36618
cs2,0.0,0.0,0.0
cs3,0.795961,0.0,0.0
