**Text Feature Extraction : BagofWords Model**

Text preprocessing includes feature extraction from the given text data. For this, you can use any of the standard tools such as the bagofwords or k-mer models, as long as they produce sparse output. For this assignment, you should not use models that create dense representations of the text or per-word dense representations (e.g., word2vec).

Here is an example of feature extraction using BagOfWords standard tools.

---



---



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from scipy.sparse import csr_matrix
import string
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import sklearn as sk
from sklearn.feature_extraction.text import TfidfVectorizer
import math

In [2]:
doc = ["The bag-of-words model is a simplifying representation using natural language processing and information retrieval (IR)",
      "In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity",
      "The bag-of-words model has also been used for computer vision."] 

**Sklearn CountVectorizer**

token_pattern is defined since CountVectorizer ignores the single character.

In [3]:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
count_occurrences = cv.fit_transform(doc)

In [4]:
count_occurrences.toarray()

array([[1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
       [3, 0, 1, 2, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,
        1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]])

In [5]:
count_vect_df = pd.DataFrame(data = count_occurrences.toarray(), columns= cv.get_feature_names_out())

In [6]:
count_vect_df

Unnamed: 0,a,also,and,as,bag,been,but,computer,disregarding,document,...,simplifying,such,text,the,this,used,using,vision,word,words
0,1,0,1,0,1,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,1
1,3,0,1,2,1,0,1,0,1,1,...,0,1,1,1,1,0,0,0,1,1
2,0,1,0,0,1,1,0,1,0,0,...,0,0,0,1,0,1,0,1,0,1


**Stemming**

Stemming is a technique used to reduce an inflected word down to its word stem. Performing this text-processing technique is often useful for dealing with sparsity and/or standardizing vocabulary.

Following is a Stemming example in nltk

In [7]:
nltk.download('punkt')

first_sentence = "Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization."

# Initialize Python porter stemmer
ps = PorterStemmer()

# Remove punctuation
first_sentence_no_punct = first_sentence.translate(str.maketrans("", "", string.punctuation))

# Create tokens
word_tokens = word_tokenize(first_sentence_no_punct)

# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in word_tokens:
    print ("{0:20}{1:20}".format(word, ps.stem(word)))

[nltk_data] Downloading package punkt to /home/david/nltk_data...


--Word--            --Stem--            
Stemming            stem                
is                  is                  
a                   a                   
natural             natur               
language            languag             
processing          process             
technique           techniqu            
that                that                
lowers              lower               
inflection          inflect             
in                  in                  
words               word                
to                  to                  
their               their               
root                root                
forms               form                
hence               henc                
aiding              aid                 
in                  in                  
the                 the                 
preprocessing       preprocess          
of                  of                  
text                text                
words           

[nltk_data]   Unzipping tokenizers/punkt.zip.


**Lemmatization**

Lemmatization is another technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning.

In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer. 

In [9]:
nltk.download("wordnet")
nltk.download("omw-1.4")

# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()

# Example sentence
second_sentence = "Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good."

# Remove punctuation
second_sentence_no_punc = second_sentence.translate(str.maketrans("", "", string.punctuation))

# Create tokens
word_tokens = word_tokenize(second_sentence_no_punc)

# Perform lemmatization
print("{0:20}{1:20}".format("--Word--", "--Lemma--"))
for word in word_tokens:
  print("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))


[nltk_data] Downloading package wordnet to /home/david/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/david/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


--Word--            --Lemma--           
Lemmatization       Lemmatization       
is                  be                  
a                   a                   
text                text                
preprocessing       preprocessing       
technique           technique           
used                use                 
in                  in                  
natural             natural             
language            language            
processing          process             
NLP                 NLP                 
models              model               
to                  to                  
break               break               
a                   a                   
word                word                
down                down                
to                  to                  
its                 its                 
root                root                
meaning             mean                
to                  to                  
identify        

Stemming and Lemmatization are both ways to shrink the size of the vocabulary space. Please note that you may choose either of the 2 and not both. 

**Creating TF-IDF**

Following is an example to implement tf-idf technique in python using standard tools, this technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

In [10]:
#calling the TfidfVectorizer
vectorize = TfidfVectorizer()

#fitting the model and passing our sentences right away:
response = vectorize.fit_transform([first_sentence, second_sentence])

Now let's implement the feature extraction module from scratch

**Manually (Scratch)**

The following function, given a string and a k-mer length parameter c, will create the k-mers for the same.

In [11]:
def cmer(name, c=3):
    r""" Given a string and parameter c, return the vector of k-mers associated with the words
    """
    name = name.lower()
    if len(name) < c:
        return [name]
    v = []
    for i in range(len(name)-c+1):
        v.append(name[i:(i+c)])
    return v

In [12]:
def build_matrix(docs):
    r""" Build sparse matrix from a list of documents, 
    each of which is a list of word/terms in the document.  
    """
    nrows = len(docs)
    idx = {}
    tid = 0
    nnz = 0
    for d in docs:
        nnz += len(set(d))
        for w in d:
            if w not in idx:
                idx[w] = tid
                tid += 1
    ncols = len(idx)
        
    # set up memory
    ind = np.zeros(nnz, dtype=int)
    val = np.zeros(nnz, dtype=np.double)
    ptr = np.zeros(nrows+1, dtype=int)
    i = 0  # document ID / row counter
    n = 0  # non-zero counter
    # transfer values
    for d in docs:
        cnt = Counter(d)
        keys = list(k for k,_ in cnt.most_common())
        l = len(keys)
        for j,k in enumerate(keys):
            ind[j+n] = idx[k]
            val[j+n] = cnt[k]
        ptr[i+1] = ptr[i] + l
        n += l
        i += 1
            
    mat = csr_matrix((val, ind, ptr), shape=(nrows, ncols), dtype=np.double)
    mat.sort_indices()
    
    return mat

def csr_info(mat, name="", non_empy=False):
    r""" Print out info about this CSR matrix. If non_empy, 
    report number of non-empty rows and cols as well
    """
    if non_empy:
        print("%s [nrows %d (%d non-empty), ncols %d (%d non-empty), nnz %d]" % (
                name, mat.shape[0], 
                sum(1 if mat.indptr[i+1] > mat.indptr[i] else 0 
                for i in range(mat.shape[0])), 
                mat.shape[1], len(np.unique(mat.indices)), 
                len(mat.data)))
    else:
        print( "%s [nrows %d, ncols %d, nnz %d]" % (name, 
                mat.shape[0], mat.shape[1], len(mat.data)) )

def csr_l2normalize(mat, copy=False, **kargs):
    r""" Normalize the rows of a CSR matrix by their L-2 norm. 
    If copy is True, returns a copy of the normalized matrix.
    """
    if copy is True:
        mat = mat.copy()
    nrows = mat.shape[0]
    nnz = mat.nnz
    ind, val, ptr = mat.indices, mat.data, mat.indptr
    # normalize
    for i in range(nrows):
        rsum = 0.0    
        for j in range(ptr[i], ptr[i+1]):
            rsum += val[j]**2
        if rsum == 0.0:
            continue  # do not normalize empty rows
        rsum = 1.0/np.sqrt(rsum)
        for j in range(ptr[i], ptr[i+1]):
            val[j] *= rsum
            
    if copy is True:
        return mat
        
def textToMatrix(names, c):
    docs = [cmer(n, c) for n in names]
    return build_matrix(docs)

In [13]:
csr_info(textToMatrix(doc, 1))
csr_info(textToMatrix(doc, 2))
csr_info(textToMatrix(doc, 3))

 [nrows 3, ncols 28, nnz 74]
 [nrows 3, ncols 156, nnz 246]
 [nrows 3, ncols 251, nnz 314]


IMPORTANT: DO NOT MAKE CHANGES TO THIS NOTEBOOK. YOU MAY USE HELP FROM THE GIVEN MODULES BY COPYING THEM TO A NEW NOTEBOOK.

**Problem:**

1. Given the training set and the test set files from the Program 1 assignment, extract features from both using the standard tools and the mannual function (see above).

2. Make sure that both document-feature matrices need to be in the same Euclidean space, i.e., the $i$th dimension refers to the same token in both the training and test matrices. Think about different ways you could achieve this with the manual processing function.

3. Time processing the datasets using both the standard tools and the mannual method and report the times.
