# Document Vectorization | BAIS:6100

**Instructor: Qihang Lin**

Machine learning algorithms always deal with numbers not texts, so we have to transform a doument into a numeric vector, known as **text vectorization**. The matrix formed by stacking all of these vectors is called **Document-Term Matrix** (DTM). 


## Document-Term Matrix

We use the vectorizer from **sklearn** library to construct a DTM.

In [1]:
#!pip3 install --upgrade scikit-learn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import nltk                                  

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.1.1
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
mytexts = ["I will take text mining in Fall 2021.",
           "Are you taking Text-Mining this year?",
           "Unfortunately, Text Mining isn't offered."]

In [3]:
vectorizer = CountVectorizer()     #Initialize the vectorizer with default setting.
DTM = vectorizer.fit_transform(mytexts)     #Convert the corpus into DTM. 

In [4]:
DTM.shape      #The shape of DTM. (Num of Docs) * (Num of Terms).

(3, 15)

DTM is a **sparse matrix**, namely, a matrix in which most of the elements are zeros. 

In [5]:
DTM

<3x15 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

When storing a sparse matrix, a **Compressed Sparse Row format** is used to take advantage of the sparsity so that the RAM space used by a DTM is significantly reduced. 

You can convert a DTM from the compressed sparse row format to the regular format as follows, but it is not recommended because the regular format requires significantly larger RAM space.

In [6]:
DTM.toarray()

array([[1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0]])

Let's compare the RAM space used by the compressed format and the regular format.

In [7]:
import sys                           #sys.getsizeof() shows the RAM space used
print(sys.getsizeof(DTM))            #DTM is a compressed format
print(sys.getsizeof(DTM.toarray()))  #DTM.toarray() convert DTM to a regular format 

48
488


In order to better display the DTM in this notebook, we still convert DTM to a regular format. **You don't want to do this in your homework or project when the data is large.**

**vectorizer.fit_transform()** not only create a DTM but also learn a **vocabulary** based on the text data. The following code prints the vocabulary the vectrorizer learns. 

In [8]:
print(vectorizer.get_feature_names())

['2021', 'are', 'fall', 'in', 'isn', 'mining', 'offered', 'take', 'taking', 'text', 'this', 'unfortunately', 'will', 'year', 'you']




We then convert DTM into a dataframe using the vocabulary as the column names. 

In [9]:
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())

Unnamed: 0,2021,are,fall,in,isn,mining,offered,take,taking,text,this,unfortunately,will,year,you
0,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0
1,0,1,0,0,0,1,0,0,1,1,1,0,0,1,1
2,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0


The **default vectorizer** converts all letters to lower cases. It also uses a tokenizer different from the tokenizer in nltk library. In particular, the default vectorizer removes all punctuation and tokens with no more than two characters. The tokenizer in nltk keeps most of the punctuations. 

If we prefer the tokenizer in nltk, we can do the following:

In [10]:
#Initialize vectorizer using nltk tokenizer.
vectorizer = CountVectorizer(tokenizer = nltk.word_tokenize)   
DTM = vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,",",.,2021,?,are,fall,i,in,is,mining,...,offered,take,taking,text,text-mining,this,unfortunately,will,year,you
0,0,1,1,0,0,1,1,1,0,1,...,0,1,0,1,0,0,0,1,0,0
1,0,0,0,1,1,0,0,0,0,0,...,0,0,1,0,1,1,0,0,1,1
2,1,1,0,0,0,0,0,0,1,1,...,1,0,0,1,0,0,1,0,0,0


Interested in learning more about **CountVectorizer**? See (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

A DTM is a type of **bag-of-words model**, where a document is viewed as a "bag" of its words, disregarding grammar and word order but only keeping multiplicity.

## DTM Cleaning

The default **vectorizer** will convert all tokens to lower case, remove all punctuations, and remove all single-character tokens. However, we still need to clean DTM further.

Similar to counting word frequency in the entire dataset, the following steps of clearning can be modified or skipped according to use cases. 

1. Replace words if needed

In [11]:
mytexts = [s.replace("n't "," not ") for s in mytexts]
mytexts

['I will take text mining in Fall 2021.',
 'Are you taking Text-Mining this year?',
 'Unfortunately, Text Mining is not offered.']

2. Remove all stop words. You may customize your stop word list as needed.

In [12]:
#Remove stop words using the list from nltk during vecterization
nltk_stopwords = nltk.corpus.stopwords.words("english")
vectorizer = CountVectorizer(stop_words=nltk_stopwords)
DTM = vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,2021,fall,mining,offered,take,taking,text,unfortunately,year
0,1,1,1,0,1,0,1,0,0
1,0,0,1,0,0,1,1,0,1
2,0,0,1,1,0,0,1,1,0


3. Stemming. Unfortuanely, **sklearn** cannot do stemming by itself. We need to create a new vectorizer which **inherits** from the **CountVectorizer** class and integrate the stemming function from **nltk**. 

    The code below starting from "class" is beyond the scope of this course. You only need to know how to apply it but will not be asked to modify them.

In [13]:
stemmer = nltk.stem.SnowballStemmer("english")  #You may use a different stemmer.
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [14]:
vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords)
DTM =vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,2021,fall,mine,offer,take,text,unfortun,year
0,1,1,1,0,1,1,0,0
1,0,0,1,0,1,1,0,1
2,0,0,1,1,0,1,1,0


   Please note that the vectorizer **always  removes stop words before stemming**, which may cause some problems. 

4. Build a vocabulary based on terms' **document frequency**.  A term's document frequency is the percentage of documents containing this term. 

   By specifying **max_df** and  **min_df** (as percentages), we can let the vectorizer ignore terms that have a document frequency lower than min_df and higher than max_df.

In [15]:
# We only keep terms whose document freqency is <=0.9 and >=0.4.
vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords,
                                        max_df=0.9,
                                        min_df=0.4)
DTM =vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,take
0,1
1,1
2,0


5. Build a vocabulary that only contains the top $K$ terms with the highest total frequencies across all documents. This can be done setting **max_features**=$K$.

In [16]:
# keep top 3 terms ordered by term frequency across the corpus. 
vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords,
                                    max_features=3)
DTM =vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,mine,take,text
0,1,1,1
1,1,1,1
2,1,0,1


6. Use a customized vocabulary.

In [17]:
myvocabulary=["mine","fall","year"]       #This is a customized vocabulary.
vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords,
                                    vocabulary=myvocabulary)
DTM =vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,mine,fall,year
0,1,1,0
1,1,0,1
2,1,0,0


## Count Total Frequency of a Term using DTM

DTM can be used to calculate total term frequency across all documents. We just need to take the sum along each column by **DTM.sum(axis=0).tolist()[0]**

In [18]:
vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords)
DTM =vectorizer.fit_transform(mytexts)
df = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

df.sort_values(by="Frequency",inplace=True,ascending=False)
df.reset_index(inplace=True,drop=True)
df



Unnamed: 0,Term,Frequency
0,mine,3
1,text,3
2,take,2
3,2021,1
4,fall,1
5,offer,1
6,unfortun,1
7,year,1


Text vectorization is a fundamental step in analyzing text data, and different vectorization algorithms may drastically affect end results, so we to choose one that will deliver the results we're hoping for.

## TFIDF (Term Frequency-Inverse Document Frequency)

In all DTMs above, each term in a document is scored by its **term frequency** (TF). Is it always the best way to score a term?

Example:

In [19]:
mytexts = ["An apple is a fruit.", 
           "Apple Inc. is a technology company.",
           "Apple recall is issued due to listeria."]
vectorizer = CountVectorizer(stop_words=nltk_stopwords)
text_counts= vectorizer.fit_transform(mytexts)
pd.DataFrame(text_counts.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,apple,company,due,fruit,inc,issued,listeria,recall,technology
0,1,0,0,1,0,0,0,0,0
1,1,1,0,0,1,0,0,0,1
2,1,0,1,0,0,1,1,1,0


Should "fruit" and "apple" have the same score in Doc1? 

Should "company" and "apple" have the same score in Doc2? 

Should "listeria" and "apple" have the same score in Doc3?

**Principle:** A good vectorization scheme should give a term in a document a high score if 
1.  The term has a high frequency.
2.  The term that occurs only in a small number of documents. ("signiture" of those documents.)

* TF: Term frequency
 -  TF of term $t$ in doc $d$ $=$ The number of occurrences of $t$ in $d$.
 -  Follows Principle 1.
 -  Changes with documents. 
 
 
* IDF: Inverse document frequency
 -  IDF of term $t$ $=1+\ln\left(\frac{1+\text{ How many documents in total}}{1+\text{ How many documents containing } t }\right)$ 
 -  IDF is high if the term only appear in a small number of documents. 
 -  Follows Principle 2.
 -  Same across documents.  
 
 
* **TFIDF=TF$\times$IDF**
 -  Most well-known document representation schema in information retrieval.

We can use **TfidfVectorizer** from **sklearn** to create a DTM in terms of TFIDF.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
vectorizer=TfidfVectorizer(stop_words=nltk_stopwords, norm=None)
DTM = vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,apple,company,due,fruit,inc,issued,listeria,recall,technology
0,1.0,0.0,0.0,1.693147,0.0,0.0,0.0,0.0,0.0
1,1.0,1.693147,0.0,0.0,1.693147,0.0,0.0,0.0,1.693147
2,1.0,0.0,1.693147,0.0,0.0,1.693147,1.693147,1.693147,0.0


Here, **norm=None** means we don't want to normalize each row of the DTM to make it sum up to one. 

If we want to normalize the rows, for example, when the documents have highly diversified lengths, we can set **norm="l1"**.

Want to know more about **TfidfVectorizer**? See (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Similar to **CountVectorizer**, **TfidfVectorizer** cannot do stemming by itself, so the following vectorizer needs to be defined if we need to apply stemming before creating the DTM.

In [22]:
stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [23]:
vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)
DTM = vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,appl,compani,due,fruit,inc,issu,listeria,recal,technolog
0,1.0,0.0,0.0,1.693147,0.0,0.0,0.0,0.0,0.0
1,1.0,1.693147,0.0,0.0,1.693147,0.0,0.0,0.0,1.693147
2,1.0,0.0,1.693147,0.0,0.0,1.693147,1.693147,1.693147,0.0


Just like **CountVectorizer**, **StemmedTfidfVectorizer** and **TfidfVectorizer** can also initialize vectorizer with optional arguments such as max_df, min_df, max_features, vocabulary.

## Binary DTM

In a binary DTM, each term has a score of either one (occur) or zero (not occur).

In [24]:
vectorizer=CountVectorizer(stop_words=nltk_stopwords, binary=True)
DTM = vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,apple,company,due,fruit,inc,issued,listeria,recall,technology
0,1,0,0,1,0,0,0,0,0
1,1,1,0,0,1,0,0,0,1
2,1,0,1,0,0,1,1,1,0


A binary DTM is often used to vectorize short documents, for examples, tweets, facebook posts, and text messages. 

   * In a long document, a frequent term deserves a TF score.
   
   
   * In a short document, a frequent term might not carry more information than others. Shall we still give it a high TF score?
   

For example, someone tweets **"I've never enjoy the snow so much before. Here is the snowman I made, seconds before my kid knocked it down."** Should "before" be scored twice as much as other word like "snow" or "kid"?

## N-Grams

A **n-gram** is a contiguous sequence of $n$ tokens in a document. "1-gram", "2-gram" and "3-gram" are also called unigram, bigram and trigram, respectively.

For example, "I love reading in a rainy day" has five **trigrams**: "I love reading", "love reading in", "reading in a", "in a rainy", and "a rainy day". 

After removing stop words, the sentence becomes "love reading rainy day" and has only two **trigrams**: "love reading rainy" and "reading rainy day".

We can create a DTM using a vocabulary that consists of all n-grams in the dataset using **ngram_range**.

In [25]:
mytexts = ["I will take text mining in Fall 2021.",
           "Are you taking Text-Mining this year?",
           "Unfortunately, Text Mining is not offered."]

In [26]:
vectorizer=CountVectorizer(stop_words=nltk_stopwords,
                           ngram_range=(2,2))
DTM= vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,fall 2021,mining fall,mining offered,mining year,take text,taking text,text mining,unfortunately text
0,1,1,0,0,1,0,1,0
1,0,0,0,1,0,1,1,0
2,0,0,1,0,0,0,1,1


Note that **vectorizer** always remove stop words before creating n-grams. That's why "mining offered" is there. 

Typically, looking as the most frequent n-grams can help us better understand the text dataset than
unigrams. If a n-gram has a high frequency, it is probably a used fixed phrase.
- text mining
- home run
- never work
- new year
- big data
- university of iowa

Note that the DTM created by n-grams will be huge but most of the n-grams in the vocabulary will be meaningless. Hence, we almost always need to set **max_features** in order to limit the size of the vocabulary.

Moreover, **vectorizer** always does stemming after creating n-grams. As a result, only the last term in the n-grams will be stemmed. Therefore, we either do not apply stemming or clearn the raw text first before creating a DTM.

In [27]:
vectorizer=StemmedCountVectorizer(stop_words=nltk_stopwords,
                                  ngram_range=(2,2))
DTM= vectorizer.fit_transform(mytexts)
pd.DataFrame(DTM.toarray(), columns = vectorizer.get_feature_names())



Unnamed: 0,fall 2021,mining fal,mining off,mining year,take text,taking text,text min,unfortunately text
0,1,1,0,0,1,0,1,0
1,0,0,0,1,0,1,1,0
2,0,0,1,0,0,0,1,1


## fit_transform() VS transform()

**vectorizer.fit_transform()** and **vectorizer.transform()** can be both used to create a DTM. However, they are slightly different and it is important that you know what the difference is because we will use they at different steps of predictive modeling.

   * **vectorizer.fit_transform()** constructs the vocabulary from the text data, saves the vocabulary internally, and creates the DTM using that vocabulary. This should be applied to **training** data.
   
   
   * **vectorizer.transform()** does not construct the vocabulary. Instead, it directly borrows the vocabulary constructed by **vectorizer.fit_transform()** to create the DTM. This should be applied to **testing** data.
   

See the following examples:

In [28]:
mytext_train = ["I will take text mining in Fall 2021.",
                "Are you taking Text-Mining this year?",
                "Unfortunately, Text Mining is not offered."]
mytext_test = ["I will have to take text mining next year."]

In [29]:
vectorizer = CountVectorizer()     #Initialize the vectorizer with default setting.
DTM_train = vectorizer.fit_transform(mytext_train)  #Construct and save the vocabulary and create DTM
pd.DataFrame(DTM_train.toarray(), columns = vectorizer.get_feature_names()) #Print DTM



Unnamed: 0,2021,are,fall,in,is,mining,not,offered,take,taking,text,this,unfortunately,will,year,you
0,1,0,1,1,0,1,0,0,1,0,1,0,0,1,0,0
1,0,1,0,0,0,1,0,0,0,1,1,1,0,0,1,1
2,0,0,0,0,1,1,1,1,0,0,1,0,1,0,0,0


In [30]:
DTM_test = vectorizer.transform(mytext_test)  #Use the saved vocabulary to create DTM
pd.DataFrame(DTM_test.toarray(), columns = vectorizer.get_feature_names()) #Print DTM



Unnamed: 0,2021,are,fall,in,is,mining,not,offered,take,taking,text,this,unfortunately,will,year,you
0,0,0,0,0,0,1,0,0,1,0,1,0,0,1,1,0


  * "next" appears in **mytext_test** but not in **mytext_train**. The vocabulary is constructed based on **mytext_train** by **vectorizer.fit_transform**, so "next" isn't included in the vocabulary. Since **vectorizer.transform** directly uses the vocabulary constructed by **vectorizer.fit_transform**  to create DTM, it will not include "next" neither. This is what we want.
  
  
  * "2021" appears in **mytext_train** but not in **mytext_test**, so "2021" is included in the saved vocabulary by **vectorizer.fit_transform**. Since **vectorizer.transform** directly uses the saved vocabulary to create DTM, it will still create a column for "2021" even if its frequency is zero in **mytext_test**.
  
  
  * This makes sure the DTM created by **vectorizer.transform** will have exactly the same column as the DTM created by **vectorizer.fit_transform**. 
  
  
  * This is very important for predictive modeling because we will apply **vectorizer.fit_transform** to training data and apply **vectorizer.transform** to testing data, and the training and testing DTMs must have the same set of columns. 