<a href="https://colab.research.google.com/github/Walidsati/AAI_612O/blob/main/Week7/Notebook7.1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# AAI612: Deep Learning & its Applications

*Notebook 7.1: Text Preprocessing Using *Scikit-learn*



The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. You can use it as follows:

* Create an instance of the `CountVectorizer` class.
* Call the `fit()` function in order to learn a vocabulary from one or more documents.
* Call the `transform()` function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package. The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to NumPy arrays to look and better understand what is going on by calling the toarray() function. Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
corpus = ["The quick brown fox jumped over the lazy dog."]

### Create the transform

In [2]:
vectorizer = CountVectorizer()

### Tokenize and Build Vocabulary

In [3]:
vectorizer.fit(corpus)

Access the vocabulary to see what exactly was tokenized.  Notice that all words were made lowercase by default and that the punctuation was ignored:

In [4]:
vectorizer.vocabulary_

{'the': 7,
 'quick': 6,
 'brown': 0,
 'fox': 2,
 'jumped': 3,
 'over': 5,
 'lazy': 4,
 'dog': 1}

### Encode the document:

In [5]:
vector = vectorizer.transform(corpus)

### Summarize encoded vector

In [6]:
print(vector.shape)
print(vector.toarray())

(1, 8)
[[1 1 1 1 1 1 1 2]]


### Encode other documents

The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector. For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not:

In [9]:
# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1]]


## Word Frequencies with TF-IDF

Word counts are a good starting point, but are very basic. One issue with simple counts is that some words like the will appear many times and their large counts will not be very meaningful in the encoded vectors. An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym that stands for Term Frequency - Inverse Document Frequency which are the components of the resulting scores assigned to each word.
* **Term Frequency**: This summarizes how often a given word appears within a document.
* **Inverse Document Frequency**: This downscales words that appear a lot across documents.

Without going into the math, `TF-IDF` are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. The `TfidfVectorizer` will tokenize documents, learn the vocabulary and inverse document requency weightings, and allow you to encode new documents.

Alternately, if you already have a learned CountVectorizer, you can use it with a `TfidfTransformer` to just calculate the inverse document frequencies and start encoding documents. The same `create, fit, and transform` process is used as with the `CountVectorizer`.

### List of text documents

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The quick brown fox jumped over the lazy dog.", "The dog.", "The fox"]

### Create the transform

In [11]:
vectorizer = TfidfVectorizer()

### Tokenize and build vocab

In [12]:
vectorizer.fit(corpus)

### Summarize

In [13]:
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


### Encode document

In [14]:
vector = vectorizer.transform([corpus[0]])

### Summarize encoded vector

In [15]:
print(vector.shape)
print(vector.toarray())

(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


A vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: the at index 7. Finally, the first document is encoded as an 8-element sparse array and we can review the final scorings of each word with different values for the, fox, and dog from the other words in the vocabulary.  The scores are normalized to values between 0 and 1 and the encoded document vectors can
then be used directly with most machine learning algorithms.

In [16]:
corpus = ["Artificial intelligence and machine learning are core components of data science."]

In [17]:
vectorizer = CountVectorizer()

In [18]:
vectorizer.fit(corpus)

In [19]:
vectorizer.vocabulary_

{'artificial': 2,
 'intelligence': 6,
 'and': 0,
 'machine': 8,
 'learning': 7,
 'are': 1,
 'core': 4,
 'components': 3,
 'of': 9,
 'data': 5,
 'science': 10}

In [20]:
vector = vectorizer.transform(corpus)

In [21]:
print(vector.shape)
print(vector.toarray())

(1, 11)
[[1 1 1 1 1 1 1 1 1 1 1]]


In [23]:
# encode another document
text2 = ["define machine learning"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1 1 0 0]]
