# Introduction to NLP in Python
## Quest 2: Vectorization

### Bag-of-Words Model (BOW)


In Python, we can use the scikit-learn library to create BoW vectors from text data.

1. If it has not been installed, run the following line to install scikit-learn:

In [1]:
!pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.1
    Uninstalling scikit-learn-1.2.1:
      Successfully uninstalled scikit-learn-1.2.1
Successfully installed scikit-learn-1.2.2


2. Import the necessary libraries:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

3. Create an instance of the CountVectorizer class:

In [3]:
count_vectorizer = CountVectorizer()

4. Load your text data into a list or an array. Here, we have provided some sentences for you to work with. Feel free to add on to the sentences or come up with your own sample text to experiment more!

In [4]:
sample_text = [
    "Supervised learning is a type of machine learning",
    "Unsupervised learning is another type of machine learning",
    "Machine learning algorithms can be used for NLP",
    "NLP is a subfield of machine learning",
    "Sentiment analysis is a type of NLP application"
]

5. Next, we fit and transform the text data. Fitting the vectorizer on the text data creates the vocabulary, while the transform method transforms the text data into a BoW representation

In [5]:
bow = count_vectorizer.fit_transform(sample_text)

The resulting `bow` object is a sparse matrix that represents the text data in BoW form. Go ahead and print the contents of this matrix to see how the words in the sample text have been mapped to their respective indices in the vocabulary.

In [6]:
print(bow.toarray())

[[0 0 0 0 0 0 0 1 2 1 0 1 0 0 1 1 0 0]
 [0 0 1 0 0 0 0 1 2 1 0 1 0 0 0 1 1 0]
 [1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0]
 [0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0]]


Each row in the matrix corresponds to a sentence in the vocabulary. The values in the matrix represent the frequency of each word in each sentence. 

That's it! You now know how to use the Bag-of-Words vectorizer in Python. **Head back to the StackUp platform, where we continue with another vectorization technique.**

<br></br>
### Term Frequency-Inverse Document Frequency (TF-IDF)

The scikit-learn library provides a convenient implementation of the TF-IDF algorithm. 
1. First, run the cell below to import the necessary libraries:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

2. Define the text that you want to analyze. This can be a list or array of strings, or even a Pandas DataFrame containing text data for a larger dataset. 

Here, lets work with a slightly larger set of data than before. Feel free to include more sentences or import your own Pandas DataFrame!

In [8]:
text_data = [
    "Supervised learning is a type of machine learning.",
    "Unsupervised learning is another type of machine learning.",
    "Machine learning algorithms can be used for a wide range of applications, such as image recognition and natural language processing.",
    "One of the most popular machine learning algorithms is the decision tree.",
    "Random forest is an ensemble learning method that combines multiple decision trees.",
    "Support vector machines are a powerful class of machine learning algorithms used for classification and regression tasks.",
    "Neural networks are a type of machine learning algorithm inspired by the structure and function of the human brain.",
    "Deep learning is a subfield of machine learning that involves neural networks with many layers.",
    "Transfer learning is a technique in machine learning where a model trained on one task is used to improve performance on another related task.",
    "Natural Language Processing (NLP) is a subfield of machine learning that deals with the interaction between computers and humans in natural language.",
    "Text classification is an important task in NLP that involves assigning a document to one or more predefined categories.",
    "Named Entity Recognition (NER) is another important task in NLP that involves identifying and extracting entities such as people, organizations, and locations from text data."
]


3. Create an instance of the TfidfVectorizer class. You can specify any parameters you want to use. Some common parameters include:
- `max_features`: the maximum number of features (unique words) to include in the TF-IDF matrix
- `stop_words`: a list of words to exclude from the TF-IDF calculation, such as common stop words like "the" and "and"

If you are keen to know more, check out other available parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Feel free to experiment with the parameters and changing the values in the vectorizer!

In [9]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

4. Fit and transform the text into a TF-IDF matrix:

In [10]:
tfidf = tfidf_vectorizer.fit_transform(text_data)

5. Print the resulting matrix:

In [11]:
print(tfidf.toarray())

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.52146104 0.         0.28202368 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.64158678
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.48673138 0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.   

The resulting matrix is printed using the `toarray()` method, which converts the sparse matrix to a dense matrix for easier viewing.

That's it for the TF-IDF technique! **Switch back over to the StackUp platform, where we cover the third and final vectorization method.**

<br></br>
### Word Embeddings

Here, we will use the Gensim Python library to create word embeddings.

1. If it has not been installed, run the following line to install Gensim:

In [12]:
!pip install gensim

Collecting FuzzyTM>=0.4.0
  Downloading FuzzyTM-2.0.5-py3-none-any.whl (29 kB)
Collecting pyfume
  Downloading pyFUME-0.2.25-py3-none-any.whl (67 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m238.8 kB/s[0m eta [36m0:00:00[0m1m414.7 kB/s[0m eta [36m0:00:01[0m
Collecting fst-pso
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting simpful
  Downloading simpful-2.10.0-py3-none-any.whl (31 kB)
Collecting miniful
  Downloading miniful-0.0.6.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: fst-pso, miniful
  Building wheel for fst-pso (setup.py) ... [?25ldone
[?25h  Created wheel for fst-pso: filename=fst_pso-1.8.1-py3-none-any.whl size=20430 sha256=42db753bddc163a296067b0f4e3b7f070aa05cecab3083025c58561ab64f3b2b
  Stored in directory: /home/ggbaguidi/.cache/pip/wheels/01/02/ee/df0699282986903a384b69aab4413af9efd26b3612b5d

2. Import the necessary libraries:

In [13]:
import gensim
from gensim.models import Word2Vec
import nltk
from nltk import word_tokenize

3. Train a Word2Vec model using a tokenized text corpus. Word2Vec takes in tokens as its input, so the `word_tokenize` method from the previous quest is used here. We will make use of the `text_data` list from the previous step:

In [14]:
text_data = [
    "Supervised learning is a type of machine learning.",
    "Unsupervised learning is another type of machine learning.",
    "Machine learning algorithms can be used for a wide range of applications, such as image recognition and natural language processing.",
    "One of the most popular machine learning algorithms is the decision tree.",
    "Random forest is an ensemble learning method that combines multiple decision trees.",
    "Support vector machines are a powerful class of machine learning algorithms used for classification and regression tasks.",
    "Neural networks are a type of machine learning algorithm inspired by the structure and function of the human brain.",
    "Deep learning is a subfield of machine learning that involves neural networks with many layers.",
    "Transfer learning is a technique in machine learning where a model trained on one task is used to improve performance on another related task.",
    "Natural Language Processing (NLP) is a subfield of machine learning that deals with the interaction between computers and humans in natural language.",
    "Text classification is an important task in NLP that involves assigning a document to one or more predefined categories.",
    "Named Entity Recognition (NER) is another important task in NLP that involves identifying and extracting entities such as people, organizations, and locations from text data."
]

tokens_in_text = [nltk.word_tokenize(sentence.lower()) for sentence in text_data] 

model = Word2Vec(tokens_in_text, min_count=1)

The cell above returns the list of sentences into individual tokens after converting the words to lowercase, and is passed into the Word2Vec model. 

The `min_count` parameter ignores all words with a total frequency lower than this. In this case, all words that only appear once is ignored.

4. Once your model is trained, you can access the word embeddings for any word like this:

In [15]:
vector = model.wv['class']
print(vector)

[ 4.2433320e-03  4.1890572e-04 -9.6239574e-04 -3.2622248e-03
  4.9813832e-03 -9.1374218e-03  5.1592113e-03  5.7440903e-03
 -2.4666542e-03 -4.2066104e-03 -7.3827682e-03 -6.4108432e-03
 -5.1381472e-03  3.1858739e-03 -4.7803563e-03 -1.2228384e-03
 -8.9032417e-03 -3.5037154e-03 -8.0953800e-04 -8.0476720e-03
  3.2537808e-03  9.7621465e-03  4.4832961e-03  1.8484986e-03
 -4.8418064e-03 -1.1118491e-03  2.2252002e-03  1.7033379e-03
  8.7847020e-03 -9.2013582e-04  7.0528258e-03  5.7957778e-03
  3.1277214e-03  7.2565642e-03 -4.9505257e-03 -3.9364714e-03
  6.9563957e-03  8.5821832e-03  3.1465974e-03 -4.1753957e-03
  7.6282239e-03  2.6987009e-03 -2.6221786e-04  8.8245040e-03
  4.3581417e-03  3.9333361e-03 -3.9797589e-05 -8.9332564e-03
 -3.0806521e-03  2.4464622e-03 -5.6178658e-03  7.0744818e-03
  7.4318070e-03 -8.9356452e-03 -5.3773653e-03 -6.0308459e-03
 -6.4305880e-04 -1.8272620e-03 -2.0672381e-03 -2.2206558e-03
  3.5930267e-03  9.5866909e-03 -4.5388569e-03 -8.2720798e-03
  3.7101971e-03 -6.38759

This will return a vector representing the word 'class' in the embedding space. The attribute `wv` of a Word2Vec model stands for "word vector".

And that sums up the 3 techniques for vectorization in NLP! **Return back to the StackUp platform,** where we wrap up the quest and prepare the deliverables for submission. 