<a href="https://colab.research.google.com/github/aygul0790/Bootcamp/blob/main/NLP_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Text Preprocessing
For any Machine Learning algorithm, data preprocessing is crucial, and this remains true for algorithms dealing with text

✍️ Text preprocessing is quite different from numerical preprocessing. The most common preprocessing tasks for textual data are:

- lowercase
- dealing with numbers, punctuation, and symbols
- splitting
- tokenizing
- removing "stopwords"
- lemmatizing

<br>

### 💻 🧹 Basic cleaning with Python core string operations
When you have some unstructured text, you can already clean it with some Python built-in string operations

<br>

💻 ✂️ [strip](https://docs.python.org/3/library/stdtypes.html#str.strip) (1/2)

`strip` removes all the whitespaces at the beginning and the end of a string

In [None]:
texts = [
    '   Bonjour, comment ca va ?     ',
    '    Heyyyyy, how are you doing ?   ',
    '        Hallo, wie gehts ?     '
]
texts

['   Bonjour, comment ca va ?     ',
 '    Heyyyyy, how are you doing ?   ',
 '        Hallo, wie gehts ?     ']

In [None]:
[text.strip() for text in texts]

['Bonjour, comment ca va ?',
 'Heyyyyy, how are you doing ?',
 'Hallo, wie gehts ?']

💻 ✂️ strip (2/2)

You can also specify a "list" of characters (in the form of a single and unordered string) to be removed at the beginning and at the end of a string

In [None]:
text = "abcd Who is abcd ? That's not a real name!!! abcd"
text

"abcd Who is abcd ? That's not a real name!!! abcd"

In [None]:
text.strip('bdac')

" Who is abcd ? That's not a real name!!! "

💻 🔄 [replace](https://docs.python.org/3/library/stdtypes.html#str.replace)

In [None]:
text = "I love koalas, koalas are the cutest animals on Earth."
text

'I love koalas, koalas are the cutest animals on Earth.'

In [None]:
text.replace("koala", "panda")

'I love pandas, pandas are the cutest animals on Earth.'

💻 📏 [split](https://docs.python.org/3/library/stdtypes.html#str.split)

In [None]:
text = "linkin park / metallica /red hot chili peppers"

In [None]:
text.split("/")

['linkin park ', ' metallica ', 'red hot chili peppers']

💻 🔡 Lowercase

Text modeling algorithms are case-sensitive (the capitalization of words carries meanings and contexts). Two words need to have the same casing to be considered equal.

In [None]:
text = "i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?"
text

'i LOVE football sO mUch. FOOTBALL is my passion. Who else loves fOOtBaLL ?'

In [None]:
text.lower()

'i love football so much. football is my passion. who else loves football ?'

💻 🔢 Numbers

✅ We can (and often should) remove numbers during the text preprocessing steps, especially for:

- text clustering
- collecting keyphrases

In [None]:
text = "i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous"
text

'i do not recommend this restaurant, we waited for so long, like 30 minutes, this is ridiculous'

In [None]:
cleaned_text = ''.join(char for char in text if not char.isdigit())
cleaned_text

'i do not recommend this restaurant, we waited for so long, like  minutes, this is ridiculous'

💻 ❗️❓Punctuation and Symbols

Punctuation like ".?!" and symbols like "@#$" are not useful for topic modeling.

Punctuation is barely used properly on social media platforms.

Warning: you might want to keep punctuation and symbols for authorship attribution (Authorship attribution is the task of identifying the author of a given text.)!

In [None]:
text = "I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ "
text

'I love bubble tea! OMG so #tasty @channel XOXO @$ ^_^ '

In [None]:
import string # "string" module is already installed with Python
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
for punctuation in string.punctuation:
    text = text.replace(punctuation, '')

text

'I love bubble tea OMG so tasty channel XOXO   '

💻 Combination: `strip` + `lowercase` + `numbers` + `punctuation/symbols`

In [None]:
sentences = [
    "   I LOVE Pizza 999 @^_^",
    "  Study is amazing, take care - 666"
]

In [None]:
def basic_cleaning(sentence):
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())

    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    sentence = sentence.strip()

    return sentence

In [None]:
cleaned = [basic_cleaning(sentence) for sentence in sentences]
cleaned

['i love pizza', 'study is amazing take care']

💻 🔍 Removing Tags with [RegEx](https://regexr.com/)

We can remove HTML tags using RegEx:

In [None]:
import re

text = """<head><body>Hello HSLU!</body></head>"""
cleaned_text = re.sub('<[^<]+?>','', text)

print (cleaned_text)

Hello HSLU!


We can also extract e-mail addresses from a text:

In [None]:
import re

txt = '''
    This is a random text, authored by darkvador@gmail.com
    and batman@outlook.com, WOW!
'''

re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)

  re.findall('[\w.+-]+@[\w-]+\.[\w.-]+', txt)


['darkvador@gmail.com', 'batman@outlook.com']

## 💻 Cleaning with NLTK

Natural Language Toolkit (NLTK) is an NLP library that provides preprocessing and modeling tools for text data

📚 [NLTK official website](https://www.nltk.org/)

🛠 [Installation Documentation](https://www.nltk.org/install.html)

### 💻 🧩 Tokenizing

Tokenizing is essentially splitting a sentence, a paragraph, or even an entire piece of text into smaller chunks such as individual words called tokens.

"Natural Language Processing"  →   ["Natural","Language","Processing"]

📚 [nltk.tokenize](https://www.nltk.org/api/nltk.tokenize.html)

🔅 Here is a quote from Aristotle:

In [None]:
text = 'It is during our darkest moments that we must focus to see the light'

text

'It is during our darkest moments that we must focus to see the light'

In [None]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /home/aygul/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/aygul/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/aygul/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/aygul/nltk_data...
[nltk_data] Downloading package punkt_tab to /home/aygul/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(text)
print(word_tokens) # print displays the words in one line

['It', 'is', 'during', 'our', 'darkest', 'moments', 'that', 'we', 'must', 'focus', 'to', 'see', 'the', 'light']


💻 🛑 Stopwords

Stopwords are words that are used so frequently that they don't carry much information, especially for topic modeling

NLTK has a built-in corpus of English stopwords that can be loaded and used

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) # you can also choose other languages

Here is an example of a tokenized sentence:

In [None]:
tokens = ["i", "am", "going", "to", "go", "to", "the",
        "club", "and", "party", "all", "night", "long"]

❓ What stopwords could be removed ❓

In [None]:
stopwords_removed = [w for w in tokens if w in stop_words]
stopwords_removed

['i', 'am', 'to', 'to', 'the', 'and', 'all']

❓ What are the meaningful words in this sentence ❓

In [None]:
tokens_cleaned = [w for w in tokens if not w in stop_words]
tokens_cleaned

['going', 'go', 'club', 'party', 'night', 'long']

👉 What if you are not going to the party?

😱 "not" is also considered as a stopword

✅ Removing stopwords is useful for:

- topic modeling

❌ Dangerous for:

- sentiment analysis
- authorship attribution

## 💻 🧬 Lemmatizing
Lemmatizing is a technique used to find the root of words, in order to group them by their meaning rather than by their exact form

![lemmatizing](https://github.com/aygul0790/Bootcamp/blob/main/pics/stem_lemma.png?raw=1)

📚 [nltk.stem - WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

👇 Look at the following sentence:

In [None]:
sentence = 'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

In [None]:
sentence

'He was RUNNING and EATING at the same time =[. He has a bad habit of swimming after playing 3 hours in the Sun =/'

🗓 Let's apply the following steps:

- Basic cleaning
- Tokenizing
- Removing stopwords (if not doing sentiment analysis!)
- Lemmatizing


🧹 Step 1: Basic Cleaning

In [None]:
cleaned_sentence = basic_cleaning(sentence)
cleaned_sentence

'he was running and eating at the same time  he has a bad habit of swimming after playing  hours in the sun'

💻 🧩 Step 2 : Tokenize

In [None]:
tokenized_sentence = word_tokenize(cleaned_sentence)
print(tokenized_sentence)

['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', 'he', 'has', 'a', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'hours', 'in', 'the', 'sun']


🛑 Step 3: Remove Stopwords

In [None]:
tokenized_sentence_no_stopwords = [w for w in tokenized_sentence if not w in stop_words]
print(tokenized_sentence_no_stopwords)

['running', 'eating', 'time', 'bad', 'habit', 'swimming', 'playing', 'hours', 'sun']


💻 🧬 Step 4: Lemmatizing

📚 [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)

[Lemmatization with NLTK](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)

In [None]:
from nltk.stem import WordNetLemmatizer

# Lemmatizing the verbs
verb_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "v") # v --> verbs
    for word in tokenized_sentence_no_stopwords
]

# 2 - Lemmatizing the nouns
noun_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "n") # n --> nouns
    for word in verb_lemmatized
]

✅ Lemmatizing is useful for:

- topic modeling
- sentiment analysis

## Preprocessing Text - Takeaways

First of all, we can perform some pre-cleaning operations on the pieces of text of a corpus using Python built-in functions such as:

- ✂️ strip
- 🔄 replace
- 📏 split
- 🔡 lowercase
- 🔢 removing numbers
- ❗️ removing punctuation and symbols

Next, we can apply preprocessing techniques to prepare the pieces of text for NLP algorithms

- 🧩 Tokenizing
- 🛑 Removing stopwords
- 🧬 Lemmatizing


🤔 Now that the text is preprocessed, how can it be analyzed by Machine Learning algorithms?

## 2. Vectorizing

🤖 Machine Learning algorithms cannot process raw text, as it needs to be converted into numbers first

**Vectorizing** = the process of converting raw text into a numerical representation

There are multiple vectorizing techniques. Among them, we will present:

- `Bag-of-Words`
- `Tf_idf`
- `N-grams`


### 2.1. Bag-of-Words representation

**Bag-of-Words representation(BoW)** is one of the most simple and effective ways to represent text for Machine Learning models.

When using this representation, we are simply counting how often each word appears in each document of a corpus.

The count for each word becomes a feature:



💻 `CountVectorizer`

In Scikit-Learn, there is a tool called `CountVectorizer` to generate bag-of-words representations of a set of texts

👉 `CountVectorizer` converts a collection of text documents into a matrix of token counts

📚 [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

👇 Look at the following sentences:

In [None]:
texts = [
    'the young dog is running with the cat',
    'running is good for your health',
    'your cat is young',
    'young young young young young cat cat cat'
]

Let's apply the CountVectorizer to generate a Bag-of-Words representation of these four sentences

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(texts)
X.toarray()

array([[1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0]])

🤔 Can you guess which column represents which word?

🔥 As soon as the `CountVectorizer` is fitted to the text, you can retrieve all the words seen with `get_feature_names_out()`:

In [None]:
count_vectorizer.get_feature_names_out()

array(['cat', 'dog', 'for', 'good', 'health', 'is', 'running', 'the',
       'with', 'young', 'your'], dtype=object)

In [None]:
import pandas as pd

vectorized_texts = pd.DataFrame(
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

display(vectorized_texts)

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


Be aware that there are some limitations when it comes to the bag-of-words representation:

❌ A BoW does NOT take into account the order of the words  →   hence the name `"Bag of Words"`

❌ A BoW does NOT take into account a document's length  →   `Tf-idf` to the rescue

❌ A BoW does NOT capture document context  →   `N-gram` to the rescue

## 2.2. `Tf-idf` Representation

Term Frequency (`tf`) & `CountVectorizer`

*Idea: The more often a word appears in a document relative to others, the more likely it is that it will be important to this document*

Example: if the word elections appears relatively frequently in a document, it is obvious that this document deals with politics.



The frequency of a word $x$ in a document $d$ is called **term frequency**, and is denoted by:

$
TF_{x,d} = \dfrac{\text{Number of times term } x \text{ appears in document } d}{\text{Total number of terms in the document}}
$


❓ In our last example, could we compute $tf_{young.document4}$ ❓

In [None]:
vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


$tf_{young, document4} = \dfrac{5 \text{ counts of "young"}}{8 \text{ total words}} = 0.625 $

Document Frequency (`df`)

*Idea: If a word appears in many documents of a corpus, however, it shouldn't be that important to understand a particular document.*

Example: on eurosport.com/football, the word "football" appears in every article, hence why the word football on this website is an unimportant word!

The number of documents $d$ in a corpus containing the word $x$ is called document frequency (df), and is denoted by $df_{x}$

❓ In our last example, could we compute $df_{cat}$, $df_{young}$, $df_{the}$ ❓

In [None]:
# Compute document frequency (DF)
document_frequency = (vectorized_texts > 0).sum(axis=0)

# Convert DF into a DataFrame format
document_frequency = pd.DataFrame([document_frequency], index=["Document Frequency"])

# Display the DataFrame
display(document_frequency)


Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
Document Frequency,3,1,1,1,1,3,2,1,1,3,2


If a word $x$ appears in too many documents of a corpus - i.e. if the document frequency $df_{x}$ is too high - the word $x$ won't help us with topic modeling and should be considered irrelevant.

Example: on eurosport.com/football/, the word "football" won't help us distinguish two articles, one dealing mainly with strategy and another one talking about referee best practices!

What if we considered the **relative document frequency** of a word $x$, which can be computed as:

$
\dfrac{df_x}{N}
$

where:
- $df_x$ is the number of documents $d$ containing the word $x$,
- $N$ is the total number of documents in a corpus.

For the word "football" on Eurosport, we would expect this formula to be close to 1 since the number of docs containing the word "football" will probably only be slightly less than the total number of docs (out of 100 maybe only 5 don't have the word "football", so we get 95/100).

Idea: A word $x$ in a corpus of texts will be considered important when its **(relative) document frequency** is **low** ⇔ its inverse document frequency $\dfrac{N}{df_x}$ is high.

Again, if the word "football" appears in all the articles it is not very useful for helping us identify between two articles, but if only a few documents contain words like "concussion" or "wellbeing", (e.g. they appear in 2/100 articles) it will be much more useful in determining the topic of that article (they are probably specifically about player wellfare).

**Tf-idf Formula**

💡 Thus the intuition of the `term frequency - inverse document frequency` approach is to give a high weight to any term which appears frequently in a single document, but not in too many documents of the corpus.

The weight of a word $x$ in a document $d$ is given by:

$$
w_{x,d} = tf_{x,d} \times \left[ \log \left( \frac{N + 1}{df_x + 1} \right) + 1 \right]
$$

where:

- $tf_{x,d}$ = $ \dfrac{\text{Number of occurrences of word } x \text{ in document } d}{\text{Total number of words in document } d} $

- $df_x$ = Number of documents $d$ containing the word $x$
- $N$ = Total number of documents in a corpus

## 2.3. 💻 TfidfVectorizer

`raw documents`  →   `matrix of tf-idf features`

📚 [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
texts

['the young dog is running with the cat',
 'running is good for your health',
 'your cat is young',
 'young young young young young cat cat cat']

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Instantiating the TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer()

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
0,0.227904,0.357056,0.0,0.0,0.0,0.227904,0.281507,0.714112,0.357056,0.227904,0.0
1,0.0,0.0,0.463709,0.463709,0.463709,0.29598,0.365594,0.0,0.0,0.0,0.365594
2,0.470063,0.0,0.0,0.0,0.0,0.470063,0.0,0.0,0.0,0.470063,0.580622
3,0.514496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.857493,0.0


**Controlling the vocabulary size**:

In every language, there are many words used in everyday vocabulary:

- 🇬🇧 English: ~20,000 words
- 🇫🇷 French: ~20,000 words
- 🇩🇪 German: ~20,000 words

In a document, we can't afford to vectorize every word!

We can, however, control the number of words to be vectorized (*curse of dimensionality*!):

👉 Scikit-Learn allows us to customize the `CountVectorizer` and `TfidVecdtorizer` with key parameters to control vocabulary size.

💻 Key parameters of `TfidfVectorizer` (and `CountVectorizer`)
- `max_df/min_df`
- `max_features`

💻 `max_df` (resp. `min_df`)

*When building the vocabulary, `CountVectorizer` and `TfidfVectorizer` will remove terms which have a document frequency strictly higher (resp. lower) than the given threshold. `max_df` and `min_df` help us building corpus-specific stopwords.*

Example: when classifying pieces of text into "basketball" or "football", the word "ball" would appear too often and would be useless for this classification, it would be better to filter it out using `max_df`

**How to use these parameters in practice?**

`max_df` (`min_df`) can be either a float between 0.0 and 1.0 or an integer

- `max_df` (`min_df`) = 0.5  ⇔   "ignore terms that appear in more (less) than 50% of the documents"
- `max_df` (`min_df`) = 20  ⇔   "ignore terms that appear in more (less) than 20 documents"

By default, `max_df` = 1.0  ⇔  no "frequent" word will be removed

By default, `min_df` = 0.0  ⇔   no "infrequent" word will be removed

In [None]:
# Number of occurences of each word
document_frequency

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
Document Frequency,3,1,1,1,1,3,2,1,1,3,2


In [None]:
# Instantiate the CountVectorizer with max_df = 2
count_vectorizer = CountVectorizer(max_df = 2) # removing "cat", "is", "young"

# Train it
X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

X

Unnamed: 0,dog,for,good,health,running,the,with,your
the young dog is running with the cat,1,0,0,0,1,2,1,0
running is good for your health,0,1,1,1,1,0,0,1
your cat is young,0,0,0,0,0,0,0,1
young young young young young cat cat cat,0,0,0,0,0,0,0,0


💻 max_features

By specifying `max_features` = $k$ (k being an integer), the `CountVectorizer` (or the `TfidfVectorizer`) will build a vocabulary that only considers the top $k$ tokens ordered by term frequency across the corpus.

**How to use "max_features" in practice?**

In [None]:
# CountVectorizer with the 3 most frequent words
count_vectorizer = CountVectorizer(max_features = 3)

X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(
    X.toarray(),
     columns = count_vectorizer.get_feature_names_out(),
     index = texts
)

X

Unnamed: 0,cat,is,young
the young dog is running with the cat,1,1,1
running is good for your health,0,1,0
your cat is young,1,1,1
young young young young young cat cat cat,3,0,5


✅ Advantages of the `Tf-idf` representation:

- Using relative frequency rather than count is robust to document length
- Takes into account the context of the whole corpus

❌ Disadvantages of the `Tf-idf` representation:

- Like the `BoW`, `Tf-idf` does NOT capture the **within-document context** →  `N-gram` helps here
- Like the `BoW`, the word order is completely disregarded

### 2.4. `N-grams`

Example: the two following sentences have the exact same representation:

In [None]:
actors_movie = [
    "I like the movie but NOT the actors",
    "I like the actors but NOT the movie"
]

In [None]:
# Vectorize the sentences
count_vectorizer = CountVectorizer()
actors_movie_vectorized = count_vectorizer.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized = pd.DataFrame(
    actors_movie_vectorized.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


When using a `bag-of-words` representation, an efficient way to capture context is to consider:

- the count of single tokens (unigrams)
- the count of pairs (bigrams), triplets (trigrams), and more generally sequences of $n$ words, also known as `n-grams`

Examples:

- "mathematics" is a unigram (n = 1)
- "machine learning" is a bigram (n = 2)
- "natural language processing" is a trigram (n = 3)
- "deep convolutional neural networks" is a 4-gram (n = 4)

💻 `ngram_range`

In both `CountVectorizer` and `TfidfVectorizer`, you can specify the length of your sequences with the parameter `ngram_range` = (`min_n`, `max_n`).

Examples:

- ngram_range = (1, 1) 👉 (by default) will only capture the unigrams (single words)
- ngram_range = (1, 2) 👉 will capture the unigrams, and the bigrams
- ngram_range = (1, 3) 👉 will capture the unigrams, the bigrams, and the trigrams
- ngram_range = (2, 3) 👉 will capture the bigrams, and the trigrams but not the unigrams

With a unigram vectorization, we couldn't distinguish two sentences with the same words.

In [None]:
actors_movie_vectorized

Unnamed: 0,actors,but,like,movie,not,the
I like the movie but NOT the actors,1,1,1,1,1,2
I like the actors but NOT the movie,1,1,1,1,1,2


 What about a **bigram vectorization**?

In [None]:
# Vectorize the sentences
count_vectorizer_n_gram = CountVectorizer(ngram_range = (2,2)) # BI-GRAMS
actors_movie_vectorized_n_gram = count_vectorizer_n_gram.fit_transform(actors_movie)

# Show the representations in a nice DataFrame
actors_movie_vectorized_n_gram = pd.DataFrame(
    actors_movie_vectorized_n_gram.toarray(),
    columns = count_vectorizer_n_gram.get_feature_names_out(),
    index = actors_movie
)

# Show the vectorized movies with bigrams
actors_movie_vectorized_n_gram

Unnamed: 0,actors but,but not,like the,movie but,not the,the actors,the movie
I like the movie but NOT the actors,0,1,1,1,1,1,1
I like the actors but NOT the movie,1,1,1,0,1,1,1


👍 The two sentences are now distinguishable

#### **Vectorizing - Takeaways**

There are two methods for vectorizing:
- `CountVectorizer` (counting)
- `TfidfVectorizer` (weighing: take the document length into consideration)

The most important parameters of these vectorizers are:
- `min_df` (infrequent words)
- `max_df` (frequent words)
- `max_features` (curse of dimensionality)
- `ngram_range` = (`min_n`, `max_n`) (capturing the context of the words)