<h1 align="center"><b>Feature Engineering</b></h1>

<h3><div align="right">Ehtisham Sadiq</div></h3>    


<img align="center" width="900"  src="images/phase3.png"  > 

# Learning Agenda of this Notebook

- **Overview of Feature Engineering and Vectorization**
  
- **Frequency or Statistical Based Approaches**
  - Label Encoding
  - One-Hot Encoding
  - Bag of Words
  - Bag of n-grams
  - TF-IDF

- **Prediction-Based Approaches (Embeddings)**
  - Word2Vec — From Google
  - FastText — From Facebook

# <span style='background :lightgreen' > Overview of Feature Engineering and Text Vectorization </span>

## a. What do you mean by `Feature` in Machine Learning?

<!-- <img src="images/pca5.png" style="align:center" height=500px width=900px> -->
<img src="images/pca5.png" style="display: block; margin: 0 auto;" height="400px" width="900px">

<br/>

> **Feature engineering involves identifying and transforming attributes `(features)` from raw data into a format suitable for machine learning models. Think of it as crafting the input that powers AI systems.**

Imagine predicting house prices. `Features` could include:
- The city it’s in.
- The size of the house.
- The number of bedrooms.

## b. What do you mean by `Feature` in Natural Language Processing?

# <h3 align="center"><div class="alert alert-success" style="margin: 20px">Andrew Ng Youtube Lectures are amazing.</h3>

Similarly, in `NLP`:
- A `feature` can be a single word, phrase, sentence, or even an entire document.
- Our job is to represent these features as numbers. Why? Because machine learning models can only understand numbers, not plain text.

## c. What is Feature Engineering or Text Vectorization?

> **It’s the process of converting text into vectors (arrays of numbers) that represent the meaning or frequency of the text.**

<img src="images/text-rep3.png" style="display: block; margin: 0 auto;" height="500px" width="900px">

####  Feature Extraction from Text (Text Representation or Text Vectorization)

The process of converting text data into vectors of real numbers is called **Feature Extraction from Text** or **Text Representation/Text Vectorization**. The goal of this process is to transform textual data into numerical form, ensuring that these numbers can capture the semantic or contextual meaning of the words.

In simple terms:
- **Text Vectorization** converts words or phrases into a format that machines can process, such as vectors of numbers.
- These numbers represent the meaning or attributes of the text in a mathematical form.


**Example**
The example demonstrates how different textual entities can be represented numerically based on their attributes or properties:
1. Each vector contains numerical representations of features such as:
   - **Person** (1 or 0, denoting if it is a person)
   - **Healthy/Fit** (a continuous value representing fitness level)
   - **Location** (binary representation indicating presence as a location)
   - **Has two eyes** (binary attribute indicating presence of this feature)
   - **Has Government** (binary attribute for governmental entities)

2. The vectors in the image:
   - Blue vector `[1, 0.8, 0, 1, 0]`: Represents a person who is healthy, not a location, has two eyes, and does not represent a government.
   - Orange vector `[1, 0.9, 0, 1, 0]`: Another individual with similar attributes but a higher "Healthy/Fit" value.
   - Green vector `[0, 0.6, 1, 0, 1]`: Represents a location entity with medium fitness value, does not have two eyes, but represents a government.

This approach ensures that text entities are numerically encoded for further machine learning or natural language processing tasks.


**Importance**
- **Why Feature Extraction from Text?**
   - Converts unstructured textual data into structured numerical data.
   - Enables the use of machine learning models for text classification, clustering, and other NLP tasks.
   - Represents textual data in a way that preserves semantic meaning and relationships.

## d. Why to do Text Vectorization?


<img src="images/text-rep2.png" style="display: block; margin: 0 auto;" height="900px" width="1300px">

1. **Convert Text to Numerical Representations**  
   - Text vectorization transforms textual features like words, phrases, or entire documents into numerical vectors, enabling machine learning algorithms to process and analyze the data.  
   - Example: Words like "football" and "basketball" are represented as numeric vectors `[3, 1]`.

2. **Enable Mathematical and Statistical Operations**  
   - Text vectors allow for mathematical operations such as addition, subtraction, and similarity measurement.  
   - This enables algorithms to assess relationships between words, phrases, or documents effectively.

3. **Facilitate Document Similarity Analysis**  
   - By representing documents as vectors, their similarity can be calculated using methods like **cosine similarity**.  
   - Example: Two documents with angles close to 0 degrees are more similar, while those with larger angles (e.g., 90 degrees) are dissimilar.

4. **Handle High-Dimensional Text Features**  
   - Although visualizing low-dimensional vectors is simple, real-world text data involves thousands of features.  
   - Text vectorization techniques like cosine similarity enable efficient handling and comparison of such high-dimensional data.


#### **Difference Between Cosine Similarity and Cosine Distance**

##### **Cosine Similarity**
- **Definition**: Measures the cosine of the angle between two vectors in a multidimensional space.
- **Range**: Values range from 0 to 1:
  - `1`: Vectors are identical (angle = 0 degrees).
  - `0`: Vectors are orthogonal or completely different (angle = 90 degrees).

##### **Cosine Distance**
- **Definition**: Measures the dissimilarity between two vectors by subtracting the cosine similarity from 1.
- **Range**: Values range from 0 to 1:
  - `0`: Vectors are identical.
  - `1`: Vectors are completely different.

#### **When to Use Each**

1. **Cosine Similarity**
   - **Use Case**: When you want to **compare similarity** between text data or documents.  
     Examples:
       - Ranking search results based on query-document similarity.
       - Measuring how similar two product reviews or news articles are.
   - **Scenario**: Useful in recommendation systems, document clustering, and NLP tasks where identifying closeness in meaning is the goal.

2. **Cosine Distance**
   - **Use Case**: When you want to focus on **dissimilarity** or differences between vectors.  
     Examples:
       - Finding outliers in datasets.
       - Evaluating how distinct one document is from another.
   - **Scenario**: Useful in anomaly detection, classification tasks, or when penalizing differences is critical.

## e. How to do Text Vectorization?


<img src="images/vec-techniques.jpg" style="display: block; margin: 0 auto;" height="600px" width="1000px">



##### **1. Frequency or Statistical-Based Approaches**
These methods rely on counting or transforming the frequency of words in the text.

- **Label Encoding**: Assigns unique integers to each word.
- **One-Hot Encoding**: Creates binary vectors with one "hot" (1) value for the presence of a word and 0 elsewhere.
- **Bag of Words**: Represents text by the frequency of words, ignoring grammar and word order.
- **Bag of N-Grams**: Considers combinations of words (n-grams) to capture more context.
- **TF-IDF**: Combines term frequency (TF) and inverse document frequency (IDF) to weight words based on their importance in the document.



##### **2. Prediction-Based Approaches (Embeddings)**
These approaches rely on neural networks to generate dense vector representations of words.

- **Word2Vec**: Developed by Google, uses shallow neural networks to create word embeddings.
- **FastText**: Developed by Facebook, extends Word2Vec by capturing subword information, making embeddings useful for rare words.



##### **Process Workflow for Text Vectorization**
The process of text vectorization involves the following steps:

1. **Corpus Preparation**:
   - Start with raw textual data or a document collection.

2. **Preprocessing**:
   - **Splitting**: Split text into smaller components like sentences or words.
   - **Noise Removal**: Remove unnecessary characters, symbols, or stopwords.
   - **Normalization**: Standardize text (e.g., convert to lowercase).

3. **Tokenization**:
   - Split the text into individual words or tokens.

4. **Token-ID Mapping**:
   - Map tokens to their respective IDs using:
     - Vocabulary lookup for known words.
     - Feature hashing for unknown or rare words.

5. **Vectorization Outputs**:
   - **One-Hot Encoding**
   - **Count Vectors (+ TF-IDF)**
   - **Word Embeddings**

### Example of Token-ID Mapping

In [43]:
from sklearn.feature_extraction.text import HashingVectorizer

# Vocabulary lookup for known words
vocabulary = {
    "person": 1,
    "healthy": 2,
    "fit": 3,
    "location": 4,
    "government": 5
}

In [None]:
# Input text
text = "person healthy fit unknown"

# Tokenize the text
tokens = text.split()

# Token-ID Mapping using vocabulary lookup
token_ids = [vocabulary.get(token, None) for token in tokens]
print("Token IDs using vocabulary lookup:")
print(token_ids)  # Known tokens will have IDs, unknown will be None

In [None]:
# Feature hashing for unknown or rare words
vectorizer = HashingVectorizer(n_features=10)  # 10 hash buckets
hashed_features = vectorizer.transform([text])

print("\nFeature hashing output:")
print(hashed_features.toarray())

## 1. Label Encoding

#### Definitions:
- **Document**: A single text data point (e.g., a tweet, a YouTube comment, or a product review).
- **Corpus**: A collection of all documents in the dataset.
- **Feature**: Every unique word in the corpus.

#### Example:
- Consider the following corpus that consist of three documents:

    ```boldtext
    doc1 = ["Ali help his students"]
    doc2 = ["Ali assist his students"]
    doc3 = ["Ali lectures are great"]
    ```
**Vocabulary:**
- To build a vocabulary, assign a unique number to each word in the corpus:

    ```boldtext
    vocab = {1: 'students', 2: 'assist', 3: 'his', 4: 'help', 5: 'Ali', 6: 'lectures', 7: 'great', 8: 'are'}
    ```

**Vectorized Documents:**
- Each word in the document is replaced by its corresponding number from the vocabulary:

    ```boldtext
    doc1 = [5, 4, 3, 1]  # "Ali help his students"
    doc2 = [5, 2, 3, 1]  # "Ali assist his students"
    doc3 = [5, 6, 8, 7]  # "Ali lectures are great"
    ```

- **Limitations of Label Encoding**
    - **Size is not Fixed**:
    - Example: Consider a new document: `Ali students are great help`. 
        - When vectorized, this document will have a size of 5 instead of 4 (as expected by the machine learning algorithm).
        - Machine learning algorithms often expect fixed-size input, making such cases unprocessable.
  
    - **Out of Vocabulary Problem (OOV)**:
    - Example: Consider a new document: `Ali YouTube lectures are great`. 
        - The word `YouTube` is not in the vocabulary, so it cannot be encoded.
        - This is known as the Out of Vocabulary (OOV) problem.

    - **Cannot Capture Semantics/Meanings**:
    - The words `help` and `assist` are almost similar in meaning, but they are represented completely differently in label encoding. 
        - As a result, semantic relationships are lost.

    > **Conclusion:** Due to these limitations, **Label Encoding** is not suitable for transforming text data into numbers for most Natural Language Processing (NLP) applications.

In [None]:
# Python Implementation
corpus = [
    "Ali help his students",
    "Ali assist his students",
    "Ali lectures are great"
]

# Build Vocabulary
vocab = {'students': 1, 'assist': 2, 'his': 3, 'help': 4, 'Ali': 5, 'lectures': 6, 'great': 7, 'are': 8}

# Function to encode documents
def encode_document(doc, vocab):
    return [vocab[word] for word in doc.split()]

# Encode each document
doc1_encoded = encode_document("Ali help his students", vocab)
doc2_encoded = encode_document("Ali assist his students", vocab)
doc3_encoded = encode_document("Ali lectures are great", vocab)
# doc4_encoded = encode_document("Ali lectures are not great", vocab)

print("Encoded Documents:")
print("doc1:", doc1_encoded)
print("doc2:", doc2_encoded)
print("doc3:", doc3_encoded)
# print("doc4:", doc4_encoded)

## 2. One-Hot Encoding

- Consider the following corpus that consists of three documents and the corresponding vocabulary of the corpus:

    ```boldtext
    doc1 = ["Ali help his students"]
    doc2 = ["Ali assist his students"]
    doc3 = ["Ali lectures are great"]
    ```

```boldtext    
vocab = {'students', 'assist', 'his', 'help', 'Ali', 'lectures', 'great', 'are'}
```

- Encode every word as a binary vector of the same size as the number of words in the vocabulary, with only one location having a value of 1, and that is under the word.
- The columns in the following matrix are the words in the vocabulary, while rows are the words with their vector representation:

| Word      | students | assist | his | help | Ali | lectures | great | are |
|-----------|----------|--------|-----|------|-----|----------|-------|-----|
| Ali       | 0        | 0      | 0   | 0    | 1   | 0        | 0     | 0   |
| help      | 0        | 0      | 0   | 1    | 0   | 0        | 0     | 0   |
| his       | 0        | 0      | 1   | 0    | 0   | 0        | 0     | 0   |
| students  | 1        | 0      | 0   | 0    | 0   | 0        | 0     | 0   |
| assist    | 0        | 1      | 0   | 0    | 0   | 0        | 0     | 0   |
| great     | 0        | 0      | 0   | 0    | 0   | 0        | 1     | 0   |
| are       | 0        | 0      | 0   | 0    | 0   | 0        | 0     | 1   |
| lectures  | 0        | 0      | 0   | 0    | 0   | 1        | 0     | 0   |

- In One-Hot Encoding, you represent every word of a document as a vector of size ( v ), where ( v ) is the size of the vocabulary.
- The vectorization of the three documents using one-hot encoding is shown below:

    ```boldtext
    doc1 = ["Ali help his students"] = [[00001000], [00010000], [00100000], [10000000]]
    doc2 = ["Ali assist his students"] = [[00001000], [01000000], [00100000], [10000000]]
    doc3 = ["Ali lectures are great"] = [[00001000], [00000001], [00000010], [00000100]]
    ```

**Limitations**:
- **Size is not Fixed:**
    - Consider a document (Ali students are great help) that we need to classify. Once you vectorize this document, the size of the sparse matrix representing this document will be ( (5, 8) ), i.e., 40 numbers.
	- Unfortunately, our machine learning algorithms expect same/fixed size input, and therefore, we will not be able to process it.
- **Out of Vocabulary Problem (OOV)**:
	- Consider a document (Ali YouTube lectures are great). It has a new word `YouTube` that is not there in the vocabulary.
	- So we will not be able to encode it. This is called an out of vocabulary (OOV) problem.
- **Cannot Capture Semantic Meanings:**
	- The word `help` and `assist` are almost similar in meaning, but these two words have completely different representations in one-hot encoding.
- **Sparse Matrix Representation:**
	- This technique is memory-hungry. For example, in the above toy example, each document is represented by a sparse matrix of size ( (4, 8) ), i.e., 32 numbers, out of which 28 are zero.

> **Due to these limitations, One-Hot Encoding is not used for transforming text data into numbers in any of the NLP applications.**

In [None]:
# Define the vocabulary
vocab = ['students', 'assist', 'his', 'help', 'Ali', 'lectures', 'great', 'are']

# Define the documents
documents = [
    "Ali help his students",
    "Ali assist his students",
    "Ali lectures are great"
]

# Create a one-hot encoding matrix
one_hot_matrix = {word: [1 if word == token else 0 for token in vocab] for word in vocab}

# Display the one-hot encoded table
print(f"{'Word':<10} {' '.join([f'{token:<8}' for token in vocab])}")
for word in vocab:
    print(f"{word:<10} {' '.join([str(one_hot_matrix[word][i]).ljust(8) for i in range(len(vocab))])}")

## 3. Bag of Words (BoW) Encoding

### a. Conceptual Understanding
- `Bag of Words (BoW)` is the most basic strategy for converting a text document into numbers, which specifies the presence/count of a word/n-grams in a vocabulary.
- The most common NLP application in which we use BoW representation is text classification (e.g., classifying a collection of documents into categories like sports, entertainment, and politics).
- Consider the following corpus that consists of three documents consisting of five, ten, and five words respectively and the corresponding vocabulary of the corpus:

	```boldtext
	doc1 = ["Ali YouTube lectures are amazing"]
	doc2 = ["I like YouTube lectures and Khurram also like YouTube lectures"]
	doc3 = ["Ali YouTube lectures are great"]
	```
**Vocabulary Mapping**
	```
	vocab = {'also': 0, 'amazing': 1, 'and': 2, 'are': 3, 'ali': 4, 'great': 5, 'khurram': 6, 'lectures': 7, 'like': 8, 'youtube': 9}
	```

- Irrespective of the size, each document is converted into a `v-dimensional frequency vector`, where `v` is the size of the vocabulary.
- The three documents are represented as `Document-Term Matrix (DTM)`, which is  mathematical matrix that describe the frequency of terms that occur in a collection of documents.

**In Theory**
| Word     | also | amazing | and | are | ali | great | khurram | lectures | like | youtube |
|----------|------|---------|-----|-----|-----|-------|---------|----------|------|---------|
| **doc1** | 0    | 1       | 0   | 1   | 1   | 0     | 0       | 1        | 0    | 1       |
| **doc2** | 1    | 0       | 1   | 0   | 0   | 0     | 1       | 2        | 2    | 2       |
| **doc3** | 0    | 0       | 0   | 1   | 1   | 1     | 0       | 1        | 0    | 1       |


**In Practice**
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-------|---|---|---|---|---|---|---|---|---|---|
| **doc1** | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
| **doc2** | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 2 | 2 |
| **doc3** | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |


- `doc1` has a total of 5 words each appearing once, and it is represented as a vector of size 10 (count of vocab), having 5 non-zero values.

- `doc2` has a total of 9 words with 6 unique words (excluding single character I). The word “like”, “YouTube”, and “lectures” are coming twice, the rest of the words are appearing once. doc2 is represented as a vector of size 10.

- `doc3` has a total of 5 words, each appearing once, and it is represented as a vector of size 10 (count of vocab), having 5 non-zero values.

**Advantage:**
- `Size is Fixed:` Unlike one-hot encoding, which encodes every word separately, BoW technique encodes every document as a fixed-size vector irrespective of the number of words in it. So this can now be easily fed to a machine learning algorithm.

**Limitations:**
- `Size Reduced but Still Large:` The vector representation of a document is small in size as compared to one-hot encoding.
- `Sparsity Reduced but Still Exists:` A bit better than one-hot encoding; however, vector representation of BoW still has lots and lots of zero values.
- `OOV Partially Solved:` In BoW, a word which is not in the vocabulary will be valued as zero.
- `Semantic Meaning are Partially Captured:` A bit better than one-hot encoding; however, BoW does not capture the meaning of sentences accurately.
- `Ordering of Words is Ignored:` In BoW representation, the ordering of words is not captured.
- `Two Very Similar Vectors Convey Completely Different Meanings:` For instance, “Ali YouTube lectures are very good” and “Ali YouTube lectures are not very good” will have similar vectors in BoW but convey opposite meanings.

### b. Creating Bag of Words using `CountVectorizer`

- **Steps to create a BOW representation of a corpus programmatically:**
    - **Tokenization**: First, tokenize all the input documents.
    - **Vocabulary creation**: Of all the obtained tokenized words, only unique words are selected to create the vocabulary and then sorted by alphabetical order.
    - **Vector creation**: Finally, a sparse matrix is created in which each row is a document vector whose length (the columns of the matrix) is equal to the size of the vocabulary. The value of each cell in a row/document is the frequency count of the word under that column.

- **Scikit-learn's `CountVectorizer`**: The `CountVectorizer` computes the frequency of occurrence of a word in a document. It converts the corpus of multiple documents (say product reviews) into a Document Term Matrix (a sparse matrix). It also allows you to:
    - Control your n-gram size.
    - Perform custom preprocessing.
    - Perform custom tokenization.
    - Eliminate stop words.
    - Limit vocabulary size.

    ```boldtext
    cv = sklearn.feature_extraction.text.CountVectorizer(arg1, arg2, arg3,......, arg16)
    ```

In [50]:
# Corpus of documents
corpus = [
    "Ali youTube lectures are amazing",
    "I like youTube lectures and Khurram also like youTube lectures",
    "Ali youTube lectures are great"
]

In [None]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Create an object of the CountVectorizer class
cv = CountVectorizer()

# Check the type of the object
type(cv)

In [55]:
bow = cv.fit_transform(corpus) # generates vocabulary dictionary and returns a DTM

In [54]:
dir(cv)

In [None]:
# to print the vocabulary
print(cv.vocabulary_)

In [None]:
print(cv.get_feature_names_out())

> **Note that single characters are not included in the vocabulary.**

In [None]:
bow.shape

In [None]:
bow

In [None]:
# Since the bow is sparse matrix, we need to convert it to a dense matrix by using numpy toarray() or todense() method
bow.toarray()

**Let us understand the sparsity of the matrix.**

In [None]:
# Total count of values in the BOW matrix
total_cells = bow.shape[0] * bow.shape[1]
total_cells

# Total count of non-zero cells
nonzero_cells = bow.nnz
nonzero_cells
total_cells, nonzero_cells

In [None]:
# Percentage of non-zero values in the document term matrix
percentage = (nonzero_cells / total_cells) * 100
percentage

- Since this is a toy example, around 47% of the cells contain zero values.
- In real-world examples, approximately 99% of the cells contain zero values.
- In order to save memory space and speed up algebraic operations, we use the sparse representation of matrices.

**Let us save the corpus as Document Term Matrix of BoW Representation**

In [None]:
import pandas as pd
dmt = pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())
dmt

In [None]:
dmt2 = pd.DataFrame(bow.toarray())
dmt2

### c. **Hyper-Parameters of CountVectorizer**

- To improve or fine-tune the results on your dataset, you can tweak different hyperparameters of the `CountVectorizer()` method. Some important ones are mentioned below:

  ```boldtext
  cv = sklearn.feature_extraction.text.CountVectorizer(arg1, arg2, ...)
  ```

**Where:**
  * **vocabulary**: None is the default, created from the input documents. You can pass a Python dictionary where keys are terms and values are indices in the feature matrix.
  * **lowercase**: True converts characters of all the documents to lowercase before tokenizing.
  * **tokenizer**: None. The default tokenization in `CountVectorizer` removes all special characters, punctuation, and single characters. If this is not the behavior you desire, you can pass a custom tokenizer.
  * **stop\_words**: None. By default, it will not remove any stop words. You can pass a custom list or sklearn's built-in English stop-word list.
  * **preprocessor**: None. If passed, a function name that performs customized pre-processing by changing to lower case, removing characters of your choice, including stemming, lemmatization.
  * **binary**: False by default shows a frequency count from 0, 1, 2, 3, .... If you are not interested in the frequency of words, rather, just want to know whether the word exists in a document or not, set `binary=True`. It sets all non-zero counts to 1. It is recommended to set this argument to True if you are doing sentiment analysis.
  * **max\_features**: None. If you want to put a limit on the number of features (vocabulary size), then you pass an integer value to this argument. For example, a value of 100 will keep the top 100 most frequent words in the vocabulary and drop the rest.
  * **ngram\_range**: (1, 1). The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such that `min_n <= n <= max_n` will be used. For example, an ngram\_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if `analyzer` is not callable.

> - I strongly recommend going through all the hyper-parameters of `CountVectorizer()` at the following link:

[https://scikit-learn.org/stable/modules/generated/sklearn.feature\_extraction.text.CountVectorizer.html](https://www.google.com/url?sa=E&source=gmail&q=https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

## 4. Bag of N-Grams Encoding

### a. Conceptual Understanding
- In the simple BoW model, the vocabulary consists of single unique words of the corpus, and its limitation is that the ordering of words is not captured.
- A `Bag of n-grams model` is quite similar to the BoW model, and it represents a text document as an unordered collection of its n-grams (a contiguous sequence of **n items** from a given sample of text or speech).
- An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram," and so on.
- The formula to calculate the count of n-grams in a document is:
  $
  \text{X} - (\text{N} - 1)
  $
  where **X** is the number of words in a given document and **N** is the number of words in n-grams.

### Example:
- Using BoW representation, the two sentences:
  - *Ali YouTube lectures are good*
  - *Ali YouTube lectures are not good*
  will be very similar.
- Using bigram, the two bigrams **are good** in the first sentence and **not good** in the second sentence will make the difference.

- Consider the following corpus that consists of three documents consisting of five, ten, and five words respectively:

    ```boldtext
    doc1 = ["Ali YouTube lectures are amazing"]
    doc2 = ["I like YouTube lectures and Khurram also like YouTube lectures"]
    doc3 = ["Ali YouTube lectures are great"]
    ```

- The vocabulary of bi-grams in this case consists of ten words, which can be represented as a dictionary as shown below:
    ```boldtext
    vocab = {
        'also like': 0,
        'and khurram': 1,
        'are amazing': 2,
        'are great': 3,
        'ali youtube': 4,
        'khurram also': 5,
        'lectures and': 6,
        'lectures are': 7,
        'like youtube': 8,
        'youtube lectures': 9
    }
    ```

**Bag of Bi-grams**
| Document | also like | and khurram | are amazing | are great | ali youtube | khurram also | lectures and | lectures are | like youtube | youtube lectures |
|----------|-----------|-------------|-------------|-----------|-------------|--------------|--------------|--------------|--------------|------------------|
| **doc1** | 0         | 0           | 1           | 0         | 1           | 0            | 0            | 1            | 0            | 1                |
| **doc2** | 1         | 1           | 0           | 0         | 0           | 1            | 1            | 0            | 2            | 2                |
| **doc3** | 0         | 0           | 0           | 1         | 1           | 0            | 0            | 1            | 0            | 1                |


- The vocabulary of tri-grams in this case consists of ten words, which can be represented as a dictionary as shown below:

    ```boldtext
    vocab = {
        'also like youtube': 0,
        'and khurram also': 1,
        'ali youtube lectures': 2,
        'khurram also like': 3,
        'lectures and khurram': 4,
        'lectures are amazing': 5,
        'lectures are great': 6,
        'like youtube lectures': 7,
        'youtube lectures and': 8,
        'youtube lectures are': 9
    }
    ```

**Bag of Tri-grams**

| Document | also like youtube | and khurram also | ali youtube lectures | khurram also like | lectures and khurram | lectures are amazing | lectures are great | like youtube lectures | youtube lectures and | youtube lectures are |
|----------|-------------------|------------------|-----------------------|-------------------|----------------------|----------------------|--------------------|-----------------------|-----------------------|-----------------------|
| **doc1** | 0                 | 0                | 1                     | 0                 | 0                    | 1                    | 0                  | 0                     | 0                     | 1                     |
| **doc2** | 1                 | 1                | 0                     | 1                 | 1                    | 0                    | 0                  | 1                     | 0                     | 0                     |
| **doc3** | 0                 | 0                | 1                     | 0                 | 0                    | 0                    | 1                  | 0                     | 0                     | 1                     |

- **Advantages of N-Grams**

    - `Captures Semantic Meaning:` As we use bi-grams or tri-grams, it takes a sequence of sentences which makes it easy for finding word relationships.
    - `Intuitive and Easy to Implement:` Implementation of N-grams is straightforward with a little bit of modification in the Bag of Words approach.

- **Disadvantages of Bag of N-Grams Encoding**
    - As we move from unigram to N-gram, the dimension of vector formation or vocabulary increases, due to which it takes more time in computation and prediction.
    - Still no solution for out-of-vocabulary terms – we do not have a way other than ignoring the new words in a new sentence.

### b. Creating Bag of N-Grams using `CountVectorizer`

- To create a Bag of N-Grams, we can use the `ngram_range` argument of the `CountVectorizer` method:

    ```boldtext
    cv = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,1))
    ```
#### Explanation:
- **Uni-grams**: Single words from the text are treated as tokens.
- **Bi-grams**: Consecutive pairs of words are treated as tokens.
- **Tri-grams**: Consecutive triplets of words are treated as tokens.
- **Combined N-Grams**: Tokens are generated by combining multiple ranges of N-Grams (e.g., uni-grams, bi-grams, and tri-grams together).


**Example of Bi-Grams Using CountVectorizer**

In [None]:
corpus = [
    "Ali YouTube lectures are amazing",
    "I like YouTube lectures and Khurram also like YouTube lectures",
    "Ali YouTube lectures are great"
]

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create an object of CountVectorizer with ngram_range for bi-grams
cv = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the corpus to create a bag-of-words representation
bow = cv.fit_transform(corpus)

# Print the vocabulary
print(cv.vocabulary_)

# Print the bag-of-words matrix as an array
print("\n\n")
print(bow.toarray())

In [None]:
import pandas as pd

# Convert the matrix to a DataFrame
dtm1 = pd.DataFrame(data=bow.toarray(), columns=cv.get_feature_names_out())

# Display the DataFrame
dtm1

**Example of Tri-Grams Using CountVectorizer**

In [None]:
corpus = [
    "Ali YouTube lectures are amazing",
    "I like YouTube lectures and Khurram also like YouTube lectures",
    "Ali YouTube lectures are great"
]

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create an object of CountVectorizer with ngram_range for bi-grams
cv = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the corpus to create a bag-of-words representation
bow = cv.fit_transform(corpus)

# Print the vocabulary
print(cv.vocabulary_)

# Print the bag-of-words matrix as an array
print("\n\n")
print(bow.toarray())

In [None]:
import pandas as pd

# Convert the matrix to a DataFrame
dtm1 = pd.DataFrame(data=bow.toarray(), columns=cv.get_feature_names_out())

# Display the DataFrame
dtm1

# 5. Term Frequency -inverse Document Frequency (TF-IDF)


## a. Conceptual Understanding

- In the `Bag of Words` representation of a document, the values of the vector are the number of times a particular word appears in a document, but it does not capture the importance of a word in a document.
- In simple terms, the Bag of Words approach treats every word equally, irrespective of its actual importance.
- So, BoW gives more importance to some unimportant words that appear more frequently in a document. For example, words like `since`, `as`, `can`, `any`, `and`, `the`, `of`, `it`, `they` can have high frequency but are not important.
- This will take the attention of the machine learning model away from less frequent but more important words.
- **TF-IDF** stands for **Term Frequency (TF)** times **Inverse Document Frequency (IDF)**, and it is used to address this issue:
  - **Term Frequency (TF)**: Tells us how important a term is in a particular document by assigning more weight to a term that is appearing more frequently in a dataset.
  - **Document Frequency (IDF)**: Tells us how important a term is in the entire corpus of documents. The intuition behind taking its inverse is that the more common a word is across all documents, the lesser its importance for the current document.

### (i) Term Frequency (TF)

- **Term Frequency (TF)** tells us the count of a term in a specific document and thus tells us how important a term is in a particular document.
- In literature, there are two different formulae for computing **TF**, as shown below:

$$
TF(t, d) = \frac{\text{Number of times term (t) occurs in document (d)}}{\text{Total number of terms in document (d)}}
$$

$$
TF(t, d) = \frac{\text{Number of times term (t) occurs in document (d)}}{\text{Frequency of most common term in document (d)}}
$$
- Consider a corpus of three documents as shown below:

    ```boldtext
    doc1 = ["Ali YouTube lectures are amazing"]
    doc2 = ["I like YouTube lectures and Khurram also like YouTube lectures"]
    doc3 = ["Ali YouTube lectures are great"]
    ```

> The `TfidfVectorizer()` method of sklearn uses the first formula.

**Term Frequencies of Each Term in Each Document**

| Term       | also | amazing | and | are | ali | great | khurram | lectures | like | youtube |
|------------|------|---------|-----|-----|------|-------|---------|----------|------|---------|
| **doc1**   | 0    | 1       | 0   | 1   | 1    | 0     | 0       | 1        | 0    | 1       |
| **doc2**   | 1    | 0       | 1   | 0   | 0    | 0     | 1       | 2        | 2    | 2       |
| **doc3**   | 0    | 0       | 0   | 1   | 1    | 1     | 0       | 1        | 0    | 1       |


**Document Analysis**

- **Doc1**:
  - Contains a total of 5 words, each appearing once.
  - Represented as a vector of size 10 (count of vocabulary), having 5 non-zero values.

- **Doc2**:
  - Contains a total of 10 words, with 6 unique words (excluding the single character "I").
  - The words *like*, *YouTube*, and *lectures* appear twice, while all other words appear once.
  - Represented as a vector of size 10 (count of vocabulary), having 6 non-zero values.

- **Doc3**:
  - Contains a total of 5 words, each appearing once.
  - Represented as a vector of size 10 (count of vocabulary), having 5 non-zero values.

### (ii) Inverse Document Frequency (IDF)

- **Document Frequency** tells us how important a term is in the entire corpus of documents.
- To penalize frequently occurring words across all the documents, we take the **Inverse of Document Frequency**.
- This way, the IDF of rare words in the corpus will be large, while the IDF of very common words in the corpus will be close to zero.
- In literature, the most common formula for computing IDF is:


$$\text{IDF}(t, D) = \log_e \frac{n}{df(d, t)}$$


Where:
- \( n \) = Total number of documents in the corpus.
- \( df(d, t) \) = Number of documents in which term \( t \) appears.

- If a word is appearing in only one document (large value of \( n \)), then the IDF value of that term will be very large. Therefore, we take the natural logarithm to dampen its effect.

- **Sklearn's `TfidfVectorizer()`** uses the following formula when the argument `smooth_idf=True`:

$$\text{IDF}(t, D) = 1 + \log_e \frac{1 + n}{1 + df(d, t)}$$

#### Why `smooth_idf=True`?

- The constant "1" is added to the numerator and denominator to prevent zero divisions.
- Adding "1" ensures terms with zero IDF (i.e., terms that occur in all documents) will not be entirely ignored.


**Examples:**

$$\text{IDF}(\text{Ali}, D) = 1 + \log_e \frac{1 + 3}{1 + 2} = 1 + \log_e(1.333333) = 1 + 0.28768 = 1.28768$$


$$\text{IDF}(\text{YouTube}, D) = 1 + \log_e \frac{1 + 3}{1 + 3} = 1 + \log_e(1) = 1$$



**Inverse Document Frequencies (common for all the documents)**


| Term       | also       | amaizing   | and        | are        | ali        | great      | khurram    | lectures   | like       | youtube    |
|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
| Corpus     | 1.69314718 | 1.69314718 | 1.69314718 | 1.28768207 | 1.28768207 | 1.69314718 | 1.69314718 | 1          | 1.69314718 | 1          |

<br/>
<br/>
<br/>

> - The **Term Frequency (TF)** varies for each term in each document, but **Inverse Document Frequency (IDF)** values remain the same for all terms across all the documents in the corpus.
> - A term that appears in all the documents (e.g., "youtube") is not given a zero IDF because of the "+1" added in the formula. This ensures that the term is not entirely ignored in the overall calculation of TF-IDF.

### (iii) Term Frequency-Inverse Document Frequency (TFIDF)
- The **TFIDF score** can be calculated using the following formula:

$$TFIDF = TF \times IDF$$

**TFIDF Table**

| Term   | also       | amaizing   | and        | are        | ali        | great      | khurram    | lectures   | like       | youtube    |
|--------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
| doc1   | 0          | 1.69314718 | 0          | 1.28768207 | 1.28768207 | 0          | 0          | 1          | 0          | 1          |
| doc2   | 1.69314718 | 0          | 1.69314718 | 0          | 0          | 0          | 1.69314718 | 2          | 3.38629436 | 2          |
| doc3   | 0          | 0          | 0          | 1.28768207 | 1.28768207 | 1.69314718 | 0          | 1          | 0          | 1          |


### (iv) Normalizing TFIDF Values
- To avoid large documents in the corpus dominating smaller ones, normalization is applied to each row in the sparse matrix. This is achieved by calculating the Euclidean norm for each document.
- The **Euclidean Norm** for the three documents is:
  - **Doc1**: 2.86059
  - **Doc2**: 5.29785
  - **Doc3**: 2.86059


$$Normalized \, TFIDF = \frac{TFIDF}{\sqrt{\sum_{i=1}^{n}(TFIDF^2)}}$$

**Normalized TFIDF Table**

| Term   | also       | amaizing   | and        | are        | ali        | great      | khurram    | lectures   | like       | youtube    |
|--------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|
| doc1   | 0          | 0.59188659 | 0          | 0.45014501 | 0.45014501 | 0          | 0          | 0.34957775 | 0          | 0.34957775 |
| doc2   | 0.31959128 | 0          | 0.31959128 | 0          | 0          | 0          | 0.31959128 | 0.37751152 | 0.63918256 | 0.37751152 |
| doc3   | 0          | 0          | 0          | 0.45014501 | 0.45014501 | 0.59188659 | 0          | 0.34957775 | 0          | 0.34957775 |




- **The higher the TFIDF score, the more relevant the term is in that document.**
  - In **doc1**, the term **"amaizing"** is the most relevant term as it has the highest TFIDF value (**0.59188659**).
  - In **doc2**, the term **"like"** is the most relevant term as it has the highest TFIDF value (**0.63918256**).
  - In **doc3**, the term **"great"** is the most relevant term as it has the highest TFIDF value (**0.59188659**).

## b. Creating TFIDF using `TfidfVectorizer`

In [114]:
corpus = [
    "Ali YouTube lectures are amazing",
    "I like YouTube lectures and Khurram also like YouTube lectures",
    "Ali YouTube lectures are great"
]

In [115]:
# Create an instance of TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
type(tfidf_vec)

sklearn.feature_extraction.text.TfidfVectorizer

In [116]:
tfidf = tfidf_vec.fit_transform(corpus) # generates vocabulary dictionary and returns a DTM having TF-IDF values

In [117]:
# to print the vocabulary
print(tfidf_vec.vocabulary_)

{'ali': 0, 'youtube': 9, 'lectures': 7, 'are': 4, 'amazing': 2, 'like': 8, 'and': 3, 'khurram': 6, 'also': 1, 'great': 5}


In [118]:
print(tfidf_vec.get_feature_names_out())

['ali' 'also' 'amazing' 'and' 'are' 'great' 'khurram' 'lectures' 'like'
 'youtube']


> **Note that single characters are not included in the vocabulary.**

In [119]:
tfidf.shape

(3, 10)

In [120]:
tfidf

<3x10 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [121]:
# since the tfidf is sparse matrix, we need to convert it to a dense matrix by using numpy toarray() or todense() method
tfidf.toarray()

array([[0.45014501, 0.        , 0.59188659, 0.        , 0.45014501,
        0.        , 0.        , 0.34957775, 0.        , 0.34957775],
       [0.        , 0.31959128, 0.        , 0.31959128, 0.        ,
        0.        , 0.31959128, 0.37751152, 0.63918256, 0.37751152],
       [0.45014501, 0.        , 0.        , 0.        , 0.45014501,
        0.59188659, 0.        , 0.34957775, 0.        , 0.34957775]])

**Let us save the corpus as a Document Term Matrix of TF-IDF values**

In [122]:
dtm2 = pd.DataFrame(data=tfidf.toarray(), columns=tfidf_vec.get_feature_names_out())
dtm2

Unnamed: 0,ali,also,amazing,and,are,great,khurram,lectures,like,youtube
0,0.450145,0.0,0.591887,0.0,0.450145,0.0,0.0,0.349578,0.0,0.349578
1,0.0,0.319591,0.0,0.319591,0.0,0.0,0.319591,0.377512,0.639183,0.377512
2,0.450145,0.0,0.0,0.0,0.450145,0.591887,0.0,0.349578,0.0,0.349578


**Advantage:**
- The TFIDF technique of vectorization is mainly used in **Information Retrieval**, such as in Google Search Engines.

**Limitations: *(Almost same as BoW)***
- **Size of the Vector**: Depends on the overall size of the vocabulary (quite large), thus increasing the number of dimensions.
- **Sparsity Exists**: Many entries in the vector remain zero.
- **Out of Vocabulary Problem**: If a new word appears, it cannot be vectorized.
- **Increased Dimensions**: The number of dimensions increases with a larger vocabulary.
- **Semantic Meanings Are Not Completely Captured**:
  - **Ordering of Words Is Ignored**: The sequence of words is not considered.
  - **Two Very Similar Vectors Can Convey Completely Different Meanings**: This can lead to misrepresentation of the context.

## 6. Word Embeddings

### a. Conceptual Understanding
- In Bag of Words and TF-IDF encodings, every word is treated as an individual entity, and semantics are completely ignored.
- These vectorization techniques work fine for NLP tasks like **Text Generation** and **Classification**.
- However, Bag-of-Words and TFIDF encodings won’t be as effective for other NLP tasks like **Sentiment Analysis**, **Machine Translation**, and **Question Answering**, where a deeper understanding of the context is required to achieve great results.
- For this, we turn to **Word Embeddings**, a featurized word-level representation capable of capturing the semantic meanings of words.
- **Word Embeddings** are techniques that map a single **word** as well as an entire **document** to a dense vector of fixed size (50 to 300 dimensions) that captures the semantic meanings of words.

#### Word Embedding Techniques:
1. **Prediction-Based (Don’t Count, Predict)**:
   - **Word2Vec** by Google (2013): [Paper](https://arxiv.org/pdf/1301.3781.pdf)
   - **FastText** by Facebook (2015): [Paper](https://arxiv.org/pdf/1607.01759.pdf)
2. **Frequency/Count-Based**:
   - **Global Vectors (GloVe)** by Stanford (2014): [Paper](https://nlp.stanford.edu/pubs/glove.pdf)

### b. Word2Vec
- **Word2Vec** technique was released in 2013 by Google researchers that uses the power of a simple Neural Network to generate word embeddings.
- It is a contextually aware word embedding technique that uses a simple Neural Network to generate word embeddings.
- It converts a word into a vector of real numbers (300 or maybe 400 dimensions).
- **Two Approaches to Train a Word2Vec Model**:
  - **CBOW**: Use the context words to predict the target word.
  - **Skip-Gram**: Use a word to predict target context words.
  
<img align="right" width="800"  src="images/word2vec.jpeg" >


**Example:**

- Bag of Words (BOW) and TFIDF treat `king`, `queen`, `woman`, and `princess` as completely different words.
- **Word Embedding techniques** will capture the semantic meanings of these words and represent them as vectors that are quite close or similar to each other.

<img align="right" width="800"  src="images/word2vectors.png" >

### c. `word2vec` using spaCy `en_core_web_lg` Model

In [129]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [131]:
len(nlp.vocab.vectors) # number of words in the vocabulary

342918

In [133]:
nlp.vocab.vectors.shape # dimension of the word vectors, it means that each word has 300 dimensions

(342918, 300)

**Vector Representation of Words**

In [139]:
nlp(u"man").vector # here u is a Unicode character

array([-1.7310e-01,  2.0663e-01,  1.6543e-02, -3.1026e-01,  1.9719e-02,
        2.7791e-01,  1.2283e-01, -2.6328e-01,  1.2522e-01,  3.1894e+00,
       -1.6291e-01, -8.8759e-02,  3.3067e-03, -2.9483e-03, -3.4398e-01,
        1.2779e-01, -9.4536e-02,  4.3467e-01,  4.9742e-01,  2.5068e-01,
       -2.0901e-01, -5.8931e-01,  6.1615e-02,  1.0434e-01,  2.4424e-01,
       -2.9120e-01,  3.0746e-01,  3.6276e-01,  7.1151e-01, -8.0523e-02,
       -5.9524e-01,  3.4834e-01, -3.3048e-01,  7.0316e-02,  5.3329e-01,
       -2.9081e-01,  1.3459e-01, -3.9856e-01, -3.2435e-01,  1.1867e-01,
       -1.4938e-01, -3.8256e-01,  3.3116e-01, -3.1488e-01, -9.4491e-02,
       -6.1319e-02,  1.5518e-01, -2.5523e-01, -1.1813e-01,  2.5296e-01,
       -9.5174e-02, -1.6596e-01, -1.0840e-01,  8.8803e-02,  2.0890e-01,
        4.3981e-01,  1.0476e-03, -4.0666e-02,  2.6487e-01, -6.1009e-01,
       -1.4405e-01, -8.1185e-02,  7.5475e-03,  2.3373e-01, -2.7772e-02,
       -2.9315e-01, -1.1744e-01, -8.3193e-02, -2.3768e-01,  1.57

In [140]:
nlp(u"ehtisham").vector # here u is a Unicode character

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

**Out of Vocabulary Problem and L2 Norms**

In [141]:
words = nlp(u"lion cat ehtisham")
dir(words[0])

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [142]:
for word in words:
    print(word.text, '--->', word.has_vector, '--->', word.vector_norm, '--->', word.is_oov)

lion ---> True ---> 6.5120897 ---> False
cat ---> True ---> 6.6808186 ---> False
ehtisham ---> False ---> 0.0 ---> True


**Cosine Similarity/Cosine Distance among Vectors**

In [144]:
words = nlp(u"python cat pet")
for word1 in words:
    for word2 in words:
        print(word1.text, '--->', word2.text, '--->', word1.similarity(word2))

python ---> python ---> 1.0
python ---> cat ---> 0.3270457983016968
python ---> pet ---> 0.21873925626277924
cat ---> python ---> 0.3270457983016968
cat ---> cat ---> 1.0
cat ---> pet ---> 0.7505456209182739
pet ---> python ---> 0.21873925626277924
pet ---> cat ---> 0.7505456209182739
pet ---> pet ---> 1.0


In [145]:
words = nlp(u"love hate")
for word1 in words:
    for word2 in words:
        print(word1.text, '--->', word2.text, '--->', word1.similarity(word2))

love ---> love ---> 1.0
love ---> hate ---> 0.6393099427223206
hate ---> love ---> 0.6393099427223206
hate ---> hate ---> 1.0


**Compute Cosine Similarity and Cosine distance using Sklearn**

In [150]:
from sklearn. metrics.pairwise import cosine_similarity, cosine_distances

words = nlp(u"love hate")
print(cosine_similarity(words[0].vector.reshape(1, -1), words[1].vector.reshape(1, -1)))
print(cosine_distances(words[0].vector.reshape(1, -1), words[1].vector.reshape(1, -1)))

[[0.63931]]
[[0.36069]]


In [152]:
print(cosine_similarity([words[0].vector], [words[1].vector]))
print(cosine_distances([words[0].vector], [words[1].vector]))

[[0.63931]]
[[0.36069]]


In [None]:
from IPython.core.display import HTML

style = """
    <style>
        body {
            background-color: #f2fff2;
        }
        h1 {
            text-align: center;
            font-weight: bold;
            font-size: 36px;
            color: #4295F4;
            text-decoration: underline;
            padding-top: 15px;
        }
        
        h2 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #4A000A;
            text-decoration: underline;
            padding-top: 10px;
        }
        
        h3 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #f0081e;
            text-decoration: underline;
            padding-top: 5px;
        }

        
        p {
            text-align: center;
            font-size: 12 px;
            color: #0B9923;
        }
    </style>
"""

html_content = """
<h1>Hello</h1>
<p>Hello World</p>
<h2> Hello</h2>
<h3> World </h3>
"""

HTML(style + html_content)