# Feature Extraction from Text

Other Name: Text Representation or Text Vectorization : Converting text into numbers.

Techniques to convert text into numbers: OHE, BOW, NGrams, TFIDF, Custom Features, Word2Vec (Embedding-Deep Learning Based)

Corpus : all the words together in a dataset together are called corpus. (All the repeating words are included)
Vocabulary : All the unique words inside corpus together is called vocabulary.
Document : All the individual review is called a document.
Word : All the individual words in a document is called word.

# One Hot Encoding

One Hot Encoding Disadvantage:
1. Not ideal for large Dataset.(Problem on Sparsity). I will form sparse array. Sparse array will form overfitting.
2. OOV (Out of Vocab Problem.)- New word during prediction.
3. Not fixed size of Input. In case of ML algo, we have to provide fixed size of input every time.
4. No Capturing of Semantic.

# Bag of Words. (Used in Text Classification)

How many times a word comes in a document. (Frequency of words in a document).

Limitations of Bag of Words:

Loss of Semantic Meaning: BoW does not capture the context or order of words. Words with the same frequency but different meanings are treated equally. It ignores ordering of the words. So loss of Semantic.

High Dimensionality: For large corpora, the vocabulary size can become very large, leading to high-dimensional vectors that require more computational resources.

Sparsity: Most vectors are sparse (contain many zeros), which can be inefficient in terms of storage and processing. (Overfitting)

In [1]:
#Example of bag of words..
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample text documents
documents = {
    'Document_ID': [1, 2, 3, 4, 5,6],
    'Text': [
        "Cats sit on mats.",
        "Dogs sit on mats.",
        "Cats and dogs play together.",
        "Mats are places where cats and dogs rest.",
        "Rest and play are important for pets.",
        'Hi chayan How are you chayan'
    ]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(documents)
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
vectorized_documents = vectorizer.fit_transform(df['Text'])

# Convert the result to an array
vector_array = vectorized_documents.toarray()

# Vocabulary (features)
vocabulary = vectorizer.get_feature_names_out()

print("Vocabulary:", vocabulary)
print("Vector Representation:\n", vector_array)

Vocabulary: ['and' 'are' 'cats' 'chayan' 'dogs' 'for' 'hi' 'how' 'important' 'mats'
 'on' 'pets' 'places' 'play' 'rest' 'sit' 'together' 'where' 'you']
Vector Representation:
 [[0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0]
 [1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0]
 [1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0]
 [1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0]
 [0 1 0 2 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1]]


In [2]:
vectorizer.vocabulary_

{'cats': 2,
 'sit': 15,
 'on': 10,
 'mats': 9,
 'dogs': 4,
 'and': 0,
 'play': 13,
 'together': 16,
 'are': 1,
 'places': 12,
 'where': 17,
 'rest': 14,
 'important': 8,
 'for': 5,
 'pets': 11,
 'hi': 6,
 'chayan': 3,
 'how': 7,
 'you': 18}

In [3]:
vectorized_documents[5].toarray()
#out of vocab words will be represented as 0
#Rather than OHE, bag of words provide fixed size of array in each document.

array([[0, 1, 0, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

# N-Grams

N-grams are a contiguous sequence of n items from a given sample of text or speech. They are a fundamental concept in natural language processing (NLP) and text analysis, used to capture the context and structure of the text by looking at sequences of words or characters.

Types of N-grams:
Unigrams: Single words (n=1).

Example: "Cats play outside" → ["Cats", "play", "outside"]

Bigrams: Sequences of two adjacent words (n=2).

Example: "Cats play outside" → ["Cats play", "play outside"]

Trigrams: Sequences of three adjacent words (n=3).

Example: "Cats play outside" → ["Cats play outside"]

Higher-order N-grams: Sequences of more than three words (n=4, n=5, etc.).

Example (4-gram): "Cats play outside during" → ["Cats play outside during"]

N_Gram is useful to preserve local level context. It also used for large dataset to ower the sparse matrix of vocabulary.

In [4]:
import pandas as pd
import nltk
from nltk.util import ngrams
nltk.download('punkt')

# Sample DataFrame
data = {
    'Document_ID': [1, 2, 3],
    'Text': [
        "The quick brown fox jumps over the lazy dog.",
        "A journey of a thousand miles begins with a single step.",
        "To be or not to be, that is the question."
    ]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Function to generate n-grams
def generate_ngrams(text, n):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())  # Convert text to lowercase
    # Generate n-grams
    n_grams = list(ngrams(tokens, n))
    return [' '.join(gram) for gram in n_grams]

# Generate bigrams and trigrams for each document
df['Bigrams'] = df['Text'].apply(lambda x: generate_ngrams(x, 2))
df['Trigrams'] = df['Text'].apply(lambda x: generate_ngrams(x, 3))

# Display the resulting DataFrame
df


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Document_ID,Text,Bigrams,Trigrams
0,1,The quick brown fox jumps over the lazy dog.,"[the quick, quick brown, brown fox, fox jumps,...","[the quick brown, quick brown fox, brown fox j..."
1,2,A journey of a thousand miles begins with a si...,"[a journey, journey of, of a, a thousand, thou...","[a journey of, journey of a, of a thousand, a ..."
2,3,"To be or not to be, that is the question.","[to be, be or, or not, not to, to be, be ,, , ...","[to be or, be or not, or not to, not to be, to..."


In [5]:
#Using n_grams with bag of words
#Example of bag of words..
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample text documents
documents = {
    'Document_ID': [1, 2, 3, 4, 5,6],
    'Text': [
        "Cats sit on mats.",
        "Dogs sit on mats.",
        "Cats and dogs play together.",
        "Mats are places where cats and dogs rest.",
        "Rest and play are important for pets.",
        'Hi chayan How are you chayan'
    ]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(documents)
# Initialize CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,2)) #Bigram, if (1,2)--> both unigram and bigram

# Fit and transform the documents
vectorized_documents = vectorizer.fit_transform(df['Text'])

# Convert the result to an array
vector_array = vectorized_documents.toarray()

# Vocabulary (features)
vocabulary = vectorizer.get_feature_names_out()

print("Vocabulary:", vocabulary)
print("Vector Representation:\n", vector_array)

Vocabulary: ['and' 'and dogs' 'and play' 'are' 'are important' 'are places' 'are you'
 'cats' 'cats and' 'cats sit' 'chayan' 'chayan how' 'dogs' 'dogs play'
 'dogs rest' 'dogs sit' 'for' 'for pets' 'hi' 'hi chayan' 'how' 'how are'
 'important' 'important for' 'mats' 'mats are' 'on' 'on mats' 'pets'
 'places' 'places where' 'play' 'play are' 'play together' 'rest'
 'rest and' 'sit' 'sit on' 'together' 'where' 'where cats' 'you'
 'you chayan']
Vector Representation:
 [[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0
  1 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0
  1 1 0 0 0 0 0]
 [1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
  0 0 1 0 0 0 0]
 [1 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0
  0 0 0 1 1 0 0]
 [1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 1
  0 0 0 0 0 0 0]
 [0 0 0 1 0 0 1 0 0 0 2 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 

# TF-IDF (Term Frequency and Inverse Document Frequency)

IF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus).

It is widely used in text mining, information retrival, and NLP to ternsform text into numerical vectors that can be used for various machine learning models.

Components of TF-IDF:

1. Term Frequency (TF): Measures how frequently a word appears in a document. It helps in understanding the relevance of a word within the specific document.

2. Inverse Document Frequency (IDF): Measure how important a word is across the entire corpus. The words that appear in many documents are less informative and are given lower weights.

.toarray() - convert to numpy array.


In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample DataFrame with text data
data = {
    'Document_ID': [1, 2, 3, 4],
    'Text': [
        "The quick brown fox jumps over the lazy dog.",
        "The dog barks loudly in the backyard.",
        "A lazy dog is often a sleepy dog.",
        "Foxes are quick and brown, and they jump high."
    ]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(df['Text'])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the Document_ID to the TF-IDF DataFrame for reference
tfidf_df['Document_ID'] = df['Document_ID']

# Set Document_ID as the index for better readability
tfidf_df.set_index('Document_ID', inplace=True)

# Display the TF-IDF DataFrame
print("\nTF-IDF DataFrame:")
tfidf_df


TF-IDF DataFrame:


Unnamed: 0_level_0,and,are,backyard,barks,brown,dog,fox,foxes,high,in,...,jump,jumps,lazy,loudly,often,over,quick,sleepy,the,they
Document_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.28305,0.229153,0.359012,0.0,0.0,0.0,...,0.0,0.359012,0.28305,0.0,0.0,0.359012,0.28305,0.0,0.566099,0.0
2,0.0,0.0,0.380865,0.380865,0.0,0.243101,0.0,0.0,0.0,0.380865,...,0.0,0.0,0.0,0.380865,0.0,0.0,0.0,0.0,0.600557,0.0
3,0.0,0.0,0.0,0.0,0.0,0.557077,0.0,0.0,0.0,0.0,...,0.0,0.0,0.344051,0.0,0.436384,0.0,0.0,0.436384,0.0,0.0
4,0.624903,0.312451,0.0,0.0,0.24634,0.0,0.0,0.312451,0.312451,0.0,...,0.312451,0.0,0.0,0.0,0.0,0.0,0.24634,0.0,0.0,0.312451


# Word2Vec

Word2Vec is a popular technique in NLP for generating word embeddings, which are dense vector representation of words. Its a deep leaning based method (Neural Netword Based). Each word will be converted into a maximum 300 dimension vector.

Advantage : Low dimention for faster computation, Dense Vector (Overcome Sparsity), Semantic Meaning Capture.

Word Embedding : Word Embeddings are dense, fixed size vectors (typically 100 to 300 dimensions) that capture the semantic meaning of words. Words with similar meanings are represented by vectors that are close to each other in the embedding space.

In [7]:
# Import necessary libraries
import pandas as pd
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Step 1: Prepare the DataFrame
data = {
    'Document': [
        "The quick brown fox jumps over the lazy dog",
        "Never jump over the lazy dog quickly",
        "Brightly colored birds are flying over the clear blue sky",
        "The quick red fox swiftly jumps over the lazy hound",
        "A fast moving train swiftly passes by the station"
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Step 2: Preprocess the Text
# Tokenize the sentences in the DataFrame
df['Tokens'] = df['Document'].apply(lambda x: word_tokenize(x.lower()))

# Step 3: Train the Word2Vec Model
# Using the tokenized sentences from the DataFrame
model = Word2Vec(df['Tokens'], vector_size=100, window=5, min_count=1, sg=0)  # CBOW model

# Save the model
model.save("word2vec_dataframe_example.model")

# Step 4: Explore the Word Embeddings
# Load the model
model = Word2Vec.load("word2vec_dataframe_example.model")

# Find similar words to 'quick'
similar_words = model.wv.most_similar("quick")
print("\nWords most similar to 'quick':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

# Get the vector for the word 'fox'
fox_vector = model.wv['fox']
print(f"\nVector representation for 'fox':\n{fox_vector}")


Words most similar to 'quick':
train: 0.1669
swiftly: 0.1389
dog: 0.1315
flying: 0.0977
are: 0.0965
hound: 0.0716
lazy: 0.0640
moving: 0.0606
red: 0.0475
clear: 0.0442

Vector representation for 'fox':
[ 8.1664836e-03 -4.4427775e-03  8.9857746e-03  8.2530547e-03
 -4.4332538e-03  3.0154921e-04  4.2750454e-03 -3.9255521e-03
 -5.5614812e-03 -6.5129590e-03 -6.6958909e-04 -2.9906651e-04
  4.4628102e-03 -2.4737732e-03 -1.7084299e-04  2.4598504e-03
  4.8684007e-03 -3.0527943e-05 -6.3401032e-03 -9.2617162e-03
  2.6186362e-05  6.6623739e-03  1.4673454e-03 -8.9668771e-03
 -7.9376111e-03  6.5512918e-03 -3.7855492e-03  6.2557305e-03
 -6.6829068e-03  8.4785884e-03 -6.5149795e-03  3.2870427e-03
 -1.0579199e-03 -6.7878389e-03 -3.2878725e-03 -1.1636761e-03
 -5.4713818e-03 -1.2093936e-03 -7.5616692e-03  2.6468080e-03
  9.0693105e-03 -2.3785823e-03 -9.7887742e-04  3.5141103e-03
  8.6649666e-03 -5.9214211e-03 -6.8887267e-03 -2.9316733e-03
  9.1479970e-03  8.6731103e-04 -8.6771306e-03 -1.4473030e-03
  9.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
fox_vector.shape

(100,)

In [9]:
similar_words

[('train', 0.1669362187385559),
 ('swiftly', 0.13887985050678253),
 ('dog', 0.13149002194404602),
 ('flying', 0.09769198298454285),
 ('are', 0.09651773422956467),
 ('hound', 0.07155133783817291),
 ('lazy', 0.06398721039295197),
 ('moving', 0.0605585090816021),
 ('red', 0.0475480742752552),
 ('clear', 0.04421411082148552)]

In [10]:
#example with google news dataset
# Import necessary libraries
import pandas as pd
import gensim.downloader as api
import numpy as np

# Step 1: Load the Google News Word2Vec Model
# Note: Ensure you have enough memory (approx. 1.5 GB) to load this model.
print("Loading the Google News Word2Vec model...")
model = api.load("word2vec-google-news-300")
print("Model loaded successfully!")

# Step 2: Use the Model for Similarity Queries

# Define some words to find similarities
words_to_query = ["king", "queen", "apple", "car"]

# Create an empty list to store results
results = []

# Query the model for each word
for word in words_to_query:
    if word in model:
        similar_words = model.most_similar(word, topn=5)
        for similar_word, similarity in similar_words:
            results.append([word, similar_word, similarity])
    else:
        print(f"The word '{word}' is not in the model's vocabulary.")

# Step 3: Visualize the Results with a DataFrame

# Create a DataFrame to hold the results
df = pd.DataFrame(results, columns=["Original Word", "Similar Word", "Similarity Score"])

# Display the DataFrame
print("\nSimilarity results using the Google News Word2Vec model:")
print(df)

# Step 4: Get the Vector for a Specific Word

# Retrieve and display the vector for the word "computer"
if "computer" in model:
    vector_computer = model['computer']
    print("\nVector representation for 'computer':")
    print(vector_computer)
else:
    print("The word 'computer' is not in the model's vocabulary.")


Loading the Google News Word2Vec model...
Model loaded successfully!

Similarity results using the Google News Word2Vec model:
   Original Word              Similar Word  Similarity Score
0           king                     kings          0.713805
1           king                     queen          0.651096
2           king                   monarch          0.641319
3           king              crown_prince          0.620422
4           king                    prince          0.615999
5          queen                    queens          0.739944
6          queen                  princess          0.707053
7          queen                      king          0.651096
8          queen                   monarch          0.638360
9          queen  very_pampered_McElhatton          0.635703
10         apple                    apples          0.720360
11         apple                      pear          0.645070
12         apple                     fruit          0.641015
13         apple   

# CBOW

CBOW : It is a method used in the Word2Vec model for generating word embeddings.

CBOW predict a target word (center word) from its context word.(surround words)

CBOW uses a sliding window around each word to define its context.

The size of this window (usually denoted as '2n' where 'n' is the number of context words on either side) determines how many words before and after the target word are considered for prediction.

for example, with a window size of 2, the model will consider the two words before and after the target word.

Input of NN : The average of vectors of context words.

Output of NN : Probability distribution over all words in the vocabulary, with the highest probability corresponding to the target word.

For each word in the training corpus, CBOW take the context words and predict the target word in the middle.

CBOW is typically more efficient than skip gram, especially for large datasets, because it predicts one word for multiple context words, rather than predicting multiple context words from a single target word.


In [11]:
# Import necessary libraries
from gensim.models import Word2Vec
import pandas as pd
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Example corpus
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "Brightly colored birds are flying over the clear blue sky",
    "The quick red fox swiftly jumps over the lazy hound",
    "A fast moving train swiftly passes by the station"
]

# Step 1: Preprocess the text (tokenize the sentences)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Step 2: Train the Word2Vec model using CBOW
model_cbow = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=2, min_count=1, sg=0)  # sg=0 specifies CBOW

# Save the model
model_cbow.save("cbow_example.model")

# Load the model
model_cbow = Word2Vec.load("cbow_example.model")

# Find words similar to 'quick'
similar_words_cbow = model_cbow.wv.most_similar("quick")
print("Words most similar to 'quick' using CBOW model:")
for word, similarity in similar_words_cbow:
    print(f"{word}: {similarity:.4f}")

# Get the vector for the word 'fox'
vector_fox_cbow = model_cbow.wv['fox']
print("\nVector representation for 'fox' using CBOW model:")
print(vector_fox_cbow)


Words most similar to 'quick' using CBOW model:
train: 0.1670
swiftly: 0.1389
dog: 0.1315
flying: 0.0979
are: 0.0966
hound: 0.0716
lazy: 0.0642
moving: 0.0605
red: 0.0478
clear: 0.0442

Vector representation for 'fox' using CBOW model:
[ 8.1674615e-03 -4.4427789e-03  8.9943772e-03  8.2541173e-03
 -4.4356198e-03  3.0441178e-04  4.2767138e-03 -3.9221640e-03
 -5.5602528e-03 -6.5167863e-03 -6.6987943e-04 -2.9904293e-04
  4.4625741e-03 -2.4786519e-03 -1.7218136e-04  2.4563510e-03
  4.8653078e-03 -2.9344041e-05 -6.3427431e-03 -9.2698727e-03
  2.4019097e-05  6.6687656e-03  1.4692649e-03 -8.9644017e-03
 -7.9422919e-03  6.5578488e-03 -3.7909430e-03  6.2555107e-03
 -6.6862735e-03  8.4818834e-03 -6.5225735e-03  3.2860374e-03
 -1.0534794e-03 -6.7866454e-03 -3.2889245e-03 -1.1639396e-03
 -5.4755118e-03 -1.2114083e-03 -7.5642853e-03  2.6420618e-03
  9.0703806e-03 -2.3765215e-03 -9.8270620e-04  3.5181378e-03
  8.6619668e-03 -5.9354748e-03 -6.8899016e-03 -2.9335199e-03
  9.1478257e-03  8.6760009e-04 -

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# SkipGram

Skipogram is the another method used in word2vec model for generating word embeddings. Unlike CBOW, which predicts a target word given its context, SkipGram works in the opposite dicrection. It predicts the context words given a target word.

The goal of skipgram is to predict the surrounding context words given a central (target) word.

It aims to maximize the probability of context words appearing around a given target word.

Skipgram is also uses a sliding window to define the context words around the target word.

The size of window determines how many words before and after the target word are considered for prediction.

Input : The target word.

Output : The probability distribution over all words in the vocabulary for each context position within the specified window size. For each word in the corpus, skipgram tries to predict each of its context words. Training time will be more as compared to CBOW due to multiple predictions per target word.

It generally produces higher quality embeddings, specially when the training corpus is relatively small or the words of interests are rare.

In [12]:
# Import necessary libraries
from gensim.models import Word2Vec
import pandas as pd
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Example corpus
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "Brightly colored birds are flying over the clear blue sky",
    "The quick red fox swiftly jumps over the lazy hound",
    "A fast moving train swiftly passes by the station"
]

# Step 1: Preprocess the text (tokenize the sentences)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Step 2: Train the Word2Vec model using Skip-Gram
model_skipgram = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=2, min_count=1, sg=1)  # sg=1 specifies Skip-Gram

# Save the model
model_skipgram.save("skipgram_example.model")

# Load the model
model_skipgram = Word2Vec.load("skipgram_example.model")

# Find words similar to 'quick'
similar_words_skipgram = model_skipgram.wv.most_similar("quick")
print("Words most similar to 'quick' using Skip-Gram model:")
for word, similarity in similar_words_skipgram:
    print(f"{word}: {similarity:.4f}")

# Get the vector for the word 'fox'
vector_fox_skipgram = model_skipgram.wv['fox']
print("\nVector representation for 'fox' using Skip-Gram model:")
print(vector_fox_skipgram)


Words most similar to 'quick' using Skip-Gram model:
train: 0.1671
swiftly: 0.1388
dog: 0.1314
flying: 0.0978
are: 0.0968
hound: 0.0715
lazy: 0.0642
moving: 0.0609
red: 0.0481
clear: 0.0441

Vector representation for 'fox' using Skip-Gram model:
[ 8.1679700e-03 -4.4448152e-03  8.9972867e-03  8.2545383e-03
 -4.4309525e-03  2.9656425e-04  4.2692111e-03 -3.9117662e-03
 -5.5667358e-03 -6.5185563e-03 -6.6661945e-04 -3.0787205e-04
  4.4581732e-03 -2.4677459e-03 -1.7359917e-04  2.4666733e-03
  4.8720646e-03 -2.2302580e-05 -6.3520004e-03 -9.2721526e-03
  2.2349635e-05  6.6642119e-03  1.4734204e-03 -8.9791911e-03
 -7.9350900e-03  6.5591554e-03 -3.7853247e-03  6.2591634e-03
 -6.6885678e-03  8.4860791e-03 -6.5151392e-03  3.2775723e-03
 -1.0572986e-03 -6.7933607e-03 -3.2882635e-03 -1.1609819e-03
 -5.4604937e-03 -1.2128860e-03 -7.5625200e-03  2.6458437e-03
  9.0675363e-03 -2.3841353e-03 -9.9444960e-04  3.5208962e-03
  8.6625209e-03 -5.9187771e-03 -6.8909694e-03 -2.9341045e-03
  9.1481712e-03  8.696

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Applying Word2Vec on game of thrones dataset.

In [13]:
import gensim
import os

In [18]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

#Consolidating file

import zipfile
import os

# Step 1: Unzip the File

# Path to the zip file
zip_file_path = '/content/001ssb.txt.zip'

# Directory to extract files to
extract_dir = 'extracted_files'

# Create a directory for extracted files if it doesn't exist
if not os.path.exists(extract_dir):
    os.makedirs(extract_dir)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Get the list of extracted files
extracted_files = os.listdir(extract_dir)
print(f"Extracted files: {extracted_files}")

# Step 2: Read the Text File

# Assume we know the name of the text file (example.txt)
text_file_path = '/content/extracted_files/001ssb.txt'

# Initialize an empty list to hold the lines
lines_list = []

# Open and read the text file
with open(text_file_path, 'r') as file:
    # Read all lines and strip newline characters
    lines_list = [line.strip() for line in file.readlines()]

# Step 3: Consolidate to a List

# Display the list
print("Consolidated list of lines from the text file:")
print(lines_list)


Extracted files: ['001ssb.txt']
Consolidated list of lines from the text file:


In [20]:
type(lines_list)

list

In [21]:
len(lines_list)

20169

In [22]:
#Tokenizing the list (Process 1)

tokenized_sentences = [simple_preprocess(line) for line in lines_list]
tokenized_sentences

[['game', 'of', 'thrones'],
 ['book', 'one', 'of', 'song', 'of', 'ice', 'and', 'fire'],
 ['by', 'george', 'martin'],
 ['prologue'],
 ['we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them',
  'the',
  'wildlings',
  'are'],
 ['dead'],
 ['do',
  'the',
  'dead',
  'frighten',
  'you',
  'ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared',
  'did',
  'not',
  'rise',
  'to',
  'the',
  'bait',
  'he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead',
  'is',
  'dead',
  'he',
  'said',
  'we',
  'have',
  'no',
  'business',
  'with',
  'the',
  'dead'],
 ['are',
  'they',
  'dead',
  'royce',
  'asked',
  'softly',
  'what',
  'proof',
  'have',
  'we'],
 ['will',
  'saw',
  'them',
  'gared',
  'said',
  'if',
  'he',
  'says',
  'the

In [23]:
#Tokenizing Sentence (Process 2)

import nltk
from nltk.tokenize import word_tokenize

# Ensure nltk resources are available
nltk.download('punkt')

# Tokenize sentences using NLTK
tokenized_sentences_2 = [word_tokenize(line) for line in lines_list]
tokenized_sentences_2

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['A', 'Game', 'Of', 'Thrones'],
 ['Book', 'One', 'of', 'A', 'Song', 'of', 'Ice', 'and', 'Fire'],
 ['By', 'George', 'R.', 'R.', 'Martin'],
 ['PROLOGUE'],
 ['``',
  'We',
  'should',
  'start',
  'back',
  ',',
  "''",
  'Gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them',
  '.',
  '``',
  'The',
  'wildlings',
  'are'],
 ['dead', '.', "''"],
 ['``',
  'Do',
  'the',
  'dead',
  'frighten',
  'you',
  '?',
  "''",
  'Ser',
  'Waymar',
  'Royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'a',
  'smile',
  '.'],
 ['Gared',
  'did',
  'not',
  'rise',
  'to',
  'the',
  'bait',
  '.',
  'He',
  'was',
  'an',
  'old',
  'man',
  ',',
  'past',
  'fifty',
  ',',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go',
  '.'],
 ['``',
  'Dead',
  'is',
  'dead',
  ',',
  "''",
  'he',
  'said',
  '.',
  '``',
  'We',
  'have',
  'no',
  'business',
  'with',
  'the',
  'dead',
  '.',
  "''

In [24]:
model1 = gensim.models.Word2Vec(tokenized_sentences,
                                vector_size=100, window=5, min_count=1, sg=0, workers=4) #CBOW Method- sg=0
                                #Min_count = sentences atleast 1 word will be taken #vector_size= size of the output vector #workers: no of cores in processor

model1.build_vocab(tokenized_sentences)
model1.train(tokenized_sentences, total_examples=model1.corpus_count, epochs=model1.epochs)




(1081285, 1423500)

In [26]:
model1.wv.most_similar('daenerys')

[('reed', 0.9777145981788635),
 ('rhaegar', 0.9760158658027649),
 ('forel', 0.9757736325263977),
 ('ben', 0.9749482870101929),
 ('giggling', 0.971921980381012),
 ('daeren', 0.9716361165046692),
 ('samwell', 0.9707944393157959),
 ('hushed', 0.9707140922546387),
 ('gracious', 0.9705380797386169),
 ('brothels', 0.9699256420135498)]

In [28]:
#odd one out
model1.wv.doesnt_match("cersei tyrion jaime robb".split())

'jaime'

In [29]:
model1.wv.doesnt_match("cersei tyrion jaime bronn".split())

'tyrion'

In [30]:
model1.wv.most_similar('tyrion')

[('ned', 0.9233036041259766),
 ('jon', 0.8829814791679382),
 ('robb', 0.8681408166885376),
 ('he', 0.8647586107254028),
 ('dany', 0.8488488793373108),
 ('sansa', 0.8386368751525879),
 ('catelyn', 0.8372656106948853),
 ('joffrey', 0.8333051800727844),
 ('bran', 0.8331075310707092),
 ('littlefinger', 0.8290245532989502)]

In [31]:
model1.wv.doesnt_match("jon rikon robb arya sansa bran".split())



'robb'

In [32]:
model1.wv.similarity('tyrion', 'jaime')

0.64719486

If a vector is more than 3 dimension, we can not visualise it due to higher dimension. To visualize it we can use dimension reduction techniques like PCA, LDA, T-SNE.

In [33]:
#vector representation of all the words.
model1.wv.get_normed_vectors()

array([[ 0.0274129 ,  0.06005437, -0.00443343, ..., -0.06812375,
        -0.1342919 ,  0.03715624],
       [ 0.01279588,  0.07755776,  0.083529  , ..., -0.11166973,
        -0.04484145, -0.00713079],
       [ 0.09801964, -0.04466369,  0.09030006, ..., -0.05498173,
        -0.07387774,  0.07339717],
       ...,
       [-0.11220028,  0.05541934,  0.18253513, ..., -0.01629963,
        -0.10800217, -0.09487936],
       [-0.11364119,  0.13268521,  0.04860182, ..., -0.0012538 ,
        -0.0625404 , -0.00182984],
       [-0.00191663,  0.11623424,  0.11953957, ..., -0.13361076,
        -0.08137963, -0.05734683]], dtype=float32)

In [34]:
model1.wv.get_normed_vectors().shape

(11307, 100)

In [35]:
#Total words to a list
y = model1.wv.key_to_index
len(y)

11307

In [37]:
#Applying PCA

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=3)
pca_result = pca.fit_transform(model1.wv.get_normed_vectors())

pca_result.shape

(11307, 3)

In [40]:
import plotly.express as px
fig = px.scatter_3d(x=pca_result[:,0], y=pca_result[:,1], z=pca_result[:,2],color=y.values())
fig.show()