# Section 3: Vector Models and Text Preprocessing

### Import definitions:

>**Token:**
- A token is a sub-unit of substance, it can be a 'word', can be a 'punctuation', also can be a sub-words. A token is often used interchangibly with 'word'.

>**Letters and Characters:**
- Letters: a, A, b, B.
- Characters: '_', '\n'.
- [All letters are characters, but not all characters are letters].

>**Vocabulary:**
- Collection of all the words.

>**Corpus:**
- Nothing but the dataset our ML model will be trained on.

>**N-gram:**
- N-consecutive items.
- unigram - 'I', bigram - 'good morning', trigram - 'See you soon'.
- Usecase - word2vec (bigrams), markov models (bigram probabilities).

>**Vector:**
- An array of scalars.
- Usecase - in spam detection, vectors are used as working with vectors are easier than working with raw text.

>**Bag of Words:**
- An unordered collection of words.

>**Stopwords:**
- Words that don't provide any context.
- "and", "the", "but".

>**Stepping and lemmatization:**
- Converting words to root words.
- Stemming - crude, may or may not produce meaningful word.
- Lemmatization - sophisticated, returns a meaningful root word (aka lemma).
- Root word of a word is dependent upon its POS.

>**Vector Similarity:**
- Calculating the similarity b/w two vectors and giving a similarity score as an output.
- Application - article spinning, word replacement
- Measures of similarity - Euclidean distance, cosine similarity, cosine similarity is more commonly used.
- Euclidean Distance- The Euclidean distance `d` between two points `A(x₁, y₁)` and `B(x₂, y₂)` is given by the formula:

    <p style="text-align: center;">d = sqrt((x₂ - x₁)² + (y₂ - y₁)²)</p>

- Cosine similarity - The cosine similarity between two vectors A and B is given by the formula:

    - Cosine Similarity = (A · B) / (||A|| ||B||)
    Where:
    - A · B is the dot product of A and B.
    - ||A|| is the magnitude of vector A.
    - ||B|| is the magnitude of vector B.
    In expanded form, this can be written as:
    <p style="text-align: center;">Cosine Similarity = Σ (Aᵢ * Bᵢ) / (√(Σ Aᵢ²) * √(Σ Bᵢ²))</p>
    Where Σ denotes the summation over the vector elements.

- When we are ranking the similarities, after sorting them by similraity scores, the euclidean distance annd cosine similarity is somehow equivalent.
- If we normalize vectors, the L2-norm becomes 1: Converting all the vectors into unit length, thus just comparing them based on angles b/w them, which makes the comparison simple.
- L2-norm - Nothing but euclidean distance of a vector from origin.
    - The L2-norm (Euclidean norm) of a vector A with n components is given by the formula:

    <p style="text-align: center;">||A||₂ = √(Σ Aᵢ²)</p>

    - Where:
        - ||A||₂ represents the L2-norm of vector A.
        - Aᵢ denotes the i-th component of vector A.

    - This formula calculates the square root of the sum of squares of each component of the vector, providing a measure of the vector's magnitude in Euclidean space.

>**TF-IDF:**
- To improve the count-vectorizer.
- Popular for documnet retrieval and text mining.
- Intuition - If a word appears many times in several documents, it basically doesn't hold much significance in analyzing those documents.
- The TF-IDF (Term Frequency-Inverse Document Frequency) score of a term `t` in a document `d` within a corpus is calculated as follows:

    <p style="text-align: center;">TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)</p>

    - Where:
        - TF(t, d) is the term frequency of term `t` in document `d`, representing the frequency of term `t` in document `d`.
        - IDF(t, D) is the inverse document frequency of term `t` in corpus `D`, representing the logarithmically scaled inverse fraction of the documents that contain term `t` across the entire corpus `D`.

    - TF-IDF score is used to evaluate the importance of a term within a document relative to its frequency in the entire corpus, providing a measure of the significance of the term in the context of the document and corpus.
- The normalized TF-IDF score of a term t in a document d within a corpus is calculated as follows:

    <p style="text-align: center;">TF-IDF_norm(t, d, D) = TF-IDF(t, d, D) / √(∑(TF-IDF(t', d, D)²))</p>

    - Where:
        - TF-IDF(t, d, D) is the TF-IDF score of term t in document d within corpus D, as calculated by the TF-IDF formula.
        - ∑ denotes the summation over all terms t' in document d.
        - TF-IDF_norm(t, d, D) is the normalized TF-IDF score of term t in document d within corpus D.

    - The normalized TF-IDF score scales down the TF-IDF score of each term in a document by the Euclidean norm of the TF-IDF vector for that document, ensuring that the scores are in the range [0, 1].

>**Word embedding:**
- Unlike representing words as isolated entities with frequencies, like in count vectorizing, word embedding captures the semantic relationships b/w words. 
- Words with similar meanings or contexts are closer together in the embedding space. 
- Man:King::Woman:Queen, Miami:Florida::Dallas:Texas

### Tokenization:
>**Definition:**
- "I like cats" -> str.split() -> ["I", "like", "cats"]
- Punctuation characters can be tokenized, if it enhances the result
- Types of tokenization - word-based, character-based, sub-word based


In [1]:
# %pip install scikit-learn
# %pip install nltk
# %pip install --upgrade pip

In [2]:
# Import CountVectorizer function from sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
# Read the text file as the possible corpus
with open('../textfile.txt', 'r') as f:
    corpus = [line.strip() for line in f]
corpus

['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?']

In [5]:
# Create a CountVectorizer object, perform tokenization
# Considering words as tokens, Converting the corpus to lowercase, avoiding any accent
vectorizer_cv = CountVectorizer(analyzer='word', lowercase=True, strip_accents='ascii') 
# vectorizer_cv = CountVectorizer(analyzer='char', lowercase=True, strip_accents='ascii') # For character-level tokenization
X = vectorizer_cv.fit_transform(corpus)

In [6]:
# Get the vocabulary
vectorizer_cv.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [7]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [19]:
# Removing stopwords
# Importing stopwords function for nltk
import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\debnathk\AppData\Roaming\nltk_data...


In [12]:
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\debnathk\AppData\Roaming\nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora\stopwords.zip.


In [13]:
word_tokens = word_tokenize(corpus[0])
word_tokens

['This', 'is', 'the', 'first', 'document', '.']

In [14]:
# Filter stopword from corpus
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence

['first', 'document', '.']

In [15]:
## Stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
for word in word_tokens:
    print(f'{word}: {stemmer.stem(word)}')

This: thi
is: is
the: the
first: first
document: document
.: .


In [20]:
# Lemmatizing
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for word in word_tokens:
    print(f'{word}: {lemmatizer.lemmatize(word)}')

This: This
is: is
the: the
first: first
document: document
.: .


In [21]:
# Extra: POS-tagging

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    if treebank_tag.startswith('V'):
        return wordnet.VERB
    if treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [22]:
sentence = "Donal Trump has a devoted following."
sentence_tok = word_tokenize(sentence)
sentence_tok

['Donal', 'Trump', 'has', 'a', 'devoted', 'following', '.']

In [24]:
nltk.download('averaged_perceptron_tagger')

words_and_tags = nltk.pos_tag(sentence_tok)
words_and_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\debnathk\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


[('Donal', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN'),
 ('.', '.')]

In [25]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=" ")

Donal Trump have a devote following . 

### Count Vectorizer:

>**Definition:**
- Counting the scalers inside a vector.