# Define

Converting words/sentences into numbers so a computer can understand them.

# Why do we need ?

Computers cannot understand text directly.
So we convert text into numbers (vectors) so ML models can process and learn from it.

# We use text representation to create good features (numerical vectors) from text so machine learning models can learn better.

# Why it is difficult ?

It is difficult because language has many meanings and variations, but computers only understand numbers, not words directly.

# What the core idea ?

Change text (words) into numbers so a computer can understand and learn from it.



# What are the technique ?

1Ô∏è‚É£ Bag of Words (BoW) ‚Äì Count how many times each word appears

2Ô∏è‚É£ TF-IDF ‚Äì Smart counting (gives importance to important words).

3Ô∏è‚É£ One-Hot Encoding ‚Äì Each word gets a unique 0 and 1 vector.

4Ô∏è‚É£ Word Embeddings ‚Äì Convert words into meaningful dense vectors (like Word2Vec).

5Ô∏è‚É£ Contextual Embeddings ‚Äì Understand word meaning based on sentence (like BERT).

# Common term ?

1Ô∏è‚É£ Corpus

A corpus is a collection of text data.

2Ô∏è‚É£ Document

A single piece of text.

3Ô∏è‚É£ Token

A token is a small unit of text (usually a word).

4Ô∏è‚É£ Tokenization

Process of breaking text into tokens (words).

5Ô∏è‚É£ Vocabulary

List of all unique words in the corpus.

6Ô∏è‚É£ Vector

Numerical form of text.

7Ô∏è‚É£ Feature

Numerical representation used by ML model.

8Ô∏è‚É£ Stop Words

Common words with little meaning.

9Ô∏è‚É£ Stemming

Cutting words to root form.

üîü Lemmatization

Converting word to its base meaningful form.


# üéØ One-Line Core Memory (Exam/Interview)

Corpus = full text data, Document = single text, Token = word, Vocabulary = unique words, Vector = numerical form of text.

In [None]:
# One hot Encoding

One-Hot Encoding is a method that converts each word into a vector of 0s and 1s where only one position is 1 and the rest are 0.

# Simple

Each word gets its own unique position in a list (vocabulary).

# Example

‚ÄúI love AI‚Äù

# Vocabulary (unique words):

[I, love, AI]

# One-Hot Vectors:

I     ‚Üí [1, 0, 0]
love  ‚Üí [0, 1, 0]
AI    ‚Üí [0, 0, 1]

# ‚ùå Main Problem

1Ô∏è‚É£  Vectors become very large if vocabulary is big

2Ô∏è‚É£  Does NOT understand meaning or similarity between words

3Ô∏è‚É£  Sparsity Problem  (Most values are 0) - Too many zeros ‚Üí inefficient.

# Bag of Words (fixed size) - every sentence is converted into a vector of the same length.

Bag of Words is a text representation technique that converts text into numbers by counting how many times each word appears.

# Core Idea (Very Simple)

Ignore grammar and order.
Just count the words.

# Example

‚ÄúI love AI‚Äù

‚ÄúI love coding‚Äù

# Step 1: Create Vocabulary (unique words)

[I, love, AI, coding]

# Step 2: Count word frequency (BoW vectors)

"I love AI"     ‚Üí [1, 1, 1, 0]  
"I love coding" ‚Üí [1, 1, 0, 1]

**Each number = how many times the word appears.**

# ‚ùå Problems with Bag of Words

1. Ignores word order (meaning can change)

2. Large vocabulary = large vectors

3. Does not understand context or meaning (that why we study n-grams)

4. Not consider order is an issue

# Example:
‚ÄúI love AI‚Äù and ‚ÄúAI love I‚Äù ‚Üí Same vector (wrong meaning)

# Both give same vector in Bag of Words ‚ùå
# But meaning is different.






# n-grams

N-grams are sequences of N words used to capture word order and context in text.

# Simple Idea

Instead of looking at single words, we look at groups of words.

# Example

‚ÄúI love AI‚Äù

1Ô∏è‚É£ Unigram (n = 1) ‚Üí Single words

[I], [love], [AI]

2Ô∏è‚É£ Bigram (n = 2) ‚Üí 2-word groups

[I love], [love AI]

3Ô∏è‚É£ Trigram (n = 3) ‚Üí 3-word groups

[I love AI]

# üéØ Why We Use N-grams

1. To capture word order

2. To understand context better

3. To improve meaning compared to Bag of Words

Example:

‚Äúnot good‚Äù ‚â† ‚Äúgood‚Äù

Bigram captures: [not good] ‚úî (better meaning)

# ‚ùå Problems with N-grams

Vocabulary becomes very large

High memory usage

Slower computation



# TF-IDF

TF-IDF = Term Frequency ‚Äì Inverse Document Frequency

TF-IDF gives higher importance to important words and lower importance to common words.

# Simple Understanding

Not all words are important.

# Example sentence:

‚ÄúThe movie is very very good‚Äù

‚Äúthe‚Äù, ‚Äúis‚Äù ‚Üí common (less important)

‚Äúgood‚Äù ‚Üí important (more weight)

TF-IDF automatically learns this.

Two Parts (Easy)

1Ô∏è‚É£ TF (Term Frequency)

How many times a word appears in a document.

Example:

‚ÄúAI AI coding‚Äù

TF(AI) = 2

2Ô∏è‚É£ IDF (Inverse Document Frequency)

How rare a word is in all documents (corpus).

Common word ‚Üí low IDF

Rare word ‚Üí high IDF

# Example (Easy)

Documents:

‚ÄúI love AI‚Äù

‚ÄúI love coding‚Äù

Common words: ‚ÄúI‚Äù, ‚Äúlove‚Äù ‚Üí lower importance
Rare words: ‚ÄúAI‚Äù, ‚Äúcoding‚Äù ‚Üí higher importance

So TF-IDF gives more weight to:

AI, coding ‚úî

# üéØ Why We Use TF-IDF

Removes importance of common words

Highlights important keywords

Better than Bag of Words for many NLP tasks

# ‚ùå Limitation

Still does NOT understand deep meaning

Ignores full context (like BoW)

# üìå One-Line (Exam / Interview)

TF-IDF is a text representation technique that measures how important a word is in a document relative to the entire corpus.

**TF-IDF gives higher weight to important and rare words, and lower weight to common words in the corpus.**