# Data Representation

## Today's Agenda

1. What is feature extraction from text/image?  
2. Why do we need it?
3. Why is it so difficult?
4. What is the core idea?

Here is your agenda in plain text:

## Today's Agenda

1. What is feature extraction from text/image?  
Feature extraction is the process of transforming raw data (like text or images) into numerical features that can be used for machine learning. For text, this could mean converting words into vectors (such as using Bag of Words, TF-IDF, or word embeddings). For images, it could involve extracting edges, shapes, or using deep learning to get feature vectors.

2. Why do we need it?  
Machine learning algorithms require numerical input. Raw text or images cannot be directly used by most algorithms, so feature extraction converts them into a format that models can understand and learn from.

3. Why is it so difficult?  
Raw data is often complex, high-dimensional, and unstructured. Capturing the most important information while ignoring noise is challenging. Also, different tasks may require different features, and the best features are not always obvious.

4. What is the core idea?  
The core idea is to represent complex data in a simplified, structured, and informative way that preserves the essential information needed for the task (such as classification or clustering), while reducing noise and irrelevant details.

5. Some techniques  
- For text: Bag of Words, TF-IDF, word embeddings (Word2Vec, GloVe, BERT)
- For images: SIFT, HOG, color histograms, CNN feature maps

____

**TF-IDF** stands for **Term Frequency–Inverse Document Frequency**. It is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).

### How it works:
- **Term Frequency (TF):**  
  Measures how frequently a word appears in a document.  
  TF = (Number of times the word appears in the document) / (Total words in the document)

- **Inverse Document Frequency (IDF):**  
  Measures how important a word is across all documents. Rare words get a higher score.  
  IDF = log(Total number of documents / Number of documents containing the word)

- **TF-IDF Score:**  
  TF-IDF = TF × IDF  
  A high TF-IDF score means the word is frequent in a document but rare in the corpus, making it important for that document.

### Why use TF-IDF?
- It helps highlight unique words in each document.
- Common words (like "the", "is") get lower scores, while rare, meaningful words get higher scores.
- Widely used in text mining, search engines, and document classification.

**Example:**  
If "pizza" appears often in one review but rarely in others, it will have a high TF-IDF score for that review, indicating its importance.

____

### TF-IDF Example with Calculation

Suppose you have the following 3 documents:

- **Document 1:** "I love pizza"
- **Document 2:** "I love pasta"
- **Document 3:** "Pizza and pasta are delicious"

Let's calculate the TF-IDF for the word **"pizza"** in **Document 1**.

---

#### 1. **Term Frequency (TF)**
TF = (Number of times "pizza" appears in Document 1) / (Total words in Document 1)  
TF = 1 / 3 = **0.333**

---

#### 2. **Inverse Document Frequency (IDF)**
First, count in how many documents "pizza" appears:
- "pizza" appears in Document 1 and Document 3 → **2 documents**

IDF = log(Total number of documents / Number of documents containing "pizza")  
IDF = log(3 / 2) ≈ **0.176**

---

#### 3. **TF-IDF Score**
TF-IDF = TF × IDF  
TF-IDF = 0.333 × 0.176 ≈ **0.059**

---

**Interpretation:**  
- The TF-IDF score for "pizza" in Document 1 is **0.059**.
- This means "pizza" is somewhat important in Document 1, but since it appears in more than one document, its uniqueness is reduced.

In [4]:
# Example: Calculating TF-IDF using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love pizza",
    "I love pasta",
    "Pizza and pasta are delicious"
]

# Create the vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert to dense matrix and print
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(df)

        and       are  delicious      love     pasta     pizza
0  0.000000  0.000000   0.000000  0.707107  0.000000  0.707107
1  0.000000  0.000000   0.000000  0.707107  0.707107  0.000000
2  0.490479  0.490479   0.490479  0.000000  0.373022  0.373022


In [7]:
from transformers import AutoTokenizer

# Choose a model (e.g., 'bert-base-uncased' or 'gpt2')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "I love pizza and pasta!"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['i', 'love', 'pizza', 'and', 'pasta', '!']
Token IDs: [101, 1045, 2293, 10733, 1998, 24857, 999, 102]


**Bag of Words (BoW)** is a simple and popular technique for representing text data in machine learning.

- In BoW, each document is represented as a vector of word counts (or frequencies), ignoring grammar and word order.
- The vocabulary is built from all unique words in the corpus.
- Each position in the vector corresponds to a word from the vocabulary, and the value is the number of times that word appears in the document.

**Example:**  
Suppose your corpus has these two sentences:  
1. "I love pizza"  
2. "I love pasta"

The vocabulary is: `["I", "love", "pizza", "pasta"]`

The BoW vectors are:  
- "I love pizza" → [1, 1, 1, 0]  
- "I love pasta" → [1, 1, 0, 1]

BoW is widely used for text classification and information retrieval, but it does not capture word order or context.

____

**One hot encoding** is a technique used to convert categorical data (such as words or labels) into a numerical format that can be used by machine learning algorithms.

In one hot encoding:
- Each unique category or word is represented as a binary vector.
- The vector has the same length as the number of unique categories.
- Only the position corresponding to the category is set to 1; all other positions are 0.

**Example:**  
Suppose you have three categories: `cat`, `dog`, `mouse`.

| Category | One Hot Encoding |
|----------|------------------|
| cat      | [1, 0, 0]        |
| dog      | [0, 1, 0]        |
| mouse    | [0, 0, 1]        |

This method is commonly used for representing categorical variables in machine learning and for representing words in text data before more advanced techniques like embeddings.

___

**Drawbacks of One Hot Encoding:**

- High Dimensionality: For large vocabularies or many categories, the resulting vectors become very large and sparse, consuming a lot of memory.
- No Semantic Meaning: It does not capture any relationship or similarity between categories or words (for example, "cat" and "dog" are as different as "cat" and "car").
- Curse of Dimensionality: High-dimensional data can make machine learning models less efficient and harder to train.
- Not Suitable for Rare Categories: Categories that appear rarely may not contribute much and can be ignored by the model.
____
**Drawbacks of Bag of Words:**

- Ignores Word Order: Bag of Words does not consider the order of words, so important context or meaning can be lost.
- High Dimensionality: Like one hot encoding, Bag of Words creates large, sparse vectors for big vocabularies.
- No Semantic Information: It treats all words as independent and does not capture word meanings or relationships.
- Sensitive to Vocabulary Size: Adding new words to the corpus changes the vector size, making it hard to handle new or unseen words.
- Cannot Handle Synonyms: Different words with similar meanings are treated as completely different features.

____
For sentiment analysis, you can use several techniques depending on your data size, accuracy needs, and resources:

**1. Bag of Words (BoW) or TF-IDF:**  
- Good for simple models and small datasets.
- Convert text to numerical vectors using BoW or TF-IDF.
- Train a classifier (like Logistic Regression, SVM, or Naive Bayes) on these vectors.

**2. Word Embeddings:**  
- Use pre-trained embeddings (Word2Vec, GloVe, FastText) for better semantic understanding.
- Average the word vectors or use them as input to a neural network.

**3. Deep Learning / LLMs:**  
- Use models like LSTM, GRU, or transformers (BERT, RoBERTa, DistilBERT).
- These models capture context and word order, giving better results for complex data.

**Recommendation:**  
- For beginners or small projects, start with TF-IDF + Logistic Regression.
- For higher accuracy, use a pre-trained transformer model (like BERT) with fine-tuning.

**Example (TF-IDF + Logistic Regression):**


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Example data
texts = ["I love this product!", "This is terrible.", "Absolutely fantastic!", "Not good at all."]
labels = [1, 0, 1, 0]  # 1 = positive, 0 = negative

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = LogisticRegression()
clf.fit(X, labels)

# Predict sentiment
test_texts = ["I hate this", "Best ever!"]
X_test = vectorizer.transform(test_texts)
predictions = clf.predict(X_test)
print(predictions)

[0 0]




For best results on large or complex datasets, use transformer-based models from Hugging Face Transformers.

____
## Let's implement Bags of Word in python

In [6]:
import numpy as np
import pandas as pd

In [7]:
df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv= CountVectorizer()

In [9]:
bow=cv.fit_transform(df["text"])

In [10]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'dswithbappy': 1, 'write': 4, 'comment': 0}


In [11]:
bow.toarray()

array([[0, 1, 1, 1, 0],
       [0, 2, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 0, 0, 1]])

In [12]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]


In [13]:
# new
cv.transform(['Bappy watch dswithbappy']).toarray()

array([[0, 1, 0, 1, 0]])

In [14]:
X = bow.toarray()
y = df['output']

### NGram

means means if ngram value is 2 then it will consider 2 word as one token

In [15]:
df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [16]:
# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

This line creates an instance of the `CountVectorizer` class with the parameter `ngram_range=(2,2)`. The `CountVectorizer` is a tool from scikit-learn used to convert a collection of text documents into a matrix of token counts, which is a common first step in text analysis and machine learning workflows.

By setting `ngram_range=(2,2)`, the vectorizer is configured to extract only bigrams (sequences of two consecutive words) from the input text, rather than the default unigrams (single words). For example, given the sentence "machine learning is fun", the bigrams extracted would be "machine learning", "learning is", and "is fun". When you later fit this vectorizer to a corpus, it will build a vocabulary of all unique bigrams found and represent each document as a vector indicating the count of each bigram.

This approach is useful when you want to capture word pair relationships and context that single words alone might miss.

In [17]:
bow = cv.fit_transform(df['text'])


In [18]:
print(cv.vocabulary_)

{'people watch': 2, 'watch dswithbappy': 4, 'dswithbappy watch': 0, 'people write': 3, 'write comment': 5, 'dswithbappy write': 1}


In [19]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]


In [20]:
#Ti gram
# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))

In [21]:
bow = cv.fit_transform(df['text'])

In [22]:
print(cv.vocabulary_)

{'people watch dswithbappy': 2, 'dswithbappy watch dswithbappy': 0, 'people write comment': 3, 'dswithbappy write comment': 1}


In [23]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]


____
TF-IDF(Term frequency-Inverse document frequency)

In [24]:
df = pd.DataFrame({"text":["people watch dswithbappy",
                         "dswithbappy watch dswithbappy",
                         "people write comment",
                          "dswithbappy write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


### Word2Vec

In [2]:
import numpy as np
import pandas as pd
import gensim
import os


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject