<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/main/Exercises/day-4/Conversion_techniques/Conversion_types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Conversion Techniques: Single-Lab Exercises
Below are four self-contained lab exercises—one each for Bag of Words, N-grams, TF-IDF, and Word2Vec—complete with concise step-by-step instructions and reference Python solutions.
All examples run in a standard Jupyter/Python 3 environment and rely only on widely used libraries (nltk, scikit-learn, gensim, pandas).

In [6]:
import nltk
# Download both old and new tokenizer data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# 1. Bag of Words (BoW)
### Exercise Brief
Convert three short product reviews into a Bag-of-Words matrix and compare a manual implementation with CountVectorizer.

### Reviews

“This phone has great battery life”

“Battery life on this phone is poor”

“I love the camera on this phone”

###Tasks

- Pre-process each review: lowercase, tokenize, remove stop-words.

- Construct a vocabulary of unique words.

- Create a frequency vector for every review (manual).

- Repeat using sklearn.feature_extraction.text.CountVectorizer.

- Compare the two matrices.





In [7]:
import nltk, string, pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This phone has great battery life",
    "Battery life on this phone is poor",
    "I love the camera on this phone"
]

# download once per session
nltk.download('punkt'); nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    return [w for w in word_tokenize(text) if w not in stop_words]

tokens_list = [preprocess(doc) for doc in corpus]
vocab = sorted({w for sent in tokens_list for w in sent})

def bow_vector(tokens):
    return [tokens.count(term) for term in vocab]

manual_bow = [bow_vector(t) for t in tokens_list]
manual_df  = pd.DataFrame(manual_bow, columns=vocab)

cv = CountVectorizer(lowercase=True, stop_words='english')
cv_bow = cv.fit_transform(corpus).toarray()
cv_df  = pd.DataFrame(cv_bow, columns=cv.get_feature_names_out())

print(manual_df, '\n'); print(cv_df)

   battery  camera  great  life  love  phone  poor
0        1       0      1     1     0      1     0
1        1       0      0     1     0      1     1
2        0       1      0     0     1      1     0 

   battery  camera  great  life  love  phone  poor
0        1       0      1     1     0      1     0
1        1       0      0     1     0      1     1
2        0       1      0     0     1      1     0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 🧪 Exercise Brief: N-grams Generation and Analysis

## 📝 Sentence
> “Natural language processing is fascinating and powerful”

---

## 🧩 Tasks

### 1. Tokenize the Sentence
- Use `nltk.word_tokenize` to split the sentence into tokens.

### 2. Generate N-grams (n = 1 to 4)
- Use `nltk.util.ngrams` to create:
  - **Unigrams** (n=1)
  - **Bigrams** (n=2)
  - **Trigrams** (n=3)
  - **Four-grams** (n=4)
- Store:
  - As **lists of tuples** (e.g., `('natural', 'language')`)
  - As **joined strings** (e.g., `"natural language"`)

---

### 3. Frequency Analysis
- Use `collections.Counter` to compute:
  - **Bigram frequency count**
  - **Trigram frequency count**
- Report most frequent combinations (if ties, show all with max count)

---

In [8]:
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
sentence = "Natural language processing is fascinating and powerful"
tokens   = word_tokenize(sentence.lower())

for n in range(1, 5):
    grams = list(ngrams(tokens, n))
    print(f"{n}-grams:", grams)

bigram_freq  = Counter(ngrams(tokens, 2))
trigram_freq = Counter(ngrams(tokens, 3))
print("Top bigrams:", bigram_freq.most_common())
print("Top trigrams:", trigram_freq.most_common())

1-grams: [('natural',), ('language',), ('processing',), ('is',), ('fascinating',), ('and',), ('powerful',)]
2-grams: [('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('fascinating', 'and'), ('and', 'powerful')]
3-grams: [('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'fascinating'), ('is', 'fascinating', 'and'), ('fascinating', 'and', 'powerful')]
4-grams: [('natural', 'language', 'processing', 'is'), ('language', 'processing', 'is', 'fascinating'), ('processing', 'is', 'fascinating', 'and'), ('is', 'fascinating', 'and', 'powerful')]
Top bigrams: [(('natural', 'language'), 1), (('language', 'processing'), 1), (('processing', 'is'), 1), (('is', 'fascinating'), 1), (('fascinating', 'and'), 1), (('and', 'powerful'), 1)]
Top trigrams: [(('natural', 'language', 'processing'), 1), (('language', 'processing', 'is'), 1), (('processing', 'is', 'fascinating'), 1), (('is', 'fascinating', 'and'), 1), (

# 📘 Exercise Brief: TF-IDF Vectorization of News Headlines

## 📰 Headlines
1. “Stock market crashes amid global uncertainty”  
2. “Global leaders discuss climate change solutions”

---

## 🧩 Tasks

### 1. Pre-processing
- Convert all text to **lowercase**
- **Tokenize** the headlines (split into words)
- **Remove stop words** (e.g., "the", "amid", etc.)

---

### 2. Compute Term Frequency (TF)
- Count term occurrences in each headline
- Normalize by total terms in the headline

---

### 3. Compute Inverse Document Frequency (IDF)
- Use the formula:  
  \[
  \text{IDF}(t) = \log\left(\frac{N}{1 + \text{df}(t)}\right)
  \]  
  Where:
  - *N* = total number of documents (2 in this case)
  - *df(t)* = number of documents containing term *t*

---

### 4. Build TF-IDF Matrix (Manual)
- Multiply each term’s TF by its IDF
- Result: **2 × V matrix**, where *V* is the vocabulary size

---

### 5. Recreate Matrix with `TfidfVectorizer` (Sklearn)
- Use `sklearn.feature_extraction.text.TfidfVectorizer` with:
  - `stop_words='english'`
  - `lowercase=True`

---

### 6. Top-Weighted Terms
- For each headline:
  - Extract top **three terms** with the **highest TF-IDF weights**
  - Report terms and their corresponding scores

---

## ✅ Expected Output
- **Manual TF-IDF Matrix**
- **Sklearn TF-IDF Matrix**
- **Top 3 TF-IDF terms per headline**  


In [9]:
from math import log
from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
    "Stock market crashes amid global uncertainty",
    "Global leaders discuss climate change solutions"
]

# scikit-learn path
vec = TfidfVectorizer(lowercase=True, stop_words='english')
tfidf = vec.fit_transform(docs).toarray()
print(pd.DataFrame(tfidf, columns=vec.get_feature_names_out()))

# manual IDF (for pedagogy)
tokens = [word_tokenize(d.lower()) for d in docs]
stop = set(stopwords.words('english'))
proc   = [[w for w in t if w.isalpha() and w not in stop] for t in tokens]
vocab  = sorted({w for doc in proc for w in doc})
idf    = {t: log(len(docs) / sum(t in p for p in proc)) for t in vocab}
tfidf_manual = []
for doc in proc:
    length = len(doc)
    tfidf_manual.append([doc.count(t)/length*idf[t] for t in vocab])
print(pd.DataFrame(tfidf_manual, columns=vocab))


      amid   change  climate  crashes  discuss    global  leaders   market  \
0  0.42616  0.00000  0.00000  0.42616  0.00000  0.303216  0.00000  0.42616   
1  0.00000  0.42616  0.42616  0.00000  0.42616  0.303216  0.42616  0.00000   

   solutions    stock  uncertainty  
0    0.00000  0.42616      0.42616  
1    0.42616  0.00000      0.00000  
       amid    change   climate   crashes   discuss  global   leaders  \
0  0.115525  0.000000  0.000000  0.115525  0.000000     0.0  0.000000   
1  0.000000  0.115525  0.115525  0.000000  0.115525     0.0  0.115525   

     market  solutions     stock  uncertainty  
0  0.115525   0.000000  0.115525     0.115525  
1  0.000000   0.115525  0.000000     0.000000  
