# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*

# Team Members
## Fasalu Rahman Kottaparambu 8991782
## Srinu Babu Rai 8994032


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [46]:
import os
import re
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fasal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [47]:
# Example: Load text files from a folder
def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Replace 'sample_docs/' with your actual folder
documents = load_documents("./documents")
print(f"Loaded {len(documents)} documents.")



Loaded 20 documents.


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [48]:

def tokenize(text):
    tokens = tokens = re.findall(r'\b[a-zA-Z0-9]+\b', text.lower())
    return tokens

# Test on one document
tokens = tokenize(documents[0])
print(tokens[:20])  # Preview first 20 tokens


['ai', 'and', 'big', 'data', 'analytics', 'introduction', 'the', 'combination', 'of', 'artificial', 'intelligence', 'and', 'big', 'data', 'analytics', 'enables', 'organizations', 'to', 'extract', 'valuable']


## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [49]:

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    return [stemmer.stem(t) for t in tokens if t not in stop_words]

# Example: normalize one document
norm_tokens = normalize_tokens(tokens)
print(norm_tokens[:20])


['ai', 'big', 'data', 'analyt', 'introduct', 'combin', 'artifici', 'intellig', 'big', 'data', 'analyt', 'enabl', 'organ', 'extract', 'valuabl', 'insight', 'massiv', 'complex', 'dataset', 'ai']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fasal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🔍 Step 4: Inverted Index


### 🗣 Instructor Talking Point:
> We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

### 🔧 Your Task:
- Build the inverted index using a dictionary.
- Add code to support phrase queries using positional indexing.


In [50]:
from collections import defaultdict

def build_positional_inverted_index(documents):
    index = defaultdict(lambda: defaultdict(list))
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))
        for position, token in enumerate(tokens):
            index[token][doc_id].append(position)
    return index

positional_inverted_index = build_positional_inverted_index(documents)

# Preview first 5 terms
for term in list(positional_inverted_index.keys())[:5]:
    print(f"{term}: {dict(positional_inverted_index[term])}")


ai: {0: [0, 19, 158, 189], 1: [0, 17, 94, 130, 148, 154], 2: [0, 38, 40, 73, 90, 180], 3: [0, 21, 73, 171], 4: [0, 40, 169], 5: [0, 20, 35, 162, 172], 6: [0, 15, 189], 7: [0, 16, 31, 54, 115, 141, 151, 156, 161], 8: [0, 18, 32, 62, 180, 187], 9: [0, 16, 30, 84, 109, 144], 10: [7, 24, 45, 48, 63, 113, 181, 184, 191, 204, 206, 282, 317, 325, 372, 389, 412, 420, 425, 443, 457, 470, 478, 486, 509, 542, 566, 581, 600, 604, 625, 628, 649, 686], 13: [1, 12, 16, 36, 56, 162, 167, 176], 14: [1, 19, 28, 45, 50, 55, 60, 66, 78, 85, 113, 123, 159, 186, 188, 201], 16: [25], 18: [2, 59, 115, 124, 145, 158]}
big: {0: [1, 8, 36, 190], 8: [117]}
data: {0: [2, 9, 23, 29, 37, 43, 54, 72, 114, 115, 120, 140, 164, 178, 183, 191], 1: [74, 132], 2: [36, 102, 116, 124, 150, 166, 171], 3: [88], 4: [27, 33, 38, 92, 143, 151, 164], 5: [25, 67, 96, 145, 154], 6: [85], 7: [91, 128, 137], 8: [118, 132, 157, 162], 9: [115], 10: [66, 133, 290, 510, 525], 11: [160], 12: [26, 45, 90, 122, 128, 148, 190, 230], 13: [107]

## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [62]:
# Phrase Query Implementation using Positional Inverted Index

def phrase_query(phrase, index):
    terms = [stemmer.stem(t) for t in tokenize(phrase) if t not in stop_words]
    if not terms:
        return []
    potential_docs = set(index.get(terms[0], {}).keys())
    for term in terms[1:]:
        potential_docs &= set(index.get(term, {}).keys())
    result_docs = []
    for doc_id in potential_docs:
        positions = [index[term][doc_id] for term in terms]
        for pos in positions[0]:
            if all((pos + offset) in positions[offset] for offset in range(len(terms))):
                result_docs.append(doc_id)
                break
    return result_docs

# Example phrase queries
query1 = "machine learning"
query2 = "deep learning"
query3 = "natural language"
query4 = "data"
query5 = "big data"


print(f"Documents with exact phrase '{query1}':", phrase_query(query1, positional_inverted_index))
print(f"Documents with exact phrase '{query2}':", phrase_query(query2, positional_inverted_index))
print(f"Documents with exact phrase '{query3}':", phrase_query(query3, positional_inverted_index))
print(f"Documents with exact phrase '{query4}':", phrase_query(query4, positional_inverted_index))
print(f"Documents with exact phrase '{query5}':", phrase_query(query5, positional_inverted_index))


Documents with exact phrase 'machine learning': [0, 1, 2, 3, 4, 5, 7, 8, 12, 15, 16, 17, 18, 19]
Documents with exact phrase 'deep learning': [6, 10, 12, 13, 15, 16]
Documents with exact phrase 'natural language': [0, 1, 3, 7, 8, 17]
Documents with exact phrase 'data': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
Documents with exact phrase 'big data': [0, 8]


# Taking points
1. Learned how to preprocess documents by tokenizing, removing stopwords, and applying stemming to prepare data for indexing.
2. Implemented Boolean queries using AND, OR, and NOT operations to find documents containing specific terms quickly.
3. Created and tested phrase queries that allow us to find documents containing exact phrases or word using our positional inverted index.
