# MBA AI - Language Technology Group Assignment 2025
**Course**: Language Technology  
**Team Members**:  
**Date**:  

## Overview
This assignment involves working with 4 main questions covering lexical representations, named entity recognition, question-answering systems, and generative AI chatbots. Each question is worth 25 points for a total of 100 points.

---

## Table of Contents
1. [Lexical Representations & Vocabulary](#section1)
2. [Named Entity Recognition and Entity Analysis](#section2)
3. [Question Answering with Transformers](#section3)
4. [RAG-Based Football Chatbot](#section4)

---

## General Dependencies Summary

### Core Libraries:
```python
# Data handling
import pandas as pd
import numpy as np

# NLP Libraries
import spacy  # en_core_web_sm model required
import re

# Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Transformers & Sentence Transformers
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import torch

# Vector Database
import milvus

# Utilities
from collections import Counter
import json
import requests
```

### Model Downloads Required:
- SpaCy: `python -m spacy download en_core_web_sm`
- Hugging Face models (auto-downloaded):
  - `sentence-transformers/multi-qa-mpnet-base-cos-v1`
  - `distilbert/distilbert-base-cased-distilled-squad`
  - `deepset/tinyroberta-squad2`

<a id="section1"></a>
# 1. Lexical Representations & Vocabulary

## Question 1: Lexical Representations and Text Classification (25 points)

### Actions Required:
1. **Load datasets** from parquet files (Amazon Reviews, Malawi, EN Crawl)
2. **Vocabulary analysis** with preprocessing (Q1 - 10pts)
3. **Binary classification model** for review scores (Q2 - 15pts)

### Q1 Approach (10pts):
**Objective**: Calculate vocabulary size with lowercase + lemmatization preprocessing

**Algorithm**:
1. Load Amazon Reviews dataset from parquet
2. Apply preprocessing pipeline:
   - Convert text to lowercase
   - Apply lemmatization using SpaCy
3. Use CountVectorizer to build vocabulary
4. Return vocabulary size

**Libraries/Dependencies**:
- `pandas` - data loading
- `spacy` (en_core_web_sm model) - lemmatization
- `sklearn.feature_extraction.text.CountVectorizer` - vocabulary extraction
- `re` - regular expressions

### Q2 Approach (15pts):
**Objective**: Binary classification (score=5 vs not=5) with hyperparameter exploration

**Algorithm**:
1. Sample 50,000 random records from Amazon Reviews
2. Create binary target: score==5 vs score!=5
3. Build 2-stage sklearn pipeline:
   - Stage 1: TfidfVectorizer with specific parameters
   - Stage 2: LogisticRegression (max_iter=1000)
4. Hyperparameter grid search on:
   - Regularization parameter C
   - Token filtering (>=3 ASCII letters)
   - TF vs TFIDF (use_idf parameter)
   - Unigrams vs Unigrams+Bigrams

**Libraries/Dependencies**:
- `sklearn.feature_extraction.text.TfidfVectorizer`
- `sklearn.linear_model.LogisticRegression`
- `sklearn.pipeline.Pipeline`
- `sklearn.model_selection.GridSearchCV`
- `sklearn.metrics` - evaluation metrics

## Question 2: Named Entity Recognition (25 points)

### Actions Required:
1. **Entity category analysis** (Q1 - 10pts)
2. **VIP identification and analysis** (Q2 - 15pts)

### Q1 Approach (10pts):
**Objective**: Build dataframe of entity categories and their usage counts

**Algorithm**:
1. Load Malawi dataset from parquet
2. Process each text document with SpaCy NER
3. Extract all entities and their categories
4. Count occurrences of each entity category
5. Create summary dataframe

**Libraries/Dependencies**:
- `spacy` (en_core_web_sm model) - NER
- `pandas` - data manipulation
- `collections.Counter` - counting occurrences

### Q2 Approach (15pts):
**Objective**: Identify top 10 most mentioned people with analysis

**Algorithm**:
1. Extract all PERSON entities from Malawi corpus
2. Normalize mentions (lowercase grouping)
3. Count mentions per person
4. Rank and select top 10
5. Analyze accuracy and provide recommendations

**Libraries/Dependencies**:
- `spacy` (en_core_web_sm model) - NER for PERSON entities
- `pandas` - data manipulation and analysis
- `re` - text normalization

## Question 3: Question-Answering System (25 points)

### Actions Required:
1. **Document chunking and vectorization**
2. **Semantic search implementation**
3. **Span-based QA with multiple models**
4. **Answer 3 specific questions about BERT**

### Approach:
**Objective**: Build QA system using semantic search + span extraction

**Algorithm**:
1. Load BERT article text
2. Sentence tokenization using SpaCy
3. Create 2-sentence chunks
4. Encode chunks using sentence-transformers model
5. For each question:
   - Encode question with same model
   - Find 3 most similar chunks (cosine similarity)
   - Check similarity threshold (>0.3)
   - If threshold met, use chunks as context for QA models
   - Apply both QA models and compare results

**Libraries/Dependencies**:
- `spacy` (en_core_web_sm) - sentence tokenization
- `sentence-transformers` - text encoding
- `transformers` - QA pipeline
- `numpy` - similarity calculations
- `torch` - model backend

**Models Required**:
- `sentence-transformers/multi-qa-mpnet-base-cos-v1` - encoding
- `distilbert/distilbert-base-cased-distilled-squad` - QA
- `deepset/tinyroberta-squad2` - QA


## Question 4: RAG-based Chatbot (25 points)

### Actions Required:
1. **Setup API connections** (LLM + embeddings + Milvus)
2. **Build RAG pipeline with multiple LLM calls**
3. **Implement multilingual chatbot**
4. **Test and analyze performance**

### Approach:
**Objective**: Build multilingual football chatbot using RAG architecture

**Algorithm**:
1. Setup system prompt for football topic validation
2. For each user query:
   - Check if query relates to football (LLM call 1)
   - If not football-related, politely decline
   - Parse query into question + formatting instructions (LLM call 2)
   - Rephrase question considering multiple facets (LLM call 3)
   - Retrieve relevant chunks from Milvus vector store
   - Generate answer with source citation (LLM call 4)

**Pipeline Components**:
1. **Topic Validation**: LLM determines if query is football-related
2. **Query Parsing**: Separate question from formatting instructions
3. **Query Enhancement**: Rephrase for better retrieval
4. **Semantic Retrieval**: Vector similarity search in Milvus
5. **Answer Generation**: Grounded response with citations

**Libraries/Dependencies**:
- Custom API endpoints (provided in notebook)
- `milvus` - vector database client
- `requests` - API calls
- `json` - data handling

**Infrastructure**:
- GenAI API endpoint (multiple models)
- Text embedding API (text-embedding-3-small)
- Milvus vector store (62,068 embedded chunks)

