# FN_YunzeWu_ITAI2377

**Final Project – Implementing a Domain-Specific AI Assistant**  
**Domain:** Intelligent Education – Few-Shot Learning Based Educational Assistant  
**Environment:** Google Colab (CPU)

This notebook implements the core components of our domain-specific AI assistant:

1. Data Collection & Preprocessing  
2. Feature Engineering  
3. Assistant Core Functionality (Few-Shot Style)  
4. Evaluation & Reflection  

All code cells are documented with comments, and each section includes markdown explanations.


## 1. Setup & Imports

This section sets up the Python environment and imports all required libraries for data handling, preprocessing, feature engineering, and prototyping the assistant.


In [None]:
# SECTION 1: Setup & Imports

# Core data and numerical libraries
import pandas as pd
import numpy as np

# Text processing
import re
import nltk
from nltk.corpus import stopwords

# Feature engineering and ML utilities
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.ensemble import RandomForestClassifier

# Visualization
import matplotlib.pyplot as plt

# Download NLTK resources
nltk.download('stopwords')

print("Environment setup complete.")


## 2. Data Collection & Loading

We use a small synthetic educational dataset following our midterm plan.  
You can replace this section with real data later.


In [None]:
# SECTION 2: Data Collection & Loading

data = [
    {"question": "Simplify 2x + 3x.", "answer": "5x", "topic": "algebra", "difficulty": "easy"},
    {"question": "Solve for y: 2y + 4 = 10.", "answer": "3", "topic": "algebra", "difficulty": "easy"},
    {"question": "Find x if 3x - 5 = 16.", "answer": "7", "topic": "algebra", "difficulty": "medium"},
    {"question": "A triangle has sides 3, 4, 5. Is it a right triangle?", "answer": "Yes.", "topic": "geometry", "difficulty": "medium"},
    {"question": "Area of a circle with radius 2?", "answer": "4π", "topic": "geometry", "difficulty": "medium"},
]

df = pd.DataFrame(data)
df.head()


## 3. Data Exploration & Quality Assessment


In [None]:
print("DataFrame info:")
print(df.info())

print("\nTopic distribution:")
print(df['topic'].value_counts())

print("\nDifficulty distribution:")
print(df['difficulty'].value_counts())

print("\nMissing values:")
print(df.isna().sum())

df['topic'].value_counts().plot(kind='bar')
plt.title("Question Count by Topic")
plt.show()


## 4. Preprocessing Pipeline

Lowercasing, regex cleaning, stopword removal.


In [None]:
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9 ]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = [w for w in text.split() if w not in stop_words]
    return ' '.join(tokens)

df['question_clean'] = df['question'].apply(clean_text)
df[['question', 'question_clean']]


In [None]:
df['len_raw'] = df['question'].apply(lambda x: len(x.split()))
df['len_clean'] = df['question_clean'].apply(lambda x: len(x.split()))

plt.hist(df['len_raw'], alpha=0.5, label='Raw')
plt.hist(df['len_clean'], alpha=0.5, label='Clean')
plt.legend()
plt.title("Token Count Before vs After Cleaning")
plt.show()

df[['question', 'len_raw', 'question_clean', 'len_clean']]


## 5. Feature Engineering


In [None]:
df['question_length'] = df['question_clean'].apply(lambda x: len(x.split()))
df['is_algebra'] = (df['topic'] == 'algebra').astype(int)

difficulty_map = {'easy': 1, 'medium': 2}
df['difficulty_enc'] = df['difficulty'].map(difficulty_map)

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['question_clean'])

df[['question_clean', 'question_length', 'is_algebra', 'difficulty_enc']]


In [None]:
X_basic = df[['question_length', 'difficulty_enc']].values
y_basic = df['is_algebra'].values

rf = RandomForestClassifier(random_state=42, n_estimators=50)
rf.fit(X_basic, y_basic)

for name, imp in zip(['question_length', 'difficulty_enc'], rf.feature_importances_):
    print(f"{name}: {imp:.3f}")


## 6. Few-Shot Educational Assistant Prototype


In [None]:
def find_most_similar_question(user_query, df, tfidf_vectorizer, tfidf_matrix):
    user_clean = clean_text(user_query)
    user_vec = tfidf_vectorizer.transform([user_clean])
    sims = cosine_similarity(user_vec, tfidf_matrix)[0]
    idx = sims.argmax()
    return idx, sims[idx]

def assistant_reply(user_query, df, tfidf_vectorizer, tfidf_matrix):
    idx, sim = find_most_similar_question(user_query, df, tfidf_vectorizer, tfidf_matrix)
    row = df.iloc[idx]
    return (
        f"Closest question: {row['question']}\n"
        f"Suggested answer: {row['answer']}\n"
        f"Topic: {row['topic']}, Difficulty: {row['difficulty']}\n"
        f"(Similarity: {sim:.3f})"
    )

tfidf_assistant = TfidfVectorizer()
X_assistant = tfidf_assistant.fit_transform(df['question_clean'])

example_query = "How do I solve 2y + 4 = 10?"
print(assistant_reply(example_query, df, tfidf_assistant, X_assistant))


### 6.1 Interactive Loop


In [None]:
def chat_loop(df, tfidf_vectorizer, tfidf_matrix):
    print("Assistant ready. Type 'quit' to exit.\n")
    while True:
        q = input("Student: ")
        if q.lower() in ['quit', 'exit']:
            print("Goodbye!")
            break
        print("\nAssistant:")
        print(assistant_reply(q, df, tfidf_vectorizer, tfidf_matrix))
        print()

# Uncomment to use in Colab:
# chat_loop(df, tfidf_assistant, X_assistant)


## 7. Evaluation & Reflection


In [None]:
test_queries = [
    "Simplify 5x + 2x",
    "Is 3,4,5 a right triangle?",
    "Area of a circle radius 3?",
]

scores = []
for q in test_queries:
    _, sim = find_most_similar_question(q, df, tfidf_assistant, X_assistant)
    scores.append(sim)
    print(q, " => Similarity:", sim)

print("\nAverage similarity:", np.mean(scores))
