# Model Development

1️⃣ Objective

Demonstrate how the AI receptionist learns to predict the most appropriate system response (classification task).

You’ll build multiple models to show progressive improvement and compare performance.

| Model                   | Technique                    | Purpose                                                          |
| ----------------------- | ---------------------------- | ---------------------------------------------------------------- |
| **Baseline**            | TF-IDF + Linear Regression   | Establish a numeric baseline accuracy (simple, no hyper-tuning). |
| **Mid Model**           | TF-IDF + Logistic Regression | Core classical ML classifier — interpretable, scalable.          |
| **Advanced (optional)** | Neural Seq2Seq / Transformer | Text-generation or intent prediction — for top marks.            |


# 1) Baseline Model
### TF-IDF + Linear Regression

In [1]:
# Let's begin by reading in the three data splits we cleaned from the data preprocessing step:
# https://www.w3schools.com/python/pandas/pandas_csv.asp
import pandas as pd

train_pairs_df = pd.read_csv('train_pairs_df.csv')
val_pairs_df = pd.read_csv('val_pairs_df.csv')
test_pairs_df = pd.read_csv('test_pairs_df.csv')

#check it's working
test_pairs_df.sample(3)

Unnamed: 0,dialogue_id,context,response,domain,user_intents,user_slots
3516,MUL2086.json,"no, that will be all, thanks.",thank you for using our service and enjoy your...,"train, hotel",[],"{'hotel-internet': 'yes', 'hotel-name': 'gonvi..."
3867,MUL2074.json,"nope, that's all i need today. thanks for your...","glad to have been of service, have a great day!","train, hotel",[],"{'hotel-bookday': 'friday', 'hotel-bookpeople'..."
5422,MUL1273.json,"ok, that's ok. i'll take care of it later. tha...",have a nice day.,"restaurant, hotel",[],"{'restaurant-area': 'centre', 'restaurant-book..."


We will begin by using TfidfVectorizer from scikit learn to transform the context column to numeric vectors:

I will be following the official scikit learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(test_pairs_df["context"])
print(vectorizer.get_feature_names_out()[1000:1040])
print(X.shape)

['pre' 'prefer' 'preferable' 'preferably' 'preference' 'preferred'
 'pretty' 'previously' 'prezzo' 'price' 'priced' 'prices' 'pricy' 'prince'
 'probably' 'problem' 'professional' 'promised' 'promising' 'proper'
 'property' 'provide' 'provides' 'psychic' 'punter' 'put' 'quality'
 'question' 'questions' 'quickly' 'quite' 'quote' 'raise' 'rajmahal'
 'random' 'range' 'ranges' 'rated' 'rates' 'rather']
(5776, 1495)


for .shape the first number is the number of rows and second number is number of words so:

5,776 context messages × 1,495 unique words → each context turned into a 1,495-dimensional numeric vector.