# Chatbot Asisten Akademik Mahasiswa Berbasis NLP

## Manajemen Jadwal dan Kesehatan Mental Mahasiswa

## 1. Import Library

Pada tahap ini, seluruh library yang dibutuhkan untuk proyek NLP diimpor terlebih dahulu. Hal ini bertujuan agar seluruh proses berikutnya dapat berjalan dalam satu lingkungan kerja yang konsisten.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import nltk

from datasets import load_dataset

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

import joblib

  from .autonotebook import tqdm as notebook_tqdm


## 2. Download Resource NLTK

NLTK membutuhkan beberapa resource tambahan seperti stopwords dan wordnet untuk proses preprocessing.

```python
nltk.download('stopwords')
nltk.download('wordnet')
```

---



## 3. Load Dataset

Dataset dimuat langsung dari platform Hugging Face menggunakan library `datasets`. Metode ini memastikan dataset yang digunakan bersifat publik, konsisten, dan dapat direproduksi tanpa perlu mengunduh file secara manual.

In [2]:
dataset = load_dataset("edmdias/amfam-chatbot-intent-dataset")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 11130/11130 [00:00<00:00, 152269.41 examples/s]
Generating test split: 100%|██████████| 11042/11042 [00:00<00:00, 325267.26 examples/s]


## 4. Konversi Dataset ke DataFrame

Untuk memudahkan proses analisis dan preprocessing, dataset dikonversi ke dalam bentuk Pandas DataFrame.

In [8]:
train_df = pd.DataFrame(dataset['train'])
train_df.head()

Unnamed: 0,INTENT_NAME,UTTERANCES
0,INFO_ADD_HOUSE,add a homeowners policy
1,INFO_ADD_HOUSE,I just bought a house and want to add it to th...
2,INFO_ADD_HOUSE,How can I add my house to my existing policies
3,INFO_ADD_HOUSE,just purchased a house and need to add it to m...
4,INFO_ADD_HOUSE,I need to add a house to my policy


In [12]:
train_df = train_df.rename(columns={
    'UTTERANCES': 'text',
    'INTENT_NAME': 'intent'
})

## 5. Exploratory Data Analysis (EDA)

### 5.1 Informasi Dataset

In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11130 entries, 0 to 11129
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   INTENT_NAME  11130 non-null  object
 1   UTTERANCES   11130 non-null  object
dtypes: object(2)
memory usage: 174.0+ KB


### 5.2 Distribusi Label Intent

Analisis ini dilakukan untuk mengetahui sebaran kelas intent pada dataset.

In [13]:
train_df['intent'].value_counts().head(10)

intent
INFO_ADD_REMOVE_VEHICLE    189
INFO_LOGIN_ERROR           186
INFO_ERS                   180
INFO_ADD_REMOVE_INSURED    179
INFO_CAREERS               162
INFO_DIFFERENT_AMTS        161
INFO_CANCEL_INS_POLICY     155
INFO_SPEAK_TO_REP          155
INFO_UPDATE_LIENHOLDER     153
INFO_DELETE_DUPE_PYMT      148
Name: count, dtype: int64

## 6. Text Preprocessing

Tahap preprocessing dilakukan secara bertahap, di mana setiap proses dipisahkan agar alur kerja mudah dipahami dan dievaluasi.

### 6.1 Lowercasing

Semua teks diubah menjadi huruf kecil untuk menghindari perbedaan makna akibat kapitalisasi.

In [14]:
train_df['text_lower'] = train_df['text'].str.lower()
train_df[['text', 'text_lower']].head()

Unnamed: 0,text,text_lower
0,add a homeowners policy,add a homeowners policy
1,I just bought a house and want to add it to th...,i just bought a house and want to add it to th...
2,How can I add my house to my existing policies,how can i add my house to my existing policies
3,just purchased a house and need to add it to m...,just purchased a house and need to add it to m...
4,I need to add a house to my policy,i need to add a house to my policy


### 6.2 Cleaning (Remove Symbol & Number)

Karakter selain huruf dihapus untuk mengurangi noise pada data teks.

In [15]:
train_df['text_clean'] = train_df['text_lower'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))
train_df[['text_lower', 'text_clean']].head()

Unnamed: 0,text_lower,text_clean
0,add a homeowners policy,add a homeowners policy
1,i just bought a house and want to add it to th...,i just bought a house and want to add it to th...
2,how can i add my house to my existing policies,how can i add my house to my existing policies
3,just purchased a house and need to add it to m...,just purchased a house and need to add it to m...
4,i need to add a house to my policy,i need to add a house to my policy


### 6.3 Tokenizing

Proses tokenizing memecah teks menjadi kata-kata individual.

In [16]:
train_df['tokens'] = train_df['text_clean'].apply(lambda x: x.split())
train_df[['text_clean', 'tokens']].head()

Unnamed: 0,text_clean,tokens
0,add a homeowners policy,"[add, a, homeowners, policy]"
1,i just bought a house and want to add it to th...,"[i, just, bought, a, house, and, want, to, add..."
2,how can i add my house to my existing policies,"[how, can, i, add, my, house, to, my, existing..."
3,just purchased a house and need to add it to m...,"[just, purchased, a, house, and, need, to, add..."
4,i need to add a house to my policy,"[i, need, to, add, a, house, to, my, policy]"


### 6.4 Stopword Removal

Stopword dihapus karena tidak memberikan kontribusi signifikan terhadap makna teks.

In [17]:
stop_words = set(stopwords.words('english'))

train_df['tokens_no_stopwords'] = train_df['tokens'].apply(
    lambda x: [word for word in x if word not in stop_words]
)
train_df[['tokens', 'tokens_no_stopwords']].head()

Unnamed: 0,tokens,tokens_no_stopwords
0,"[add, a, homeowners, policy]","[add, homeowners, policy]"
1,"[i, just, bought, a, house, and, want, to, add...","[bought, house, want, add, rest, policies]"
2,"[how, can, i, add, my, house, to, my, existing...","[add, house, existing, policies]"
3,"[just, purchased, a, house, and, need, to, add...","[purchased, house, need, add, policies]"
4,"[i, need, to, add, a, house, to, my, policy]","[need, add, house, policy]"


### 6.5 Lemmatization

Lemmatization digunakan untuk mengembalikan kata ke bentuk dasarnya.

In [18]:
lemmatizer = WordNetLemmatizer()

train_df['tokens_lemmatized'] = train_df['tokens_no_stopwords'].apply(
    lambda x: [lemmatizer.lemmatize(word) for word in x]
)
train_df[['tokens_no_stopwords', 'tokens_lemmatized']].head()

Unnamed: 0,tokens_no_stopwords,tokens_lemmatized
0,"[add, homeowners, policy]","[add, homeowner, policy]"
1,"[bought, house, want, add, rest, policies]","[bought, house, want, add, rest, policy]"
2,"[add, house, existing, policies]","[add, house, existing, policy]"
3,"[purchased, house, need, add, policies]","[purchased, house, need, add, policy]"
4,"[need, add, house, policy]","[need, add, house, policy]"


### 6.6 Join Tokens

Token yang telah diproses digabung kembali menjadi satu teks.

In [19]:
train_df['final_text'] = train_df['tokens_lemmatized'].apply(lambda x: ' '.join(x))
train_df[['final_text']].head()

Unnamed: 0,final_text
0,add homeowner policy
1,bought house want add rest policy
2,add house existing policy
3,purchased house need add policy
4,need add house policy


## 7. Feature Extraction (TF-IDF)

TF-IDF digunakan untuk mengubah teks menjadi representasi numerik.

In [20]:
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(train_df['final_text'])
y = train_df['intent']

## 8. Train-Test Split

Data dibagi menjadi data latih dan data uji untuk evaluasi model.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## 9. Model Training

### 9.1 Model 1 Logistic Regression

In [22]:
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


### 9.2 Model 2 Support Vector Machine (SVM)

In [23]:
model_svm = LinearSVC()
model_svm.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,loss,'squared_hinge'
,dual,'auto'
,tol,0.0001
,C,1.0
,multi_class,'ovr'
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,verbose,0


## 10. Evaluation

Evaluasi dilakukan menggunakan classification report dan accuracy.

In [24]:
print("Logistic Regression")
print(classification_report(y_test, model_lr.predict(X_test)))

print("SVM")
print(classification_report(y_test, model_svm.predict(X_test)))

Logistic Regression
                                             precision    recall  f1-score   support

                             INFO_ADD_HOUSE       0.00      0.00      0.00         1
                    INFO_ADD_REMOVE_INSURED       0.67      0.92      0.78        36
                    INFO_ADD_REMOVE_VEHICLE       0.69      0.92      0.79        38
INFO_ADD_VEHICLE_PROPERTY_PAPERLESS_BILLING       0.83      0.92      0.87        26
                           INFO_AGENT_WRONG       0.00      0.00      0.00         1
                    INFO_AGT_NOT_RESPONDING       0.68      1.00      0.81        27
                         INFO_AMERICAN_STAR       0.00      0.00      0.00         1
                               INFO_AMT_DUE       0.71      0.87      0.78        23
                          INFO_AST_PURCHASE       0.00      0.00      0.00         2
                             INFO_AST_QUOTE       0.00      0.00      0.00         4
                        INFO_ATV_INS_EXPLAN 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 11. Save Model

Model dan vectorizer disimpan agar dapat digunakan kembali pada tahap deployment.

In [26]:
joblib.dump(model_lr, '../model/intent_model.pkl')
joblib.dump(vectorizer, '../model/tfidf_vectorizer.pkl')

['../model/tfidf_vectorizer.pkl']

## 12. Tahap Pengembangan Selanjutanya

Model chatbot ini akan diintegrasikan ke aplikasi MindSchedule sehingga dapat membaca konteks jadwal pengguna dan memberikan rekomendasi yang lebih personal terkait manajemen waktu dan kesehatan mental.
