**Tutorial Question Type Detection / Classification**

Mendeteksi tipe pertanyaan, bagian dari tahapan *question processing*

Referensi:  Learning Question Classifiers, Xin Li, Dan Roth. COLING'02, Aug., 2002

Dataset yang digunakan: training_set_1, https://cogcomp.seas.upenn.edu/Data/QA/QC/

Import library yang diperlukan untuk klasifikasi teks

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

Baca file yang berisi dataset pertanyaan/question yang berisi informasi tipe/label question

In [22]:
file = open('train_1000.label', 'r', encoding = "ISO-8859-1") 
lines = file.readlines() 

Coba print question dan label pertama di dataset

In [23]:
print(lines[0])

DESC:manner How did serfdom develop in and then leave Russia ?



Periksa jumlah total question

In [24]:
print(len(lines))

1000


Proses dataset, pisahkan label tipe question dan question. Pada tutorial ini, identifikasi tipe hanya dilakukan pada level tipe yang lebih general. Deskripsi lengkap tipe question silakan dibaca di paper referensi.

In [25]:
labels = []
questions = []
for line in lines:
    tokens = line.split()
    current_label = tokens[0].split(':')[0]
    current_question = line[len(tokens[0])+1:]    
    if (("Who" or "Whom") in current_question):
        current_label = "HUM"
    elif ("Where" in current_question):
        current_label = "LOC"
    elif (("How much" or "How many") in current_question):
        current_label = "NUM"
    elif (("abbreviation" or "stand for") in current_question):
        current_label = "ABBR"
    labels.append(current_label)
    questions.append(current_question)

Cek label question pertama

In [26]:
print(labels[0])

DESC


Cek question pertama

In [27]:
print(questions[0])

How did serfdom develop in and then leave Russia ?



Split menjadi data train dan data test, 80:20

In [28]:
X_train = questions[0:800]
y_train = labels[0:800]
X_test = questions[800:1000]
y_test = labels[800:1000]

In [29]:
print(len(X_train))

800


Ubah representasi teks ke vektor count

In [30]:
cv = CountVectorizer(analyzer='word') 
X_train_cv = cv.fit_transform(X_train)

In [31]:
X_test_cv = cv.transform(X_test)

In [32]:
print(X_train[0])

How did serfdom develop in and then leave Russia ?



Contoh representasi vektor question pertama

In [33]:
print(X_train_cv[0])

  (0, 1053)	1
  (0, 646)	1
  (0, 1862)	1
  (0, 637)	1
  (0, 1074)	1
  (0, 131)	1
  (0, 2089)	1
  (0, 1221)	1
  (0, 1817)	1


Train classifier Multinomial Naive Bayes

In [34]:
clf = MultinomialNB()
clf.fit(X_train_cv, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Lakukan prediksi tipe question data tes

In [35]:
y_predict = clf.predict(X_test_cv)

Representasi vektor data tes pertama

In [36]:
print(X_test_cv[0])

  (0, 279)	1
  (0, 846)	1
  (0, 937)	1
  (0, 1074)	1
  (0, 1500)	1
  (0, 2084)	2
  (0, 2111)	1
  (0, 2215)	1
  (0, 2251)	1


Question pertama pada data tes

In [37]:
print(X_test[0])

Who was the first black golfer to tee off in the Masters ?



Tipe question hasil prediksi question pertama

In [38]:
print(y_predict[0])

HUM


Tipe question pertama sebenarnya

In [39]:
print(y_test[0])

HUM


Kinerja klasifikasi (deteksi tipe question) pada data tes

In [40]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

        ABBR       0.00      0.00      0.00         4
        DESC       0.72      0.57      0.64        37
        ENTY       0.52      0.67      0.59        51
         HUM       0.78      0.76      0.77        50
         LOC       0.85      0.88      0.86        32
         NUM       0.83      0.77      0.80        26

    accuracy                           0.70       200
   macro avg       0.62      0.61      0.61       200
weighted avg       0.71      0.70      0.70       200



  _warn_prf(average, modifier, msg_start, len(result))
