Introduction: The goal of this task is to predict the intent of the customer given their query text input. The dataset is ATIS (Airline Travel Information System) and consists of (query, intent) pairs. For this problem, you will implement the Bag of Word technique and a fully connected neural network to predict the intent of query text.

In [1]:
import tensorflow as tf
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [2]:
df_train = pd.read_csv('dataset/atis.train.csv', delimiter=',')
query_train, intent_train = df_train["tokens"], df_train["intent"]

df_test = pd.read_csv('dataset/atis.test.csv', delimiter=',')
query_test, intent_test = df_test["tokens"], df_test["intent"]

In [3]:
print("Query examples: ")
query_train.head(5)

Query examples: 


0    BOS what is the cost of a round trip flight fr...
1    BOS now i need a flight leaving fort worth and...
2    BOS i need to fly from kansas city to chicago ...
3           BOS what is the meaning of meal code s EOS
4    BOS show me all flights from denver to pittsbu...
Name: tokens, dtype: object

In [4]:
print("Intent examples: ")
intent_train.head(5)

Intent examples: 


0         atis_airfare
1          atis_flight
2          atis_flight
3    atis_abbreviation
4          atis_flight
Name: intent, dtype: object

In [5]:
print("There are {:d} different categories of intent.".format(intent_train.nunique()))
print(intent_train.unique())

There are 17 different categories of intent.
['atis_airfare' 'atis_flight' 'atis_abbreviation' 'atis_ground_service'
 'atis_restriction' 'atis_airport' 'atis_quantity' 'atis_meal'
 'atis_airline' 'atis_city' 'atis_flight_no' 'atis_ground_fare'
 'atis_flight_time' 'atis_flight#atis_airfare' 'atis_distance'
 'atis_aircraft' 'atis_capacity']


In [12]:
intent_test.shape, intent_train.shape, query_test.shape, query_train.shape

((586,), (4274,), (586,), (4274,))

In [41]:
#####################################################################################
# TODO: Transform query_train and query_test into Bag of Word representations: X_train, X_test
# Hint: Use sklearn CountVectorizer to construct Bag of Word representation for the query text.
# Approximately 3 lines of code

vectorizer = CountVectorizer(ngram_range=(1,2))
# fit_vectorizer = vectorizer.fit(query_train)
X_train = vectorizer.fit_transform(query_train).toarray()
X_test = vectorizer.transform(query_test).toarray()


#####################################################################################

In [42]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [43]:
X_test.shape, X_train.shape

((586, 6992), (4274, 6992))

In [44]:
from sklearn import preprocessing

labelEncoder = preprocessing.LabelEncoder()
y_train = labelEncoder.fit_transform(intent_train)
y_test = labelEncoder.transform(intent_test)

In [45]:
y_train.shape, y_test.shape

((4274,), (586,))

In [46]:
#####################################################################################
# TODO: Train your favorite model and evaluate your model with the test data
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [47]:
xgb_model=XGBClassifier(use_label_encoder=False)

#fit xg boost model and predict
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
predictions = [round(value) for value in y_pred]

# compute accuracy of predictions
accuracy = accuracy_score(y_test, predictions)

print('Test accuracy:', round(accuracy, 2)*100, 'percent')

Test accuracy: 97.0 percent


## Brief observation and conclusion
Using an ngram_range of 1-2 as an additional argument in my count vectorizer function further increases my accuracy on the 
XGboost classifier from 96% to 97%, the ngram accounts for joint or contiguous words which makes sense together (some semantics).
Without the ngrams, the number of trainable parameters is significantly less making XGboost less computationally expensive, but with the ngram
argument, the XGB classifier took longer to run because of significant increase in trainable parameters. But the difference in accuracy gained isn't much. 