## Classifying intention!

We will do the classic intent classification problem with Naive Bayes!

In [55]:
import pandas as pd
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [56]:
col = ['class', 'question']
intents= pd.read_csv('./intent_data.csv')
intents = intents[col]
intents = intents[pd.notnull(intents['question'])]
intents.columns = ['class', 'question']
intents['category_id'] = intents['class'].factorize()[0]
intents.head()

Unnamed: 0,class,question,category_id
0,Description,What is Filename injection Path traversel ?,0
1,Description,What does Filename injection Path traversel m...,0
2,Description,Tell me something about Filename injection Pa...,0
3,Description,Filename injection Path traversel,0
4,Description,Explain Filename injection Path traversel ?,0


In [57]:
category_id_df = intents[['class', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'class']].values)

Lets create features from the text, for this is useful the TTfidfVectorizer which is a tool to extract valuable features from text!

In [58]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(intents.question).toarray()
labels = intents.category_id
features.shape


(6096, 1322)

Let's create a wrapper to easy the measurement of accuracy!

In [59]:
def predict(question):
    intent=clf.predict(count_vect.transform([question]))
    intent=str(intent).strip("['']")
    return intent

Using Naive Bayes is as simple as instantiating the class MultinomialNB! Then pass each question in the train dataset to the vectorizer!

In [60]:
X_train, X_test, y_train, y_test = train_test_split(intents['question'], intents['class'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Lets validate the accuracy of the model!

In [61]:
predictions = X_test.apply(lambda x: predict(x))

In [62]:
predictions.head() 

263     Description
1430    Description
3432       Solution
5183       Solution
124        Solution
Name: question, dtype: object

In [63]:
print('Accuracy for the model is {}'.format(sum(predictions == y_test)/len(y_test)))

Accuracy for the model is 0.9744094488188977


Awesome right!