# Question type classification problem

This problem is tackeled using the concept of bag of words commonly used in NLP as there is very less words in each question. After bag of words, we used Term frequency and Inverse Document frequency to to give weightage to each words found in a sentence and the training set. After that we have used different machine learning techniques and identified that Randomforest classifier as the suitable algorithm to use. And obtain more that 92% of F1 score. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
train=pd.read_csv('LabelledData (1).txt',sep=',,,',names=['question','label'],engine='python')

### 1. Exploratory Data Analysis

In [3]:
train.head()

Unnamed: 0,question,label
0,how did serfdom develop in and then leave russ...,unknown
1,what films featured the character popeye doyle ?,what
2,how can i find a list of celebrities ' real na...,unknown
3,what fowl grabs the spotlight after the chines...,what
4,what is the full form of .com ?,what


In [4]:
#Removing unwanted spaces from both the ends

train['label']=train['label'].apply(lambda x: x.strip())
train['question']=train['question'].apply(lambda x: x.strip())

In [5]:
train.describe()# shows duplicate questions as count and unique are different for question

Unnamed: 0,question,label
count,1483,1483
unique,1476,5
top,what is the speed of the mississippi river ?,what
freq,3,609


In [6]:
train.drop_duplicates(inplace=True)

In [7]:
train['label'].value_counts()

what           606
who            401
unknown        271
affirmation    102
when            96
Name: label, dtype: int64

In [8]:
train.groupby(by='label').describe()

Unnamed: 0_level_0,question,question,question,question
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
affirmation,102,102,"is this considered a ""fill hose"" ?",1
unknown,271,271,which japanese car maker had its biggest perce...,1
what,606,606,what was the name of the sitcom that alyssa mi...,1
when,96,96,when were camcorders introduced in malaysia ?,1
who,401,401,who played the ringo kid in the 1939 film stag...,1


In [9]:
train.describe()

Unnamed: 0,question,label
count,1476,1476
unique,1476,5
top,where is the danube ?,what
freq,1,606


### 2. Data processing

In [10]:
import string
import nltk

In [11]:
#filtering the punctuations from the sentence and tokenizing the sentence into words
def text_process(question):
    noponc=[word for word in question if word not in string.punctuation]
    noponc=''.join(noponc).strip()
    #return [word for word in noponc.split() if word.lower() in stopwords.words('english')]
    return noponc.split()

#### Exploring for vectorizing questions in to bow using predefined analyzer. 

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
bow_trasformer=CountVectorizer(analyzer=text_process).fit(train['question'])

In [14]:
print(len(bow_trasformer.vocabulary_))

3685


In [15]:
messages_bow=bow_trasformer.transform(train['question'])

In [16]:
messages_bow.shape

(1476, 3685)

### Exploring for transforming the bag of words table into Tfidf.

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

In [18]:
tfidf_transformer=TfidfTransformer().fit(messages_bow)

In [19]:
message_tfidf=tfidf_transformer.transform(messages_bow)

## 3. Loading different learning algorithms 

In [20]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
classifier=[MultinomialNB(),LogisticRegression(),RandomForestClassifier()]

### 4. Splitting the data set into train test 

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
qn_train,qn_test,label_train,label_test=train_test_split(train['question'],train['label'],test_size=0.3)

### 5. Creating a pipeline of process for training and prediction

1. We have vectorized the BOW using the defined function text_process
2. Then generate TFIDF of BOW
3. Then impliment a perticular classifier 

In [23]:
from sklearn.pipeline import Pipeline

In [24]:
def complete_process(classifier):
    pipeline=Pipeline([
        ('bow',CountVectorizer(analyzer=text_process)),
        ('tfidf',TfidfTransformer()),
        ('classifier',classifier)
    ])
    pipeline.fit(qn_train,label_train)
    return pipeline.predict(qn_test)

## 6. Checking confusion matrix and derived F1 score for individual classifier
 And find that Random forest classifier has F1 score above 92 which is comparatively high

In [25]:
from sklearn.metrics import classification_report, confusion_matrix

In [42]:
i=0

for classify in classifier:
    predictions=complete_process(classify)
    print(classifier[i])
    print('\n')
    print(pd.DataFrame(confusion_matrix(label_test,predictions,labels=['affirmation','unknown','what','when','who']),columns=['affirmation','unknown','what','when','who'],index=['affirmation','unknown','what','when','who']))    
    print('\n')
    print(classification_report(label_test,predictions))
    i=i+1

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


             affirmation  unknown  what  when  who
affirmation            1        5    30     0    1
unknown                0       43    40     0    6
what                   0        0   174     0    2
when                   0        1    28     1    1
who                    0        0    21     0   89


             precision    recall  f1-score   support

affirmation       1.00      0.03      0.05        37
    unknown       0.88      0.48      0.62        89
       what       0.59      0.99      0.74       176
       when       1.00      0.03      0.06        31
        who       0.90      0.81      0.85       110

avg / total       0.79      0.70      0.64       443

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start