# Naive Bayes Classifier
This notebook can be used to train a Naive Bayes classifier using a training set of Quora questions and making predictions on a test set. The data used for this notebook is a Kaggle dataset found [Here](https://www.kaggle.com/c/quora-insincere-questions-classification/data). To follow along with this notebook, please add the 'train.csv' and 'test.csv' to the /data directory of this repository. 

In [None]:
import pandas as pd
import src.NB_Classifier as NB

from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

In [29]:
train = pd.read_csv('data/train.csv')
train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [None]:
# Extract the training data and corresponding labels
text = train['question_text']
labels = train['target']

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(text, labels,\
                                                  test_size=0.2)

In [30]:
# Initialize the classifier and train on the training set
classifier = NB.NB_Classifier('data/stopwords.txt')

# Only use words that appear more than 50 times in the dataset
classifier.collect_dictionary(X_train, 50)

classifier.train(X_train, y_train, k=1)

In [33]:
# Evaluate the classifier on the validation set
classifier.evaluate(X_val, y_val)

# Predictions
The remaining code will generate predictions using the test.csv file

In [35]:
test = pd.read_csv('data/test.csv')
test_qs = test['question_text']

In [36]:
test.head()

Unnamed: 0,qid,question_text
0,00014894849d00ba98a9,My voice range is A2-C5. My chest voice goes u...
1,000156468431f09b3cae,How much does a tutor earn in Bangalore?
2,000227734433360e1aae,What are the best made pocket knives under $20...
3,0005e06fbe3045bd2a92,Why would they add a hypothetical scenario tha...
4,00068a0f7f41f50fc399,What is the dresscode for Techmahindra freshers?


In [37]:
preds = classifier.generate_preds(test_qs)
test['predictions'] = pd.Series(preds).values

In [38]:
test.head()

Unnamed: 0,qid,question_text,predictions
0,00014894849d00ba98a9,My voice range is A2-C5. My chest voice goes u...,0
1,000156468431f09b3cae,How much does a tutor earn in Bangalore?,0
2,000227734433360e1aae,What are the best made pocket knives under $20...,0
3,0005e06fbe3045bd2a92,Why would they add a hypothetical scenario tha...,0
4,00068a0f7f41f50fc399,What is the dresscode for Techmahindra freshers?,0


In [39]:
test = test.drop(['question_text'], axis=1)

In [76]:
test.to_csv('data/bmmidei_NB_Submission_1', index=False)