<a href="https://colab.research.google.com/github/farooqzaman1/DataSciencePrj/blob/master/bernoliNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Quora Insincere Questions Classification
#Detect toxic content to improve online conversations
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.


**Importing required libararies**


In [0]:
from string import punctuation
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.naive_bayes import BernoulliNB
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import numpy as np

In [2]:
!pip install PyDrive

Collecting PyDrive
[?25l  Downloading https://files.pythonhosted.org/packages/52/e0/0e64788e5dd58ce2d6934549676243dc69d982f198524be9b99e9c2a4fd5/PyDrive-1.3.1.tar.gz (987kB)
[K    100% |████████████████████████████████| 993kB 7.5MB/s 
Building wheels for collected packages: PyDrive
  Running setup.py bdist_wheel for PyDrive ... [?25l- \ done
[?25h  Stored in directory: /root/.cache/pip/wheels/fa/d2/9a/d3b6b506c2da98289e5d417215ce34b696db856643bad779f4
Successfully built PyDrive
Installing collected packages: PyDrive
Successfully installed PyDrive-1.3.1


**Using google drive for project data resources**

In [4]:
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from google.colab import files


auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


# # files.upload()
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


**loading dataset**

In [5]:

data = pd.read_csv("drive/My Drive/DataScienceProject/train.csv")
print("total Instances: ",data.shape[0])

data=data.sort_values(by=['target'])

print("class 1: ",data['target'].sum(), "class 0: ", len(data['target'])-data['target'].sum())
SampleIndex=data['target'].sum() * 2 + 1000

dataSample= data[-SampleIndex:]
print("Selected Subset: ",dataSample.shape[0])


total Instances:  1306122
class 1:  80810 class 0:  1225312
Selected Subset:  162620


**Droping Nan values**

In [6]:
data = data[pd.notnull(data['target'])]

print("is there any null?")
print(data.isna().sum())

is there any null?
qid              0
question_text    0
target           0
dtype: int64


**Splitting data into train and test part**


In [15]:
train, test = train_test_split(dataSample, test_size=0.3)

X_train = train['question_text']
y_train = train['target']

    
X_test = test['question_text']
y_test = test['target']
print("Training on :", len(train['target']))
print("test on: ", len(test['target']))

Training on : 113834
test on:  48786


**Helper Function for reading Files**

In [0]:
# read Files
def load_doc(filename):
	file = open(filename, 'r', encoding="utf-8")
	text = file.read()
	file.close()
	return text


**Create list of valid tokens from text**

In [0]:
#create tokens of text, remove punctuation marks and filter out invalid tokens
def clean_question(quest, vocab):
	# create tokens using white space
	tokens = quest.split()
	# remove punctuation
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

**process list of documents**

In [0]:
# function to process and clean all questions
def process_docs(docs, vocab,):
    questionss = list()
    for d in docs:
        tokens = clean_question(d, vocab)
        questionss.append(tokens)
    return questionss


**Loading vocablary previously created  on complete data**

In [0]:
# load the vocabulary
vocab_filename = 'drive/My Drive/DataScienceProject/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

**Process all training Questions**

In [0]:
# process all training Questions
train_docs = process_docs(X_train, vocab)


**Create tokenizer and convert text to sequences of maximum question length**

In [0]:
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the document
tokenizer.fit_on_texts(train_docs)

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

**Applying Bernoulli Naive Bayes from sklearn**

In [24]:

model = BernoulliNB()
model.fit(Xtrain, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

**Load test data and create text sequences of max document length**

In [0]:
# process all test questions and convert it to numerical form
test_docs = process_docs(X_test, vocab)
# fit the tokenizer on the documents
tokenizer.fit_on_texts(test_docs)
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

**Test predictions**

In [26]:

y_pred= (model.predict(Xtest))
sc1 = accuracy_score(y_test, y_pred)
print("Acc=",sc1)



Acc= 0.6400196777764112


**Confusion matrix**

In [41]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()


print( "tp=",tp, "tn =",tn, " fp=",fp, " fn=",fn)

tp= 12667 tn = 18557  fp= 6100  fn= 11462


**Percision, Recall and F Measure**

In [42]:
P = tp/(tp+fn)
R= tp/(tp+fp)
accuracy = (tp+tn)/(tn + fp + fn + tp)
F1=2*P*R/(P+R)
print("Accuracy= ",np.round(accuracy,2),"precission=",np.round(P,2),"recall=",np.round(R,2)," F1=",np.round(F1,2))

Accuracy=  0.64 precission= 0.52 recall= 0.67  F1= 0.59
