Problem Definition
---
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( Problem Formalization)

> We will be predicting whether a question asked on Quora is sincere or not

Source : https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora

Data Source : https://www.kaggle.com/c/quora-insincere-questions-classification/data

**About Quora**

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

**Business View**

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

**What is a insincere question?**

Is defined as a question intended to make a statement rather than look for helpful answers.

![Quora_moderation_warning](\datasets\images/Quora_moderation_warning.png)

**Feature Set**

We use train.csv and test.csv as Input and we should upload a submission.csv as Output.

The training set contains the following 3 features (for Supervised Learning)
1. qid - unique question identifier
2. question_text - Quora question text
3. target - a question labeled "insincere" has a value of 1, otherwise 0

**Coding a solutiuon for solving the above problem**

In [42]:
# all import statements
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
#from wordcloud import WordCloud as wc   # not needed
from nltk.corpus import stopwords
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import string
import scipy
import numpy
import nltk
import json
import sys
import csv
import os

In [43]:
# printing versions of the important packages
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 3.1.2
sklearn: 0.22.1
scipy: 1.4.1
seaborn: 0.9.0
pandas: 0.25.3
numpy: 1.18.0
Python: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]


We would first do EDA ( Exploratory Data Analysis ) over Quora Data set :
--

![Quora_EDA_steps](images/Quora_EDA_steps.png)

In [44]:
# I start Collection Data by reading training and testing datasets 
# into Pandas DataFrames.
from sklearn.model_selection import train_test_split
train_large = pd.read_csv('C:\Program Files\Python36\suven\Adv ML\datasets\datasets/QuoratrainSet.csv')
# test_large = pd.read_csv('C:\Program Files\Python36\suven\Adv ML\datasets\datasets/Quoratestdata.csv')

train = train_large[ :15000]
X = train.drop('target', axis=1)  
y = train['target']  

In [45]:
# check top 5 records of training dataset
X.head() 

Unnamed: 0,qid,question_text
0,00002165364db923c7e6,How did Quebec nationalists see their province...
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco..."
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...


In [46]:
# Find the type of features in Quora dataset
# i.e get a quick statistics


print(X.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 2 columns):
qid              15000 non-null object
question_text    15000 non-null object
dtypes: object(2)
memory usage: 234.5+ KB
None


In [47]:
print(X.info())  # see carefully the last value is -> None. 
                    # indicating that there are no "Null" values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 2 columns):
qid              15000 non-null object
question_text    15000 non-null object
dtypes: object(2)
memory usage: 234.5+ KB
None


In [48]:
# shape for train and test
print('Shape of train:',X.shape)
print('Shape of test:',y.shape)

Shape of train: (15000, 2)
Shape of test: (15000,)


In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [50]:
# How many NA elements in every column!!
# Good news, it is Zero!
# To check out how many null info are on the dataset, we can use isnull().sum().
# recall from info() -> we found that it has zero Nulls. 

X_train.isnull().sum()
X_test.isnull().sum()
# data is infact clean and ready for use.

qid              0
question_text    0
dtype: int64

In [51]:
# in case , their were NA or None values in any row then we would drop the row.

# remove rows that have NA's
print('Before Droping',X_train.shape)
print('Before Droping',X_test.shape)
X_train = X_train.dropna()
X_test = X_test.dropna()
print('After Droping',X_train.shape)
print('After Droping',X_test.shape)

Before Droping (12000, 2)
Before Droping (3000, 2)
After Droping (12000, 2)
After Droping (3000, 2)


In [52]:
# Number of words in the text

X_train["num_words"] = X_train["question_text"].apply(lambda x: len(str(x).split()))
X_test["num_words"] = X_test["question_text"].apply(lambda x: len(str(x).split()))

print('maximum of num_words in train',X_train["num_words"].max())
print('min of num_words in train',X_train["num_words"].min())
print("maximum of  num_words in test",X_test["num_words"].max())
print('min of num_words in train',X_test["num_words"].min())

maximum of num_words in train 55
min of num_words in train 2
maximum of  num_words in test 51
min of num_words in train 3


In [53]:
# Number of unique words in the text
X_train["num_unique_words"] = X_train["question_text"].apply(lambda x: len(set(str(x).split())))
X_test["num_unique_words"] = X_test["question_text"].apply(lambda x: len(set(str(x).split())))

print('maximum of num_unique_words in train',X_train["num_unique_words"].max())

print("maximum of num_unique_words in test",X_test["num_unique_words"].max())


maximum of num_unique_words in train 48
maximum of num_unique_words in test 42


In [54]:
# Number of stopwords in the text

#from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))

X_train["num_stopwords"] = X_train["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
X_test["num_stopwords"] = X_test["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

print('maximum of num_stopwords in train',X_train["num_stopwords"].max())
print("maximum of num_stopwords in test",X_test["num_stopwords"].max())


maximum of num_stopwords in train 30
maximum of num_stopwords in test 29


In [55]:
# Number of punctuations in the text

X_train["num_punctuations"] =X_train['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
X_test["num_punctuations"] =X_test['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
# print(train.head())
# print(test.head())
print('maximum of num_punctuations in train',X_train["num_punctuations"].max())
print("maximum of num_punctuations in test",X_test["num_punctuations"].max())

maximum of num_punctuations in train 39
maximum of num_punctuations in test 25


In [56]:
# step 1: Change all the text to lower case. 

# This is required as python interprets 'quora' and 'QUORA' differently

X_train['question_text'] = [entry.lower() for entry in X_train['question_text']]

X_test['question_text'] = [entry.lower() for entry in X_test['question_text']]

X_train.head()

Unnamed: 0,qid,question_text,num_words,num_unique_words,num_stopwords,num_punctuations
74,00033991dd3302d609e2,do web developer refer to w3c standard practice?,8,8,2,1
5538,01143604d4dc4b028344,was life in soviet union much better than in a...,10,9,4,1
3647,00b62782e246c25e06d9,what are some reasons that people behave in an...,14,14,8,1
3195,009f5b16e260b6cae6d8,how can i root my samsung galaxy s7 edge?,9,9,4,1
1911,005f71079134f4bb0342,if you had a city as your boyfriend/girlfriend...,14,14,10,3


In [57]:
# more imports for NLP
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [58]:
# step 2 : Tokenization : In this each entry in the corpus will be broken 
#                         into set of words


X_train['question_text']= [word_tokenize(entry) for entry in X_train['question_text']]

X_test['question_text']= [word_tokenize(entry) for entry in X_test['question_text']]

X_test.head()

Unnamed: 0,qid,question_text,num_words,num_unique_words,num_stopwords,num_punctuations
6498,01429e53e4883a27a203,"[which, top, 3, biblical, fallacies, or, inacc...",23,21,10,1
9608,01e0f4e3e8320688ffdc,"[how, do, software, download, sites, like, sof...",19,16,6,2
1281,003f6d2f760c16445cb9,"[would, it, be, possible, to, create, a, bette...",20,20,8,1
1917,005fbb834ab6617b046e,"[what, is, my, zip, code, for, india, ?]",7,7,4,1
14958,02f0dbc16acf93a574f5,"[where, is, the, book, 'harry, and, the, wrink...",9,8,5,3


In [59]:
# step 3, 4 and 5
# Remove Stop words and Numeric data 
# and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb
# or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
# the tag_map would map any tag to 'N' (Noun) except
# Adjective to J, Verb -> v, Adverb -> R
# that means if you get a Pronoun then it would still be mapped to Noun


for index,entry in enumerate(X_train['question_text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    
    # pos_tag function below will provide the 'tag' 
    # i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only 
        # alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
            
    # The final processed set of words for each iteration will be stored 
    # in 'question_text_final'
    X_train.loc[index,'question_text_final'] = str(Final_words)  
    
print(X_train.head())

                       qid                                      question_text  \
74    00033991dd3302d609e2  [do, web, developer, refer, to, w3c, standard,...   
5538  01143604d4dc4b028344  [was, life, in, soviet, union, much, better, t...   
3647  00b62782e246c25e06d9  [what, are, some, reasons, that, people, behav...   
3195  009f5b16e260b6cae6d8  [how, can, i, root, my, samsung, galaxy, s7, e...   
1911  005f71079134f4bb0342  [if, you, had, a, city, as, your, boyfriend/gi...   

      num_words  num_unique_words  num_stopwords  num_punctuations  \
74          8.0               8.0            2.0               1.0   
5538       10.0               9.0            4.0               1.0   
3647       14.0              14.0            8.0               1.0   
3195        9.0               9.0            4.0               1.0   
1911       14.0              14.0           10.0               3.0   

                                    question_text_final  
74    ['suppose', 'write', 'statem

In [60]:
# step 3, 4 and 5
# Remove Stop words and Numeric data 
# and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb
# or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
# the tag_map would map any tag to 'N' (Noun) except
# Adjective to J, Verb -> v, Adverb -> R
# that means if you get a Pronoun then it would still be mapped to Noun


for index,entry in enumerate(X_test['question_text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words_test = []
    
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    
    # pos_tag function below will provide the 'tag' 
    # i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only 
        # alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words_test.append(word_Final)
            
    # The final processed set of words for each iteration will be stored 
    # in 'question_text_final'
    X_test.loc[index,'question_text_final'] = str(Final_words_test)    

print(X_test.head())

                        qid  \
6498   01429e53e4883a27a203   
9608   01e0f4e3e8320688ffdc   
1281   003f6d2f760c16445cb9   
1917   005fbb834ab6617b046e   
14958  02f0dbc16acf93a574f5   

                                           question_text  num_words  \
6498   [which, top, 3, biblical, fallacies, or, inacc...       23.0   
9608   [how, do, software, download, sites, like, sof...       19.0   
1281   [would, it, be, possible, to, create, a, bette...       20.0   
1917            [what, is, my, zip, code, for, india, ?]        7.0   
14958  [where, is, the, book, 'harry, and, the, wrink...        9.0   

       num_unique_words  num_stopwords  num_punctuations  \
6498               21.0           10.0               1.0   
9608               16.0            6.0               2.0   
1281               20.0            8.0               1.0   
1917                7.0            4.0               1.0   
14958               8.0            5.0               3.0   

                         

In [66]:
Tfidf_vect = TfidfVectorizer()
Tfidf_vect.fit(X_train['question_text_final'])


Train_X_Tfidf = Tfidf_vect.transform(X_train['question_text_final'])

# Test_X_Tfidf = Tfidf_vect.transform(X_test['question_text_final'])

AttributeError: 'float' object has no attribute 'lower'

In [62]:
# You can use the below syntax to see the vocabulary that 
# it has learned from the corpus
print(Tfidf_vect.vocabulary_)

AttributeError: 'TfidfVectorizer' object has no attribute 'vocabulary_'

In [140]:
print(Train_X_Tfidf)

# Output: 
# 1: Row number of ‘Train_X_Tfidf’, 0 is the sentence number
# 2: Unique Integer number of each word, 2982 is where the word is in the corpus
# 3: Score calculated by TF-IDF Vectorizer

  (0, 11939)	0.30069751448370113
  (0, 10855)	0.4848705185236342
  (0, 10708)	0.4968836778542565
  (0, 9021)	0.534199959102793
  (0, 9018)	0.3771188134083847
  (1, 14862)	0.18744864502808342
  (1, 12147)	0.3383931780682121
  (1, 10005)	0.19449855559871665
  (1, 4393)	0.3799035394097826
  (1, 3935)	0.3014794820607904
  (1, 198)	0.7598070788195652
  (2, 14321)	0.6816644513132366
  (2, 13539)	0.20030929495367858
  (2, 12542)	0.2773844978302303
  (2, 5477)	0.37839876312335247
  (2, 249)	0.5244825817900651
  (3, 14500)	0.4297809332572122
  (3, 14194)	0.19803570932994952
  (3, 9631)	0.44809094845120234
  (3, 8035)	0.44809094845120234
  (3, 6011)	0.4167897707322595
  (3, 5761)	0.44809094845120234
  (4, 13911)	0.3970738498195995
  (4, 8826)	0.39013726524235975
  (4, 8760)	0.44651000583578443
  :	:
  (14995, 13376)	0.3571061914887556
  (14995, 10547)	0.5614028585252218
  (14995, 7601)	0.30713012393983075
  (14995, 6881)	0.5614028585252218
  (14995, 1643)	0.30873299285549494
  (14995, 1384)	0.22

In [141]:
print(Test_X_Tfidf)

  (0, 14817)	0.246423013630397
  (0, 14619)	0.40015255554868456
  (0, 11619)	0.38176567573815745
  (0, 10429)	0.2868424453874653
  (0, 8181)	0.22460278249406615
  (0, 7823)	0.33921737570028404
  (0, 5497)	0.17354084561856511
  (0, 1506)	0.3412331575750045
  (0, 1299)	0.22976716938140146
  (0, 825)	0.43158513419750283
  (1, 14539)	0.264860351980696
  (1, 13045)	0.27485958772416885
  (1, 11389)	0.49704798960652974
  (1, 4429)	0.4081235424725203
  (1, 2707)	0.2967087287651224
  (1, 2666)	0.3969771825747993
  (1, 713)	0.44164773052978473
  (2, 11051)	0.4994411873004137
  (2, 9355)	0.7825191291141893
  (2, 7752)	0.37178261524488176
  (3, 4474)	1.0
  (4, 11051)	0.40283577553718425
  (4, 10005)	0.29484496439614455
  (4, 9316)	0.582583977298527
  (4, 8085)	0.30956499115747393
  :	:
  (2996, 8085)	0.18309714418289233
  (2996, 7594)	0.3202153218227155
  (2996, 7198)	0.3489456740828173
  (2996, 3388)	0.41128838375225674
  (2996, 2043)	0.3538276792320532
  (2996, 1246)	0.3733089239236176
  (2997, 

Data Pre-processing is over !!
---

Use ML Algorithms to Predict the outcome
---

In [142]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
train_Y = train["target"]

Naive.fit(Train_X_Tfidf,train_Y)

# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)

print(predictions_NB)

[0 0 0 ... 0 0 0]


In [144]:
from collections import Counter
Counter(predictions_NB)

Counter({0: 3000})

In [145]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

SVM.fit(Train_X_Tfidf,train['target'])

# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)

print(predictions_SVM)
Counter(predictions_SVM)

[0 0 0 ... 0 0 0]


Counter({0: 2946, 1: 54})