In [1]:
import os
import sys
import numpy as np 
import pandas as pd
import utilities as ut # 
import quora_vocab as qv
from importlib import reload
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
output_notebook()

# Quora Insincere Questions Classification EDA
## Introduction
The following Notebook is a short exploratory data analysis of the data set distributed as a part of the 'Quora Insincere Questions Classification' whose purpose is to 'Detect toxic content to improve online conversations'. 

**Competition Description:**

>An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.
>
>Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
>
>In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.
>
>Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.

### Data Loading

In [2]:
DATA_PATH = '~/google_drive/data/quora/'
DATA_FILE = '{}{}'.format(DATA_PATH,'train.csv')
data = pd.read_csv(DATA_FILE)
data.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


### Tokenizing Each Comment

In [3]:
data_pos_tokenized = [[vec[0],ut.canon_token_sentence(vec[1]),vec[2]] 
                      for vec in data.to_numpy() if vec[2] == 1]  
data_neg_tokenized = [[vec[0],ut.canon_token_sentence(vec[1]),vec[2]] 
                      for vec in data.to_numpy() if vec[2] == 0]   

### Tokenized Innapropriate Question Example

In [4]:
' '.join(data_pos_tokenized[103][1])

'what do democrats think of the fact that i am homophobic and was born this way will this make you rethink your rhetoric'

### Tokenized Appropriate Question Example

In [5]:
' '.join(data_neg_tokenized[11][1])

'how were the calgary flames founded'

### Class Construction

In [7]:
reload(qv)
comments = qv.CommentVocab(data_pos_tokenized+data_neg_tokenized)

Processing Comments: 100%|██████████| 1306122/1306122 [00:37<00:00, 34466.61it/s]


### Comparison of Question Length by Class

In [10]:
pos_lengths = [ len(vec[1]) for vec in data_pos_tokenized]
neg_lengths = [ len(vec[1]) for vec in data_neg_tokenized]
print('{}\n{}\n'.format('----Length of Positive Examples-----',pd.Series(pos_lengths).describe()))
print('{}\n{}'.format('----Length of Negative Examples-----',pd.Series(neg_lengths).describe()))

----Length of Positive Examples-----
count    80810.000000
mean        17.425282
std          9.641367
min          1.000000
25%         10.000000
50%         15.000000
75%         23.000000
max         83.000000
dtype: float64

----Length of Negative Examples-----
count    1.225312e+06
mean     1.260752e+01
std      6.811622e+00
min      2.000000e+00
25%      8.000000e+00
50%      1.100000e+01
75%      1.500000e+01
max      1.320000e+02
dtype: float64


In [11]:
comments.comment_length_graph()

#### Comments

* Here one can see in the summary statistics and graph that the distribution of inappropriate questions has a fatter tail than the distribution of appropriate questions showing that innaproriate questions tend to be somewhat longer. This matches what one might expect from the description of the innapropriate class which includes questions that are really statements of some position that the user has, or just generalized trolling.

* On the other hand one can interpret the relative shortness of appropriate question as reflecting more concise and clearly stated questions. 

### Comparison of Unigram Word Frequency By Class

In [12]:
comments.word_frequency_graphs(min_rank=3,max_rank=50)

#### Comments

* As one would expect the highest ranked unigrams in each of the classes are mostly short determiner words. However one interesting comparison is the relvatively higher ranking (in terms of counts) of the words 'people', 'trump', 'women', and 'men' in the inappropriate class.

In [13]:
comments.word_count_difference_graph(gram_length=1,num=20)

#### Comments

* One thing that the last graph shows, which one should expect is a higher frequency ,in the inappropriate class, of words that that are either polarizing e.g. 'trump' or are groups who are often subject to claims of supremacy or inferiority depending on the prejudices of the asker e.g. 'white', 'men', 'women', 'muslims','black'.

* One interesting unexpected result is the difference in the frequency of different interogative words used in the different classes of questions e.g. 'what', 'how', and 'which' are more frequent in the appropriate class, while interogative word 'why' is far more likely in the innappropriate class. 

* Perhaps the increased use of 'why' in the innapropriate class is due to the ease at which one can disguise a statement of a dubious, non-factual nature as a *why* question. For example the question 'why are aliens managing my local dairy queen?' presents the premise, that aliens are managing a dairy queen somewhere, as fact and implicitly requires the reader to accept the premise in order to respond directly to it. Note that questions like, 'which of my local dairy queens is managed by aliens', 'how can my local dairy queen be managed by aliens', and 'does/can my local dairy queen be managed by aliens' do not require the answerer to accept the premise in order to respond and invites the asker to respond to the premise directly. 

### Comparison of Bigram Word Frequency By Class

In [17]:
comments.word_frequency_graphs(gram_length=2,min_rank=3,max_rank=50)

#### Comments

* Many of the highest ranked bigrams in the appropriate class are what would seem on their face to be illicitations of instructions to bring about some result e.g. 'how_do', 'can_i', 'how_can'. 

* It is intersting that that 'in_india' shows up in the graph for both classes and contains the only proper noun 'india' present in the approprate graph. 


In [18]:
comments.word_count_difference_graph(gram_length=2,num=20)

#### Bigram Frequency Comments

* There are a few notable things here. First, we can see that 'what', 'can', 'how', and 'is_there' type questions are the most prevelant in the appropriate class, while 'why', and 'when' questions are more prevelant in the inappropriate class. 

* This result reinforces the comments made in the unigram section: that there is a fundamental difference in the type of questions being posed in the two classes. 

* One might have expected that the subject or object of a question would be the best differentiator but these results suggest that it is the phrasing of the question itself that is the most important feature. 


### Comparison of Trigram Word Frequency By Class

In [19]:
comments.word_count_difference_graph(gram_length=3,num=20)

#### Trigram Frequency Comments

* Here when we examine bigrams and find the same trends as we found in the analysis of unigrams 

* In the inappropriate example we find a comparitively high frequency of 'why' interogative words and phrases which can be formulated as a statement of dubious facts, that implicitly require the answer to assume this premise in order to answer it. 

## Future Extensions

* In the future I would like to use the stanford parser to explore whether there are any important differences in common syntactic structure of the two classes of questions. The hope is that this will lead to the engineering of pertinent features to include in the training of the classifier. 
