## Load Librarys

In [None]:
import os
import time
import numpy as np
import pandas as pd 
from collections import Counter

from wordcloud import WordCloud
import matplotlib.pyplot as plt

## Load Data

In [None]:
train_df = pd.read_csv("../input/train.csv")
train_df.shape

In [None]:
train_df.head(10)

In [None]:
target_count = train_df.target.value_counts()
print('Class 0:', target_count[0])
print('Class 1:', target_count[1])
print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')

target_count.plot(kind='bar', title='Count (target)');

Our sample is very unbalanced

## Tokenize

We will drop stop words and tokenize text.

**Stop words** usually refers to the most common words in a language. Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

**Tokenization** is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing.

In [None]:
import nltk
from nltk.corpus import stopwords

def tokenizer(file_text):
    tokens = nltk.word_tokenize(file_text)

    stop_words = stopwords.words('english')
    tokens = [i for i in tokens if ( i not in stop_words )]
    
    return ' '.join(tokens)

train_df.question_text = train_df.question_text.apply(lambda x: tokenizer(x))

train_df.head(10)

## Word Clouds

Build our first  Word Clouds using all data.

In [None]:
text = ' '.join(train_df['question_text'].str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, background_color='black',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in question text')
plt.axis("off")
plt.show()

**Сonclusion**

This cloud does not provide any useful information or insights.

We need to see the clouds for nontoxic toxic content in search of insight

### Nontoxic content

In [None]:
train_df[train_df['target']==0].question_text.head(10)

In [None]:
text = ' '.join(train_df[train_df['target']==0].question_text.str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, background_color='black',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in nontoxic question text')
plt.axis("off")
plt.show()

**Сonclusion**

This cloud does not provide any useful information or insights same as previous.

We need to see the cloud of toxic content and compare them.

## Toxic content

Attention, 18+ content

The text below may offend your feelings.

![Toxic content](https://upload.wikimedia.org/wikipedia/commons/7/78/RARS_18%2B.svg)

In [None]:
text = ' '.join(train_df[train_df['target']==1].question_text.str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, background_color='black',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in toxic question text')
plt.axis("off")
plt.show()

![trump](https://timedotcom.files.wordpress.com/2018/03/donald-trump-snl-baldwin-twitter.jpg)

**Сonclusion**

Donald Trump on top of the world

There are several words that intersect in toxic and nontoxic question text. For example: India, People.
We need to understand what to do about it.

Then you can pay attention to the typical rasism, sexist and political question text.

I was very surprised to find Donald Trump on top of toxic questions.
Out of interest I decided to take a look, what kind of questions are these.

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
train_df[(train_df['target']==1) & (train_df['question_text'].str.contains("Trump"))].head(10)

And look at nontoxic questions about Donald Trump

In [None]:
train_df[(train_df['target']==0) & (train_df['question_text'].str.contains("Trump"))].head(10)

The author is not English speaking. But I think that many questions with a target = 1 should be from 0 and questions from a target = 0 should be from 1.

Yes, I know that in the description of the competition was:
>  The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

I think this can seriously affect the generalizing ability of the model.

## TODO
1. Try n-grams
1. Try Word2Vec
1. Try sec2seq