# Civil Comments - Classifying Toxicity, Using NLTK Naive Bayes Classifier

## By: ZXS

### Introduction-

The Civil Comments Platform, which has since shut down, made its content of roughly 2 million user-comments available in the form of a dataset in 2017.

These comments have all been classified and labelled with an index of toxicity representing the level of negative sentiment believed to be present in each individual post. The toxicity index is a continuous variable between 0 and 1 and comments can be classified by a value => .5 as toxic. 

Our objective is to see if it is possible to build a machine learning model that is able to accurately identify this sentiment. 

### Procedure-

In order to conduct a sentiment analysis, we will first have to perform some "natural language processing" on our datta. There are many different ways that this can be done. For our purposes we will make the following modifications to our data:

* Remove punctuation.
* **Tokeninzing:** splitting all paragraphs/sentences into lists of the component words.
* Removing stopwords.
* **Stemming:** the removal of all endings from words leaving only the root. For example, "running" would become "run".


In [1]:
# Import the required libraries
import os
import numpy as np
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import pickle
import operator
from tqdm import tqdm
import gc
tqdm.pandas()
from sklearn.model_selection import train_test_split
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

  return f(*args, **kwds)
Using TensorFlow backend.


In [2]:
# Set the working directory
wd = '/Users/zxs/Documents/code/kaggle/sentiment/'
os.chdir(wd)

# Load the data
train_df = pd.read_csv('train.csv.zip', compression = 'zip')
test_df = pd.read_csv('test.csv.zip', compression = 'zip')

# Load the pickled data
with open('sentiment.pickle', mode = 'rb') as f:

    train_text = pickle.load(f)

f.close() 

with open('train_words.pickle', mode = 'rb') as f:

    train_words = pickle.load(f)

f.close() 

In [3]:
# Rejoin the data
train_df['processed_text'] = train_words

# Split the data based on toxicity
train_df['toxicity'] = train_df['target'].progress_apply(lambda x: 'positive' if x >= .5 else 'negative')

# Arrange list of words and labels
train_df['processed_text'] = train_df['processed_text'].progress_apply(lambda x: ' '.join(x))

# Separate
text = train_df[['processed_text', 'toxicity']].values.tolist()

100%|██████████| 1804874/1804874 [00:01<00:00, 1064537.36it/s]
100%|██████████| 1804874/1804874 [00:03<00:00, 548007.45it/s]


### Feature Extraction-

Now that the comments have been separated into components and processed, we must extract features that belong to each different class of toxicity- "positive" and "negative" based on having a value for "target" (representing the continuous value between 0 and 1 assigned by the human scoring).

To do this, we will create simple dictionaries for each class of comments containing boolean values for whether or not the word is present in the texts belonging to each respective class:

In [4]:
# Function to process features
def word_fts(words):
    
    return dict([(word, True) for word in words])

# Split the data for training and testing
train, test = train_test_split(text, test_size = .2, random_state = 100)    

# Iterate the tokenized data to extract features
toxic_train = []
nontoxic_train = []

for i in train:
    
    if i[1] == 'positive':
        
        toxic_train.append(word for word in word_tokenize(i[0]))

    else:
        
        nontoxic_train.append(word for word in word_tokenize(i[0]))

toxic_test = []
nontoxic_test = []
        
# Remove duplicates        
toxic_train = set(toxic_train)
nontoxic_train = set(nontoxic_train)

# ID features
toxicft = [(word_fts(tox), 'positive') for tox in toxic_train]
nontoxicft = [(word_fts(nontox), 'negative') for nontox in nontoxic_train]

# Recombine features
train = toxicft + nontoxicft

# Model
nbc = NaiveBayesClassifier.train(train)

# Evaluate
print ('Accuracy:', nltk.classify.util.accuracy(nbc, test))
nbc.show_most_informative_features()

AttributeError: 'str' object has no attribute 'copy'