# DQI Random Forest Classification Exercise for Binary Data

The following uses the binary score - participation (which can only be 0 or 1), to create a random forest classification model. It splits the labelled data into a training set (80%) and a testing set (20%). In the first pass (April 21, 2017), it has a 90% success rate in predicting the participation score of the test comments.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import nltk
import re
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

data_df = pd.read_csv('combined_scored.csv')


# Preparing the data
The first step to random forest classification is splitting the labelled data into two subsets - the training and the testing set. The training set will be used to build the model, and the testing set will be used to evaluate the accuracy of the predictions.

The comments are then cleaned by extracting all non-alpha characters and removing stopwords. In this case, the stopwords have been imported from the python nltk library.

In [8]:
data = data_df[['comment', 'participation']]

train, test = train_test_split(data, train_size = 0.8, random_state = 44)

def comment_to_words(raw_comment):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_comment)
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return(" ".join(meaningful_words))

clean_train_comments = []

train["cleaned_comment"] = train["comment"].apply(lambda x: comment_to_words(x))

for i in train["cleaned_comment"].values:
    clean_train_comments.append(i)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


# Creating word features
Once the stopwords are removed, the remaining words are transformed into bag of words features. In the bag of words method, each word is evaluated independently from the others. 


In [12]:
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, \
                             max_features = 5000)

train_data_features = vectorizer.fit_transform(clean_train_comments)
train_data_features = train_data_features.toarray()

print '(Number of comments, number of words/features)'
print train_data_features.shape

vocab = vectorizer.get_feature_names()
dist = np.sum(train_data_features, axis = 0)

#for tag, count in zip(vocab, dist):
#    print count, tag

(Number of comments, number of words/features)
(182, 2026)


# Create the random forest classifier
The Random Forest Classifier is then created. This classifier is based on a decision tree algorithm where the trees are aggregated to select a random subset of features. Decision trees have a tendency to overfit the data and to go very deep. Creating a random forest averages the predicitions from each decision tree. 

The classifier is created, in this case with 100 decision trees. It is trained on the training data set.

In [13]:
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data_features, train["participation"])

# Run the trained random forest

Once the random forest is trained, the test data is prepared and used to evaluate the classifier. The result shows a table of actual vs predicted participation scores. 

The confusion matrix shows a 90% success rate in this first attempt.

In [6]:
# Run trained Random Forest on test data set

clean_test_comments = []

test["cleaned_comment"] = test["comment"].apply(lambda x: comment_to_words(x))

for i in test["cleaned_comment"].values:
    clean_test_comments.append(i)
    
test_data_features = vectorizer.transform(clean_test_comments)
test_data_features = test_data_features.toarray()

result = forest.predict(test_data_features)

output = pd.DataFrame(data={"actual_participation": test["participation"], "predicted_participation": result})

# Create confusion matrix

df_confusion = pd.crosstab(output['actual_participation'], output['predicted_participation'], rownames=['Actual'], \
                           colnames=['Predicted'])

print df_confusion
print output

Predicted   1
Actual       
0           5
1          41
     actual_participation  predicted_participation
221                     1                        1
186                     1                        1
35                      1                        1
106                     1                        1
110                     1                        1
30                      1                        1
137                     1                        1
105                     1                        1
58                      1                        1
208                     1                        1
203                     1                        1
82                      1                        1
210                     1                        1
135                     1                        1
178                     0                        1
77                      1                        1
43                      1                        1
174                     1 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
