# Natural Language Processing with Kaggle Quora Classification Competition

## Load Data

We needed to store the data in AWS S3 so that we can use sagemaker to run our machine learning models.
Link: https://www.kaggle.com/c/quora-insincere-questions-classification/data

In [1]:
import boto3
import pandas as pd
from sagemaker import get_execution_role

# files: embeddings.zip, sample_submission.csv.zip, test.csv.zip, train.csv.zip

role = get_execution_role()
bucket='quora-kaggle'
data_key = 'train.csv.zip'
data_location = 's3://{}/{}'.format(bucket, data_key)

df_train = pd.read_csv("s3://quora-kaggle/train.csv.zip")
df_test = pd.read_csv("s3://quora-kaggle/test.csv.zip")
df_sample = pd.read_csv("s3://quora-kaggle/sample_submission.csv.zip")

In [2]:
df_train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [3]:
df_test.head()

Unnamed: 0,qid,question_text
0,00014894849d00ba98a9,My voice range is A2-C5. My chest voice goes u...
1,000156468431f09b3cae,How much does a tutor earn in Bangalore?
2,000227734433360e1aae,What are the best made pocket knives under $20...
3,0005e06fbe3045bd2a92,Why would they add a hypothetical scenario tha...
4,00068a0f7f41f50fc399,What is the dresscode for Techmahindra freshers?


In [4]:
df_sample.head()

Unnamed: 0,qid,prediction
0,00014894849d00ba98a9,0
1,000156468431f09b3cae,0
2,000227734433360e1aae,0
3,0005e06fbe3045bd2a92,0
4,00068a0f7f41f50fc399,0


## Import NLP Libraries

In [5]:
import sys
import nltk
import sklearn
import pandas
import numpy as np

## Preprocess Data

In [6]:
#separate data into our labels --> Y, and texts --> text_messages
Y = df_train['target']
#Y
text_messages = df_train['question_text']
# text_messages

In [8]:
# use regular expressions to replace email addresses, URLs, phone numbers, other numbers
#link: http://regexlib.com/?AspxAutoDetectCookieSupport=1

# Replace email addresses with 'email'
processed = text_messages.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$',
                                 'emailaddress')

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$',
                                  'webaddress')

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
processed = processed.str.replace(r'£|\$', 'moneysymb')
    
# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$',
                                  'phonenumbr')
    
# Replace numbers with 'numbr'
processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr')

# note: we end in 'r' instead of 'er' to prevent lemmatization 

In [9]:
# Remove punctuation
processed = processed.str.replace(r'[^\w\d\s]', ' ')

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ')

# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '')

In [10]:
# change words to lower case - Hello, HELLO, hello are all the same word
processed = processed.str.lower()
print(processed)

0          how did quebec nationalists see their province...
1          do you have an adopted dog how would you encou...
2          why does velocity affect time does velocity af...
3          how did otto von guericke used the magdeburg h...
4          can i convert montra helicon d to a mountain b...
5          is gaza slowly becoming auschwitz dachau or tr...
6          why does quora automatically ban conservative ...
7          is it crazy if i wash or wipe my groceries off...
8          is there such a thing as dressing moderately a...
9          is it just me or have you ever been in this ph...
10                           what can you say about feminism
11                       how were the calgary flames founded
12         what is the dumbest yet possibly true explanat...
13         can we use our external hard disk as a os as w...
14         i am numbr living at home and have no boyfrien...
15         what do you know about bram fischer and the ri...
16         how difficult

In [11]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# remove stop words from text messages

stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x: ' '.join(
    term for term in x.split() if term not in stop_words))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [12]:
# Remove word stems using a Porter stemmer
ps = nltk.PorterStemmer()

processed = processed.apply(lambda x: ' '.join(
    ps.stem(term) for term in x.split()))

## Generating Features

In [13]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')
# create bag-of-words
all_words = []

for message in processed:
    words = word_tokenize(message)
    for w in words:
        all_words.append(w)
        
all_words = nltk.FreqDist(all_words)

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [14]:
# print the total number of words and the 15 most common words
print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

Number of words: 142847
Most common words: [('numbr', 174365), ('get', 73704), ('best', 62487), ('would', 61527), ('peopl', 56781), ('like', 54179), ('use', 49560), ('good', 39110), ('make', 38316), ('one', 36471), ('india', 32853), ('year', 30064), ('think', 29340), ('differ', 29166), ('time', 28358)]


In [15]:
# use the 1500 most common words as features
word_features = list(all_words.keys())[:1500]

#THIS CAN BE CHANGED

In [16]:
# The find_features function will determine which of the 1500 word features are contained in the review
def find_features(message):
    words = word_tokenize(message)
    features = {}
    for word in word_features:
        features[word] = (word in words)

    return features

# Lets see an example!
features = find_features(processed[0])
for key, value in features.items():
    if value == True:
        print(key)

quebec
nationalist
see
provinc
nation
numbr


In [17]:
# Now lets do it for all the messages
messages = zip(processed, Y)

# define a seed for reproducibility
# seed = 1
# np.random.seed = seed
# np.random.shuffle(messages)

# call find_features function for each SMS message
featuresets = [(find_features(text), label) for (text, label) in messages]

SyntaxError: invalid syntax (<ipython-input-17-d6b8a06cd89c>, line 1)

In [None]:
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection

# split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state=seed)

In [None]:
print(len(training))
print(len(testing))

## Scikit-Learn Classifiers with NLTK

In [None]:
# We can use sklearn algorithms in NLTK
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# train the model on the training data
model.train(training)

# and test on the testing dataset!
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Define models to train
names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear')
]

models = zip(names, classifiers)

for name, model in models:
    nltk_model = SklearnClassifier(model)
    nltk_model.train(training)
    accuracy = nltk.classify.accuracy(nltk_model, testing)*100
    print("{} Accuracy: {}".format(name, accuracy))

In [None]:
# Ensemble methods - Voting classifier
from sklearn.ensemble import VotingClassifier

names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear')
]

models = zip(names, classifiers)

nltk_ensemble = SklearnClassifier(VotingClassifier(estimators = models, voting = 'hard', n_jobs = -1))
nltk_ensemble.train(training)
accuracy = nltk.classify.accuracy(nltk_model, testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))

In [None]:
# make class label prediction for testing set
txt_features, labels = zip(*testing)

prediction = nltk_ensemble.classify_many(txt_features)

In [None]:
# print a confusion matrix and a classification report
print(classification_report(labels, prediction))

pd.DataFrame(
    confusion_matrix(labels, prediction),
    index = [['actual', 'actual'], ['ham', 'spam']],
    columns = [['predicted', 'predicted'], ['ham', 'spam']])