# ML Project with McDonald's sentiment data

by Georgios Karakostas

## Imaginary problem statement

McDonald's receives **thousands of customer comments** on their website per day, and many of them are negative. Their corporate employees don't have time to read every single comment, but they do want to read a subset of comments that they are most interested in. In particular, the media has recently portrayed their employees as being rude, and so they want to review comments about **rude service**.

McDonald's has hired us to develop a system that ranks each comment by the **likelihood that it is referring to rude service**. They will use our system to build a "rudeness dashboard" for their corporate employees, so that employees can spend a few minutes each day examining the **most relevant recent comments**.

## Description of the data

Before hiring us, McDonald's used the [CrowdFlower platform](http://www.crowdflower.com/data-for-everyone) to pay humans to **hand-annotate** about 1500 comments with the **type of complaint**. The complaint types are listed below, with the encoding used in the data listed in parentheses:

- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na)

## Task 1

We read **`mcdonalds.csv`** into a pandas DataFrame and examine it. 

- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists CrowdFlower's confidence in the judgments of its human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

In [1]:
# read mcdonalds.csv using a relative path
import pandas as pd
path = '../data/mcdonalds.csv'
mcd = pd.read_csv(path)

In [2]:
# examine the first three rows
mcd.head(3)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\nOrderProblem\nFilthy,1.0\n0.6667\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\nOrderProblem,1.0\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",


In [3]:
# examine the text of the first review
mcd.loc[0, 'review']

"I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care."

## Task 2

We remove any rows from the DataFrame in which the **policies_violated** column has a **null value**. We check the shape of the DataFrame before and after to confirm that we only removed about 50 rows.

- **Note:** Null values are also known as "missing values", and are encoded in pandas with the special value "NaN". This is distinct from the "na" encoding used by CrowdFlower to denote "None of the above". Rows that contain "na" should **not** be removed.

In [4]:
# examine the shape before removing any rows
mcd.shape

(1525, 11)

In [5]:
# count the number of null values in each column
mcd.isnull().sum()

_unit_id                           0
_golden                            0
_unit_state                        0
_trusted_judgments                 0
_last_judgment_at                  0
policies_violated                 54
policies_violated:confidence      54
city                              87
policies_violated_gold          1525
review                             0
Unnamed: 10                     1525
dtype: int64

In [6]:
# filter the DataFrame to only include rows in which policies_violated is not null
mcd = mcd[mcd.policies_violated.notnull()]

# alternatively, use the 'dropna' method to accomplish the same thing
mcd.dropna(subset=['policies_violated'], inplace=True)

In [7]:
# examine the shape after removing rows
mcd.shape

(1471, 11)

## Task 3

We add a new column to the DataFrame called **"rude"** that is 1 if the **policies_violated** column contains the text "RudeService", and 0 if the **policies_violated** column does not contain "RudeService". The "rude" column is going to be our response variable, so we should check how many zeros and ones it contains.

In [8]:
# search for the string 'RudeService', and convert the boolean results to integers
mcd['rude'] = mcd.policies_violated.str.contains('RudeService').astype(int)

In [9]:
# confirm that it worked
mcd.loc[0:4, ['policies_violated', 'rude']]

Unnamed: 0,policies_violated,rude
0,RudeService\nOrderProblem\nFilthy,1
1,RudeService,1
2,SlowService\nOrderProblem,0
3,na,0
4,RudeService,1


In [10]:
# examine the class distribution
mcd.rude.value_counts()

0    968
1    503
Name: rude, dtype: int64

## Task 4

1. We define X (the **review** column) and y (the **rude** column).
2. We split X and y into training and testing sets (using the parameter **`random_state=1`**).
3. We use CountVectorizer (with the **default parameters**) to create document-term matrices from X_train and X_test.

In [11]:
# define X and y
X = mcd.review
y = mcd.rude

In [23]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [24]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [25]:
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(1103, 7300)

In [26]:
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(368, 7300)

## Task 5

We fit a Multinomial Naive Bayes model to the training set, calculate the **predicted probabilites** (not the class predictions) for the testing set, and then calculate the **AUC**. We will repeat this task using a logistic regression model to see which of the two models achieves a better AUC.

- **Note:** Because McDonald's only cares about ranking the comments by the likelihood that they refer to rude service, **classification accuracy** is not the relevant evaluation metric. **Area Under the Curve (AUC)** is a more useful evaluation metric for this scenario, since it measures the ability of the classifier to assign higher predicted probabilities to positive instances than to negative instances.

In [27]:
# import/instantiate/fit a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [28]:
# calculate the predicted probability of rude=1 for each testing set observation
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]

In [29]:
# calculate the AUC
from sklearn import metrics
metrics.roc_auc_score(y_test, y_pred_prob)

0.8426005404546177

In [30]:
# repeat this task using a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
metrics.roc_auc_score(y_test, y_pred_prob)

0.8233667143538389

## Task 6

Since Naive Bayes has a better AUC (see previous step), we use it to try **tuning CountVectorizer**. We check the testing set **AUC** after each change, and find the set of parameters that increases AUC the most.

- **Note:** To help us in this process, we will define a **`tokenize_test()`** function that will allow us to iterate quickly through different sets of parameters.

In [31]:
# define a function that accepts a vectorizer and calculates the AUC
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features:', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to calculate predicted probabilities
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
    
    # print the AUC
    print('AUC:', metrics.roc_auc_score(y_test, y_pred_prob))

In [32]:
# confirm that the AUC is identical to task 5 when using the default parameters
vect = CountVectorizer()
tokenize_test(vect)

Features: 7300
AUC: 0.8426005404546177


In [33]:
# tune CountVectorizer to increase the AUC
vect = CountVectorizer(stop_words='english', max_df=0.3, min_df=4)
tokenize_test(vect)

Features: 1732
AUC: 0.8621522810364012
