# Goal:

Rank each comment by the likelihood that it is referring to rude service and present an ordered list of comments and associated probabilities.

## Description of the data:

The complaint types are listed below, with the encoding used in the data listed in parentheses:

- Bad Food (BadFood)
- Bad Neighborhood (ScaryMcDs)
- Cost (Cost)
- Dirty Location (Filthy)
- Missing Item (MissingFood)
- Problem with Order (OrderProblem)
- Rude Service (RudeService)
- Slow Service (SlowService)
- None of the above (na)

### Read in the data:

- The **policies_violated** column lists the type of complaint. If there is more than one type, the types are separated by newline characters.
- The **policies_violated:confidence** column lists the confidence in the judgments of human annotators for that row (higher is better).
- The **city** column is the McDonald's location.
- The **review** column is the actual text comment.

In [1]:
import numpy as np
import pandas as pd

mcdonalds = pd.read_csv('mcdonalds.csv')
mcdonalds.head(3)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,policies_violated,policies_violated:confidence,city,policies_violated_gold,review,Unnamed: 10
0,679455653,False,finalized,3,2/21/15 0:36,RudeService\r\nOrderProblem\r\nFilthy,1.0\r\n0.6667\r\n0.6667,Atlanta,,"I'm not a huge mcds lover, but I've been to be...",
1,679455654,False,finalized,3,2/21/15 0:27,RudeService,1,Atlanta,,Terrible customer service. ŒæI came in at 9:30...,
2,679455655,False,finalized,3,2/21/15 0:26,SlowService\r\nOrderProblem,1.0\r\n1.0,Atlanta,,"First they ""lost"" my order, actually they gave...",


### Inspect the data:

In [2]:
mcdonalds.dtypes

_unit_id                          int64
_golden                            bool
_unit_state                      object
_trusted_judgments                int64
_last_judgment_at                object
policies_violated                object
policies_violated:confidence     object
city                             object
policies_violated_gold          float64
review                           object
Unnamed: 10                     float64
dtype: object

In [3]:
mcdonalds.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'policies_violated',
       'policies_violated:confidence', 'city', 'policies_violated_gold',
       'review', 'Unnamed: 10'],
      dtype='object')

In [4]:
mcdonalds.policies_violated.value_counts().head()

na              295
RudeService     177
SlowService     127
OrderProblem    116
BadFood         101
Name: policies_violated, dtype: int64

### Remove rows from the data where the policies_violated column has a null value. 
- **Note:** In this data 'na' does not represent a null value but rather none of the above

In [5]:
mcdonalds.shape

(1525, 11)

In [6]:
mcdonalds = mcdonalds[mcdonalds.policies_violated.notnull()]
mcdonalds.shape

(1471, 11)

### Add a 'rude' column that is 1 if the policies_violated column contains the text 'RudeService' and 0 otherwise

In [7]:
mcdonalds['rude'] = mcdonalds.policies_violated.str.contains('RudeService').astype(int)

# verify resulting column
mcdonalds[['policies_violated','rude']].head()

Unnamed: 0,policies_violated,rude
0,RudeService\r\nOrderProblem\r\nFilthy,1
1,RudeService,1
2,SlowService\r\nOrderProblem,0
3,na,0
4,RudeService,1


### Define X as the 'review' column and y as the 'rude' column

In [8]:
X = mcdonalds.review
y = mcdonalds.rude

### Calculate the null accuracy (aka the accuracy of predicting the most frequent class 'rude' or 'not rude')

In [9]:
max(y.mean(), 1-y.mean())

0.6580557443915704

### Split X and y into training and testing sets (with a set random_state for reproducible results)

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(1103,) (368,)
(1103,) (368,)


### 'Learn' the language of the reviews using CountVectorizer and convert X_train and X_test into sparse matrices

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [12]:
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [13]:
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(1103, 7300)
(368, 7300)


- ### Fit a Multinomial Naive Bayes model to the training set
- ### Calculate the predicted probabilites for the test set
- ### Calculate AUC

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
nb = MultinomialNB()

In [15]:
nb.fit(X_train_dtm, y_train)

# [:,1] isolates the probability of a 1 aka a 'rude' probability
y_pred_probs = nb.predict_proba(X_test_dtm)[:,1]

In [16]:
metrics.roc_auc_score(y_test, y_pred_probs)

0.8426005404546177

### Repeat above steps for a Logistic Regression model instead

In [17]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [18]:
logreg.fit(X_train_dtm, y_train)

y_pred_probs = logreg.predict_proba(X_test_dtm)[:,1]

In [19]:
metrics.roc_auc_score(y_test, y_pred_probs)

0.8233985058019394

### Since Multinomial Naive Bayes outperformed Logistic Regression we will perform parameter tuning for CountVectorizer to look for better performing models

In [20]:
# define a function that accepts a vectorizer and calculates the accuracy of the model 
def tokenize_test(vect):
    
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use Multinomial Bayes to predict probabilities
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_probs = nb.predict_proba(X_test_dtm)[:,1]
    
    # print the accuracy of it's predictions
    print('Accuracy: ', metrics.roc_auc_score(y_test, y_pred_probs))

In [21]:
# default parameters
vect = CountVectorizer()
tokenize_test(vect)

Features:  7300
Accuracy:  0.8426005404546177


In [22]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
tokenize_test(vect)

Features:  8742
Accuracy:  0.8406453663964394


In [23]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1,2))
tokenize_test(vect)

Features:  57936
Accuracy:  0.8195994277539342


In [24]:
# include 1-grams, 2-grams and 3-grams
vect = CountVectorizer(ngram_range=(1,3))
tokenize_test(vect)

Features:  142380
Accuracy:  0.7963916706405977


In [25]:
# include 1-grams, 2-grams, 3-grams, and 4-grams
vect = CountVectorizer(ngram_range=(1,4))
tokenize_test(vect)

Features:  236084
Accuracy:  0.7837068828485139


In [26]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  7020
Accuracy:  0.853520902877126


In [27]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
tokenize_test(vect)

Features:  7291
Accuracy:  0.8448895247178507


In [28]:
# only keep the top 1k most frequent terms
vect = CountVectorizer(max_features=1000)
tokenize_test(vect)

Features:  1000
Accuracy:  0.8300906056270863


In [29]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
tokenize_test(vect)

Features:  3500
Accuracy:  0.8445875059608965


In [30]:
vect = CountVectorizer(ngram_range=(1,2), min_df=2)
tokenize_test(vect)

Features:  14485
Accuracy:  0.8412970910824988


In [31]:
vect = CountVectorizer(ngram_range=(1,2), stop_words='english')
tokenize_test(vect)

Features:  44739
Accuracy:  0.8316165951359085


### Combine the 'city' column with the 'review' column to see if the City may be predictive of the response. 

In [32]:
mcdonalds['review_city'] = mcdonalds.city.str.cat(mcdonalds.review, sep=' ', na_rep='na')

In [33]:
X = mcdonalds.review_city

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [35]:
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  7023
Accuracy:  0.8532347798442219


In [36]:
vect = CountVectorizer(stop_words='english', min_df=2)
tokenize_test(vect)

Features:  3246
Accuracy:  0.8541885232872358


In [37]:
vect = CountVectorizer(stop_words='english', min_df=2, max_df=0.3)
tokenize_test(vect)

Features:  3242
Accuracy:  0.8571769194086791


### Score new comments with the likelihood that they are referring to rude service

In [38]:
X = mcdonalds.review_city

In [39]:
vect = CountVectorizer(stop_words='english', min_df=2, max_df=0.3)

In [40]:
X_dtm = vect.fit_transform(X)

In [41]:
X_dtm.shape

(1471, 3834)

In [42]:
new_comments = pd.read_csv('mcdonalds_new.csv')

In [43]:
new_comments['review_city'] = new_comments.city.str.cat(new_comments.review, 
                                                        sep=' ', na_rep='na')

In [44]:
new_comments

Unnamed: 0,city,review,review_city
0,Las Vegas,Went through the drive through and ordered a #...,Las Vegas Went through the drive through and o...
1,Chicago,Phenomenal experience. Efficient and friendly ...,Chicago Phenomenal experience. Efficient and f...
2,Los Angeles,Ghetto lady helped me at the drive thru. Very ...,Los Angeles Ghetto lady helped me at the drive...
3,New York,Close to my workplace. It was well manged befo...,New York Close to my workplace. It was well ma...
4,Portland,I've made at least 3 visits to this particular...,Portland I've made at least 3 visits to this p...
5,Houston,Why did I revisited this McDonald's again. I...,Houston Why did I revisited this McDonald's a...
6,Atlanta,This specific McDonald's is the bar I hold all...,Atlanta This specific McDonald's is the bar I ...
7,Dallas,My friend and I stopped in to get a late night...,Dallas My friend and I stopped in to get a lat...
8,Cleveland,Friendly people but completely unable to deliv...,Cleveland Friendly people but completely unabl...
9,,"Having visited many McDonald's over the years,...",na Having visited many McDonald's over the yea...


In [45]:
new_dtm = vect.transform(new_comments.review_city)

In [46]:
new_dtm.shape

(10, 3834)

In [47]:
y = mcdonalds.rude

In [48]:
y.shape

(1471,)

In [49]:
nb = MultinomialNB()
nb.fit(X_dtm, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [50]:
new_pred_probs = nb.predict_proba(new_dtm)[:,1]

In [51]:
new_pred_probs

array([0.38421442, 0.00676949, 0.95068161, 0.03715289, 0.72560183,
       0.07188112, 0.21955204, 0.99998954, 0.95277939, 0.01742432])

In [52]:
# widen the column display
pd.set_option('display.max_colwidth', 1000)

In [53]:
pd.DataFrame({'comment': new_comments.review_city, 
              'rude_probability': new_pred_probs}).sort_values('rude_probability', ascending=False)

Unnamed: 0,comment,rude_probability
7,Dallas My friend and I stopped in to get a late night snack and we were refused service. The store claimed to be 24 hours and the manager was standing right there doing paper work but would not help us. The cashier was only concerned with doing things for the drive thru and said that the manager said he wasn't allowed to help us. We thought it was a joke at first but when realized it wasn't we said goodbye and they just let us leave. I work in a restaurant and this is by far the worst service I have ever seen. I know it was late and maybe they didn't want to be there but it was completely ridiculous. I think the manager should be fired.,0.99999
8,"Cleveland Friendly people but completely unable to deliver what was ordered at the drive through. Out of my last 6 orders they got it right 3 times. Incidentally, the billing was always correct - they just could not read the order and deliver. Very frustrating!",0.952779
2,Los Angeles Ghetto lady helped me at the drive thru. Very rude and disrespectful to the co workers. Never coming back. Yuck!,0.950682
4,"Portland I've made at least 3 visits to this particular location just because it's right next to my office building.. and all my experience have been consistently bad. There are a few helpers taking your orders throughout the drive-thru route and they are the worst. They rush you in placing an order and gets impatient once the order gets a tad bit complicated. Don't even bother changing your mind oh NO! They will glare at you and snap at you if you want to change something. I understand its FAST food, but I want my order placed right. Not going back if I can help it.",0.725602
0,"Las Vegas Went through the drive through and ordered a #10 (cripsy sweet chili chicken wrap) without fries- the lady couldn't understand that I did not want fries and charged me for them anyways. I got the wrong order- a chicken sandwich and a large fries- my boyfriend took it back inside to get the correct order. The gentleman that ordered the chicken sandwich was standing there as well and she took the bag from my bf- glanced at the insides and handed it to the man without even offering to replace. I mean with all the scares about viruses going around... ugh DISGUSTING SERVICE. Then when she gave him the correct order my wrap not only had the sweet chili sauce on it, but the nasty (just not my first choice) ranch dressing on it!!!! I mean seriously... how lazy can you get!!!! I worked at McDonalds in Texas when I was 17 for about 8 months and I guess I was spoiled with good management. This was absolutely ridiculous. I was beyond disappointed.",0.384214
6,"Atlanta This specific McDonald's is the bar I hold all other fast food joints to now. Been working in this area for 3 years now and gone to this location many times for drive-through pickup. Service is always fast, food comes out right, and the staff is extremely warm and polite.",0.219552
5,"Houston Why did I revisited this McDonald's again. I needed to use the restroom facilities and the women's bathroom didn't have soap, the floor was wet, the bathroom stink, and the toilets were nasty. This McDonald's is very nasty.",0.071881
3,"New York Close to my workplace. It was well manged before. Now it's OK. The parking can be tight sometimes. Like all McDonald's, prices are getting expensive.",0.037153
9,"na Having visited many McDonald's over the years, I have to say that this one is the most efficient one ever! Even though it is still fast food, the service at the drive-thru is the best. They rarely make a mistake and I never see anyone parked in the drive-thru slots where they bring food out because they don't have it ready. So, if you like McDonald's fast food, it doesn't get any better than this.",0.017424
1,"Chicago Phenomenal experience. Efficient and friendly staff. Clean restrooms, good, fast service and bilingual staff. One of the best restaurants in the chain.",0.006769
