# Akshay Bhide - Homework 1 (Due Tuesday, March 22, 2022 at 6:29pm PST)

**Rubric**
* Identified 4 major themes from the reviews (2pts)
* Regex that groups / cleans the reviews is correctly implemented (4pts)
* Word count is correctly implemented (2pts)
* Analysis of recommendations and pitfalls/limitations are specific enough to be actionable (2pts)

Not actionable recommendation:
* *The store managers should consider trying to improve the drive through experience to be more pleasant for customers*

Actionable:
* *Drive throughs are mentioned 23% of the time in reviews, and often focus on how slow the service is. We recommend adopting parallel drive through stations for Atlanta and Chicago*

You are a business analyst working for McDonalds. First, read through the reviews in `mcdonalds-yelp-negative-reviews.csv` (found in `datasets` folder). 

1. Identify 4 recurring themes/topics that reviewers are unhappy with. For example, one theme is that users are consistently unhappy with the drive-through experience.

2. Next, using regex, group together all occurrences of these phrases. For example, `drive-thru`, `drive through`, `drivethrough` can all be replaced as `_DRIVE_THROUGH_`.

3. Perform a word count, both overall, and broken out by city.

4. **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

Some considerations in your analysis:

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

**Submit everything as a new notebook and Slack direct message (group message) to me (Yu Chen) and the TAs (Mengqi Tan and Siyuan Ni) the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

Every day late is -10%.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import re

In [2]:
#read in data
data = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv', encoding="latin1")

In [3]:
#data head/shape
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [4]:
#1525 reviews
data.shape

(1525, 3)

In [5]:
#isolate reivews column
reviews = data['review']

In [6]:
#sample reviews and run over and over to look for common themes
samp = reviews.sample(3)
for s in samp:
    print(s, end='\n\n')

I swear the service with this mcdonalds is NO GOOD AT ALL!!! I don't like stopping at this McDonald's location cause I absolutely hate theirService but my girlfriend was hungry after work and she insisted weStop there so she can get a bite to eat so I went for it! BIG MISTAKE!! 1ST of all Once they speak it seems like they have an attitude. they dont even greetYou with good customer service and huff and puff as if they are in a rush to take your Order. 2nd when I ordered the wings for my girlfriend they said "its gonna be A 20 minute wait after you pay" are they serious they cant start on my orderRight away once I place it??!! There was 3 cars in front of me so I have to waitEven longer??? So I said no thank you we will just order something else and thereWe hear another deep breathe as if im taking very long to place my order. Then after I placed my order we stood in line with 3 cars still in front of us so we waited patiently.After waiting 15mins and no car has moved and 2 cars behind

### Major Recurring Themes

- slow service
- wrong orders
- homelessness issue
- bad customer service

In [7]:
#use this function to find reviews with a certain word in it (to find words to substitute for each theme)
def find_rev_with(word):
    return [x for x in reviews if len(re.findall(word, x)) > 0]

#commented because of long output
# find_rev_with('rude')

In [8]:
#substitute words for each theme
revs_cleaned = []
for review in reviews:
    rev_copy = review
    rev_copy = re.sub(r'\b(order|messed up|incorrect|wrong|screw(ed)?)\b', '_WRONG_ORDER_', rev_copy, flags=re.IGNORECASE)
    rev_copy = re.sub(r'\b(slow|wait(ed|ing)?)\b', '_SLOW_SERVICE_', rev_copy, flags=re.IGNORECASE)
    rev_copy = re.sub(r'\b(homeless|hobo(s)?|transient(s)?)\b', '_HOMELESSNESS_', rev_copy, flags=re.IGNORECASE)
    rev_copy = re.sub(r'\b(attitude|rude|disrespectful|curs(e|ing)?|incompetent|unfriendly)\b', '_CUST_SERVICE_', rev_copy, flags=re.IGNORECASE)
    
    revs_cleaned.append(rev_copy)

In [9]:
#check to make sure subs worked

#commented because of long output
# [x for x in revs_cleaned if len(re.findall('_HOMELESSNESS_', x)) > 0]

In [10]:
#define function to find the number of reviews with each theme, do word count for all reviews and break down by city
theme_keywords = {'_WRONG_ORDER_':'Wrong Order', '_SLOW_SERVICE_':'Slow Service', 
                  '_HOMELESSNESS_':'Homelessness Issue', '_CUST_SERVICE_':'Bad Customer Service'}
def find_theme_counts(revs_cleaned):
    theme_counts = {'_WRONG_ORDER_':0, '_SLOW_SERVICE_':0, '_HOMELESSNESS_':0, '_CUST_SERVICE_':0}

    for k in theme_counts:
        for r in revs_cleaned:
            if k in r:
                theme_counts[k] += 1
                
    return theme_counts

overall_counts = find_theme_counts(revs_cleaned)
print(f'Overall word count of themes: ({len(data)} reviews)')
for k in theme_keywords:
    print(f'{theme_keywords[k]}:  {overall_counts[k]} ({round(overall_counts[k]/len(data),2)})')
    
print()
data['Reviews Cleaned'] = revs_cleaned
data['city'] = data['city'].fillna('no city given')
cities = data['city'].unique()

for c in cities:
    sub_df = data[data.city == c]
    city_counts = find_theme_counts(sub_df['Reviews Cleaned'].values)
    
    print(f'Overall word count for: {c.upper()} ({len(sub_df)} reviews)')
    for k in theme_keywords:
        print(f'{theme_keywords[k]}:  {city_counts[k]} ({round(city_counts[k]/len(sub_df),2)})')
    print()

Overall word count of themes: (1525 reviews)
Wrong Order:  566 (0.37)
Slow Service:  379 (0.25)
Homelessness Issue:  61 (0.04)
Bad Customer Service:  176 (0.12)

Overall word count for: ATLANTA (130 reviews)
Wrong Order:  54 (0.42)
Slow Service:  39 (0.3)
Homelessness Issue:  2 (0.02)
Bad Customer Service:  18 (0.14)

Overall word count for: LAS VEGAS (409 reviews)
Wrong Order:  166 (0.41)
Slow Service:  100 (0.24)
Homelessness Issue:  8 (0.02)
Bad Customer Service:  44 (0.11)

Overall word count for: DALLAS (75 reviews)
Wrong Order:  33 (0.44)
Slow Service:  14 (0.19)
Homelessness Issue:  3 (0.04)
Bad Customer Service:  8 (0.11)

Overall word count for: PORTLAND (97 reviews)
Wrong Order:  26 (0.27)
Slow Service:  16 (0.16)
Homelessness Issue:  4 (0.04)
Bad Customer Service:  13 (0.13)

Overall word count for: CHICAGO (219 reviews)
Wrong Order:  79 (0.36)
Slow Service:  49 (0.22)
Homelessness Issue:  8 (0.04)
Bad Customer Service:  19 (0.09)

Overall word count for: CLEVELAND (71 revie

### Business Recommendations


- **Wrong Orders:** More than a third (37%) of overall reviews mention wrong orders, often citing that these are repeat occurrences as well. Employee training should be bolstered to mitigate this outright and consequences should be put in place to incentivize employees to deliver accurate orders to customers.
- **Slow Service:** Overall, slow service is mentioned in about a fourth of overall reviews, and 38% of reviews in the Cleveland restaurant. Across the board order completion practices and trainings should be reviewed to see if any improvements can be made. Moreover, hiring more employees or having more employees work at the same time, particularly in the Cleveland restaurant, can help decrease service time.
- **Homelessness Issue:** Homelessness/transients disturbing customers was mentioned in only 4% of reviews, but still seems to be a major problem when mentioned in reviews. Furthermore, 14% of reviews in Los Angeles mention homelessness. Specifically for that location, I recommend working with local law enforcement agencies to find ways for customers to feel safe and comfortable entering and eating inside a McDonald's restaurant.
- **Bad Customer Service:** 12% of reviews mention employees having poor attitudes, being rude or disrespectful, or cursing in front of employees. Although these incidents are mentioned in less reviews than wrong orders or slow service, it's an incredibly important matter to deal with, and this kind of behavior should not be tolerated in any restaurant. All restaurants should perform a thorough review of each employee and assess whether or not their attitude towards customers is fitting of an ideal McDonald's employee.

### Pitfalls of using just a word count analysis

The biggest pitfall of using a word count analysis to make these kinds of inferences is the context around the certain words that are being matched. Since words are matched individually, word count's don't reflect the context from which the word occurs in a sentence or document. For example, in this analysis, when analyzing the occurrence of the word "rude," this analysis would identify a review with "rude" in it and possibly classify it as a review where an employee was rude to a customer. However, the review might have been that the employee "was very kind and not rude at all," in which case classifying this review as one where an employee was rude to a customer would be the incorrect. Another pitfall is that word count analysis doesn't capture tenses very well. For example, in the context of slow orders, "wait", "waited", and "waiting" would all show up in the word count as separate entries when in this case it would have been more informative if they were grouped together.


One additional step that can be taken to verify my conclusions could be using sentiment analysis to classify a review as positive or negative before doing a word count analysis (a positive review with "rude" in it might actually have "not rude" or "wasn't rude" instead). Another step can be to try and find the context around certain words in reviews. For example, we can run a secondary word count on all reviews with the word "rude" in them, or try and find the most common words before/after/around the word "rude". Along with possibly uncovering context, doing a secondary word count analysis like this could also reveal which words are used most commonly together, which could provide additional insight and can further support/reject conclusions.