# Homework 3 (Due 5:30pm PST April 2nd, 2019): N-Grams, Regex, and TF-IDF

### Submit via Slack/email.

You are an analyst working at McDonalds' corporate headquarters, and charged with identifying areas for improvement to increase customer service.

Using the `mcdonalds-yelp-negative-reviews.csv` dataset, clean and parse the text reviews. Document the decisions you make:
- why remove/keep stopwords?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?

Finally, generate a TF-IDF report that **visualizes** for each city what the major source of complaints with the McDonalds franchises are. Offer your analysis and business recommendations on next steps for the global SVP of Operations.

Here is a brief summary of my answers to the following four questions:
- why remove/keep stopwords?
I removed stopwords since in this way I can filter out all the frequent but meaningless words like 'the' and then find out the root cause of negative reviews.

- stemming versus lemmatization?
I used **lemmatization**, since it can keep the orginal meaning of words, which can be easier to understand compared with stemming.

- regex cleaning and substitution?
I applied a regex rule that requires words have more than 1 character to filter out less informative words. I used a **less strict** restriction compared with other regex rules that require words containing more characters, since valuable phrases like **'20 minutes'** will be filtered out under such circumstances. I do not want to miss out too much valuable insights in the data. 

- adding in custom stopwords?
I added some custom stopwords. For example, I added words like **'mcdonalds' and 'always'** as they are not informative and cannot add valuable contents.

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

In [None]:
df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")
corpus = list(df["review"].values)

In [None]:
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk2wn_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None
def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
    res_words = []
    for word, tag in wn_tagged:
        if tag is None:            
            res_words.append(word)
        else:
            res_words.append(lemmatizer.lemmatize(word, tag))
    return " ".join(res_words)

In [None]:
corpus_lem=[]
for review in corpus:
    corpus_lem = corpus_lem + [lemmatize_sentence(review)]

In [None]:
# Here I decide to keep the stopwords and set several more words as stopwords. 
# The reason is that only in this way can I filter out all the frequent but meaningless words
# and find out the root cause of negative reviews.
# I added 'mcdonalds','mcdonald','macdonald','mcds', as all of which refers to Mcdonalds and cannot add valuable contents.
# I also added words like 'ever','like','always', which cannot tell the area for improvement either.

stop=stopwords.words('english')+['mcdonalds','mcdonald','macdonald','mcds','mc',
                                 'ever','like','always','every',
                                 'bad','worst','terrible','even','though']

In [None]:
# Here I used regex to define the pattern of the token, which should contain more than 1 letters or numbers.
# The reason is that I think words composed of only one letter or number actually cannot
# contribute any meaningful information.
vectorizer = TfidfVectorizer(ngram_range=(2,3),
                             token_pattern=r'\b[a-zA-Z0-9]{2,}\b',
                             max_df=0.5,
                             min_df=1, stop_words=stop)

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score["term"] = terms
score.sort_values(by="score", ascending=False, inplace=True)

In [None]:
score.head()

In [None]:
# check how many cities there are in the dataset
city = df.city.unique()
city = list(city)
city

In [None]:
def word_score(df,city):
    df_city = df[df['city']==city]
    corpus = list(df_city["review"].values)
    corpus_lem=[]
    for review in corpus:
        corpus_lem = corpus_lem + [lemmatize_sentence(review)]
    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score["term"] = terms
    score.sort_values(by="score", ascending=False, inplace=True)
    return score

In [None]:
score_atlanta = word_score(df, 'Atlanta')
score_atlanta

In [None]:
score_lasvegas = word_score(df, 'Las Vegas')
score_lasvegas.head()

In [None]:
score_dallas = word_score(df, 'Dallas')
score_dallas.head()

In [None]:
score_portland = word_score(df, 'Portland')
score_portland.head()

In [None]:
score_chicago = word_score(df, 'Chicago')
score_chicago.head()

In [None]:
score_cleveland = word_score(df, 'Cleveland')
score_cleveland.head()

In [None]:
score_houston = word_score(df, 'Houston')
score_houston.head()

In [None]:
score_la = word_score(df, 'Los Angeles')
score_la.head()

In [None]:
score_ny = word_score(df, 'New York')
score_ny.head()

# Take a closer look at the reviews containing the key words

## Note this is a good idea to do and what I considered an example of deep dive - a lot of us wrote that `drive thru` was a top issue, but it's important to actually dig into the reviews themselves to see what the context is - is it actually the drivethrough itself, is it the poor customer service, bad taste of products, lack of drive through, etc.?

In [None]:
match = [string for string in corpus_lem if  "drive thru" in string]
match

In [None]:
match = [string for string in corpus_lem if  "customer service" in string]
match

In [None]:
match = [string for string in corpus if  "ice cream" in string]
match

In [None]:
match = [string for string in corpus if  "10 minutes" in string]
match

In [None]:
match = [string for string in corpus if  "fast food" in string]
match

In [None]:
match = [string for string in corpus_lem if  "parking lot" in string]
match

In [None]:
match = [string for string in corpus if  'roaches' in string]
match

In [None]:
match = [string for string in corpus if  'egg mcmuffin' in string]
match

# Customer Complaints Analysis
After analyzing 1525 negative customer reviews regarding McDonalds, I summarized mainly the following five issues, all of which are shown in the pictures.

**- 1/Drive thru**

Drive thru is one of the greatest sources of complaints. Customers are unsatisfied about waiting for a long time at drive thru and getting their orders wrong food.

**- 2/Fast food**

This appears frequently because customers are expecting fast service thus they are unsatisfied with the long waiting time to place orders and get their food.

**- 3/Customer service**

A lot of customers complaint that the staff at McDonald's had a bad attitude and did not show respect to customers.

**- 4/Ice cream**

Customers complaint that they cannot get ice cream since the machine was shut down at midnight.

**- 5/Wrong order**

Customers are angry about not getting the right food for their orders.


Next, I would like to point out the sources of complaints in each city.

Atlanta: the 1st, 2nd and 4th issues discussed above.

Chicago: except for the 1st and 3rd issues, people are complaining about parking lots.

Cleveland: the 1st, 2nd, 3rd issues.

Dallas: the 1st, 2nd, 3rd issues. People are also complaining about parking lots.

Houston: the 1st, 2nd, 3rd issues.

Los Angeles: the 1st, 2nd, 5th issues.

Las Vegas: the 1st, 2nd, 3rd, 5th issues.

New York: except for the 1st and 2nd issues mentioned above, customers complaint about the sanitary condition, particularly the roaches.

Portland: people are mainly complaining about the 1st, 2nd and 5th issues mentioned above. 

Customers order egg mcmuffin a lot but often get wrong food for their orders.

Note: in the following pictures, the color and the size represent the frequency of the complaint sources. The darker the color and the bigger the box, the more frequent the words appear in customer negative reviews.
<img src="image/All Cities.png" style="width:600px;"/>
For the detailed pictures for each city, see appendix.

# Recommendation

**1. Promote McDonald’s mobile ordering feature.** By encouraging customers to place orders on the mobile application, on the one hand, it can **improve working efficiency** for staff at McDonald’s and **avoid serving wrong food**; on the other hand, staff can have more time to prepare the food and **save the waiting time for customers**. 

**2. Design a new window or a drive-thru lane dedicated for mobile ordering pick-up.** Combining the first recommendation with this one, McDonald’s can **streamline the service** for customers who order online and save time for them.

**3. Use staff training and development to motivate them.** Encourage staff to communicate with and serve customers more patiently and friendly. McDonald’s can **conduct employee evaluation and reward staff with good performance**.

**4. Encourage customers to order using McDonald’s self-ordering kiosks.** In this way McDonald’s can avoid unpleasant customers experience with the staff and also ensure that customers can get the right food.

**5. For McDonald’s with huge ice cream demand, consider adding another ice cream machine.** Ice cream machines cannot work 24/7 since they need time to be cleaned up. Taking into consideration that customers' demand and current supply, for McDonald’s where customers have huge demand for ice cream, we should consider adding another ice cream machine to meet the demand. On the one hand, we can ensure we are **not missing more opportunities to sell more products**; on the other hand, we can **improve our brand image and increase customer loyalty**. By ensuring ample supply, customers would also be encouraged to purchase more other food.

# Appendix

## NOTE: One thing this student could have done better is format the visualizations so that the text is a little bit larger (or at least proportional to the size of the relevance / tf-idf score)

<img src="image/Atlanta.png" style="width:600px;"/>
<img src="image/Las Vegas.png" style="width:600px;"/>
<img src="image/Dallas.png" style="width:600px;"/>
<img src="image/Portland.png" style="width:600px;"/>
<img src="image/Chicago.png" style="width:600px;"/>
<img src="image/Cleveland.png" style="width:600px;"/>
<img src="image/Houston.png" style="width:600px;"/>
<img src="image/la.png" style="width:600px;"/>
<img src="image/ny.png" style="width:600px;"/>