# Project 3: Project #GetWellPlan
---

**Organization of Project Notebooks:**
- **Notebook #1: Problem Statement & Webscrapping** (current notebook)
- Notebook #2: [Data Cleaning & Exploratory Data Analysis](./02_data_cleaning_and_eda.ipynb)
- Notebook #3: [Preprocessing & Modelling](./03_preprocessing_and_modelling.ipynb)
- Notebook #4: [Sentiment Analysis & Recommendations](./04_sentiment_analysis_and_recommendations.ipynb)

## Notebook #1: Problem Statement & Webscrapping

### Contents
- [1. Introduction](#1.-Introduction)
- [2. Problem Statement](#2.-Problem-Statement)
- [3. Executive Summary](#3.-Executive-Summary)
- [4. Data Dictionary](#4.-Data-Dictionary)
- [5. Webscrapping](#5.-Webscrapping)

### 1. Introduction

Based on the **American Consumer Satisfaction Index (ACSI) Retail and Consumer Shipping Report 2020-2021** (released in March 2021), the retail industry (comprising supermarkets, health and personal care stores, department and discount stores, specialty retail stores, online retail and gas stations) experienced a decline in the customer satisfaction score (75.7 out of 100) in 2020 as a result of the COVID-19 pandemic – its lowest score ever since 2015.

Within the supermarkets segment, there was a 2.6% reduction in its ACSI score to 76 in 2020 (from 78 in 2019), with 17 of the 20 major brands atttaining lower customer satisfaction scores (Fig 1). **Specifically, Walmart had a 3% decline to 71 in 2020 (from 73 in 2019; scores continued to range below the average), and remained in the bottom position consecutively in 2019 and 2020**.

<img src='./images/us-supermarkets-acsi-scores.jpg' width=700 align=center>
<center><font size=2 color='grey'>(Fig 1. The American Customer Satisfaction Index of Supermarkets in 2019 and 2020.)</font></center>


Source: [Russell Redman (March 2021)](https://www.supermarketnews.com/issues-trends/customer-satisfaction-fell-supermarkets-2020)

### 2. Problem Statement

To **move up the ranks in ACSI 2021**, the Executive Management has requested for a thorough review of the existing customer journey at Walmart, brand image of Walmart, and aspects that need to be addressed to improve the customer experience and satisfaction level. 

As part of the Social Media Branding team at Walmart, we have been tasked to improve the company's brand image on social media sites. For a start, we will be looking into what users are saying on Walmart's Reddit page, alongside a chosen competitor – Costco – given that Costco has retained its 2nd position ranking across 2019 and 2020. 

A classification model can be developed to predict whether a Reddit post belongs to Walmart or Costco based on the words it contains, and metrics such as accuracy, precision, recall and ROC-AUC scores can be employed to evaluate the model's performance. In addition, a sentiment analysis can be conducted to determine whether the post has positive, neutral or negative sentiment, before arriving at the recommendations to improve the brand image online.

Essentially, this project aims to achieve the following objectives: 
- **Primary Objective**: To enhance our understanding of Walmart's brand image on Reddit, in comparison to Costco's, so to introduce a series of strategies for improvement. 

- **Secondary Objective**: To identify positive and negative feedback from Reddit users regarding both supermarkets, where positive feedback will be reinforced and/or implemented, while negative feedback can be addressed and prevented. 

Our primary stakeholders include employees at Walmart Corporation, while our secondary stakeholders include customers of Walmart.

### 3. Executive Summary

*INTRODUCTION*

With Walmart remaining in the bottom position consecutively in the supermarket segment in ACSI 2019 and 2020, this initiative aims to understand what users are saying on Walmart's social media sites in comparison to Costco, and implementing changes that can positively improve the customer experience, as well as the company's brand image online. The social media site that will be explored in this project is Reddit, which is a platform consisting of American social news, ratings for web content, and discussion forums created by users (source: [Reddit (2021)](https://www.reddit.com/)). Users are able to post contents such as images, texts, videos and links within the respective subreddits, and can express their likes/dislikes for particular posts by either voting the post up or down. 

We will be analyzing posts from the Walmart subreddit ([r/walmart](https://www.reddit.com/r/walmart/)) and Costco subreddit ([r/Costco](https://www.reddit.com/r/Costco/)). While both are well-known supermarket brands in the United States (U.S.), their different business models could have resulted in varying customer experience and brand image.



*METHODOLOGY*

A data science workflow was introduced to conduct this analysis. Firstly, the problem statement was defined — how is Walmart's brand image on Reddit like compared to Costco's, and what are the positive and negative feedback from Reddit users about both supermarkets that need to be addressed, so as to enable Walmart to improve its brand image on social media sites and move up the ranks in ACSI 2021 with higher customer satisfaction scores. Next, the contents of the posts on Walmart and Costco subreddits were extracted via webscrapping using the Python Reddit API Wrapper (PRAW). 

Thereafter, an exploratory data analysis was conducted to identify the top one-word and two-word phrases that appeared frequently in the respective subreddit pages. New features such as the character length of the posts were engineered, and relationships between Walmart and Costco were visualized using a series of histograms, boxplots and bar charts. Concurrently, external research about the background of the commonly occurring words was carried out, to understand factors that could have affected the customer satisfaction. It was interesting to observe that Walmart's subreddit contained mainly posts from the employees, whereas those on Costco's were contributed by customers.  

A classification model was developed, with multiple combinations of vectorizers and models being tested to predict whether a random post belonged to Walmart or Costco. Metrics such as accuracy, precision, recall and ROC-AUC scores were utilised to evaluate the models' performances. Eventually, the final combination selected was a Logistic Regression model coupled with TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer — it was capable of making predictions with an accuracy score of 78.0% *(which was 27% higher than baseline accuracy)*. A grid search was carried out to finetune the parameters used for the modelling, though it did not improve the model's accuracy further. 

Lastly, a sentiment analysis was conducted to determine whether the posts on the respective subreddit pages had positive, neutral or negative sentiment. Based on this analysis, recommendations for Walmart were compiled which included improving the employee experience first, followed by enhancing the customer journey, in order to improve the brand image online. 



*FINDINGS*

The top 10 words that identified whether a post belonged to **Walmart** were (in the order of increasing importance): 
- 'loa', 'vaccine', 'lol', 'covid', 'ppto', 'getting', 'pallet', 'store', 'associate', 'customer'.
     
On the other hand, the top 10 words that identified whether a post belonged to **Costco** were (in the order of increasing importance): 
- 'experience', 'anyone', 'online', 'good', 'delivery', 'best', 'pizza', 'food', 'membership', 'chicken'. 

The sentiment analysis revealed that Walmart's subreddit contained 6.5% positive posts, 86.9% neutral posts and 6.6% negative posts, whereas that of Costco's contained 14.8% positive posts, 81.8% neutral posts and 3.4% negative posts. This revealed a higher proportion of Walmart's posts being identified as negative, compared to those of Costco's, and would need to be looked into to understand the source of negativity.

### 4. Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|author|object|walmart_posts, walmart_posts_actual|author of each post
|created_utc|float|walmart_posts, walmart_posts_actual|date and time the post is created
|id|object|walmart_posts, walmart_posts_actual|id of each post
|num_comments|integer|walmart_posts, walmart_posts_actual|number of comments each post receives
|score|integer|walmart_posts, walmart_posts_actual|number of upvotes a post has 
|selftext|object|walmart_posts, walmart_posts_actual|body text of each post
|subreddit|object|walmart_posts, walmart_posts_actual|post belongs to Walmart
|title|object|walmart_posts, walmart_posts_actual|title text of each post
|url|object|walmart_posts, walmart_posts_actual|url website for each post
|author|object|costco_posts, costco_posts_actual|author of each post
|created_utc|float|costco_posts, costco_posts_actual|date and time the post is created
|id|object|costco_posts, costco_posts_actual|id of each post
|num_comments|integer|costco_posts, costco_posts_actual|number of comments each post receives
|score|integer|costco_posts, costco_posts_actual|number of upvotes a post has 
|selftext|object|costco_posts, costco_posts_actual|body text of each post
|subreddit|object|costco_posts, costco_posts_actual|post belongs to Costco
|title|object|costco_posts, costco_posts_actual|title text of each post
|url|object|costco_posts, costco_posts_actual|url website for each post

### 5. Webscrapping

To scrape data from Reddit, there are a few options available, which include using [Pushshift Reddit API](https://github.com/pushshift/api) or the [Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/latest/getting_started/quick_start.html). The PRAW method is selected for this project, as it can easily retrieve fields from the Reddit pages while abiding by Reddit's API rules. In addition, it does not require any sleep calls to be introduced, and yet still able to pull around 1,000 posts at any point in time. 

Using PRAW, the following information was collected from both Walmart and Costco subreddits: 
- post author
- post date
- post ID
- number of comments per post
- post score (i.e. number of upvotes)
- post text (body)
- post title
- post URL 

The data from this section will be explored in the [next notebook](./02_data_cleaning_and_eda.ipynb).

In [1]:
# import praw package
import praw

In [2]:
# create a reddit instance
reddit = praw.Reddit(client_id='iivfy0uYiyvCRw', 
                     client_secret='X8Ju_x3sRgCBP36-m8wdvwYsU2WRXw', 
                     user_agent='xinna_scrapping')

Version 7.0.0 of praw is outdated. Version 7.3.0 was released Thursday June 17, 2021.


In [3]:
# import the relevant package
import pandas as pd

#### Import posts from Walmart subreddit

In [4]:
# import the top 1500 posts from walmart (based on score)
walmart_posts = []
walmart_subreddit = reddit.subreddit('walmart')

for post in walmart_subreddit.hot(limit=1500):
    walmart_posts.append([post.author, 
                          post.created_utc, 
                          post.id, 
                          post.num_comments, 
                          post.score, 
                          post.selftext, 
                          post.subreddit, 
                          post.title, 
                          post.url])

In [5]:
# store the posts in a DataFrame
walmart_posts_df = pd.DataFrame(walmart_posts, columns=['author', 
                                                        'created_utc', 
                                                        'id', 
                                                        'num_comments', 
                                                        'score', 
                                                        'selftext', 
                                                        'subreddit', 
                                                        'title', 
                                                        'url'])

In [6]:
# view the first 5 posts in the DataFrame
walmart_posts_df.head()

Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title,url
0,armoreddillo,1605935000.0,jy56so,434,4912,👆,walmart,"If you're here, as a customer, to complain abo...",https://www.reddit.com/r/walmart/comments/jy56...
1,jasiad,1624301000.0,o523jg,82,38,haha Johnathan you are making my retail experi...,walmart,Weekly Salt Thread 209 - See you all in Therapy!,https://www.reddit.com/r/walmart/comments/o523...
2,boobalogne,1624623000.0,o7mgnw,7,61,,walmart,Good morning,https://www.reddit.com/gallery/o7mgnw
3,Dabble007,1624576000.0,o7b68g,22,496,,walmart,"We have been on a roll lately, BUT....",https://i.redd.it/79kur2k4pa771.jpg
4,bug1998,1624599000.0,o7hdja,13,65,,walmart,Cashiers are fed up with them cutting hours,https://i.redd.it/hrtjh50umc771.jpg


In [7]:
# view the shape of the DataFrame
walmart_posts_df.shape

(847, 9)

In [8]:
# export the walmart posts DataFrame as csv file
walmart_posts_df.to_csv('datasets/walmart_posts.csv')

#### Import posts from Costco subreddit

In [9]:
# import the top 1500 posts from Costco (based on score)
costco_posts = []
costco_subreddit = reddit.subreddit('Costco')

for post in costco_subreddit.hot(limit=1500):
    costco_posts.append([post.author, 
                         post.created_utc, 
                         post.id, 
                         post.num_comments, 
                         post.score, 
                         post.selftext, 
                         post.subreddit, 
                         post.title, 
                         post.url])

In [10]:
# store the posts in a DataFrame
costco_posts_df = pd.DataFrame(costco_posts, columns=['author', 
                                                      'created_utc', 
                                                      'id', 
                                                      'num_comments', 
                                                      'score', 
                                                      'selftext', 
                                                      'subreddit', 
                                                      'title', 
                                                      'url'])

In [11]:
# view the first 5 posts in the DataFrame
costco_posts_df.head()

Unnamed: 0,author,created_utc,id,num_comments,score,selftext,subreddit,title,url
0,dyzlexiK,1616428000.0,maqvfw,278,1378,"Hey All,\n\nRecently we caught and banned a us...",Costco,Always use your best judgement before buying a...,https://www.reddit.com/r/Costco/comments/maqvf...
1,Nardelan,1623716000.0,o00s9e,8,112,**Rule 5**: Item name required in post titles ...,Costco,Please See The New Rule Added For Posting In T...,https://www.reddit.com/r/Costco/comments/o00s9...
2,daenu80,1624577000.0,o7bcie,110,673,,Costco,Whoever did this should have their membership ...,https://i.redd.it/lcvvvv1pqa771.jpg
3,Joe2700,1624591000.0,o7fdzm,6,38,,Costco,I'm going to be the coolest dad at the office....,https://imgur.com/rUbjJ2P
4,Front-Contribution91,1624574000.0,o7ame6,52,73,,Costco,"Be honest, is it worth it?",https://i.redd.it/8fv3mb4gka771.jpg


In [12]:
# view the shape of the DataFrame
costco_posts_df.shape

(964, 9)

In [13]:
# export the costco posts DataFrame as csv file
costco_posts_df.to_csv('datasets/costco_posts.csv')