# Subreddit Post Classification (Google Home vs Amazon Echo)

<br>
Adel Alsagoff<br>
General Assembly<br>
April 22, 2021<br>


## Problem Statement
<br>

Google's smart speaker system, Google Home, was designed to compete with the popular Amazon Echo. Both product serve as a vehicle to their respective voice-activated virtual helper that connects to the internet. 

Reddit users have used the platform as a forum to discuss their experience with the products. I had been tasked by Google's Research team to analyze customer sentiment towards Google Home from subreddit posts on 'r/GoogleHome'.

Additionally the Research Team would also like to find out what common and unique customer pain points are prevalent between the Google Home and Amazon Echo, with the goal of designing a better product. Therefore subreddit posts from 'r/AmazonEcho' would also be included in the dataset.

A Random Forrest model would also be built to predict if a given set of words do in fact refer to the discussion of either the Amazon Echo or the Google Home based on selected features. 

Each subreddit post are represented as 'documents' in the dataset and therefore both terms will be used interchangebly.


## Executive Summary


## Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
sentiment_score|float|modelling_data|Measure of compound sentiment of a given document (ordinal)|
word_count|intger|modelling_data|Number of words in the document|
linktype|object|modelling_data|type of link found in the document|
selftext_|object|modelling_data|post found on the subreddit|
num_comments|integer|modelling_data|number of comments per post|
subreddit_.|object|modelling_data|which subreddit the document belongs to|


**Contents**
- [Data-collection](#Data-collection)
- [Data-cleaning](#Data-cleaning)


## Data Collection & Cleaning

### Dropping posts regarded as spam

For this project the Pushshift Reddit API was used for scrapping 3000 posts per subreddit. This API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. It was extremely easy to use as compared to Beautiful Soup and would recommended it to anybody doing a similar project.

The dataset was then cleaned by dropping posts that were had reoccuring urls. These posts tend to not have selftext and had variations of the same title whilst still written by unique authors. Majority of these urls were links to promotions or deals but were leaning towards 'spam' or even phishing sites. What is worth mentioning is  7.07% of the Google Home subreddit were these type of posts, compared to 1.01% of Amazon Echo subreddit posts.

These documents were thefore removed as they were not meaningful for out predictions.

# Exploratory Data Analysis

### Forming inferences from missing data

917 documents from the Google Home subreddit and 765 documents from the Amazon Echo subreddit had titles but no self text. Additionally, 624 Google Home and 388 Amazon Echo posts were missing values under the post topic column. Further investigations were made to understand the possible reasons and inferences were formed following that. 

### **Figure 1**

![test](../images/count-of-post-topic-type.png)



What was found is that these posts had titles which more or less captures the sentiment or overall message that the redditor is espressing. Albeit some post were peculiar or nonsensical yet still having something to do with the product. 

Additionally, 624 Google Home and 388 Amazon Echo posts wereMore than half of each subreddit dataframe has a link_flair_text category. Based on their category name, for example 'Question' and 'Bug', are more than likely to be posts that authors feel it would be redundant to elaborate in 'selftext', as just the titles would be self explanatory or instead of text, an image or url was used.

These post were likely questions that were requesting for assistance or interaction in the form of comments.

this inference can be supported by comparing the types of posts assigned to each document categorized in the link_flair_text. Which lead to the inference of both subreddits being a platform for mostly troubleshooting. 

what was observed was More than half of each subreddit dataframe has a link_flair_text category. Based on their category name, for example 'Question' and 'Bug', are more than likely to be posts that authors feel it would be redundant to elaborate in 'selftext', as just the titles would be self explanatory or instead of text, an image or url was used.

These post would also likely be requesting for assistance or interaction in the form of comments. this inference can be supported by comparing the distrubution of comments or the presence of an image.




## Feature Engineering

### Bin similar topics together

The purpose of separating images and url is to see if there is any correlation between them and the respective subreddits and the topic category. My assumption is that images would appear more in topic associated with issues with the product such as 'Question','Technical Issue', 'Help', 'Bug', 'Commands | How To's' and 'Alexa Skill'. 

'Features WishList', 'Feature Request' and 'Skill Request' are also the same where author express what they would like to have in the next generation of products. Followed by 'Hacks' and 'Tips' where authors share how they have a interesting workaround to their problems. And 'Product Review' and 'Review'.

It would be better to bin these categories together as it would be result in a more accurate prediction in our model. 


### Distribution of Post Topic Type per Subreddits

Based on the above barplot, where we only have 3196 that are labelled to have a type of post, we can still clearly see that a majority in both subreddits are posts pertaining to troubleshooting. 

From here, I would like to know if people posting are mentioning the respective customer support sites as a solution to their issues.

## Check for links to customer support website or mention of site

From the above, rather suprisingly, authors do not mention the customer support of the respective subreddits, it may indicate that the customer support sites are not very helpful in fixing the author's issues with the products.

## Investigate relationship between topic and linktype

Against my own assumptions, I had though that images would fair the highest under issues thinking that many authors would share pictures regarding where their issues lie for easier troubleshooting. Images and link are the highest compared to other topics, however compared to not having any types of links in their post, the bar chart above shows that majority of post are only text.

## Find distribution of word counts

 The distribution of the word count for the respective subreddits have the same distribution. Although Google Home Subreddit have more words in certain bins. 

After 400 words there are some outliers. It would be interesting to explore these posts to understand the reason for the lengthy post. 

## Find most frequently used words and bigrams

There are common words from both subreddits, such as 'app', 'device','music','devices' and 'music' which indicates possibly describing issues in these areas of the products. Perhaps the respective apps can abe buggy and playing music is where the products have connectivity problems.

As for unique words, obvious ones are the brand name such as 'Google' and 'Amazon' which is not very useful information. However in unique words in Google Home subreddit, 'mini' which is a new google product is present similar to 'dot' in the Amazon Echo subreddit. 

In Google Echo subreddit, words such as 'tv','phone','speaker' and 'lights' are present and not in the Amazon Echo subreddit. A supposition could be that owners of Google Home seem to have more specific issues as they do not turn up in the Amazon Echo unique word. 

Moving forward count of bigrams are to be explored.

## Sentiment Analysis with Vader lexicon

Conducting a sentiment would be useful to quantify the average sentiment of posts per subreddits, correlations between other features would also be interesting to see.

## Finding Correlation Between Features

<b>Observations:</b><br>
    - Most features have weak correlation relationships with each other, making them independent features.<br> 
    - Num_comments has a positive relationship to image. Suggesting that people tend to engage more with post that includes pictures.<br>
    - sentiment_score also have a negative relationship to image. This is likely due to authors posting issues with their product acompanied by an image.<br>
    - Against my own intuition, sentiment_score has a postive relationship with number of words, I had expected a negative relationship as people tend to write lengthier posts when they are upset. Although this postivie relationship is still rather weak.<br>
    
    