To build our word count distribution, we start by putting every single review for a product into a “word count group”.  For example, a 23 word review would fall into the “21-25 word count group”, a 109 word review would fall into the “101-125 word count group”, and a 600 word review would fall into the “201+ word count group”.  This gives us the product’s word count distribution.  But just a product’s  word count distribution doesn’t really tell us that much: we need something to compare it to. That is why we grab the word count distribution for all of the reviews in the products category (category2) to get the expected word count distribution.  

Once we have the word count distribution of the product and the expected distribution of the category we compare the two distributions and identify product word count groups that are higher in concentration than we’d expect to see. For each of the larger groups we run a significance test to ensure that it isn’t due to random chance or lack of data points but rather that they are substantially overrepresented. If a product doesn’t have that many reviews, we are likely to see more variance due to random chance.  However, if our formula determines the difference is statistically significant, we’ll label that group as an Overrepresented Word Count Group.

In [1]:
#Import neccessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# #Load the review dataset
# reviews = pd.read_csv("RSC reviews with profile ids.csv")
# # Load the sales dataset
# sales = pd.read_csv("SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv")
# # Extract only columns of interest
# 
# #Take only the unique product id
# 

# #Now let's compile the two dataframes to identify the category of each product in the reviews dataset
# compiled = pd.merge(reviews,sales, how = 'inner', left_on = "product", right_on="id")

# compiled.to_csv("RSC_reviews_with_category.csv")
#sales = pd.read_csv("SalesRankExport_f0337c16-d7f3-4fc0-a46b-a0e14f18b595.csv")
# sales = sales[['id','category_id2']]
# sales = sales.drop_duplicates('id')  #shape(2525,2)
# sales.to_csv("Sales.csv")

## Word Count Comparison

### Objective: comparing the word count of the individual products with the word count of this category level
### Khaled's outline
1. filter columns
2. create total_word metrics
3. cut word bins with ranges
4. Aggregate across products
5. Merge product and category
6. Merge by joining it on category_id

### No change at all

In [4]:
# Filter column of interest
df= pd.read_csv("RSC_reviews_with_category.csv")
sales = pd.read_csv("Sales.csv")
df = df[['product','text','category_id2']]


#Let's create the word count column
df['totalwords'] = df['text'].str.split().str.len()

#Create word bins with appropriate ranges
df['word_bins'] = pd.cut(x=df['totalwords'], bins=[0, 5, 15, 25, 40, 65, 100, 200, 100000])
df['word_bins'] = pd.cut(x=df['totalwords'], bins=[0, 5, 15, 25, 40, 65, 100, 200, 100000], labels=['0 - 5 words', '6 - 15 words', '16 - 25 words', '26 - 40 words', '41 - 65 words', '66 - 100 words', '101 - 200 words','200+'])

# Create a dataframe to aggregate word bins across products & categories
# Normalize to get proportions
product_aggregation = pd.crosstab(df["product"], df["word_bins"], margins=True, normalize='index')
category_aggregation = pd.crosstab(df["category_id2"], df["word_bins"], margins=True, normalize='index')


# Merge the two features table together
product_aggregation = pd.merge(product_aggregation,sales, how = 'inner', left_on = "product", right_on="id")
product_aggregation = pd.merge(product_aggregation,category_aggregation, how = 'inner', left_on = "category_id2", right_on="category_id2")

#product_aggregation.to_csv("Word_count_features.csv")