<b>A. (Suggested duration: 90 mins)</b>  
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.  
  
<b>1. Trustworthiness of ratings</b>
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively
speaking) about the ratings in this dataset?  

In [1]:
import pandas as pd
import datetime

We will begin by reading in the data and parsing it:

In [2]:
# Read in the data
import gzip
with gzip.open('amazon-meta.txt.gz', 'rt') as f:
    file_content = f.read()
f.close()

In [3]:
# Parse and group the data by each product
grouped = []
add = []

for string in file_content.split('\n')[3:]:
    if string != '':
        add.append(string)
    else:
        grouped.append(add)
        add = []
        
grouped[0:2]        

[['Id:   0', 'ASIN: 0771044445', '  discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

We now have an accessible list of items where each item is a unique product in addition to all of its information. With regard to the first question: "<b>Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively speaking) about the ratings in this dataset?</b>" We can extract rating info and use a dataframe to determine whether rating information seems to have been manipulated.

In [4]:
# Extract rating info from each product group into a dataframe
rating_dict = {}

for group in grouped:
    ident, total, downloaded, avg_rating = '', '', '', ''
    for item in group:
        if item.startswith('Id:'):
            ident = item.split()[-1]
        elif item.startswith('  reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('  discontinued product'):
            skip = True
    if skip == False:
        rating_dict[ident] = [total, downloaded, avg_rating]
    else:
        skip = False
    
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df = rating_df.iloc[1:]
rating_df['id']  = rating_df['id'].astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating
1,1,2,2,5.0
109902,2,12,12,4.5
219674,3,1,1,5.0
329477,4,1,1,4.0
439371,5,0,0,0.0


We now have a dataframe of rating information. Let's explore it:

In [5]:
# Are there products with more ratings than downloads?
rating_df[rating_df['total'] > rating_df['downloaded']].head()

Unnamed: 0,id,total,downloaded,avg_rating
102215,193,261,260,3.0
292161,366,10,5,4.5
295451,369,416,5,5.0
297637,371,416,415,5.0
300934,374,7,5,3.5


In [6]:
# How many are there?
print('There are {} products with more ratings than downloads.'\
          .format(len(rating_df[rating_df['total'] > rating_df['downloaded']])))

There are 8615 products with more ratings than downloads.


In [7]:
# Take a look at the range of values
rating_df['total_downloaded_difference'] = rating_df['total'] - rating_df['downloaded']
rating_df = rating_df.sort_values('total_downloaded_difference', ascending=False)
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating,total_downloaded_difference
52948,148185,5034,5,5.0,5029
542125,99487,5033,5,5.0,5028
31510,128673,4922,5,5.0,4917
307144,379661,2925,5,4.5,2920
166426,251503,2925,5,4.5,2920


In [8]:
# What can we say about the range of values where total is greater than downloaded?

len_list = [4000,2000,1000,500,100,50,10,1]

for i in len_list:
    print('There are {} products with more than {} reviews greater than the number of downloads.'\
              .format(len(rating_df[rating_df['total_downloaded_difference'] > i]), i))
    if i == 1:
        print('There are {} products with {} or more reviews greater than the number of downloads.'\
                  .format(len(rating_df[rating_df['total_downloaded_difference'] >= i]), i))

There are 3 products with more than 4000 reviews greater than the number of downloads.
There are 12 products with more than 2000 reviews greater than the number of downloads.
There are 28 products with more than 1000 reviews greater than the number of downloads.
There are 56 products with more than 500 reviews greater than the number of downloads.
There are 300 products with more than 100 reviews greater than the number of downloads.
There are 549 products with more than 50 reviews greater than the number of downloads.
There are 1674 products with more than 10 reviews greater than the number of downloads.
There are 4512 products with more than 1 reviews greater than the number of downloads.
There are 8615 products with 1 or more reviews greater than the number of downloads.


As we can see above, some products have a rather unusual amount of ratings compared to the number of product downloads, which could signify ratings manipulation. It is acceptable to see a certain level of ratings more than the number of downloads (taking into account unverified Amazon reviews), but I believe we are seeing more than an acceptable level.

<b>2. Category bloat</b>
Consider the product group named 'Books'. Each product in this group is associated with
categories. Naturally, with categorization, there are tradeoffs between how broad or
specific the categories must be.  
  
For this dataset, quantify the following:  
a. Is there redundancy in the categorization? How can it be identified/removed?  
b. Is it possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

In [9]:
# Lets look at the grouped data again: Is there redundancy?
grouped[0:2]

[['Id:   0', 'ASIN: 0771044445', '  discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

It is clear that books with more than one category show serious amounts of redundancy in the first few layers of categorizations. Let's dig deeper:

In [10]:
# Extract category info from each book into a list
category_list = []
book_count = 0

for group in grouped:
    prod_type = ''
    for item in group:
        if item.startswith('  group:'):
            prod_type = item.split()[-1]
        if prod_type == 'Book':
            if item.startswith('   |'):
                category_list.append(item.strip())
    if prod_type == 'Book':
        book_count += 1
            
print('There are {} total books.'.format(book_count))
print('There are {} total category branches across all books.'.format(len(category_list)))
print('There are {} unique category branches across all books.'.format(len(set(category_list))))

There are 393561 total books.
There are 1440329 total category branches across all books.
There are 12853 unique category branches across all books.


In [11]:
# How many sub-categories are there?
sub_category_list = []
for branch in category_list:
    for sub_cat in branch.split('|')[1:]:
        sub_category_list.append(sub_cat)
        
print('There are {} total sub-categories across all books.'.format(len(sub_category_list)))
print('There are {} unique sub-categories across all books.'.format(len(set(sub_category_list))))

There are 7891047 total sub-categories across all books.
There are 14923 unique sub-categories across all books.


In [12]:
# Show examples of sub-categories
sub_category_list[0:5]

['Books[283155]',
 'Subjects[1000]',
 'Religion & Spirituality[22]',
 'Christianity[12290]',
 'Clergy[12360]']

In [13]:
# Create a dataframe to show sub-category counts
from collections import Counter

cat_counted = pd.DataFrame.from_dict(Counter(sub_category_list), orient='index').reset_index()
cat_counted = cat_counted.sort_values(0, ascending=False)
cat_counted.head()

Unnamed: 0,index,0
11993,Books[283155],1286848
13033,Subjects[1000],1222638
8500,Children's Books[4],134263
5409,Amazon.com Stores[285080],123925
1224,[265523],123925


If we target the most common sub-categories across all books (and assume that this does not harm the searching structure too much) and aim to drop them in favor of the more specific sub-categories, perhaps we can reduce the number of unique sub-categories by 90% by dropping 10% of sub-categories:

In [14]:
# Determine what percent of category appearances is the top 10% of sub-category titles
print('The first {:.2f}% of largest sub-category titles makes up {:.2f}% of the sub-category volume.'\
          .format((len(cat_counted.iloc[0:1400])/len(cat_counted))*100, 
                  cat_counted.iloc[0:1400][0].sum()/cat_counted[0].sum()*100))

The first 9.38% of largest sub-category titles makes up 90.04% of the sub-category volume.


According to our analysis, we can reduce the number of sub-category titles by 90% by removing the top 10% of most occuring sub-category titles.

<b>B. (Suggested duration: 30 mins)</b>  
Give the number crunching a rest! Just think about these problems.  
  
<b>1. Algorithm thinking</b>  
How would you build the product categorization from scratch, using similar/co-purchased
information?  
  
<b>2. Product thinking</b>  
Now, put on your 'product thinking' hat:  
a. Is it a good idea to show users the categorization hierarchy for items?  
b. Is it a good idea to show users similar/co-purchased items?  
c. Is it a good idea to show users reviews and ratings for items?  
d. For each of the above, why? How will you establish the same?  

In terms of creating product categorization from scratch, we could perform clustering based on vectors of co-purchased information to determine how many groups of products exist. From these groups of products we could compute cluster statistics and create product categories based on those statistics.  
  
It is not a good idea to show users too much of the categorization hierarchy. Too much information causes overload and is distracting. Some hierarchy is useful, however, so that users can navigate the site more directly.  
  
It is a good idea to show users similar/co-purchased items as long as it is not the main focus of the page. Since these items are frequently bought by others along with the main item of interest, the user may likely need those additional items. Even if the user is not looking for those additional items, he/she may realize that they are needed and choose to buy them in that moment.
  
It is a good idea to show users reviews and ratings for the items they are searching for since it serves as external, unbiased validation for new customers that those products are reliable. In addition, it provides a community for users that builds trust among users and with the site itself.