# KeepUp Take-Home Challenge

For this exercise, you will analyze a dataset from Amazon. The data format and a
sample entry are shown on the next page.

## Part A
(Suggested duration: 90 mins)
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.

### Question 1. Trustworthiness of ratings
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively
speaking) about the ratings in this dataset?


In [76]:
#import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
%matplotlib inline
import matplotlib.pyplot as plt

d={}
values = []
key= 'Header'

#read  file 
with open('amazon-meta.txt', encoding="utf8") as fp:
    for line in fp:
        if "Id:" in line:
            d[key] = values
            key = line.rstrip('\n')
            values =[]
        else:
            values.append(line.rstrip('\n'))
d[key] = values

In [64]:
# first 3 entires in dictionary
dict(list(d.items())[:3])

{'Header': ['# Full information about Amazon Share the Love products',
  'Total items: 548552',
  ''],
 'Id:   0': ['ASIN: 0771044445', '  discontinued product', ''],
 'Id:   1': ['ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5',
  '']}

In [78]:
# last item in dictionary
dict(list(d.items())[-1:])

{'Id:   548551': ['ASIN: B00005MHUG',
  "  title: That Travelin' Two-Beat/Sings the Great Country Hits",
  '  group: Music',
  '  salesrank: 0',
  '  similar: 5  B00008OETQ  B00005O6KL  B00006RY87  B0002OTI98  B0000634HG',
  '  categories: 6',
  '   |Music[5174]|Styles[301668]|Miscellaneous[35]|Nostalgia[67175]',
  '   |Music[5174]|Styles[301668]|Broadway & Vocalists[265640]|Classic Vocalists[67178]',
  '   |Music[5174]|Styles[301668]|Broadway & Vocalists[265640]|Traditional Vocal Pop[513060]',
  '   |Music[5174]|Styles[301668]|Pop[37]|Vocal Pop[406646]|General[513062]',
  '   |Music[5174]|Styles[301668]|Pop[37]|Vocal Pop[406646]|Classic[604206]',
  '   |Music[5174]|Styles[301668]|Broadway & Vocalists[265640]|General[912012]',
  '  reviews: total: 1  downloaded: 1  avg rating: 5',
  '    2004-5-31  cutomer:  ABTSEEYVYQ52M  rating: 5  votes:   9  helpful:   9',
  '']}

In [122]:
# Extract rating info into df
rating_dict = {}
discont = 0

for k in d.keys():
    ident = k.split()[-1]
    for v in d[k]:
        if v.startswith('  reviews:'):
            total = v.split()[2]
            downloaded = v.split()[4]
            avg_rating = v.split()[7]
            rating_dict[ident] = [total, downloaded, avg_rating]
        elif v.startswith('  discontinued product'):
            discont += 1

In [132]:
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df['id']  = rating_df['id']#.astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')

In [134]:
print ('Number of Discountinued Items:', discont)

Number of Discountinued Items: 5868


In [133]:
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating
0,1,2,2,5.0
1,10,6,6,4.0
2,100,0,0,0.0
3,1000,1,1,5.0
4,10000,0,0,0.0


In [137]:
rating_df.tail()

Unnamed: 0,id,total,downloaded,avg_rating
542679,99999,1,1,4.0
542680,B,4,4,4.5
542681,Film,10,10,3.0
542682,Princes),0,0,0.0
542683,Problematiques,0,0,0.0


In [136]:
548552 - discont

542684

In [138]:
# replace string "id" instance with appropriate id # (found using "Find" in txt file)
rating_df= rating_df.replace('Problematiques', '255657')
rating_df= rating_df.replace('Princes)', '468382')
rating_df = rating_df.replace('B', '489629')
rating_df = rating_df.replace('Film', '119182')

In [139]:
rating_df.tail()

Unnamed: 0,id,total,downloaded,avg_rating
542679,99999,1,1,4.0
542680,489629,4,4,4.5
542681,119182,10,10,3.0
542682,468382,0,0,0.0
542683,255657,0,0,0.0


In [140]:
rating_df.id = rating_df.id.apply(pd.to_numeric)

In [141]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 542684 entries, 0 to 542683
Data columns (total 4 columns):
id            542684 non-null int64
total         542684 non-null int64
downloaded    542684 non-null int64
avg_rating    542684 non-null float64
dtypes: float64(1), int64(3)
memory usage: 20.7 MB


In [143]:
rating_df.index = rating_df.id

In [144]:
rating_df.head()

Unnamed: 0_level_0,id,total,downloaded,avg_rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,2,2,5.0
2,2,12,12,4.5
3,3,1,1,5.0
4,4,1,1,4.0
5,5,0,0,0.0


In [120]:
rating_df.avg_rating.mean()

3.2095311627598431

#### The mean rating is 3.21

In [146]:
rating_df.sum()

id            1.489202e+11
total         7.781990e+06
downloaded    7.593244e+06
avg_rating    1.741763e+06
dtype: float64

There are more "total" reviews than downloads, suggesting some of the reviews caem from users that did not download the product.

In [150]:
more_rev = rating_df[rating_df.total > rating_df.downloaded]
len(more_rev)

8615

There are 8615 products that have more reviews than downloads, suggesting they may have some nefarious (fake) reviews.

In [149]:
more_rev.avg_rating.mean()

4.1973302379570514

The mean rating of the products with potentially fake reviews is higher than the overall mean rating, so this suggests it may be sellers or manufacturers that are creating the reviews. 

In [159]:
more_rev['excess'] = more_rev.total - more_rev.downloaded

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [162]:
more_rev.sort_values(by='excess', ascending = False)

Unnamed: 0_level_0,id,total,downloaded,avg_rating,excess
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
148185,148185,5034,5,5.0,5029
99487,99487,5033,5,5.0,5028
128673,128673,4922,5,5.0,4917
251503,251503,2925,5,4.5,2920
379661,379661,2925,5,4.5,2920
379659,379659,2925,5,4.5,2920
359452,359452,2919,5,4.5,2914
69897,69897,2919,5,4.5,2914
103684,103684,2562,5,3.5,2557
186578,186578,2396,5,5.0,2391


In [165]:
more_rev[more_rev.excess == 1].avg_rating.mean()

4.1823056300268098

Even items with only 1 more "total" than "downloaded" have a higher rating than the overall mean.

In [167]:
more_rev[more_rev.excess> 1].avg_rating.mean()

4.2109929078014181

And items with more than 1 excessive review have an even higher rating.

However, the item with the highest number of reviews is **Harry Potter and the Sorcerer's Stone** with 5029 reviews, suggesting that items with a high number of reviews may just be popular items that people often purchase places besides Amazon, but still chose to leave a review because they support the product so much.

### Question 2 (Part A): Category bloat

Consider the product group named 'Books'. Each product in this group is associated with
categories. Naturally, with categorization, there are tradeoffs between how broad or
specific the categories must be.

For this dataset, quantify the following:

a. Is there redundancy in the categorization? How can it be identified/removed?

b. Is is possible to reduce the number of categories drastically (say to 10% of existing
categories) by sacrificing relatively few category entries (say close to 10%)?

In [169]:
# take a glance at the first few products in the dataset
dict(list(d.items())[:6])

{'Id:   0': ['ASIN: 0771044445', '  discontinued product', ''],
 'Id:   1': ['ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5',
  ''],
 'Id:   2': ['ASIN: 0738700797',
  '  title: Candlemas: Feast of Flames',
  '  group: Book',
  '  salesrank: 168596',
  '  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940',
  '  categories: 2',
  '   |Books[283155]|Subjects[10

It appears that when a Book has more than one category, they are relatively redundant. 

In [180]:
#extracting category info
books=0
num_cats = {}
cat_list = []

for k in d.keys():
    group = ''
    key= k.split()[-1]
    for v in d[k]:
        if v.startswith('  group:'):
            if v.split()[1] == 'Book':
                books +=1
                group = 'book'
        if group == 'book':
            if v.startswith('  categories:'):
                num_cats[key] = v.split()[1]
            if v.startswith('   |'):
                cat_list.append(v.strip())


In [211]:
len(cat_list)

1440329

Number of referenced categories for all books: 1440329

In [212]:
len(set(cat_list))

12853

Number of unique categories for books:  12853

In [213]:
print ('Number of Books:',books)

Number of Books: 393561


In [205]:
cats_per = pd.DataFrame.from_dict(num_cats, orient='index')

cats_per.columns = ['Number of Categories']
cats_per.index.name = 'Product ID'

cats_per.head(10)

Unnamed: 0_level_0,Number of Categories
Product ID,Unnamed: 1_level_1
1,2
2,2
3,1
4,5
5,2
6,5
8,4
9,1
10,3
11,4


In [207]:
cats_per['Number of Categories'] = cats_per['Number of Categories'].apply(pd.to_numeric)

In [210]:
cats_per['Number of Categories'].mean() 

3.6597350855394715

Average Number of Category entries per book is 3.6597

In [218]:
#Count subcategories
sub_cats = []

for cat in cat_list:
    for node in cat.split('|')[1:]:
        sub_cats.append(node)

In [219]:
#show first 10 entries of subcategories
sub_cats[:10]

['Books[283155]',
 'Subjects[1000]',
 'Religion & Spirituality[22]',
 'Christianity[12290]',
 'Clergy[12360]',
 'Preaching[12368]',
 'Books[283155]',
 'Subjects[1000]',
 'Religion & Spirituality[22]',
 'Christianity[12290]']

In [239]:
from collections import Counter
catcount = pd.DataFrame.from_dict(Counter(sub_cats), orient = 'index')
#catcount['sub_category'] = catcount.index
catcount = catcount.sort_values(by = 0,ascending = False)
catcount.head(20)

Unnamed: 0,0
Books[283155],1286848
Subjects[1000],1222638
Children's Books[4],134263
Amazon.com Stores[285080],123925
[265523],123925
Nonfiction[53],106966
Religion & Spirituality[22],93648
Home & Office[764512],90409
Literature & Fiction[17],84709
Business & Investing Books[767740],76457


In [238]:
len(catcount)*.1

1492.3000000000002

10% of the total number of unique category types is 1492. If the top 1492 most common category types were removed, this may allow for more specifi categorization in general and my drastically reduce the category entries per book, thus reducding redundency.

In [242]:
(catcount[:1492].sum()/catcount.sum())

0    0.905716
dtype: float64

### The 10% most common of all subcategories account for 90.57% of all category instances.
So removing this 10% of categories would infact reduce the the total number of category entries to 10% of the current total, significantly reducing redundency.

## Part B. 
(Suggested duration: 30 mins)
Give the number crunching a rest! Just think about these problems.

### Question 1. Algorithm thinking
How would build the product categorization from scratch, using similar/co-purchased
information?

Firs the products would need to sorted into their product groups (books vs music, etc). Without the categories currently available in the dataset, product could be sub-categorized by clustering available data, such as "similar" product info. Once clustering statistics are calculated and a hierarchical structure formed,the structure should be checked for accuracy and sensibility, then the various branches can be labeled. 

### Question 2. Product thinking (Part B)
Now, put on your 'product thinking' hat.

a. Is it a good idea to show users the categorization hierarchy for items?

b. Is it a good idea to show users similar/co-purchased items?

c. Is it a good idea to show users reviews and ratings for items?

d. For each of the above, why? How will you establish the same?

Showing users the categorization hierarchy is not a good idea as it is not likely to generate more sales. Seeing the hierarchy may cause the average user to become confused and perhaps navigate away from a product they intended to purchase or were considering doing so.

Showing similar/co-purchased items is a good idea however because it may help a user quickly find either a more appropriate product for their needs, or even a complimentary product that they forgot that would also need, etc.

User reviews and ratings can help buyers build confidence in their purchase and encourage purchase completion, so it is a good idea to make this available. 