For this exercise, you will analyze a dataset from Amazon. The data format and a
sample entry are shown on the next page.

A. (Suggested duration: 90 mins)
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.
1. Trustworthiness of ratings
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively
speaking) about the ratings in this dataset?

2. Category bloat
Consider the product group named 'Books'. Each product in this group is associated with
categories. Naturally, with categorization, there are tradeoffs between how broad or
specific the categories must be.

For this dataset, quantify the following:
a. Is there redundancy in the categorization? How can it be identified/removed?
b. Is is possible to reduce the number of categories drastically (say to 10% of existing
categories) by sacrificing relatively few category entries (say close to 10%)?

B. (Suggested duration: 30 mins)
Give the number crunching a rest! Just think about these problems.
1. Algorithm thinking
How would build the product categorization from scratch, using similar/co-purchased
information?
2. Product thinking
Now, put on your 'product thinking' hat.
a. Is it a good idea to show users the categorization hierarchy for items?
b. Is it a good idea to show users similar/co-purchased items?
c. Is it a good idea to show users reviews and ratings for items?
d. For each of the above, why? How will you establish the same?

In [1]:
#import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
%matplotlib inline
import matplotlib.pyplot as plt

d={}
values = []
key= 'Header'

#read  file 
with open('amazon-meta.txt', encoding="utf8") as fp:
    for line in fp:
        if "Id:" in line:
            d[key] = values
            key = line.rstrip('\n')
            values =[]
        else:
            values.append(line.rstrip('\n'))
d[key] = values

In [2]:
dict(list(d.items())[:5])

{'Header': ['# Full information about Amazon Share the Love products',
  'Total items: 548552',
  ''],
 'Id:   0': ['ASIN: 0771044445', '  discontinued product', ''],
 'Id:   1': ['ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5',
  ''],
 'Id:   2': ['ASIN: 0738700797',
  '  title: Candlemas: Feast of Flames',
  '  group: Book',
  '  salesrank: 168596',
  '  similar: 5  0738700827 

In [3]:
d.pop('Header')

['# Full information about Amazon Share the Love products',
 'Total items: 548552',
 '']

In [None]:
# Extract rating info into df
rating_dict = {}

discont = 0

for k in d.keys():
    ident = k.split()[1]
    for v in d[k]:
        if v.startswith('  reviews:'):
            total = v.split()[2]
            downloaded = v.split()[4]
            avg_rating = v.split()[7]
            rating_dict[ident] = [total, downloaded, avg_rating]
        elif v.startswith('  discontinued product'):
            discont += 1

    
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df = rating_df.iloc[1:]
rating_df['id']  = rating_df['id']#.astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')

In [5]:
print ('Number of Discountinued Items:', discont)

Number of Discountinued Items: 5868


In [6]:
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating
1,10,6,6,4.0
2,100,0,0,0.0
3,1000,1,1,5.0
4,10000,0,0,0.0
5,100000,2,2,4.5


In [7]:
rating_df.tail(15)

Unnamed: 0,id,total,downloaded,avg_rating
542669,9999,39,39,3.5
542670,99990,7,7,5.0
542671,99991,128,5,4.5
542672,99992,4,4,4.5
542673,99993,4,4,4.5
542674,99994,0,0,0.0
542675,99995,0,0,0.0
542676,99996,4,4,5.0
542677,99997,2,2,3.0
542678,99998,0,0,0.0


In [8]:
rating_df.sort_values(by='id').head()

Unnamed: 0,id,total,downloaded,avg_rating
1,10,6,6,4.0
2,100,0,0,0.0
3,1000,1,1,5.0
4,10000,0,0,0.0
5,100000,2,2,4.5
