### For this exercise, you will analyze a dataset from Amazon. 

### With the given data for 548552 products, perform exploratory analysis and make suggestions for further analysis on the following aspects:
### 1. Trustworthiness of ratings
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively
speaking) about the ratings in this dataset?

### 2. Category bloat

Consider the product group named 'Books'. Each product in this group is associated with
categories. Naturally, with categorization, there are tradeoffs between how broad or
specific the categories must be.  For this dataset, quantify the following:

    a. Is there redundancy in the categorization? How can it be identified/removed?
    b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

In [256]:
import pandas as pd
import numpy as np
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [257]:
with open('../KeepUp/amazon-meta.txt', encoding="utf8") as myfile:
    text = myfile.read()
myfile.close()
text[:1000]

'# Full information about Amazon Share the Love products\nTotal items: 548552\n\nId:   0\nASIN: 0771044445\n  discontinued product\n\nId:   1\nASIN: 0827229534\n  title: Patterns of Preaching: A Sermon Sampler\n  group: Book\n  salesrank: 396585\n  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X\n  categories: 2\n   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]\n   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]\n  reviews: total: 2  downloaded: 2  avg rating: 5\n    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9\n    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5\n\nId:   2\nASIN: 0738700797\n  title: Candlemas: Feast of Flames\n  group: Book\n  salesrank: 168596\n  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940\n  categories: 2\n   |Books[283155]|Subjects[1000]|Religi

#### We get a quick look at the file structure and see each entry can be grouped by product 'Id: ' as was listed in the data description.  The Id:  entries are seperated by double line breaks which is how we will group the entries when parsing.

In [258]:
#parse and group lines looking for extra break before Id:
id_groups = []  #entire string block for each product id
entry = []  #text parts for each id
for line in text.split('\n'):
    if line != '':
        entry.append(line)
    else:
        id_groups.append(entry)
        entry = []
del id_groups[0]  #remove first entry that isn't related to product Id
id_groups[0:3]


[['Id:   0', 'ASIN: 0771044445', '  discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5'],
 ['Id:   2',
  'ASIN: 0738700797',
  '  title: Candlemas: Feast of Flames',
  '  group: Book',
  '  salesrank: 168596',
  '  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Re

#### To answer the first question we must extract each product rating into a dataframe to analyze.

In [None]:
snip = id_groups[0:2]
del snip[0]
snip

In [None]:
import re
ratings={}  #ratings dictionary
skip = False

#extract the main info, there are still further breakdowns for extracting actual similar, categories, and review entries that
#are not completed here
for group in id_groups:
    for block in group:
        if block.startswith('Id:'):
            Id = block.split()[-1]
        elif block.startswith('ASIN:'):
            ASIN = block.split()[-1]
        elif re.match('.*title:',block):
            title = block.split(':')[-1]
        elif re.match('.*group:',block):
            group = block.split(':')[-1]
        elif re.match('.*rank:',block):
            salesrank = block.split(':')[-1]
        elif re.match('.*similar:',block):
            similar = block.split(':')[-1]
        elif re.match('.*categories:',block):
            num_categories = block.split(':')[-1]
        elif re.match('.*reviews:',block):
            total_reviews = block.split()[2]
            downloads = block.split()[4]
            avg_rating = block.split()[7]
        elif re.match('.*discontinued product',block):
            skip = True
    if skip != True:
        ratings[Id] = [ASIN, title, group, salesrank, similar, num_categories, total_reviews, downloads, avg_rating]
    else:
        skip = False

In [None]:
data_df = pd.DataFrame.from_dict(ratings)
data_df.head()

In [None]:
df = data_df.T.reset_index()
df.columns = ['Id', 'ASIN', 'title', 'group', 'salesrank', 'similar', 'num_categories', 'total_reviews',
              'downloads', 'avg_rating']
df.head()

In [None]:
#covert some string object to numerical
df['Id'] = df['Id'].astype(int)
df['salesrank'] = df['salesrank'].astype(int)
df['num_categories'] = df['num_categories'].astype(int)
df['total_reviews'] = df['total_reviews'].astype(int)
df['downloads'] = df['downloads'].astype(int)
df['avg_rating'] = df['avg_rating'].astype(float)
df['group'] = df['group'].astype(str)
df = df.sort_values('Id')
df.head()

In [None]:
#what are all the product groups
df.group.value_counts()

We can see there are a few product groups that do not appear very often.  For exploring the ratings, we will focus only of the top 4 group categories from the list.

In [None]:
list = [' Book', ' Music', ' Video', ' DVD']
df2 = df[df['group'].isin(list)]
df2.group.value_counts()

In [None]:
df2.describe()

In [None]:
df2['avg_rating'].plot(kind = 'hist')

The histogram of avg ratings is interesting in that the extremes of ratings 0 vs 5 appear with much more frequency than do middling reviews.  There ends up being a bias in the ratings in that the extrems, people who are either very please or very displeased, tend to have a higher rating entry capture rate.


### Category bloat
Consider the product group named 'Books'. Each product in this group is associated with categories. Naturally, with categorization, there are tradeoffs between how broad or specific the categories must be. For this dataset, quantify the following:

a. Is there redundancy in the categorization? How can it be identified/removed?

b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

#### For part a, we will need to split apart the categorization better.

In [None]:
import re
category_list=[]  #top category list
totals = 0
#lets strip out the subcategories for books and create a df
for group in id_groups:
    for block in group:
        if block.startswith('  group:'):
            product = block.split(':')[-1]
            if 'Book' in product : totals+=1
        elif block.startswith('   |Books'):
            category_list.append(block.strip())
sub_category = []
for line in category_list:
    for subcat in line.split('|')[1:]:
        sub_category.append(subcat)

categorical_df = pd.DataFrame.from_dict(sub_category)
categorical_df.head()

In [None]:
print ('total number of books                               :', totals)
print ('total categories/sub_categories for all books       :', len(categorical_df[0]))
print ('total unique categories/sub_categories for all books:', len(categorical_df[0].unique()))

#get the unique category counts and exploring the top 20 unique
categorical_df[0].value_counts().head(20)

We can see that there is a fair amount of redundancy in the category structure setup.  Entries that are very non-descript like "Subjects" might easily be dropped.  

### B. (Suggested duration: 30 mins)

#### 1. Algorithm thinking
How would build the product categorization from scratch, using similar/co-purchased
information?
#### 2. Product thinking
Now, put on your 'product thinking' hat.

    a. Is it a good idea to show users the categorization hierarchy for items?
    b. Is it a good idea to show users similar/co-purchased items?
    c. Is it a good idea to show users reviews and ratings for items?
    d. For each of the above, why? How will you establish the same?

The best way to create a product categorization schema from scratch may be to create a categorization algorithm from the purchase information and the item descriptions.  

I believe categorization heirachy is not a necessary item to show, but when a user wants to see it there could be an available drop down.  This would prevent cluttering but still allow the user to access the info if wanted.

Showing similar/co-purchased items is a great idea to show users.  Often times this can speed up the search process for the user and the user may not always be aware they need or want the similar/co-purchased item until reminded.  This could increase the number of purchases made per visit.  

User reviews and ratings are critical to many user's purchase habits and should be shown.  Care must be taken to ensure the reviews/ratings are as accurate and trustworthy as possible because negative experience on a product could also erode trust in purchasing from the site again.

