A. (Suggested duration: 90 mins)
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.

1. **Trustworthiness of ratings:** Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively speaking) about the ratings in this dataset?

2. **Category bloat:** Consider the product group named 'Books'. Each product in this group is associated with
    categories. Naturally, with categorization, there are tradeoffs between how broad or
    specific the categories must be.
    For this dataset, quantify the following: 
        a. Is there redundancy in the categorization? How can it be identified/removed? 
        b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

In [123]:
import pandas as pd
import numpy as np
import itertools

file = open("amazon-meta.txt",encoding="utf8")
file = file.readlines()

print("Number of lines", len(file))

Number of lines 15010574


## A. Extract IDS

In [124]:
# Extract lines where "Id" Shows up
id_in_line = []
with open("amazon-meta.txt",encoding="utf8") as myfile:
    for num, line in enumerate(myfile, 1):
        if 'Id:' in line:
            id_in_line.append(num)
            
search_these_ids = id_in_line
queried_ids = [file[i-1] for i in search_these_ids]
extracted_ids = [np.str.split(np.str.strip(i))[1] for i in queried_ids]

### A1. QA
- Results of QA:
    - Four books have "Id: in it
    - Exclude these. 

In [135]:
data_check = pd.DataFrame({"line":id_in_line,"ID":extracted_ids})
errors = data_check[data_check.ID.str.contains("^[a-z]", case = False)]
errors.head()

Unnamed: 0,ID,line
119183,Monsters,3271146
255659,The,6926372
468385,The,12450227
489633,Id:,13000643


In [139]:
[file[i-1] for i in errors.line]

['  title: Monsters from the Id: The Rise of Horror in Fiction and Film\n',
 "  title: The Unconscious and the Id: A Volume from Laplanche's Problematiques\n",
 '  title: The Sea of Precious Virtues: Bahr Al-Favaid : A Medieval Islamic Mirror for Princes (Bahr Al-Fava Id: a Medieval Islamic Mirror for Princes)\n',
 '  title: Id: Peace B\n']

In [147]:
data_checked = data_check[~data_check.ID.isin(errors.ID)]
len(data_checked)

## B. Create a list of Product Information by ID

In [150]:
data_checked['next_id_line'] = data_checked.line.shift(-1)
rows = data_checked[['ID','line', 'next_id_line']].apply(tuple, axis=1)
chunks = []
for i in rows:
    c,a,b = i
    chunks.append([c]+[file[int(i)] for i in np.arange(a-1,b-2)])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


## Build "Reviews" data frame
- Build Summary
- Build individual Level

#### Build Summary frame

In [151]:
review_line = []
for i in chunks:
    chunk_len = len(i)
    chunk_id = i[0]
    for num, line in enumerate(i,1):
        if 'reviews:' in line:
            review_line.append([chunk_id] + [chunk_len] + [num] + [float([i for i in np.str.split(line)][i]) for i in [2,4,7]])
review_summary_by_id = pd.DataFrame(review_line, columns = ['id','chunk_len','review_line_in_chunk','review_total','review_downloaded','review_rating'])            

In [153]:
review_summary_by_id.head()

Unnamed: 0,id,chunk_len,review_line_in_chunk,review_total,review_downloaded,review_rating
0,1,13,11,2.0,2.0,5.0
1,2,23,11,12.0,12.0,4.5
2,3,11,10,1.0,1.0,5.0
3,4,15,14,1.0,1.0,4.0
4,5,11,11,0.0,0.0,0.0


#### Build Detail Frame

In [167]:
review_lines = review_summary_by_id[['id','chunk_len','review_line_in_chunk']].apply(tuple, axis=1)
review_detail_collector = []
for row in review_lines:
    prod_id,b,a = row
    if (b == a):
        review_detail_collector.append([[prod_id] + [np.nan]*5])
    prod_id = int(prod_id)
    search_space = chunks[prod_id]
    query_range = np.arange(a,b)
    review_detail_collector.append([
       [prod_id] + [np.str.split(search_space[line])[column] for column in [0,2,4,6,8]] for line in query_range])

In [155]:
len(review_detail_collector)

542683

In [166]:
review_detail_collector

[[[1, '2000-7-28', 'A2JW67OY8U6HHK', '5', '10', '9'],
  [1, '2003-12-14', 'A2VE83MZF98ITY', '5', '6', '5']],
 [[2, '2001-12-16', 'A11NCO6YTE4BTJ', '5', '5', '4'],
  [2, '2002-1-7', 'A9CQ3PLRNIR83', '4', '5', '5'],
  [2, '2002-1-24', 'A13SG9ACZ9O5IM', '5', '8', '8'],
  [2, '2002-1-28', 'A1BDAI6VEYMAZA', '5', '4', '4'],
  [2, '2002-2-6', 'A2P6KAWXJ16234', '4', '16', '16'],
  [2, '2002-2-14', 'AMACWC3M7PQFR', '4', '5', '5'],
  [2, '2002-3-23', 'A3GO7UV9XX14D8', '4', '6', '6'],
  [2, '2002-5-23', 'A1GIL64QK68WKL', '5', '8', '8'],
  [2, '2003-2-25', 'AEOBOF2ONQJWV', '5', '8', '5'],
  [2, '2003-11-25', 'A3IGHTES8ME05L', '5', '5', '5'],
  [2, '2004-2-11', 'A1CP26N8RHYVVO', '1', '13', '9'],
  [2, '2005-2-7', 'ANEIANH0WAT9D', '5', '1', '1']],
 [[3, '2003-7-10', 'A3IDGASRQAW8B2', '5', '2', '2']],
 [[4, '2004-8-19', 'A2591BUPXCS705', '4', '1', '1']],
 ['5', nan, nan, nan, nan, nan],
 [],
 [[6, '1997-7-4', 'ATVPDKIKX0DER', '5', '12', '11'],
  [6, '1998-10-11', 'AUEZ7NVOEHYRY', '5', '13', '12'],
  

In [119]:
file[13000642]

'  title: Id: Peace B\n'