# Page for trying out the data

In [1]:
import pandas as pd
import nltk

In [2]:
df1 = pd.read_csv('amazon_reviews_us_Mobile_Electronics_v1_00.tsv',sep="\t", error_bad_lines=False)
# bad lines exist......

b'Skipping line 35246: expected 15 fields, saw 22\n'
b'Skipping line 87073: expected 15 fields, saw 22\n'


### DATA COLUMNS:
marketplace       - 2 letter country code of the marketplace where the review was written.  
customer_id       - Random identifier that can be used to aggregate reviews written by a single author.  
review_id         - The unique ID of the review.  
product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
                    for the same product in different countries can be grouped by the same product_id.  
product_parent    - Random identifier that can be used to aggregate reviews for the same product.  
product_title     - Title of the product.  
product_category  - Broad product category that can be used to group reviews  
                    (also used to group the dataset into coherent parts).  
star_rating       - The 1-5 star rating of the review.  
helpful_votes     - Number of helpful votes.  
total_votes       - Number of total votes the review received.  
vine              - Review was written as part of the Vine program.  
verified_purchase - The review is on a verified purchase.  
review_headline   - The title of the review.  
review_body       - The review text.  
review_date       - The date the review was written.  

Filter out the ones not versified and get a subset of the data

In [3]:
df1 = df1.loc[df1['verified_purchase']=='Y',['review_id', 'product_id', 'product_title', 'helpful_votes','review_headline', 'review_body']]
df1['product_id'].nunique()

22299

Get a list of products that are with at least 200 reviews

In [4]:
count_df = df1.groupby('product_id').count()
count_df = count_df['review_id']

count_df.mean()
count_df.max()

count_df = count_df.loc[lambda x: x>=200]
count_df = count_df.sort_values(ascending=False)

product_list = count_df.index.values.tolist()
product_list

['B00J46XO9U',
 'B004911E9M',
 'B00E5PI594',
 'B008R68DFS',
 'B005S1CYO6',
 'B0067XVNTG',
 'B00166G81M',
 'B003TPQBJW',
 'B009L7EEZA',
 'B0052RMI2Y',
 'B002D4IHYM',
 'B0030BBWHQ',
 'B00QERR5CY',
 'B0069ZFYCY',
 'B008JGR44W',
 'B007OXLK82',
 'B00LAG4HN4',
 'B004A83PE6',
 'B00BUFBQWU',
 'B004T0CEBK',
 'B00LAG4F6S',
 'B0052NZYXI',
 'B007UXNHGY',
 'B002FA3R08',
 'B004N85WJE',
 'B006QF22HM',
 'B007WOKBRY',
 'B007WO5LF6',
 'B00KW3XNUE',
 'B001Q5DQOU']

In [5]:
df1 = df1[df1['product_id'].isin(product_list)]

df1

Unnamed: 0,review_id,product_id,product_title,helpful_votes,review_headline,review_body
2,R2Y0MM9YE6OP3P,B00QERR5CY,iXCC Multi pack Lightning cable,0.0,great cables,These work great and fit my life proof case fo...
4,R26I2RI1GFV8QG,B0067XVNTG,Generic Car Dashboard Video Camera Vehicle Vid...,0.0,Cameras has battery issues,"Be careful with these products, I have bought ..."
48,R2WGDZBMIMZ1HK,B00LAG4HN4,"iXCC Element II Lightning Cable 6ft, iPhone Ch...",0.0,"Good, strong, and 6 feet long!","Good, strong, and 6 feet long."
60,RRPOCULNRBGQ,B00LAG4HN4,"iXCC Element II Lightning Cable 6ft, iPhone Ch...",0.0,made with excellent materials at the joints be...,Apple makes their charging products with infer...
77,R2K2WK38XR5FKZ,B00QERR5CY,iXCC Multi pack Lightning cable,0.0,One Star,Two failed
86,R2EYF5O4W313NW,B00J46XO9U,"iXCC Lightning Cable 3ft, iPhone charger, for ...",0.0,Four Stars,Very good quality. So far so good.
104,R8P7Q3NE3Y83S,B00QERR5CY,iXCC Multi pack Lightning cable,0.0,Three Stars,"One dead on arrival, one working as intended."
107,R5F0MONILUKRP,B00J46XO9U,"iXCC Lightning Cable 3ft, iPhone charger, for ...",0.0,Five Stars,Good product and good seller
111,R3OLUK314SVUJX,B00LAG4HN4,"iXCC Element II Lightning Cable 6ft, iPhone Ch...",0.0,Certified and lasted long enough.,My iPhone recognized this cable and never told...
124,R1L9T6O2ISEKHI,B007WOKBRY,EC TECHNOLOGY? New design Backup External Batt...,0.0,Three Stars,"It culd be better, but it is seems to be not w..."


Test on one product. Get the reviews for one product

In [110]:
reviews = df1[df1['product_id']==product_list[0]]#.drop(columns=['review_id','product_title','product_id'])
reviews = reviews['review_body']

In [111]:
reviews = reviews.str.split('.',expand=True).stack().reset_index()

In [112]:
reviews['word_count'] = reviews[0].str.split().apply(len)#.strip().apply(len)
reviews

Unnamed: 0,level_0,level_1,0,word_count
0,86,0,Very good quality,3
1,86,1,So far so good,4
2,86,2,,0
3,107,0,Good product and good seller,5
4,129,0,Great product!,2
5,141,0,They charge my wife's phone,5
6,156,0,Works like a charm!,4
7,157,0,"I've ordered a ton of these white and black, l...",12
8,157,1,I keep buying them because they are made so ...,10
9,157,2,"I need more to buy for the office, my car, t...",12


In [113]:
reviews = reviews.loc[reviews['word_count']>0,[0]]

In [118]:
reviews.columns

Int64Index([0], dtype='int64')

In [124]:
reviews['POS'] = reviews[0].apply(nltk.word_tokenize).apply(nltk.pos_tag)

In [125]:
reviews

Unnamed: 0,0,POS
0,Very good quality,"[(Very, RB), (good, JJ), (quality, NN)]"
1,So far so good,"[(So, RB), (far, RB), (so, RB), (good, JJ)]"
3,Good product and good seller,"[(Good, JJ), (product, NN), (and, CC), (good, ..."
4,Great product!,"[(Great, JJ), (product, NN), (!, .)]"
5,They charge my wife's phone,"[(They, PRP), (charge, VBP), (my, PRP$), (wife..."
6,Works like a charm!,"[(Works, NNS), (like, IN), (a, DT), (charm, NN..."
7,"I've ordered a ton of these white and black, l...","[(I, PRP), ('ve, VBP), (ordered, VBN), (a, DT)..."
8,I keep buying them because they are made so ...,"[(I, PRP), (keep, VBP), (buying, VBG), (them, ..."
9,"I need more to buy for the office, my car, t...","[(I, PRP), (need, VBP), (more, JJR), (to, TO),..."
12,etc,"[(etc, NN)]"
