# Overview
This projects aims to build a model predictive model that can classify whether a given review is positive or negative.  
Positive Review= >3 star  
Negative Review= <3 star  

Data used in this project is from Amazon's US Gift Card Data Reviews Data and can be accessed @
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz

### Data Processing
The original data contains reviews as well as additional metadata information and specific scripts have been written to perform the following:  
- Automatically download the data (if not already downloaded).  
- Process the data line by line.
- Split train and test samples (Default 70% training and 30% test).
- Save the datasets into 4 files: training and test positive and negative.  

The below DataExtractor Class implements the above steps.

In [1]:
#Calling scripts and instantiating objects
from extractor import DataExtractor
from utils.text_processing import TextProcessing

extract_data = DataExtractor()
process_text = TextProcessing()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
files = extract_data.get_data()
print('Number of positive training samples: ', len(files['train_pos']))
print('Number of negative training samples: ', len(files['train_neg']))
print('Number of positive test samples: ', len(files['test_pos']))
print('Number of negative test samples: ', len(files['test_neg']))

Number of positive training samples:  97872
Number of negative training samples:  4393
Number of positive test samples:  41696
Number of negative test samples:  1969


In [8]:
#Columns in dataset
files['header']

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date\n


In [12]:
#few samples of negative reviews
files['train_neg'][13]

0                                           What?????????
1            I wanted the gift car4 for MUSIC not movies.
2       I had no idea that the amount is listed in US ...
3       ordered this for a going away party thinking i...
4       The giftcard themselves weren't the problem - ...
                              ...                        
4388    Beware, since this item is not eligible for su...
4389    I have had iTunes for about 3 weeks. The Progr...
4390    I have a G5 dual processor that is a little ov...
4391    iTune by Apple has the worst customer service ...
4392    I have ordered 2 of these cards from Amazon an...
Name: 13, Length: 4393, dtype: object

In [13]:
#few samples of positive reviews
files['train_pos'][13]

0                                                     Good
1        I can't believe how quickly Amazon can get the...
2                                                 excelent
3                               Great and Safe Gift Giving
4                                                     Bien
                               ...                        
97867    The itunes gift card is absolutely the best gi...
97868    Finally there is a way for your family to buy ...
97869    Finally there is a way for your family to buy ...
97870    I picked up a few of these at Target a while b...
97871    This is the ultimate tool for downloading musi...
Name: 13, Length: 97872, dtype: object

### Processing Reviews

In order for us to convert the text to a meaningful numerical representation the following steps have been performed: 
- Tokenisation: This step tokenize the sentence into individual words. i.e. a sentence with 5 words will become a list with 5 elements.
- Stopword removal: We then remove stopwords i.e. i, me, you etc.
- Removing Punctuations: This removes punctuations i.e. !,# etc.
- Lemmatizing: We then lemmatize the words to find the root word i.e. runs, running, ran gets converted to run.

### Word Freqs

After processing the text we then build a word freqency dictionary to count word associated with positive and negative reviews.

In [16]:
#build freqs
all_reviews, labels = extract_data.process_freq_text()
feqs = process_text.build_freq(all_reviews,labels)


In [19]:
#get freq
feqs

{('wanted', 1.0): 254,
 ('gift', 1.0): 5266,
 ('car4', 1.0): 1,
 ('music', 1.0): 43,
 ('movie', 1.0): 22,
 ('idea', 1.0): 140,
 ('amount', 1.0): 187,
 ('listed', 1.0): 14,
 ('u', 1.0): 254,
 ('dollar', 1.0): 175,
 ('ended', 1.0): 80,
 ('purchasing', 1.0): 104,
 ('100', 1.0): 137,
 ('instead', 1.0): 179,
 ('cad', 1.0): 2,
 ('meant', 1.0): 31,
 ('extra', 1.0): 58,
 ('30cad', 1.0): 1,
 ('ordered', 1.0): 532,
 ('going', 1.0): 145,
 ('away', 1.0): 70,
 ('party', 1.0): 51,
 ('thinking', 1.0): 36,
 ('quick', 1.0): 30,
 ('easy', 1.0): 150,
 ('process', 1.0): 144,
 ('two', 1.0): 295,
 ('hour', 1.0): 157,
 ('later', 1.0): 169,
 ('called', 1.0): 239,
 ('customer', 1.0): 506,
 ('service', 1.0): 492,
 ('still', 1.0): 367,
 ("n't", 1.0): 1966,
 ('34', 1.0): 591,
 ('authorized', 1.0): 5,
 ('told', 1.0): 306,
 ('would', 1.0): 1157,
 ('three', 1.0): 102,
 ('maximum', 1.0): 2,
 ('rep.', 1.0): 6,
 ('half', 1.0): 52,
 ('back', 1.0): 289,
 ('may', 1.0): 91,
 ('full', 1.0): 35,
 ('day', 1.0): 721,
 ('unfort