# Sentiment Analysis with Deep Learning

# Phase 3- Prediction

This notebook consists the functions and code for scraping comments from Amazon website, pre-prosessing them and prediction with the model.  

### CHRISP-DM phase

Deployment phase for CRISP-DM can be found in this noteboook.

#### 6.Deployment

Generally this will mean deploying a code representation of the model into an operating system to score or categorize new unseen data as it arises and to create a mechanism for the use of that new information in the solution of the original business problem. Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.

### Table of Contents

- 1. Import Libraries
- 2. Define Functions 
- 3. Load-Read-Extract
- 4. Pre-Processig
- 5. Tokenizing-Sequenzing-Padding
- 6. Exploratory Data Analysis (EDA)


## 1. Import Libraries

In [1]:

import pandas as pd 
import numpy as np
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
from sklearn.metrics import accuracy_score
np.random.seed(0)
import pickle

from keras.models import Model, Sequential, Input
from keras.models import load_model
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import text, sequence

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import joblib

Using TensorFlow backend.


### Load the trained model

In [25]:
#model = load_model('cnn_2cnv.h5')

model = joblib.load("models/joblib_RL_Model.pkl")

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [26]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 128, 64)           524288    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 128, 64)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 128, 100)          25700     
_________________________________________________________________
batch_normalization_1 (Batch (None, 128, 100)          400       
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 128, 100)          40100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 128, 100)          400       
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)              

### Load the tokenizer

Open the saved tokenizer with pickle. This tokenizer was trained in the pre-prosessing notebook and saved with pickle. We need this for creating the pipe line for the new data we will scrape from Amazon website.

In [27]:
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer=pickle.load(handle)

### 2. Functions

In [28]:

def scrape_reviews ():
    # For ignoring SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    url=input("Enter Amazon Product Url- ")
    html_page = urllib.request.urlopen(url) #Make a get request to retrieve the page
    soup = BeautifulSoup(html_page, 'html.parser')

    reviews=[]
    ratings=[]

    review_row=soup.findAll('div', attrs={'data-hook': 'review'})
    for row in review_row:
        ratings.append(row.find('span',  attrs={'class':'a-icon-alt'}).text.strip()[0])
        reviews.append(row.find('div', attrs={'data-hook': 'review-collapsed'}).text.strip())
    
    print('There are {} reviews in for this product'.format(len(reviews)) )
    print(reviews)
    
    return reviews, ratings

def punctuationRemover(p):
    '''
    Input: Takes a string. You may have to use str() to force it. 
    Removes all punctuation.
    Output: Returns a string.
    '''
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~1234567890''' 
    no_punctuations = ' '

    for words in p: # You may not have to loop this high
        for char in words:
            if char in punctuations:
                no_punctuations = no_punctuations + ' '
            if char not in punctuations:
                no_punctuations = no_punctuations + char
        
    return(no_punctuations)

def no_stopword (p):
    token= ' '.join([word.lower() for word in p.split() if word not in (stop)])
    return token


    

## 3. Scraping Comments 

In this part we will scrape the comments from the website for a product. We will also scrape the rating to evaluate the model's prediction fro unseen data.

In [39]:
review_list, ratings = scrape_reviews()



Enter Amazon Product Url- https://www.amazon.com/Internets-Best-Decorative-Nightstand-Enclosure/dp/B01M01DVR0/ref=pd_sim_199_5/138-6621365-9457105?_encoding=UTF8&pd_rd_i=B01M01DVR0&pd_rd_r=0be292b7-7baf-41bb-aa84-f1383c8af59f&pd_rd_w=PwcWV&pd_rd_wg=gxj8M&pf_rd_p=04d27813-a1f2-4e7b-a32b-b5ab374ce3f9&pf_rd_r=QWS2HDND8660HQTAYERQ&psc=1&refRID=QWS2HDND8660HQTAYERQ
There are 8 reviews in for this product
["This is much sturdier than I anticipated. I had no problems putting it together and it looks great. (Im a 48 year old woman with back problems and no engineering degree.) I purchased 3 different cat houses on Amazon for hiding litter boxes around my home. This is beautiful and no one notices what is hiding behind the door.  I put it together and placed the litter box inside, then I left the door open for a couple of days so that the cats got used to it and could check it out. Then, I shut the door and they seemed to never notice the change. Easy to sweep out if needed. I purchase a litter

In [40]:
ratings

['5', '3', '4', '1', '4', '1', '5', '1']

### Pre-process the comment list

Remove all stopwords and punctuation from the comment by using the pre-processing functions. Also split in to words list. 

In [48]:

review_list_processed=[no_stopword(punctuationRemover(p)) for p in review_list]


In [42]:
test=review_list_processed

In [43]:
test_vektor=tokenizer.texts_to_sequences(test)
test_vector=sequence.pad_sequences(test_vektor, maxlen=128)

In [44]:
test_vector

array([[   0,    0,    0, ...,  887,  301, 1629],
       [   0,    0,    0, ...,   23, 2583,  216],
       [   0,    0,    0, ...,  216,  175, 1052],
       ...,
       [   0,    0,    0, ...,  398,    6,   56],
       [   0,    0,    0, ..., 2252,  858,    9],
       [   0,    0,    0, ...,  207, 1993, 1219]], dtype=int32)

In [45]:
prediction=model.predict([test_vector])

In [46]:
print(model.predict([test_vector]))
print([round(prediction[0][0]) for prediction[0][0] in prediction])

[[0.82825875]
 [0.4909972 ]
 [0.9271606 ]
 [0.02025419]
 [0.03693619]
 [0.8342918 ]
 [0.8935666 ]
 [0.00141972]]
[1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]


In [47]:
ratings

['5', '3', '4', '1', '4', '1', '5', '1']