# Amazon product performance check using Sentiment Analysis

## What is Web Scrapping?

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}

In [None]:
search_query="titan+men+watches"
base_url="https://www.amazon.in/s?k="
url=base_url+search_query
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
search_response=requests.get(url,headers=header)
search_response.status_code

200

A status code 200 is required to continue with the process.

In [None]:
# Function gets the content of page using url.

cookie={} # insert request cookies within{}
def getAmazonSearch(search_query):
    url="https://www.amazon.in/s?k="+search_query
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"


# A function to get the contents of individual product pages using 'data-asin' number (unique identification number)

def Searchasin(asin):
    url="https://www.amazon.in/dp/"+asin
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

## Scraping product names and ASIN numbers

Every product in amazon has a unique identification number. This number is called ASIN — Amazon Standard Identification Number. Using the ASIN number, we can directly access every individual product.

In [None]:
data_asin=[]
response=getAmazonSearch('titan+men+watches')
soup=BeautifulSoup(response.content)
for i in soup.findAll("div",{'class':"sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 sg-col s-widget-spacing-small sg-col-4-of-20"}):
    data_asin.append(i['data-asin'])

https://www.amazon.in/s?k=titan+men+watches


In [None]:
data_asin

In [None]:
link=[]
for i in range(len(data_asin[0])):
    response=Searchasin(data_asin[i])
    soup=BeautifulSoup(response.content)
    for i in soup.findAll("a",{'data-hook':"see-all-reviews-link-foot"}):
        link.append(i['href'])

https://www.amazon.in/dp/B07SNC1BZQ
https://www.amazon.in/dp/B07DD2KBXV
https://www.amazon.in/dp/B01LZPW4SV
https://www.amazon.in/dp/B07DD4LBXF
https://www.amazon.in/dp/B018VZBTLY
https://www.amazon.in/dp/B00ISNVQMW
https://www.amazon.in/dp/B07CQ2DBSN
https://www.amazon.in/dp/B07DD39617
https://www.amazon.in/dp/B01CLFHBAS
https://www.amazon.in/dp/B079FW32J7


In [None]:
def Searchreviews(review_link):
    url="https://www.amazon.in"+review_link
    print(url)
    page=requests.get(url,cookies=cookie,headers=header)
    if page.status_code==200:
        return page
    else:
        return "Error"

In [None]:
reviews=[]
for j in range(len(link)):
    for k in range(100):
        response=Searchreviews(link[j]+'&pageNumber='+str(k))
        soup=BeautifulSoup(response.content)
        for i in soup.findAll("span",{'data-hook':"review-body"}):
            reviews.append(i.text)

In [None]:
reviews[:40]

In [None]:
rev={'reviews':reviews} #converting the reviews list into a dictionary
review_data=pd.DataFrame.from_dict(rev) #converting this dictionary into a dataframe

In [None]:
review_data

In [None]:
review_data.to_csv('Scraping reviews.csv',index=False)

## Sentiment Analysis

Let's perform sentiment analysis using the model we build in the previous ipynb.

In [55]:
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
from nltk.corpus import wordnet, stopwords

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [56]:
logreg_model = pickle.load(open('/content/logreg.pkl', 'rb'))
scrapped_reviews = pd.read_csv('/content/Scraping reviews.csv')

In [57]:
# Dropping the Null values
scrapped_reviews.dropna(inplace = True)

In [58]:
scrapped_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3598 entries, 1 to 3674
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   reviews  3598 non-null   object
dtypes: object(1)
memory usage: 56.2+ KB


In [59]:
stopwords_list = stopwords.words('english')

# A function to remove emojis from the reviews
def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii') 

# Removing Stopwards and unwanted text
def ReviewProcessing(df):
  # To remove '\n' from every review
  df['reviews']=df['reviews'].apply(lambda x:x.strip('\n')) 
  # remove non alphanumeric 
  df['review_cleaned'] = df.reviews.str.replace('[^a-zA-Z0-9 ]', '')
  # lowercase
  df.review_cleaned = df.review_cleaned.str.lower()
  # split into list
  df.review_cleaned = df.review_cleaned.str.split(' ')
  # remove stopwords
  df.review_cleaned = df.review_cleaned.apply(lambda x: [item for item in x if item not in stopwords_list])
  return df 

# Lemmatization
def get_wordnet_pos(word):
  tag = nltk.pos_tag([word])[0][1][0].upper()
  tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

  return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = nltk.stem.WordNetLemmatizer()

def get_lemmatize(sent):
  return " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sent)])

In [60]:
clean_data = ReviewProcessing(scrapped_reviews)
clean_data.review_cleaned = clean_data.review_cleaned.apply(' '.join)
clean_data.review_cleaned = clean_data.review_cleaned.apply(deEmojify)
clean_data['review_cleaned_lemmatized'] = clean_data.review_cleaned.apply(get_lemmatize)

  if sys.path[0] == '':


In [61]:
clean_data

Unnamed: 0,reviews,review_cleaned,review_cleaned_lemmatized
1,Nice gift to give! However not sure how long t...,nice gift give however sure long product may l...,nice gift give however sure long product may l...
2,The media could not be loa...,media could loaded ...,medium could load
3,Super,super,super
4,Too heavy. The dial glass is not scratch proof,heavy dial glass scratch proof,heavy dial glass scratch proof
5,Good looking slim watch,good looking slim watch,good look slim watch
...,...,...,...
3670,i have gifted this watch to my father on his b...,gifted watch father birthday good loved,gift watch father birthday good love
3671,No,,
3672,Geniune product,geniune product,geniune product
3673,Simply amazing in its look and feel. So far it...,simply amazing look feel far working without h...,simply amaze look feel far work without hiccup


In [62]:
# Input Data
x = clean_data['review_cleaned_lemmatized'].copy()
