# Scraping Amazon Reviews using Scrapy in Python Part 2
> Are you looking for a method of scraping Amazon reviews and do not know where to begin with? In that case, you may find this blog very useful in scraping Amazon reviews.
- toc: true 
- badges: true
- comments: true
- author: Zeyu Guan
- categories: [spaCy, Python, Machine Learning, Data Mining, NLP, RandomForest]
- annotations: true
- image: https://www.freecodecamp.org/news/content/images/2020/09/wall-5.jpeg
- hide: false

## Required Packages
[wordcloud](https://github.com/amueller/word_cloud), 
[geopandas](https://geopandas.org/en/stable/getting_started/install.html), 
[nbformat](https://pypi.org/project/nbformat/), 
[seaborn](https://seaborn.pydata.org/installing.html), 
[scikit-learn](https://scikit-learn.org/stable/install.html)

## Now let's get started!
First thing first, you need to load all the necessary libraries:

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import re
import plotly.graph_objects as go
import seaborn as sns

# Data Cleaning

Following the previous blog, the raw data we scraped from Amazon look like below. 
![Raw Data](https://live.staticflickr.com/65535/51958371521_d139a6c0b1_h.jpg)

Even thought that looks relatively clean, but there are still some inperfections such as star1 and star2 need to be combined, date need to be splited, and etc. The whole process could be found from my [github notebooks](https://github.com/christopherGuan/sample-ds-blog).

Below is data after cleaning. It contains 6 columns and more than 500 rows. 

![Clean Data](https://live.staticflickr.com/65535/51930037691_4a23b4c441_b.jpg)


# EDA
Below are the questions I curioused about, and the result generated by doing data analysis.

- Which rating (1-5) got the most and least?
![Which point was rated the most](https://live.staticflickr.com/65535/51929072567_d34db66693_h.jpg)

- Which country are they targeting?
![Target Country](https://live.staticflickr.com/65535/51930675230_6314e2ccde_h.jpg)

- Which month people prefer to give a higher rating?
![higher rating](https://live.staticflickr.com/65535/51929085842_0cb0aa6b06_w.jpg)

- Which month people leave commons the most?
![More commons](https://live.staticflickr.com/65535/51929085857_49f7c889d2_w.jpg )


- What are the useful words that people mentioned in the reviews?

![More commons](https://live.staticflickr.com/65535/51930471329_82bf0c43b9.jpg)

# Sentiment Analysis (Method 1)

## What is sentiment analysis?

Essentially, sentiment analysis or sentiment classification fall under the broad category of text classification tasks in which you are given a phrase or a list of phrases and your classifier is expected to determine whether the sentiment behind that phrase is positive, negative, or neutral. To keep the problem as a binary classification problem, the third attribute is sometimes ignored. Recent tasks have taken into account sentiments such as "somewhat positive" and "somewhat negative." 

In this specific case, we catogrize 4 and 5 stars to the positive group and 1 & 2 stars to the negative gorup. 

![rating](https://live.staticflickr.com/65535/51930237493_b6afc18052_c.jpg)

Below are the most frequently words in reviews from positive group and negative group respectively. 

Positive review
![positive](https://live.staticflickr.com/65535/51930164126_33b911e6b3_c.jpg)

Negative review
![negative](https://live.staticflickr.com/65535/51930165221_cf61fce68e_c.jpg)

## Build up the first model

Now we can build up a easy model that, as input, it will accept reviews. It will then predict whether the review will be positive or negative.

Because this is a classification task, we will train a simple logistic regression model.


- **Clean Data**
First, we create a new function to remove all punctuations from the data for later use.


In [3]:
def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final


- **Split the Dataframe**

Now, we split 80% of the dataset for training and 20% for testing. Meanwhile, each dataset should contain only two variables, one is to indicate positive or negative and another one is the reviews.

![output](https://live.staticflickr.com/65535/51930294148_3a9db0297c_b.jpg)


In [None]:
df['random_number'] = np.random.randn(len(index))
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]


- **Create a bag of words**

Here I would like to introduce a new package.

[Scikit-learn](https://scikit-learn.org/stable/install.html) is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

In this example, we are going to use [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer) to convert a collection of text documents to a matrix of token counts.

The reason why we need to convert the text into a bag-of-words model is because the logistic regression algorithm cannot understand text.


In [None]:
train_matrix = vectorizer.fit_transform(train['title'])
test_matrix = vectorizer.transform(test['title'])

- **Import Logistic Regression**

In [6]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

- **Split target and independent variables**

In [None]:
X_train = train_matrix
X_test = test_matrix
y_train = train['sentiment']
y_test = test['sentiment']

- **Fit model on data**

In [None]:
lr.fit(X_train,y_train)

- **Make predictionsa**

In [None]:
predictions = lr.predict(X_test)

The output will be either 1 or -1. As we assumed before, 1 presents the model predict the review is a positive review and vice versa.

## Testing

Now, we can test the accuracy of our model!

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
new = np.asarray(y_test)
confusion_matrix(predictions,y_test)

print(classification_report(predictions,y_test))

![accuracy](https://live.staticflickr.com/65535/51929260132_045027628c_z.jpg)

The accuracy is as high as 89%!

# Sentiment Analysis (Method 2)

In this process, you will learn how to build your own sentiment analysis classifier using Python and understand the basics of NLP (natural language processing). First, let's try to use a quick and dirty method to utilize the [Naive Bayes classifier](https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python) to predict the sentiments of Amazon product review. 

Based on the application's requirements, we should first put each review in a txt file and catogorize them as negative or positive review in different folder. 

In [None]:
#Find all negative review
neg = df[df["sentiment"] == -1].review

#Reset the index
neg.index = range(len(neg.index))

## Write each DataFrame to separate txt
for i in range(len(neg)):          
    data = neg[i]
    with open(str(i) + ".txt","w") as file:
        file.write(data + "\n")

Next, we sort the order of the official data and remove all the content. In other words, we only keep the file name. 

In [None]:
import os
import pandas as pd

#Get file names
file_names = os.listdir('/Users/zeyu/nltk_data/corpora/movie_reviews/neg')

#Convert pandas
neg_df = pd.DataFrame (file_names, columns = ['file_name'])

#split to sort
neg_df[['number','id']] = neg_df.file_name.apply(
   lambda x: pd.Series(str(x).split("_")))

#change the number to be the index
neg_df_index = neg_df.set_index('number')
neg_org = neg_df_index.sort_index(ascending=True)

#del neg["id"]
neg_org.reset_index(inplace=True)

neg_org = neg_org.drop([0], axis=0).reset_index(drop=True)
neg_names = neg_org['file_name']

for file_name in neg_names:
    t = open(f'/Users/zeyu/nltk_data/corpora/movie_reviews/neg/{file_name}', 'w')
    t.write("")
    t.close()

Next, we insert the content of amazon review to the official files with their original file names.

In [None]:
#Get file names
file_names = os.listdir('/Users/zeyu/Desktop/DS/neg')

#Convert pandas
pos_df = pd.DataFrame (file_names, columns = ['file_name'])

pos_names = pos_df['file_name']

for index, file_name in enumerate(pos_names):
    try: 

        t = open(f'/Users/zeyu/Desktop/DS/neg/{file_name}', 'r')
        # t.write("")
        t_val = ascii(t.read())
        t.close()
        
        writefname = pos_names_org[index]
        t = open(f'/Users/zeyu/nltk_data/corpora/movie_reviews/neg/{writefname}', 'w')
        t.write(t_val)
        t.close()
    except:
        print(f'{index} Reading/writing Error')

Eventually, we can just run these few lines to predict the sentiments of Amazon product review.

In [None]:
import nltk
from nltk.corpus import movie_reviews
import random

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

#All words, not unique.          
random.shuffle(documents)


In [None]:
#Change to lower case. Count word appears.
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())


#Only show first 2000.
word_features = list(all_words)[:2000]


def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
#Calculate the accuracy of the given.

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(5)