# AmazonAnalyzer
**Federico Canzonieri** 

**Matricola:  1000024369** 

## What is?
AmazonAnalyzer is a university project for **TAP** (*Technologies for Advanced Programming*) course.

The aim of this project is to build a data pipeline using *docker* and some main technologies for handling big data.


## For what?
This tool take reviews from [Amazon](www.amazon.it)  (even in *"real-time"*) and perform sentiment analysis in order to understand the popularity of a product.

## For who?

This tool can be useful for a person who wants to buy something on *Amazon* and wants to understand if that product is good or not and see if over the time the sentiment is increasing or decreasing.




## Architettura dell'applicazione
* **Ingestion**
* **Streaming**
* **Processing**
* **Indexing**
* **Visualization**



![Architettura](image/schema.svg)


## Ingestion

The phase of ingestion is where the data are taken from the source of the website [Amazon](www.amazon).

It was used the *web-scraping* by using Selenium and Python to retrive the data.
There are 2 services, the first one takes the data in streaming ("real-time") waiting on the last reviews, instead the second one retrieves all the other reviews already written. 

The retrieven data will send using TCP socket (JSON format) to **Logstash**.



### Data taken from Amazon
* Nickname 
* Date
* Reviews
* Rating (5 stars)
* Helpful_vote 
* Country
* Title reviews

### Why selenium?
In this case we need to use selenium because a simple web-scraper (Es BeautifulSoup) cannot execute javascript and so the page cannot be loaded.

![Image](image/meme1.jpg)


### Logstash

For our purpose Logstash will retrieve data using TCP plugin and will send to Kafka in the topic **amazon**.

## Streaming

### Kafka
Apache Kafka is a platform for data streaming that allows to publish, subscribe flows of record in real-time.



## Processing
Once Kafka has data is possible to manipulate and process that to obtain useful information.


### Apache Spark

Apache Spark is a framework for distribuited computations which is 100x faster than MapReduce and is optmized for ML algorithms.


### Machine Learning


Obtaining the *"polarity scores"* using Vader.

![Vader](image/vader.jpg)



## Workaround of Vader

Unfortunately Vader works only on english text/senteces.
How to made Vader works on multilingual text?

![Image](image/meme2.jpg)


In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
vader = SentimentIntensityAnalyzer()

def get_sentiment(text):
    value = vader.polarity_scores(text)
    value = value['compound']
    return value

print(get_sentiment("this dish is a piece of shit"))
print(get_sentiment("Buono"))


-0.5574
0.7351


In [4]:
%pip install translate

Collecting translate
  Downloading translate-3.6.1-py2.py3-none-any.whl (12 kB)
Collecting lxml
  Downloading lxml-4.6.3-cp37-cp37m-manylinux2014_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 5.0 MB/s eta 0:00:01
Collecting libretranslatepy==2.1.1
  Downloading libretranslatepy-2.1.1-py3-none-any.whl (3.2 kB)
Installing collected packages: lxml, libretranslatepy, translate
Successfully installed libretranslatepy-2.1.1 lxml-4.6.3 translate-3.6.1
Note: you may need to restart the kernel to use updated packages.


In [25]:
from translate import Translator

translator= Translator(from_lang="italian",to_lang="en")

def get_sentiment(text):
    
    value = vader.polarity_scores(translator.translate(text))
    #value = value['compound']
    return value

print(translator.translate("Bello sto libro"))
print(get_sentiment("Bello sto libro"))
print(get_sentiment("Good this book"))
translator= Translator(from_lang="italian",to_lang="eng")
translation = translator.translate("Buono")
print(translation)

This book is beautiful
{'neg': 0.0, 'neu': 0.435, 'pos': 0.565, 'compound': 0.5994}
{'neg': 0.0, 'neu': 0.408, 'pos': 0.592, 'compound': 0.4404}
Good


## It works?
![Image](image/meme3.jpg)





The perfomance of this method  depends on the quality of translation.


## Indexing

Data will be indexed using ElasticSearch.


## Visualization

Kibana is a dashboard of data visualization open source for ElasticSearch. If allows to create different type of graphics,plot for interrogate data.
KIBANA IMMAGINE 

## Future Update


* Kubernetes
* Add other resource (Ebay, Alien Express)
* Use