50.043 Project

Automated Use Flow

Prerequisites

Use Ubuntu 18.04 LTS (due to use of bash scripts)
AWS EC2 key pair created in the region 'ap-southeast-1'

Take Note

Notable events will be printed on the console to show progress. All actions and events will be added to the log file, which can be found at /automation/config/logs.log
The estimated time required for setting up is 20min and the estimated time for analytics can be found in the table below.
Amazon performs status checks on the instances, which monitors the software and network configurations. These checks could fail due to various reasons, this is outside of our control and cannot be resolved using Boto3. In that case, terminate all instances and security groups manually and try again.
Configuration of ec2 instances can be changed in automation/main.py

Instructions

Clone the project repository using the following command

git clone https://github.com/andrehadianto/50043_isit_database.git

In the automation folder, create a virtual environment (python3) and install the requirements
```
virtualenv env
source env/bin/activate
python -m pip install -r requirements.txt
```

Run main.py to setup and configure the EC2 instances, and run the analytics. This will take 20-30min.

python3 main.py <access key> <secret access key> <key pair name> <**absolute** pem file directory> <cluster size>

The public IP of the EC2 instance hosting the web application will be printed on the console. The website can then be viewed at <public IP 1>:5000/isit

To initiate the analytics again, run

python analytics.py

To change the cluster size, run

python clean.py
python3 main.py <access key> <secret access key> <key pair name> <**absolute** pem file directory> <cluster size>

The results of the analytics tasks will be stored in HDFS. The file location will be printed on the console. To see the files, ssh into the namenode and run
```
hadoop fs -ls /tfidf
hadoop fs -ls /corr
```
A sample of the TF-IDF and correlation analytics output can be seen below.

TF-IDF
correlation

Estimated Runtime

The estimated runtime (in min) for performing the analytics task on various vocabulary sizes and cluster sizes are as follows,

vocabulary size	2 nodes	4 nodes	8 nodes
2000	10	4	2
5000	20	7	3
10000	-	-	6

Description

In this project, we built a web application for Kindle book reviews, similar to Goodreads. You will start with some public datasets from Amazon, and will design and implement your application around them. The requirements below are intended to be broad and give you freedom to explore alternative design choices. The full details of the project requirements can be found here.

Dataset

We were provided with 2 public datasets from Amazon.

Amazon Kindle's reviews, available from Kaggle website. This dataset has 982,619 entries (about 700MB).
Amazon Kindle metadata, available from UCSD website This dataset has 434,702 products (about 450MB)

After cleaning the dataset, we loaded the data into the respective databases.

Frontend

Requirements

This consists of a web page that let an user perform at least the following:

See some reviews
Add new review
Add a new book You are free to use any Web framework you want, and free to decide your own structure and layout. As long as I can enter input and get my output. Pretty website will earn you more points, to a certain limit. And feel free to add more functionalities as the project progresses.

Framework

We used ReactJS for our frontend because it is fast, scalable, and simple. It allows us to create a web application which can change data, without reloading the page. React also allows us to create reusable UI components, making our code lighter and less repetitive. The full documentation on the usage can be found here.

Features

The features implemented on our web application is as follows:

Homepage of our webiste shows a list of all books available in our database. User may sign up and login to our website.
Filter the books by multiple categories. Type inside the input form for category suggestions.
Filtered books will show the filtered categories in the labels by upper-left corner.
Search a book by its asin code.
User may login to gain access to user-only features.
User may sign up a new account. Username is unique.
User-only features consist of Adding new books, Seeing logs, and logging out.
Selecting a book will show the book details. You can see the title if available, the genre, author, description, and its price. Below shows all the available reviews by other users and the overall ratings.
Anonymous user may post a review using a nickname of their own choice.
Logged in user does not have to choose a nickname and their review will be based on their entered display name.
Logged in user can add a new book. The preview of the book can be seen before submitting the book.
User can check the details of successful new book submission immediately.
Logged in user may see the list of logs.
View details of a log.

Preview

This is a preview of our frontend

The full implementation of our frontend and the various features supported by our web application can be found [here](/static).

Production Backend

Requirements

The web server receives requests and computes the responses by interacting with the databases.
The reviews are stored in a relational databases (SQL is recommended).
The metadata (book descriptions) is in a document store (MongoDB).
The web server logs are recorded in a document store. Each log record must have at least the following information:
- Timestamp
- What type of request is being served
- What is the response

Framework

Our server uses Flask because it allows for more control about which components to use, such as what databases and how to interact with them. Since our project requires us to connect with multiple SQL and Mongo databases, we decided that Flask was most suitable. The full documentation on the usage can be found here.

Design

1. Books Metadata

Schema (MongoDB)

id: ObjectId, primary key
asin: String
title: String
categories: String[]
imUrl?: String
related?: json
- also_bought: string[]
- also_viewed: string[]
- bought_together: string[]
price?: Double
description?: String

Endpoints

Endpoint	REST	Description
/book/:asin	GET	Returns book details (all available fields), erroneous asin in array will be ignored Parameter: asin Body: json(id, asin, title, categories, imUrl, related?, price?, description?)
/books?page={}&count={}	GET	Returns books information (lightweight) with pagination Body: Array of json(asin, title, imUrl?) Use Case: Home Page, viewing of ALL books
/books/previews/	POST	Returns books information (lightweight) Request Body: (asinArray) Array of string Response Body: Array of json(asin, title, imUrl) Use case: view also bought/also viewed/bought together
/books/category	POST	Returns books that have categories containing categories in categoryArray Request Body: (categoryArray) Array of string Response Body: Array of json(asin, title, imUrl) Use case: filtering of categories
/book/new	POST	Adds a book to database Body: json(title, categories?, imUrl, related?, price?, description) Backend: perform check on imUrl (If empty: add in hard-coded url)
/book/update/:asin	PUT	Updates book details Parameters: asin Body: json(title, categories, imUrl, related?, price?, description)
/category/all	GET	Returns all available book category Use Case: Add new book, selecting category Body: Array of json(category)
/user/logs?page={}&count={}	GET	Returns a list of logs Body: count, logs(status code, method, path, body)
/user/logs/:id	GET	Returns a specific log in details Parameters: ObjectId Body: status code, time, method, path, body
/user/login	POST	Log in to an account into the website
/user/signup	POST	Sign up a new account into the database
/books_titles	GET	Returns books that have titles in the metadata -Body: titles, asin
/user/login	POST	Returns user data for authorization -Body: name, id, token
/user/signup	POST	Returns status code of sign up result

2. Kindle Reviews

Schema (MySQL)

id: Integer, primary key
asin: String
helpful: String
overall: Integer
reviewText: Text
reviewTime: String
reviewerID: String
reviewerName: String
summary: String
unixReviewTime: Integer

Endpoints

Endpoint	REST	Description
/reviews/:asin?page={}&count={}	GET	Returns reviews details (all fields) for a book with pagination Body: Array of json(id, asin, helpful, overall, reviewText, reviewTime, reviewerID, reviewerName, summary, unixReviewTime)
/reviews/:asin	POST	Adds a review to database Body: json(asin, overall, reviewText, reviewerID, reviewerName, summary)
/reviews/user/:userid	GET	Returns all reviews by user Body: Array of json(id, asin, helpful, overall, reviewText, reviewTime, reviewerID, reviewerName, summary, unixReviewTime
/review/:id	DELETE	Delete review
/review/:id	PUT	Edit review by user Body: json(overall, reviewText, summary)

3. Logs

Schema (MongoDB)

id: Integer, primary key
time: String
body: String
method: String
path: String
status: String
statuCode: Integer

Analytics Backend

Requirements

You will build a analytics pipeline and system that comprises:

You will first write a script that saves data from the production system, and then loads the data to a distributed file system (HDFS) in the analytics systems.
Write the following applications in Spark.

Correlation: compute the Pearson correlation between price and average review length. You are to implement in a map-reduce fashion, and are not allowed to use mllib.stat.Satistics.
TF-IDF: compute the term frequency inverse document frequency metric on the review text. Treat one review as a document.

Approach

(1) Correlation

To compute the Pearson correlation between price and average review length, we would first need to retrieve these data from HDFS. Since price is from the Meta dataset while the reviews are from the Reviews dataset, we would need to fetch the two datasets and do some preprocessing.

The proprocessing includes getting the average reviewText length for each book (asin). Thereafter, we join the two datasets using asin and extracted the average_reviewLength and price required for our correlation computation.

We made use of the MapReduce concept for the computation. In the mapping stage, we mapped each row to its x, x**2, xy, y and y**2 values. (Let x = price and y = average_reviewLength). This is done via:

flatdata = data.flatMap(lambda row: (     #row -> price, average_reviewLength
	("x", row[0]),
	("y", row[1]),
	("x2", row[0] * row[0]),
	("y2", row[1] * row[1]),
	("xy", row[0] * row[1])))

Next, in the reducing stage, we reduced by key and summed up all the values in each keys. Thereafter, we sorted by key so that we can get an absolute order we can slice from.

# Sort by key so that we will get an order we can slice from
reduced_data = flatdata.reduceByKey(lambda x,y: x+y).sortByKey()

# Retrieving our terms required for calculating the correlation
end_data = reduced_data.take(5)
x = end_data[0][1]
xx = end_data[1][1]
xy = end_data[2][1]
y = end_data[3][1]
yy = end_data[4][1]

Now that we have $\small \sum x, \sum x^{2}, \sum xy, \sum y, \sum y^{2}$ , computing the Pearson correlation is straightforward. We simply use the formula as follows in order to do the computation.

$\small r = \frac{n \sum xy - \sum x \sum y}{\sqrt {(n \sum x^{2} - (\sum x)^{2}) (n \sum y^{2} - (\sum y)^{2})}}$

After running the analytics scripts detailed in the next section, the output for the correlation script can be found in hdfs under the /corr/ directory. The value of the Pearson's correlation coefficient can be found in one of the files (part-0000X) in that directory.

hadoop fs -ls /corr

(2) TF-IDF

The second analytic task was to compute the term frequency inverse document frequency metric on the reviews where each review is a document.

(a) Convert each review to an array of words

tokenizer = Tokenizer(inputCol="reviewText", outputCol="words")
wordsData = tokenizer.transform(data)

(b) Get the term frequency vectors

We used CountVectorizer to compute the term frequency vectors instead of HashingTF even though it is slower because CountVectorizer is able to restore the word vocabulary.

cv = CountVectorizer(inputCol="words", outputCol="rawFeatures")
model = cv.fit(wordsData)
featurizedData = model.transform(wordsData)

(c) Apply IDF

featurizedData.cache()
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

(d) Convert word index back to word

After using CountVectorizer, it indexes each word and thus each word is now represented by an integer index. In order to recover the words back we created a user-defined function (udf) and applied it to our dataframe.

# trying to map the index of word -> actual word cause CountVectorizer gives index
def map_to_word1(row, vocab):
    d = {}
    array = row.toArray()
    for i in range(len(row)):
        # if it is 0 -> ignore, else change the key to corresponding word
        if (array[i] != 0):
            tfidf = array[i]
            word = vocab[i]
            d[word] = tfidf
    return str(d)

def map_to_word(vocab):
    return udf(lambda row: map_to_word1(row, vocab))

# apply udf to convert index back to word
df = rescaledData.withColumn("modified", map_to_word(vocab)(rescaledData.features))

After running the analytics scripts detailed in the next section, the output for the tfidf script can be found in hdfs under the /tfidf/ directory

hadoop fs -ls /tfidf

Automation

Requirements

The requirements for the scripts are as follows:

No managed service such as EMR and RDS.
Your scripts are expected to:
- Take my AWS credentials as input. (For the analytic tasks, your scripts also take the number of nodes).
- Spin up new instances from a base Ubuntu image. Only most basic OS packages are installed in the base image.
- Configure and start your systems (both production and analytics). Your scripts tell me how/where to access the front end.
- Have options to start the analytic tasks.
- You can save the results of the analytic tasks to file, and tell me how/where to access the file.

Design

The EC2 instances will be setup as in the following structure

Instance Number	Content	Details
1	Frontend (React) Backend (Flask)	The frontend and backend is hosted at port 5000, the website can be viewed at <public IP 1>:5000/isit. The backend retrieves information from the databases via their IP addresses.
2	Kindle Reviews (MySQL)	The data can be accessed by the backend and namenode at <public IP 2>:3306.
3	Books Metadata (MongoDB) Logs (MongoDB) User Data (MongoDB)	The data can be accessed by the backend and namenode at <public IP 3>:27017.
4	Namenode (HDFS) Driver (Spark)	The namenode is configured to store the public DNS of the datanodes. The HDFS cluster can be view at <public IP 4>:50070. When the client issues an analytics job, data is retrieved from the databases via their IP addresses. The driver delegates tasks to the workers by ssh-ing into the datanodes. Upon completion, the driver returns the output file.
5+	Datanode (HDFS) Worker (Spark)	The worker performs the tasks assigned by the driver.

Use Flow

Refer to here for instructions to run automated setup.

Name		Name	Last commit message	Last commit date
Latest commit History 279 Commits
automation		automation
readme		readme
server		server
static		static
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

50.043 Project

Automated Use Flow

Prerequisites

Instructions

Estimated Runtime

Description

Dataset

Table of Contents

Frontend

Requirements

Framework

Features

Preview

Production Backend

Requirements

Framework

Design

1. Books Metadata

2. Kindle Reviews

3. Logs

Analytics Backend

Requirements

Approach

(1) Correlation

(2) TF-IDF

Automation

Requirements

Design

Use Flow

About

Releases

Packages

Contributors 6

Languages

andrehadianto/50043_isit_database

Folders and files

Latest commit

History

Repository files navigation

50.043 Project

Automated Use Flow

Prerequisites

Instructions

Estimated Runtime

Description

Dataset

Table of Contents

Frontend

Requirements

Framework

Features

Preview

Production Backend

Requirements

Framework

Design

1. Books Metadata

2. Kindle Reviews

3. Logs

Analytics Backend

Requirements

Approach

(1) Correlation

(2) TF-IDF

Automation

Requirements

Design

Use Flow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages