- Use Ubuntu 18.04 LTS (due to use of bash scripts)
- AWS EC2 key pair created in the region 'ap-southeast-1'
Take Note
- Notable events will be printed on the console to show progress. All actions and events will be added to the log file, which can be found at
/automation/config/logs.log
- The estimated time required for setting up is 20min and the estimated time for analytics can be found in the table below.
- Amazon performs status checks on the instances, which monitors the software and network configurations. These checks could fail due to various reasons, this is outside of our control and cannot be resolved using Boto3. In that case, terminate all instances and security groups manually and try again.
- Configuration of ec2 instances can be changed in
automation/main.py
-
Clone the project repository using the following command
git clone https://github.com/andrehadianto/50043_isit_database.git
-
In the automation folder, create a virtual environment (python3) and install the requirements
virtualenv env source env/bin/activate python -m pip install -r requirements.txt
-
Run main.py to setup and configure the EC2 instances, and run the analytics. This will take 20-30min.
python3 main.py <access key> <secret access key> <key pair name> <**absolute** pem file directory> <cluster size>
-
The public IP of the EC2 instance hosting the web application will be printed on the console. The website can then be viewed at
<public IP 1>:5000/isit
-
To initiate the analytics again, run
python analytics.py
To change the cluster size, run
python clean.py python3 main.py <access key> <secret access key> <key pair name> <**absolute** pem file directory> <cluster size>
-
The results of the analytics tasks will be stored in HDFS. The file location will be printed on the console. To see the files, ssh into the namenode and run
hadoop fs -ls /tfidf hadoop fs -ls /corr
A sample of the TF-IDF and correlation analytics output can be seen below.
The estimated runtime (in min) for performing the analytics task on various vocabulary sizes and cluster sizes are as follows,
vocabulary size | 2 nodes | 4 nodes | 8 nodes |
---|---|---|---|
2000 | 10 | 4 | 2 |
5000 | 20 | 7 | 3 |
10000 | - | - | 6 |
In this project, we built a web application for Kindle book reviews, similar to Goodreads. You will start with some public datasets from Amazon, and will design and implement your application around them. The requirements below are intended to be broad and give you freedom to explore alternative design choices. The full details of the project requirements can be found here.
We were provided with 2 public datasets from Amazon.
- Amazon Kindle's reviews, available from Kaggle website. This dataset has 982,619 entries (about 700MB).
- Amazon Kindle metadata, available from UCSD website This dataset has 434,702 products (about 450MB)
After cleaning the dataset, we loaded the data into the respective databases.
This consists of a web page that let an user perform at least the following:
- See some reviews
- Add new review
- Add a new book You are free to use any Web framework you want, and free to decide your own structure and layout. As long as I can enter input and get my output. Pretty website will earn you more points, to a certain limit. And feel free to add more functionalities as the project progresses.
We used ReactJS for our frontend because it is fast, scalable, and simple. It allows us to create a web application which can change data, without reloading the page. React also allows us to create reusable UI components, making our code lighter and less repetitive. The full documentation on the usage can be found here.
The features implemented on our web application is as follows:
- Homepage of our webiste shows a list of all books available in our database. User may sign up and login to our website.
- Filter the books by multiple categories. Type inside the input form for category suggestions.
- Filtered books will show the filtered categories in the labels by upper-left corner.
- Search a book by its asin code.
- User may login to gain access to user-only features.
- User may sign up a new account. Username is unique.
- User-only features consist of Adding new books, Seeing logs, and logging out.
- Selecting a book will show the book details. You can see the title if available, the genre, author, description, and its price. Below shows all the available reviews by other users and the overall ratings.
- Anonymous user may post a review using a nickname of their own choice.
- Logged in user does not have to choose a nickname and their review will be based on their entered display name.
- Logged in user can add a new book. The preview of the book can be seen before submitting the book.
- User can check the details of successful new book submission immediately.
- Logged in user may see the list of logs.
- View details of a log.
This is a preview of our frontend
The full implementation of our frontend and the various features supported by our web application can be found [here](/static).
- The web server receives requests and computes the responses by interacting with the databases.
- The reviews are stored in a relational databases (SQL is recommended).
- The metadata (book descriptions) is in a document store (MongoDB).
- The web server logs are recorded in a document store. Each log record must have at least the following information:
- Timestamp
- What type of request is being served
- What is the response
Our server uses Flask because it allows for more control about which components to use, such as what databases and how to interact with them. Since our project requires us to connect with multiple SQL and Mongo databases, we decided that Flask was most suitable. The full documentation on the usage can be found here.
Schema (MongoDB)
- id: ObjectId, primary key
- asin: String
- title: String
- categories: String[]
- imUrl?: String
- related?: json
- also_bought: string[]
- also_viewed: string[]
- bought_together: string[]
- price?: Double
- description?: String
Endpoints
Endpoint | REST | Description |
---|---|---|
/book/:asin | GET | Returns book details (all available fields), erroneous asin in array will be ignored
|
/books?page={}&count={} | GET | Returns books information (lightweight) with pagination
|
/books/previews/ | POST | Returns books information (lightweight)
|
/books/category | POST | Returns books that have categories containing categories in categoryArray
|
/book/new | POST | Adds a book to database
|
/book/update/:asin | PUT | Updates book details
|
/category/all | GET | Returns all available book category
|
/user/logs?page={}&count={} | GET | Returns a list of logs
|
/user/logs/:id | GET | Returns a specific log in details
|
/user/login | POST | Log in to an account into the website |
/user/signup | POST | Sign up a new account into the database |
/books_titles | GET | Returns books that have titles in the metadata -Body: titles, asin |
/user/login | POST | Returns user data for authorization -Body: name, id, token |
/user/signup | POST | Returns status code of sign up result |
Schema (MySQL)
- id: Integer, primary key
- asin: String
- helpful: String
- overall: Integer
- reviewText: Text
- reviewTime: String
- reviewerID: String
- reviewerName: String
- summary: String
- unixReviewTime: Integer
Endpoints
Endpoint | REST | Description |
---|---|---|
/reviews/:asin?page={}&count={} | GET | Returns reviews details (all fields) for a book with pagination
|
/reviews/:asin | POST | Adds a review to database
|
/reviews/user/:userid | GET | Returns all reviews by user
|
/review/:id | DELETE | Delete review |
/review/:id | PUT | Edit review by user
|
Schema (MongoDB)
- id: Integer, primary key
- time: String
- body: String
- method: String
- path: String
- status: String
- statuCode: Integer
You will build a analytics pipeline and system that comprises:
- You will first write a script that saves data from the production system, and then loads the data to a distributed file system (HDFS) in the analytics systems.
- Write the following applications in Spark.
- Correlation: compute the Pearson correlation between price and average review length. You are to implement in a map-reduce fashion, and are not allowed to use mllib.stat.Satistics.
- TF-IDF: compute the term frequency inverse document frequency metric on the review text. Treat one review as a document.
To compute the Pearson correlation between price and average review length, we would first need to retrieve these data from HDFS. Since price is from the Meta
dataset while the reviews are from the Reviews
dataset, we would need to fetch the two datasets and do some preprocessing.
The proprocessing includes getting the average reviewText
length for each book (asin
). Thereafter, we join the two datasets using asin
and extracted the average_reviewLength
and price
required for our correlation computation.
We made use of the MapReduce concept for the computation. In the mapping stage, we mapped each row to its x
, x**2
, xy
, y
and y**2
values. (Let x = price and y = average_reviewLength). This is done via:
flatdata = data.flatMap(lambda row: ( #row -> price, average_reviewLength
("x", row[0]),
("y", row[1]),
("x2", row[0] * row[0]),
("y2", row[1] * row[1]),
("xy", row[0] * row[1])))
Next, in the reducing stage, we reduced by key and summed up all the values in each keys. Thereafter, we sorted by key so that we can get an absolute order we can slice from.
# Sort by key so that we will get an order we can slice from
reduced_data = flatdata.reduceByKey(lambda x,y: x+y).sortByKey()
# Retrieving our terms required for calculating the correlation
end_data = reduced_data.take(5)
x = end_data[0][1]
xx = end_data[1][1]
xy = end_data[2][1]
y = end_data[3][1]
yy = end_data[4][1]
Now that we have , computing the Pearson correlation is straightforward. We simply use the formula as follows in order to do the computation.
After running the analytics scripts detailed in the next section, the output for the correlation script can be found in hdfs under the /corr/ directory. The value of the Pearson's correlation coefficient can be found in one of the files (part-0000X
) in that directory.
hadoop fs -ls /corr
The second analytic task was to compute the term frequency inverse document frequency metric on the reviews where each review is a document.
(a) Convert each review to an array of words
tokenizer = Tokenizer(inputCol="reviewText", outputCol="words")
wordsData = tokenizer.transform(data)
(b) Get the term frequency vectors
We used CountVectorizer
to compute the term frequency vectors instead of HashingTF
even though it is slower because CountVectorizer
is able to restore the word vocabulary.
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures")
model = cv.fit(wordsData)
featurizedData = model.transform(wordsData)
(c) Apply IDF
featurizedData.cache()
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
(d) Convert word index back to word
After using CountVectorizer
, it indexes each word and thus each word is now represented by an integer index. In order to recover the words back we created a user-defined function (udf
) and applied it to our dataframe.
# trying to map the index of word -> actual word cause CountVectorizer gives index
def map_to_word1(row, vocab):
d = {}
array = row.toArray()
for i in range(len(row)):
# if it is 0 -> ignore, else change the key to corresponding word
if (array[i] != 0):
tfidf = array[i]
word = vocab[i]
d[word] = tfidf
return str(d)
def map_to_word(vocab):
return udf(lambda row: map_to_word1(row, vocab))
# apply udf to convert index back to word
df = rescaledData.withColumn("modified", map_to_word(vocab)(rescaledData.features))
After running the analytics scripts detailed in the next section, the output for the tfidf script can be found in hdfs under the /tfidf/ directory
hadoop fs -ls /tfidf
The requirements for the scripts are as follows:
- No managed service such as EMR and RDS.
- Your scripts are expected to:
- Take my AWS credentials as input. (For the analytic tasks, your scripts also take the number of nodes).
- Spin up new instances from a base Ubuntu image. Only most basic OS packages are installed in the base image.
- Configure and start your systems (both production and analytics). Your scripts tell me how/where to access the front end.
- Have options to start the analytic tasks.
- You can save the results of the analytic tasks to file, and tell me how/where to access the file.
The EC2 instances will be setup as in the following structure
Instance Number | Content | Details |
---|---|---|
1 |
|
The frontend and backend is hosted at port 5000, the website can be viewed at <public IP 1>:5000/isit. The backend retrieves information from the databases via their IP addresses. |
2 | Kindle Reviews (MySQL) | The data can be accessed by the backend and namenode at <public IP 2>:3306. |
3 |
|
The data can be accessed by the backend and namenode at <public IP 3>:27017. |
4 |
|
The namenode is configured to store the public DNS of the datanodes. The HDFS cluster can be view at <public IP 4>:50070. When the client issues an analytics job, data is retrieved from the databases via their IP addresses. The driver delegates tasks to the workers by ssh-ing into the datanodes. Upon completion, the driver returns the output file. |
5+ |
|
The worker performs the tasks assigned by the driver. |
Refer to here for instructions to run automated setup.