Amazon Vine Analysis: ETL pipeline

Background

In this assignment you will put your ETL skills to the test. Many of Amazon's shoppers depend on product reviews to make a purchase. Amazon makes these datasets publicly available. However, they are quite large and can exceed the capacity of local machines to handle. One dataset alone contains over 1.5 million rows; with over 40 datasets, this can be quite taxing on the average local computer. Your first goal for this assignment will be to perform the ETL process completely in the cloud and upload a DataFrame to an RDS instance. The second goal will be to use PySpark or SQL to perform a statistical analysis of selected data. There are two levels to this homework assignment. The second level is optional but highly recommended.

Project objective

Developing google Colaboratory Notebook files for ETL pipeline of Amazon music reviews and connection to AWS PostgreSQL database and analysis of the ratio of five star reviews as it relates to participation in the Vine program.

Resources

Data Source
- Amazon-reviews-links
- Music data
Technologies used
- Python
- Spark
- PySpark
- PostgreSQL
- pgAdmin
- Google colaboratory

Results

Running AWS RDS ETL Pipeline amazon_music_reviews.ipynb to populate the four tables in our PostgreSQL database as shown in the following images:

review_id_table
products
customers
vine_table

Analysis to compare the number and ratio of 5 star reviews between those included in the Vine program (paid) and those not (unpaid). Before making this comparison, then filtered the Vine data set to only contain rows with the following conditions:

At least 20 total votes
Majority of votes considered helpful,i.e helpful_votes / total_votes >= 0.5

The resulting DataFrame splited in two, with one DataFrame containing paid reviews, i.e vine == "Y", and the other containing unpaid reviews, i.e vine == "N". Then .groupby("star_rating").agg(count("star_rating")) were used to obtain the count of each rating level for each DataFrame, as shown in Rating Counts Summary. The summarry of the information obtained in DataFrame shown in table below:

Summary

Comparing the number and ratio of 5 star reviews between those included and not included in the Vine program, it appears there is no clear positivity bias for reviews in the program (0.0% Vine 5 star review while 63.76% Non-Vine 5 star reviews). However, the much larger number of non-Vine reviews relative to Vine reviews (105979 versus 7) likely indicates that this conclusion is not statistically significant. To obtain a better understanding, increased vine review data is necessary. From here, one could formulate a two-sample T-test with the following hypotheses:

H_0 : The mean rating (in number of stars) is the same between Vine and non-Vine reviews. i.e: mean_rating_vine = mean_rating_nonvine H_a : The mean rating (in number of stars) for Vine reviews is greater than that of non-Vine reviews, i.e there is positivity bias, and: mean_rating_vine > mean_rating_nonvine

This would then determine if the mean rating for Vine reviews is significantly greater than the mean rating for non-Vine reviews.

Use case

The majority of the code for this analysis is contained in the two Jupyter Notebook files amazon_music_reviews.ipynb and vine_reviews_analysis.ipynb. Both; however, require the Spark dependency and are thus best run using Google Colaboratory.

`amazon_music_reviews.ipynb`

Open the notebook amazon_music_reviews.ipynb using "File > Open notebook". Prior to running all cells, the user should create an AWS RDS instance as follows:

Navigate to the AWS Management Console and sign in.
Search for "RDS" (Managed Relational Database Service) and select the first result.
On the resulting page, select "Create database" and change the following from the default options:

"Engine options > Engine type": "PostgreSQL"
"Templates" > "Free tier"
"Settings > DB instance identifier":
"Master username": , or use default postgres
"Master password": , separate from pgAdmin4 password
"Connectivity > Public access": "Yes"

Select "Create database".
After creating the database, update which IP addresses can access it by first navigating to the database instance on AWS (select "Databases" in side pane and choose the recently created database).
Scroll down to "Security group rules" and select the first "Security group".
Choose the only "Security group ID" shown.
Select "Edit inbound rules"
For the first entry shown under "Inbound rules", change the "Type" to "PostgreSQL" and the "Source" to "Anywhere".

This is not the best practice for production, but in this example it simplifies connecting to the database.

Select "Save rules".
Now select the "Outbound rules" tab and then "Edit outbound rules".
For the first entry shown under "Outbound rules", change the "Type" to "All traffic" and the "Destination" to "Anywhere".
Select "Save rules".
The database is now instantiated and accessible from any IP address, though the database password is still required.

With the RDS instance now created, the user should connect pgAdmin4 to its endpoint for local access. This is accomplished as follows:

Navigate to the created RDS instance on the AWS Management Console.
Copy the "Endpoint" shown under "Connectivity & security"
Open and log into pgAdmin4.
Select "Add New Server" and set the following:

Under the "General" tab, name the connection something like "AWS".
Choose the "Connection" tab and paste the copied RDS endpoint to the "Host name/address" setting and use the default Port 5432.
Leave the "Username" as "postgres" unless a different username was chosen during creation of the RDS instance.
Fill in the "Password" with the same password set during creation of the RDS instance.

Choose "Save" to establish the connection.
Establish the necessary database structure by using the "Query Tool" for the instantiated database.
Open and run the queries contained in music_schema.sql

With the RDS instance created, connection to pgAdmin4 established, and database schema defined, the user can now establish connection in amazon_music_revies.ipynb:

Before returning to Google Colaboratory, copy the <connection_string> for the RDS instance:

In pgAdmin4 right-click on the "AWS" connection shown in the "Server" directory
Select "Properties"
Select the "Connection" tab
Copy the address in the "Host name/address" field, this is the <connection_string>

Return to Google Colaboratory, and in the first cell under "Connect to the AWS RDS instance and write each DataFrame to its table", replace this <connection_string> along with the , , and that are currently shown with those created in previous steps:

One can then run all cells in Amazon_Reviews_ETL.ipynb, return to pgAdmin4, and query the recently created tables to confirm the data loading into the AWS RDS instance was successful.

After completion of this AWS RDS pipeline, the user should shut down its instance to ensure they do not incur unexpected charges. This is accomplished as follows:

Navigate to the RDS Service page from the AWS Management Console.
Select "DB Instances".
Select the checkbox for the recently created database.
Under the "Actions" dropdown menu, select "Delete" and confirm deletion.

There is no need to "Create final snapshot" or "Retain automated backups" and so these option should not be selected.

vine_review_analysis.ipynb

vine_review_analysis.ipynb does not require a user-defined AWS RDS instance and therefore one can simply open this notebook in Google Colaboratory and run all cells. This will include reading the Amazon review data set from an AWS S3 instance into a Spark DataFrame. Subsequent cells then include filtering this DataFrame and parsing the result to determine the number and percent of five star reviews among the Vine program participants and non-participants.

References

Amazon customer Reviews Dataset. (n.d.). Retrieved April 08, 2021, from: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Instructions		Instructions
Resources		Resources
level_1		level_1
level_2		level_2
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Vine Analysis: ETL pipeline

Background

Project objective

Resources

Results

Summary

Use case

`amazon_music_reviews.ipynb`

`vine_review_analysis.ipynb`

References

About

Releases

Packages

Languages

bigoshunane/Big-Data-ETL-Pipeline-Project

Folders and files

Latest commit

History

Repository files navigation

Amazon Vine Analysis: ETL pipeline

Background

Project objective

Resources

Results

Summary

Use case

amazon_music_reviews.ipynb

vine_review_analysis.ipynb

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`amazon_music_reviews.ipynb`

`vine_review_analysis.ipynb`

Packages