Skip to content
ETL pipeline for processing Amazon movie reviews in to Postgres along with accompanying sentiment analysis.
Jupyter Notebook Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
input
sql
src
.gitignore
README.md

README.md

Movie Review Sentiment Analysis ETL Pipeline

This repository represents an Extract, Transform, and Load (ETL) pipeline for sentiment analysis of Amazon Reviews. Data is first sourced from Amazon's own Open Data Registry in the Parquet data format. From there the transformation pipelines filters the data down to the Video DVD category and US marketplace segment. Additionally the number of columns is reduced to match the target data shape. Next the data is written back out in to parquet format for consumption by the load process. This process reads in the intermediary parquet files and publishes them out to PostgreSQL via the sqlalchemy and psycopg2 packages. While it is possible to perform the extract, transform, and load processes in a single script we have opted for separate scripts with data files cached in the data directory. This allows for modifications to be made to each phase of the pipeline without having to start processing over from scratch.

Data Sources

Pipeline

  1. Download data files
    • cd src && ./download_amazon.sh - 2.90 GB note this requires an Amazon Web Services Access Key and Secret ID
  2. Transformations / Generation
    • Amazon Reviews - cd src && python extract_transform.py
      1. Load source Parquet files (7,135,819 records)
      2. Limit to US market (6,166,026 records)
      3. Reviews
        1. Filter columns ("product_id", "review_id", "review_headline", "review_body", "star_rating", "review_date")
        2. Remove duplicates (5,069,140 records)
        3. Store reviews - input/reviews.parquet
      4. Movies
        1. Filter columns ("product_id", "product_title")
        2. Remove duplicates (297,919 records)
        3. Store movies - input/movies.parquet
    • Sentiment Analysis (Generation)
      1. Iterate over reduced review dataset
      2. Submit review title and text to analysis service (TBD)
      3. Store review id and sentiment scores - input/sentiment.parquet
  3. Load
    • Schema - schema/schema.sql
    • Data - src/load.py
      • Movies
      • Reviews
      • Sentiment Analysis
You can’t perform that action at this time.