This project introduced us to the concepts of Extract, Transform and Load process, known as ETL. The data from Wikipedia, Kaggle and aggregated ratings were utilized to create a movie database from a clean dataset for fictional Hackathon. To perform this, the ETL process is used to extract the Wikipedia and Kaggle data from the files, and next transform the dataset via cleaning the rows and formatting datatypes, performing joins, and lastly loading the cleaned dataset into PostgreSQL database.
In this project we created an automated pipeline that inputs new data, from Wikipedia data, Kaggle metadata and the MovieLens rating data, for the Amazing Prime Hackathon. Then we transformed the data and loaded the data into an existing PostgreSQL database.
For this analysis we performed the following:
- Write an ETL function to read three data files,
- Extract and transform the Wikipedia data,
- Extract and transform the Kaggle and rating data,
- Load the data to a PostgreSQL Movie Database.
The ETL process is performed using four Jupiter notebook and details are provided below:
Where the function inputs the Wikipedia JSON, Kaggle metadata and MovieLens csv files and ultimately creates three separate DataFrames which we later use.
The TV shows are filtered and the redundant data is consolidated, also in addition to that any duplicates are removed as well, lastly formatted the Wikipedia data
Same as before the redundant data is consolidated, the duplicates is removed, and the data is formatted and grouped. The all the three data is merged the Kaggle and rating data with the Wikipedia movies Dataframe.
The ETL function is used to create, collect and clean movie data from multiple sources. It also transforms and merges data and load it into PostgreSQL and are at this stage put into tables that are ready to use in the Amazing Prime Hackathon.