Movies-ETL

This project introduced us to the concepts of Extract, Transform and Load process, known as ETL. The data from Wikipedia, Kaggle and aggregated ratings were utilized to create a movie database from a clean dataset for fictional Hackathon. To perform this, the ETL process is used to extract the Wikipedia and Kaggle data from the files, and next transform the dataset via cleaning the rows and formatting datatypes, performing joins, and lastly loading the cleaned dataset into PostgreSQL database.

Overview

In this project we created an automated pipeline that inputs new data, from Wikipedia data, Kaggle metadata and the MovieLens rating data, for the Amazing Prime Hackathon. Then we transformed the data and loaded the data into an existing PostgreSQL database.

For this analysis we performed the following:

Write an ETL function to read three data files,
Extract and transform the Wikipedia data,
Extract and transform the Kaggle and rating data,
Load the data to a PostgreSQL Movie Database.

The ETL process is performed using four Jupiter notebook and details are provided below:

Results

First, the ETL function is written to read three data files

Where the function inputs the Wikipedia JSON, Kaggle metadata and MovieLens csv files and ultimately creates three separate DataFrames which we later use.

The Wikipedia data is extracted and transformed.

The TV shows are filtered and the redundant data is consolidated, also in addition to that any duplicates are removed as well, lastly formatted the Wikipedia data

The Kaggle and rating data is extracted and transformed.

Same as before the redundant data is consolidated, the duplicates is removed, and the data is formatted and grouped. The all the three data is merged the Kaggle and rating data with the Wikipedia movies Dataframe.

PostgreSQL is used and the data is loaded into the Movie database.

Summary

The ETL function is used to create, collect and clean movie data from multiple sources. It also transforms and merges data and load it into PostgreSQL and are at this stage put into tables that are ready to use in the Amazing Prime Hackathon.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Resources		Resources
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
Module practice.ipynb		Module practice.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies-ETL

Overview

Results

First, the ETL function is written to read three data files

The Wikipedia data is extracted and transformed.

The Kaggle and rating data is extracted and transformed.

PostgreSQL is used and the data is loaded into the Movie database.

Summary

About

Releases

Packages

Languages

abdulla971/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies-ETL

Overview

Results

First, the ETL function is written to read three data files

The Wikipedia data is extracted and transformed.

The Kaggle and rating data is extracted and transformed.

PostgreSQL is used and the data is loaded into the Movie database.

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages