This project is End to End ETL pipleline of MovielensDataset for training a recommendation system.The end to end workflow is automated using Apache Airflow.
Each pipeline is executed as Directed Acyclic graph using the Airflow
It uses spark for transformations.
NOTE: This code requires python version 3.5. Not tested for other versions
Each transformation peipleline can be defined by exetending etl.pipline.BasePipeLine
Scripts that process data/play with it are placed in the scripts folder
This is created by the piplelines where result dataset is generated
For the now the dag should be triggered externally. Dag's can be scheduled by enabling start_time in default_args or throug CLI
export AIRFLOW_HOME = ~/airflow/
pip install -r requirements.txt
dags_folder = //dags/
airflow webserver -p 8080 airflow scheduler airflow worker
airflow trigger_dag ETLPipeLineForMovieLensData
(or)
python dags/ETL_MovieLensData_pipeline.py
default address: localhost:8080
- Setup a pipeline to train a simple predicitve model
- Working on subbranch instead of master
- Create Config file: To handle spark & files configuration (use ConfigParser for this)
- Modify scripts/transform.py to function