Data Engineering 101: Building a Data Pipeline

TODO

build docker image
build final result
have git tags to checkout
instructions as a Github pages repo

Data Engineering 101: Building a Data Pipeline

This repository contains the files and data from a workshop at PARISOMA well as resources around Data Engineering.

I would love your feedback on the materials in the Github issues. And/or please do not hesitate to reach out to me directly via email at jonathan@galvanize.it or over twitter @clearspandex

The presentation can be found on Slideshare here or in this repository (presentation.pdf).

Throughout this workshop, you will learn how to make a scalable and sustainable data pipeline in Python with Luigi

Learning Objectives

Run a simple 1 stage Luigi flow reading/writing to local files
Write a Luigi flow containing stages with multiple dependencies
- Visualize the progress of the flow using the centralized scheduler
- Parameterize the flow from the command line
- Output parameter specific output files
Manage serialization to/from a Postgres database
Integrate a Hadoop Map/Reduce task into an existing flow
Parallelize non-dependent stages of a multi-stage Luigi flow
Schedule a local Luigi job to run once every day
Run any arbitrary shell command in a repeatable way

Prerequisites

Install Python, I recommend Anaconda (Mac OSX or Windows): http://continuum.io/downloads
Get the files: Download the ZIP or git clone https://github.com/Jay-Oh-eN/data-engineering-101 (git tutorial) this repository.

Text Editor: I recommend [Sublime Text][sublime]
A (modern) Web Browser: I recommend [Google Chrome][chrome]
Docker: download Kinematic

Schedule

Time	Activity
1:00-1:10	Components of Data pipelines (Lecture)
1:10-1:20	What and Why Luigi (Lecture)
1:20-1:40	The Smallest (1 stage) pipeline (Live Code)
1:25-1:40	The Smallest (1 stage) pipeline (Lab)
1:25-1:40	The Smallest (1 stage) pipeline (Solution)
Managing dependencies in a pipeline (10min)
Lab: Multi-stage pipeline and introduction to the Luigi Visualizer (15min)
Serialization in a Data Pipeline (10min)
Lab: Integrating your pipeline with HDFS and Postgres (20min)
Scheduling (10min)
Lab: Parallelism and recurring jobs with Luigi (20min)
Wrap up and next steps (5min)

Getting Started

Install Python, I recommend Anaconda (Mac OSX or Windows): http://continuum.io/downloads
Get the files: Download the ZIP or git clone https://github.com/Jay-Oh-eN/data-engineering-101 (git tutorial) this repository.

Run the Code

Hadoop Docker (with script to transfer files upload-data.sh)
Luigi Client Docker
- luigid --background --logdir logs
- python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8

Local

Install libraries and dependencies: pip install -r requirements.txt
Start the UI server: luigid --background --logdir logs
Navigate with a web browser to http://localhost:[port] where [port] is the port the luigid server has started on (luigid defaults to port 8082)
Run the final pipeline: python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8

Hadoop

Start Hadoop cluster: bin/start-dfs.sh; sbin/start-yarn.sh
Setup Directory Structure: hadoop fs -mkdir /tmp/text
Get files on cluster: hadoop fs -put ./data/text /tmp/text
Retrieve results: hadoop fs -getmerge /tmp/text-count/2012-06-01 ./counts.txt
View results: head ./counts.txt

Libraries Used

Whats in here?

text/                   20newsgroups text files
example_luigi.py        example scaffold of a luigi pipeline
hadoop_word_count.py    example luigi pipeline using Hadoop
ml-pipeline.py          luigi pipeline covered in workshop
LICENSE                 Details of rights of use and distribution
presentation.pdf        lecture slides from presentation
readme.md               this file!

The Data

The data (in the text/ folder) is from the 20 newsgroups dataset, a standard benchmarking dataset for machine learning and NLP. Each file in text corresponds to a single 'document' (or post) from one of two selected newsgroups (comp.sys.ibm.pc.hardware or alt.atheism). The first line provides which group the document is from and everything thereafter is the body of the post.

comp.sys.ibm.pc.hardware
I'm looking for a better method to back up files.  Currently using a MaynStream
250Q that uses DC 6250 tapes.  I will need to have a capacity of 600 Mb to 1Gb
for future backups.  Only DOS files.

I would be VERY appreciative of information about backup devices or
manufacturers of these products.  Flopticals, DAT, tape, anything.  
If possible, please include price, backup speed, manufacturer (phone #?), 
and opinions about the quality/reliability.

Please E-Mail, I'll send summaries to those interested.

Thanx in advance,

Resources/References

License

All files and content licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_luigi.py		example_luigi.py
hadoop_word_count.py		hadoop_word_count.py
ml-pipeline.py		ml-pipeline.py
presentation.pdf		presentation.pdf
requirements.txt		requirements.txt

License

gitter-badger/data-engineering-101

Folders and files

Latest commit

History

Repository files navigation

TODO

Data Engineering 101: Building a Data Pipeline

Learning Objectives

Prerequisites

Schedule

Getting Started

Run the Code

Local

Hadoop

Libraries Used

Whats in here?

The Data

Resources/References

License

About

Resources

License

Stars

Watchers

Forks

Languages