Canceled Flight Prediction

This project is a Cloudera Machine Learning (CML) Applied Machine Learning Prototype and has all the code and data needed to deploy an end-to-end machine learning project on a running CML instance.

The primary goal of this repository is to build a gradient boosted (XGBoost) classification model to predict the likelihood of a flight being canceled based on years of historical records. To achieve that goal, this project demonstrates the end-to-end processing needed to take two large, raw datasets and transform them into a clean, unified dataset for model training and inference using Spark on CML. Additionally, this project deploys a hosted model and front-end application to allow users to interact with the trained model.

The two datasets used in this project come from Kaggle and the Bureau of Transportation Statistics.

Project Structure

The project is organized with the following folder structure:

.
├── code/           # Backend scripts, and notebooks needed to create project artifacts
├── data/           # A post processed sample of the full dataset used for model training
├── app/            # Assets needed to support the front end application
├── images/         # A collection of images referenced in project docs
├── models/         # Directory to hold trained models
├── cdsw-build.sh   # Shell script used to build environment for experiments and models
├── README.md
├── LICENSE.txt
└── requirements.txt

By following the notebooks, scripts, and documentation in the code directory, you will understand how to perform similar tasks on CML, as well as how to use the platform's major features to your advantage. These features include:

Data ingestion, cleaning, and processing with Spark
Hive table creation and querying
Streamlined model development
Point-and-click model deployment to a RESTful API endpoint
Application hosting for deploying frontend ML applications

We will focus our attention on working within CML, using all it has to offer, while glossing over the details that are simply standard data science, and in particular, pay special attention to data ingestion and processing at scale with Spark.

Deploying on CML

There are three ways to launch the this prototype on CML:

From Prototype Catalog - Navigate to the Prototype Catalog on a CML workspace, select the "Airline Delay Prediction" tile, click "Launch as Project", click "Configure Project"
As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"
Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project". Then, follow the steps listed in this document in order

If you deploy this project as an Applied ML Prototype (AMP) (options 1 or 2 above), you will need to specify whether to run the project with STORAGE_MODE set to local or external. Running in external mode requires having external storage configured on your CML workspace and triggers the project to ingest, process, and store ~20GB of raw data using Spark. Running in local mode will bypass the data ingestion and manipulation steps by using the data/preprocessed_flight_data.tgz file to train a model and deploy the application. While running the project as an AMP will install, setup, and build all project artifacts for you, it may still be instructive to review the documentation and files in the code directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

code

code

data

data

images

images

.gitignore

.gitignore

.project-metadata.yaml

.project-metadata.yaml

LICENSE.txt

LICENSE.txt

README.md

README.md

cdsw-build.sh

cdsw-build.sh

requirements.txt

requirements.txt

Repository files navigation

Canceled Flight Prediction

Project Structure

Deploying on CML

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
app		app
code		code
data		data
images		images
.gitignore		.gitignore
.project-metadata.yaml		.project-metadata.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
cdsw-build.sh		cdsw-build.sh
requirements.txt		requirements.txt

License

fastforwardlabs/airline_delay_prediction

Folders and files

Latest commit

History

Repository files navigation

Canceled Flight Prediction

Project Structure

Deploying on CML

About

Resources

License

Stars

Watchers

Forks

Languages