Skip to content

fediazgon/sparkml-flights-delay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

project-logo
sparkml-flights-delay

Predicting the arrival delay time of a commercial flights using Apache Spark MLlib

Getting startedValidation processAuthorsLicense

Getting started

The easiest way to run this project is by cloning the project locally, create a fat jar using Maven and executing the shell script that can be found on the project's root directory.

mvn clean package
./run.sh

It is possible to active/deactivate the explore stage with the --explore flag (add/remove this flag inside the run.sh script).

The output should be similar to the following one:

project-demo

You can also import it to your favourite IDE, but keep in mind that the program requires one argument, which is the dataset to process. You can find multiple valid datasets at this link: Airline On-Time Statistics and Delay Causes.

Be aware that it can take a lot of time with a large dataset (14 models are trained with 10 folds cross-validation). This is why we included a small tuning.csv file in the raw folder. Please, consider using this dataset to check that the program works properly.

Validation process

The general workflow on the program is shown in the image below:

project-flow

Hyperparameter tuning and model selection are carried out using cross-validation on the training dataset. In this stage, a grid search is performed using two different models: Linear Regression and Random Forest (you can add your own extending the CVTuningPipeline class). Finally, the test error of the best model is obtained using the test set.

Authors 🇪🇸 💙 🇮🇹

  • Fernando Díaz
  • Giorgio Ruffa

License

This project is licensed under the MIT License - see the LICENSE.md file for details