Skip to content

Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.

License

Notifications You must be signed in to change notification settings

angeligareta/machine-learning-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Overview with Spark ML

First lab of the Scalable Machine Learning course of the EIT Digital data science master at KTH

KTH License GitHub contributors

Problem Statement

This project aims to study the basics of regression and classification in Spark. It is divided in two parts:

  • First part: guided exercise whose objective is to predict median housing value in the dataset California Housing Data (1990), which involves the analysis and transformation of the attributes of the dataset (e.g., one-hot encoding, string indexer, normalization...).. After that, four different regression models are implemented: linear regression, decision tree, random forest and gradient-boost forest regression. Finally, the dataset is divided in train and test sets, and the models are trained and hypertuned.
  • Second part: aims to classify the default payment for credit card customers , by using the dataset Default of Credit Card Clients Dataset. First, a explanatory analysis will be performed over the data, followed by the implementation and training of three different classification models (logistic regression, decision tree, and random forest). Finally, the models would be compared and a brief discussion about which model performs better for the task.

Tools

The implementation of both parts of the assignments is performed using Scala programming language with Apache Spark Machine Learning library. In addition, Databricks was used to train more efficiently in a cluster, so the source format consists of a Scala notebook. The implementation can be found at src/ and the notebook preview at https://angeligareta.com/machine-learning-spark/.

Authors