Skip to content
Parallelization project and report for DS8001: Design of Algorithms and Programming for Massive Data
Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
DS8001.ipynb Add code and project report Dec 18, 2018
LICENSE Create LICENSE Dec 20, 2018 Update Dec 20, 2018

Parallelization of segments of the end to end ensemble training and evaluation for a text classification task

End to end heterogeneous scalable ensemble learning algorithm which demonstrates the performance gain of using parallelism in certain portions of the code. The ensemble learning algorithm is used for the training and evaluation of text classification. To be specific, 3 shallow machine learning models (namely Logistic Regression, Naive Bayes and Random Forest) and 3 deep neural models (LSTMs with 3 different dropout rates) are trained and the training and evaluation of these models is executed in parallel. Aside from the actual training and evaluation, the preprocessing of the data is also parallelized.

Because of lack of access to a distributed cluster (Hadoop/Spark) and GPU server, the implementation is limited to a single 4-core CPU machine using Python’s multiprocessing framework with the joblib wrapper. For the purpose of text classification, a problem where the task is to identify authors given text excerpts from books written by them has been chosen. For this dataset, there are 3 predefined authors: Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The dataset from this Kaggle competition is from: . The training dataset contains more than 18,000 records which, while not gigantic, is big enough to allow us to simulate a big data problem and experiment with parallelization.

The end to end algorithm has been implemented with python’s multi-processing framework along with joblib wrapper, numpy, pandas, scikit learn and Keras.



You can’t perform that action at this time.