Find file History
Latest commit 2fcbf77 Jun 4, 2015 @ceteri adding social graph example

README.md

Microservices, Containers, and Machine Learning

A frequently asked question on the Apache Spark user email list concerns where to find data sets for evaluating the code. Oddly enough, the collect of archived messages for this email list provides an excellent data set to evalute machine learning, graph algorithms, text analytics, time-series analysis, etc.

Herein, an open source developer community considers itself algorithmically. This project shows work-in-progress for how to surface data insights from the developer email forums for an Apache open source project. It leverages advanced technologies for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the <user@spark.apache.org> email list archives to help understand its community better.

See DataDayTexas 2015 session talk

In particular, we will shows production use of NLP tooling in Python, integrated with MLlib (machine learning) and GraphX (graph algorithms) in Apache Spark. Machine learning approaches used include: Word2Vec, TextRank, Connected Components, Streaming K-Means, etc.

Keep in mind that "One Size Fits All" is an anti-pattern, especially for Big Data tools. This project illustrates how to leverage microservices and containers to scale-out the code+data components that do not fit well in Spark, Hadoop, etc.

In addition to Spark, other technologies used include: Mesos, Docker, Anaconda, Flask, NLTK, TextBlob.

Dependencies

conda config --add channels https://conda.binstar.org/sloria
conda install textblob
python -m textblob.download_corpora
python -m nltk.downloader -d ~/nltk_data all
pip install -U textblob textblob-aptagger
pip install lxml
pip install python-dateutil
pip install Flask

NLTK and TextBlob require some data downloads which may also require updating the NLTK data path:

import nltk
nltk.data.path.append("~/nltk_data/")

Running

To change the project configuration simply edit the defaults.cfg file.

scrape the email list

./scrape.py data/foo.json

parse the email text

./parse.py data/foo.json parsed/foo.json

What's in a name?

The word exsto is the Latin verb meaning "to stand out", in its present active form.

Research Topics

machine learning

microservices and containers