Skip to content

dazza-codes/enron-email-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Requirements

Developed on an Ubuntu 16.04 linux system, using Oracle Java 8.

Usage

This project uses the Scala Build Tool (sbt). To run examples, install and run the sbt REPL in this project directory (where the build.sbt file is located). The sbt should download all the project dependencies (any warnings about dependency conflicts can be ignored, they arise from 3rd-party dependency resolution).

Email Spark Analysis

Package a jar containing the application

sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/enron-emails_2.11-1.0.jar

Use spark-submit to run analysis (see bin/spark_analysis.sh for updated code)

PACKAGE_JAR="target/scala-2.11/enron-emails_2.11-1.0.jar"
SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
${SPARK_HOME}/bin/spark-submit \
  --class "SparkAnalysisApp" \
  --master local[2] \
  ${PACKAGE_JAR} \
  enron_email_records.parquet  2> spark_analysis.log

Some Resources

Enron Email Data

Spark

Graphs

Akka

Avro

Eel

"Eel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as Spark or Flink, Eel is an SDK intended to be used directly in process."

ElasticSearch

Gmail analysis

Scala SQL