# Creating a Big Data Stack

## Assignment 2

Big data is a rapidly evolving, ever-changing field. Keeping track of the latest big data stacks, programming libraries, software, and other tools requires constant vigilance. Any book on big data will be out of date by the time it is published. We need a resource that is updated on a more frequent basis. 

This assignment will help create that resource by researching the latest big data tools and technologies. We will use this research to create an *Awesome Big Data* list. Below is a list of similar *awesome* lists that may be useful when creating our *Awesome Big Data* list. 

*[Awesome Python](https://awesome-python.com/)* is a curated list of awesome Python frameworks, libraries, software and resources. It was inspired by [awesome-php](https://github.com/ziadoz/awesome-php). 

*[Awesome Jupyter](https://github.com/markusschanta/awesome-jupyter)* is a curated list of awesome Jupyter projects, libraries and resources. Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

*[Awesome Dash](https://github.com/ucg8j/awesome-dash)* is a curated list of awesome Dash (plotly) resources. Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python.

*[Awesome JavaScript](https://github.com/sorrycc/awesome-javascript)* is a collection of awesome browser-side JavaScript libraries, resources and shiny things. The [data visualization section](https://github.com/sorrycc/awesome-javascript#data-visualization) may be of use. 

*[Awesome Deep Learning](https://github.com/ChristosChristofidis/awesome-deep-learning)* is a curated list of awesome Deep Learning tutorials, projects and communities.

*[Awesome Machine Learning](https://github.com/josephmisiti/awesome-machine-learning)* is a curated list of awesome machine learning frameworks, libraries and software (by language).

*[Awesome Data Engineering](https://github.com/igorbarinov/awesome-data-engineering)* is a curated list of data engineering tools for software developers. 

*[Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)* is a list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. 

*[Awesome](https://github.com/sindresorhus/awesome)* is a list of awesome lists about all kinds of interesting topics.

### Assignment 2.1

Before we get started, we will access your knowledge of big data by taking the [Pokémon or Big Data Quiz](http://pixelastic.github.io/pokemonorbigdata/). Don't worry. The quiz results won't impact your grade. 

Included below is code that fetches the answers to the questions and provides the results in a Pandas dataframe. 

In [None]:
quiz_answers_json = 'https://raw.githubusercontent.com/pixelastic/pokemonorbigdata/master/app/questions.json'
df_all = pd.read_json(quiz_answers_json)
# Pokémon answers
df_all[df_all['type'] == 0]

In [None]:
# Big data answers
df_all[df_all['type'] == 1]

### Assignment 2.2

In the next part of the assignment, you will populate the items with categories for our list. The first chapter of the textbook, *Big Data Science & Analytics: A Hands-On Approach*, provides list of categories and subcategories for the big data stack. We will use these categories as a starting point, but will not constrain ourselves to them. 

When creating categories, avoid deeply nested layers of categories and subcategories. At most, define a top-level category with multiple subcategories. We will start with the following high-level categories and subcategories. 

***Categories***

We will use the disutils trove classification convention defined in [PEP 301](https://www.python.org/dev/peps/pep-0301/) when defining a category with a subcategory. This convention uses the double-colon ("::") to separate a category and subcategory. The following is an example of categories and subcategories as defined in the first chapter of the textbook, *Big Data Science & Analytics: A Hands-On Approach*. 

- Batch Analysis :: DAG
- Batch Analysis :: Machine Learning
- Batch Analysis :: MapReduce
- Batch Analysis :: Script
- Batch Analysis :: Search
- Batch Analysis :: Workflow Scheduling
- Data Access Connector :: Custom Connectors
- Data Access Connector :: Publish-Subscribe
- Data Access Connector :: Queues
- Data Access Connector :: SQL
- Data Access Connector :: Source-Sink
- Data Storage :: Distributed File System
- Data Storage :: NoSQL
- Deployment :: NoSQL
- Deployment :: SQL
- Deployment :: Visualization Frameworks
- Deployment :: Web Frameworks
- Interactive Querying :: Analytic SQL
- Real-Time Analysis :: In-Memory
- Real-Time Analysis :: Stream Processing

Below is a list containing categories and suggested starting points for research. Fill in a least two items from each of the suggested categories. Create at least one category that is not listed and add two items to that category. 

* AI and Machine Learning
    * Apache Spark's MLlib
    * H2O
    * Tensorflow
* Batch Processing
    * Apache
    * Apache Spark
    * Dask
    * MapReduce
* Cloud and Data Platforms
    * Amazon Web Services
    * Cloudera Data Platform
    * Google Cloud Platform
    * Microsoft Azure
* Container Engines and Orchestration
    * Docker
    * Docker Swarm
    * Kubernetes
    * Podman
* Data Storage :: Block Storage
    * Amazon EBS
    * OpenEBS
* Data Storage :: Cluster Storage
    * Ceph
    * HDFS
* Data Storage :: Object Storage
    * Amazon S3
    * Minio
* Data Transfer Tools
    * Apache Sqoop
* Full-Text Search
    * Apache Solr
    * Elasticsearch
* Interactive Query
    * Apache Hive
    * Google Big Query
    * Spark SQL
* Message Queues
    * Apache Kafka
    * RabbitMQ
* NoSQL :: Document Databases
    * CouchDB
    * Google Firestore
    * MongoDB
* NoSQL :: Graph Databases
    * DGraph
    * Neo4j
* NoSQL :: Key-Value Databases
    * Amazon DynamoDB
* NoSQL :: Time-Series Databases
    * TSDB
* Serverless Functions
    * AWS Lambda
    * OpenFaaS
* Stream Processing
    * Apache Spark's Structured Streaming
    * Apache Storm
    * Google Dataflow
* Visualization Frameworks
    * Apache Superset
    * Redash
* Workflow Engine
    * Apache Airflow
    * Google Cloud Composer
    * Oozie
    
We populate the list items using the `ListItem` class, defined below. The following is a description of the `ListItem` fields. 

**name**

The proper name of the list item

**website**

Link to the item's website.  Include `http://` or `https://` in the link. 

**category**

Category and optional subcategory for the item. 

**short_description**

Provide a short, one to two-sentence description of the item. 

In [None]:
from dataclasses import dataclass

@dataclass(frozen=True)
class ListItem:
    name: str
    website: str
    category: str
    short_description: str
    
all_items = set()

The following is an example of creating the entry for AWS as a seperate variable and then adding it to the `all_items` set. 

In [None]:
aws = ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
)

all_items.add(aws)

You can also add an item to the list directly. 

In [None]:
all_items.remove(aws)
all_items.add(ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
))

In [None]:
# TODO: Fill in a least two items from each of the suggested categories. 

# TODO: Create at least one category that is not listed and add two items to that category. 

### Assignment 2.3 (Optional)

Use the `all_items` data to create Markdown output that mirrors the output of [Awesome Python](https://raw.githubusercontent.com/vinta/awesome-python/master/README.md). You can use the `jinja2` template engine to complete this task. This part of the assignment is entirely optional and is not graded. 

In [None]:
import jinja2

template = jinja2.Template("""
# Awesome Big Data

A curated list of awesome big data frameworks, libraries, software and resources.

Inspired by [awesome-php](https://github.com/ziadoz/awesome-php).
""")