### install fastavro:

```bash 
sudo pip install fastavro
```

In [1]:
import os
from cStringIO import StringIO
import fastavro

In [3]:
import pyspark

sc = pyspark.SparkContext()

### Load data, and use avro schema to map to JSON
We must firts acquire data. In this case, we will load a series of .avro files from AWS S3. Note:
- in our binaryFiles() call, we use a "?" character as a regular expression to indicate either zero or one character.
- for access tot the S3 bucket, you need to set your AWS credentials as environment variables

In the few lines below, we:
- read the files from disk into a JavaPairRDD,
- map each binary data value in the RDD to a string using StringIO,
- read each string and combines (flatMap) them into a json RDD.

**An important note on distributed processing**: Imagine we're dealing with files that are ~500TB put together. We can't process that locally, which is where Spark's distributed framework comes in. By loading the data into an RDD, we ensure we're using Spark's core strength to process these large data sets at scale.

**An important note on lazy evaluation in Spark**: As noted during the Spark lecture, it applies _lazy evaluation_, which is to say that for example transformations like map() and flatMap() are only evaluated when their results are explicitly requested through a function like <code>.take()</code> or <code>.collect()</code>.

In [4]:
from boto.s3.connection import S3Connection

conn = S3Connection('Access Key Id','Secret Access Key')
bucket = conn.get_bucket('dsci')

In [5]:
keys = sc.parallelize(bucket.get_all_keys(prefix='6007/data/SuperWebAnalytics/new_data/data'))

In [6]:
avro_data = keys.map(lambda key: StringIO(key.get_contents_as_string()))

In [7]:
json_data = avro_data.flatMap(fastavro.reader)

In [8]:
json_data.count()

## Data Exploration
Before working with a data set, it is useful to explore it a bit and see what we are working with. In this case, our data is based on an avro schema for a graph schema we have worked with before. The data consists of records, each describing either a property of a node, or an edge.

Let's use .take() to grab the first few records: 

In [9]:
print json_data.take(2)

[{u'dataunit': {u'page_view': {u'nonce': 788500601, u'person': {u'cookie': u'UVWXY'}, u'page': {u'url': u'http://mysite.com/'}}}, u'pedigree': {u'true_as_of_secs': 1438379334}}, {u'dataunit': {u'equiv': {u'id2': {u'cookie': u'KLMNO'}, u'id1': {u'user_id': 888}}}, u'pedigree': {u'true_as_of_secs': 1438379334}}]


## Partitioning the data
Now that we have the data, we need to divide it into pieces according to the partitioning scheme outlined in the Lab specs. Our data is stored as a json array of objects. We'll take each record and map it to a 2-tuple contaning the datatype and the actual datum. By dynamically generating the partition name, our code will be able to handle any new node properties or edge types that might be added later.

In [10]:
def partition_data(datum):
    print datum
    datatype = datum['dataunit'].keys()[0]
    if datatype.endswith('property'):
        return '/'.join((datatype, datum['dataunit'][datatype]['property'].keys()[0])), datum
    else:
        return datatype, datum

In [11]:
partitioned_json = json_data.map(partition_data)

In [12]:
print partitioned_json.take(3)

[(u'page_view', {u'dataunit': {u'page_view': {u'nonce': 788500601, u'person': {u'cookie': u'UVWXY'}, u'page': {u'url': u'http://mysite.com/'}}}, u'pedigree': {u'true_as_of_secs': 1438379334}}), (u'equiv', {u'dataunit': {u'equiv': {u'id2': {u'cookie': u'KLMNO'}, u'id1': {u'user_id': 888}}}, u'pedigree': {u'true_as_of_secs': 1438379334}}), (u'page_view', {u'dataunit': {u'page_view': {u'nonce': 3444084808, u'person': {u'cookie': u'UVWXY'}, u'page': {u'url': u'http://mysite.com/'}}}, u'pedigree': {u'true_as_of_secs': 1438379334}})]


In [None]:
partitioned_json.cache()

PythonRDD[3] at RDD at PythonRDD.scala:43

In [None]:
partition_names = partitioned_json.map(lambda t: t[0]).distinct().collect()

In [None]:
partitioned_json.countByKey()

In [None]:
partition_names

In [None]:
# TODO: need to gracefully handle when dir/file already exists

for p in partition_names:
    path = "../SuperWebAnalytics/master/{}".format(p)
    if os.path.exists(path):
        print "{} exists".format(path)
    else:
        partitioned_json.filter(lambda t: t[0] == p).values().saveAsPickleFile(path)
#         #  line below does avro:
#         partitioned_json.filter(lambda t: t[0] == p).values().mapPartitions(avro_writer).saveAsTextFile(path)

In [None]:
!tree *_property