# Sagemaker Donkey introduction

This first tutorial will introduce you to the SageMaker service and its [Jupyter Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-notebooks-instances.html).

More specifically, we'll have a look at the data format used by the [Donkey](https://github.com/wroscoe/donkey) library when saving training data. The data is generated when you manually drive the car on the track. We'll also play around with some of the more common libraries and data structures available in the Notebooks, such as [pandas](https://pandas.pydata.org/).

## Download sample data

We've created some sample data for you start working on, so that you don't have to wait for your car to be ready. Since he sample data is recorded on another car on another track, it might not be representative for you car. However, it will allow you to get started, and it will provide a good foundation for you to continue training once you get data from you own car.

Download the sample driving runs, called *Tubs* in Donkey:

In [None]:
from sagemaker import get_execution_role

# Bucket location to get training data
sample_data_location = 's3://jayway-robocar-raw-data/samples'

# IAM execution role that gives SageMaker access to resources in your AWS account.
role = get_execution_role()

In [None]:
role

In [None]:
!aws s3 cp {sample_data_location}/ore.zip .

In [None]:
!unzip -o ore.zip

In [None]:
!cat tub_8_18-02-09/record_3658.json | jq

## Inspect and mangle Donkey data

In this section, we parse and manipulate the data generated by the [Donkey](https://github.com/wroscoe/donkey/tree/master/donkeycar) library to get familiar with the format.

The default configuration will save the captured data in a directory called a *Tub*. A new *Tub* will be created every time a new *drive* session starts.

A *Tub* directory contains *records* in JSON format, *images* in JPG format and metadata file, *meta.json*, which specifies the format of the *records*. The default *record* has the following JSON structure:

```json
{
   "user/angle": 1.0,
   "cam/image_array": "3658_cam-image_array_.jpg",
   "user/mode": "user",
   "user/throttle": 0.23455000457777642
}
```
A short description of the properties:
- *user/angle* - wheel angle
- *user/throttle* - speed
- *user/mode* - drive mode (user, local angle, or pilot)
- *cam/image_array* - relative reference to image

The metadata file specifies the types of the properties in the record files:

In [None]:
!cat tub_8_18-02-09/meta.json | jq

### Parse data

Next, parse the input files into a more suitable format. This snippet will return a list of records, where each record is a dictionary with *angle*, *throttle* and *image*.

Take your time to read through the code. There are a few very common libraries introduced in this section, e.g:
* `pandas` - Data structures and analysis tools
* `PIL` - The Python Image Framework. Nice when working with images.

In [None]:
import os
import glob
import pandas as pd
from PIL import Image

def read_tub(path):
    '''
    Read a Tub directory into memory
    
    A Tub contains records in json format, one file for each sample. With a default sample frequency of 20 Hz,
    a 5 minute drive session will contain roughly 6000 files.
    
    A record JSON object has the following properties (per default):
    - 'user/angle'      - wheel angle
    - 'user/throttle'   - speed
    - 'user/mode'       - drive mode (.e.g user or pilot)
    - 'cam/image_array' - relative path to image
    
    Returns a list of dicts, [ { 'record_id', 'angle', 'throttle', 'image', } ]
    '''

    def as_record(file):
        '''Parse a json file into a Pandas Series (vector) object'''
        return pd.read_json(file, typ='series')
    
    def is_valid(record):
        '''Only records with angle, throttle and image are valid'''
        return hasattr(record, 'user/angle') and hasattr(record, 'user/throttle') and hasattr(record, 'cam/image_array')
        
    def map_record(file, record):
        '''Map a Tub record to a dict'''
        # Force library to eager load the image and close the file pointer to prevent 'too many open files' error
        img = Image.open(os.path.join(path, record['cam/image_array']))
        img.load()
        # Strip directory and 'record_' from file name, and parse it to integer to get a good id
        record_id = int(os.path.splitext(os.path.basename(file))[0][len('record_'):])
        return {
            'record_id': record_id,
            'angle': record['user/angle'],
            'throttle': record['user/throttle'],
            'image': img
        }
    
    json_files = glob.glob(os.path.join(path, '*.json'))
    records = ((file, as_record(file)) for file in json_files)
    return list(map_record(file, record) for (file, record) in records if is_valid(record))

In [None]:
%%time
records = read_tub('tub_8_18-02-09')
print('parsed Tub into {} records'.format(len(records)))

### Inspect

Inspect one of the parsed records

In [None]:
print(records[100])
records[100]['image']

### Create a matrix

Looks legit. Lets merge all the vectors into a matrix with the following format:

| record_id    | angle | throttle | image    |
| ------------ | ----- | -------- | -------- |
| 1            |   0.1 |   0.3    | PIL...   |         
| ...          |       |          |          |
| n            |       |          |          |

In [None]:
df = pd.DataFrame.from_records(records).set_index('record_id') # Use record_id as index
df.sort_index(inplace=True)                                    # Do not create a new copy when sorting
pd.set_option('display.max_columns', 10)                       # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)                          # Keep the output on one page
df

### The Pandas DataFrame

Finally, let's look at some of the properties of the [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe) object

In [None]:
# Displays the top 5 rows (i.e. not top 5 elements based on label index)
df.head()

In [None]:
# Similar to head, but displays the last rows
df.tail()

In [None]:
# The dimensions of the dataframe as a (rows, cols) tuple
df.shape

In [None]:
# The number of columns. Equal to df.shape[0]
len(df) 

In [None]:
# An array of the column names
df.columns 

In [None]:
# Columns and their types
df.dtypes

In [None]:
# Axes
df.axes

In [None]:
# Converts the frame to a two-dimensional table
df.values 

In [None]:
# Displays descriptive stats for all columns
df.describe()

In [None]:
# Select one element returns a Pandas.Series object
df.loc[1]

In [None]:
# Select multiple elements returns a Pandas.DataFrame object
df.loc[1:5]

## Visualizing data

Let's see if we can make the data a little more visual.

In [None]:
# Plot throttle only
%matplotlib inline

df.plot.line(y='throttle')

In [None]:
# Plot both throttle and angle next to each other
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
df.plot(ax=axes[0], kind='line', y='throttle', color='orange')
df.plot(ax=axes[1], kind='density', y='angle', color='red')
plt.figure()

Nice.

We can see that throttle seems to be limited to 0.25 (see the [Donkey configuration file](https://github.com/wroscoe/donkey/blob/master/donkeycar/templates/config_defaults.py) for the explanation to that).

We can also see that to some extent, the car turns more towards one direction than to the other.

### Images

Let's also have a quick look at the images in the data set. One way is to create a video of all the images

In [None]:
%%time
import numpy
from cv2 import VideoWriter, VideoWriter_fourcc, cvtColor, COLOR_RGB2BGR
from contextlib import contextmanager

@contextmanager
def VideoCreator(*args, **kwargs):
    v = VideoWriter(*args, **kwargs)
    try:
        yield v
    finally:
        v.release()

def make_video(images, out='donkey-run.mp4', fps=20):
    '''
    Creates a video from PIL images
    '''
    if (len(images) <= 0):
      raise ValueError('Images array must not be empty')
    
    # Extract size from first image
    size = images[1].size
    
    # Create codec
    fourcc = VideoWriter_fourcc(*'H264')
    
    # Create a VideoCreator and return the new video
    with VideoCreator(out, fourcc, float(fps), size) as v:
        for img in images:
            arr = cvtColor(numpy.array(img), COLOR_RGB2BGR)
            v.write(arr)

    return out

video_file = os.path.join('~/SageMaker', make_video(df['image']))
print(video_file)

Sadly, Jupyter notebooks do not currently support HTML5 video inline (v.5.0.0). You'll have to open it in a new tab:

[Run video](./donkey-run.mp4)

## Next

[Donkey library tools](./donkey-tools.ipynb)