# DAY 3 - Data Exploration and Visualization

Exploratory data analysis and visualization is an important step in the data analytics pipeline. Cleaning and exploring the dataset is often an iterative process before developing the actual models and algorithms that are used to give knowledge and insight. This notebook lets you combined the knowledge you have recieved from using Spark with producing more graphical content to display information in a way such that becomes more perceivable. 

## The Dataset

The dataset is rating data from the [MovieLens](https://movielens.org/) website. Find information about the data [here](https://s3-eu-west-1.amazonaws.com/orvarsbucket/ml-latest/README.txt)

## Loading Data From S3

[S3](http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html) is a scalable storage service provided by AWS. Two datasets is uploaded to a bucket and can be fetched by the provided access keys. The smaller one can be used for testing purposes.

An external library called [spark-csv](https://github.com/databricks/spark-csv) can be used to parse csv-files for Spark.

**Exercise:** Create Dataframes for the datasets and perform simple counts.

In [None]:
# Set access keys for S3 bucket.
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", accessKeyId)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secretAccessKey)

# Paths. Change PATH_DATASET to ml-latest/ to get the larger dataset.
PATH_BUCKET = 's3n://orvarsbucket/'
PATH_DATASET = 'ml-latest-small/'

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Create Spark SQL context.
sql_context = SQLContext(sc)

# Read links.csv
filename = 'links.csv'
links_schema = StructType([ \
    StructField("movieId", StringType(), True), \
    StructField("imdbId", StringType(), True), \
    StructField("tmdbId", StringType(), True), \
])

links_df = sql_context.read \
    .format('com.databricks.spark.csv') \
    .options(header='true') \
    .load(PATH_BUCKET + PATH_DATASET + filename, schema=links_schema)

links_df.cache()
print 'Loaded ' + str(links_df.count()) + ' entries from ' + filename + '\n'
    
# Read movies.csv
filename = 'movies.csv'
movies_schema = StructType([ \
    StructField("movieId", StringType(), True), \
    StructField("title", StringType(), True), \
    StructField("genres", StringType(), True), \
])

movies_df = sql_context.read \
    .format('com.databricks.spark.csv') \
    .options(header='true') \
    .load(PATH_BUCKET + PATH_DATASET + filename, schema=movies_schema)

movies_df.cache()
print 'Loaded ' + str(movies_df.count()) + ' entries from ' + filename + '\n'
    
# Read ratings.csv
filename = 'ratings.csv'
ratings_schema = StructType([ \
    StructField("userId", StringType(), True), \
    StructField("movieId", StringType(), True), \
    StructField("rating", FloatType(), True), \
    StructField("timestamp", IntegerType(), True), \
])

ratings_df = sql_context.read \
    .format('com.databricks.spark.csv') \
    .options(header='true') \
    .load(PATH_BUCKET + PATH_DATASET + filename, schema=ratings_schema)

ratings_df.cache()
print 'Loaded ' + str(ratings_df.count()) + ' entries from ' + filename + '\n'
    
# Read tags.csv
filename = 'tags.csv'
tags_schema = StructType([ \
    StructField("userId", StringType(), True), \
    StructField("movieId", StringType(), True), \
    StructField("tag", StringType(), True), \
    StructField("timestamp", IntegerType(), True), \
])

tags_df = sql_context.read \
    .format('com.databricks.spark.csv') \
    .options(header='true') \
    .load(PATH_BUCKET + PATH_DATASET + filename, schema=tags_schema)

tags_df.cache()
print 'Loaded ' + str(tags_df.count()) + ' entries from ' + filename + '\n'

## Introduction to Matplotlib

[Matplotlib](http://matplotlib.org/gallery.html) is an excellent graphics library in Python for generating simple visualizations. Follow the simple example below to understand how it works.

In [None]:
# Aggregate data by using Spark
ratings_gby_df = ratings_df.withColumn("year", year(from_unixtime(ratings_df["timestamp"]))) \
    .groupBy("year") \
    .count()
    
data_x = ratings_gby_df.select("year").flatMap(lambda x: x).collect()
data_y = ratings_gby_df.select("count").flatMap(lambda x: x).collect()

In [None]:
# This line configures matplotlib to show figures embedded in the notebook.
%matplotlib inline

# Import matplotlib, the easy way
from pylab import *

# Create a new matplotlib figure with a specific size.
fig = plt.figure(figsize=(12, 9))

# Add subplot in figure and retrieve axes. "111" means "1x1 grid, first subplot"
ax = fig.add_subplot(111)

# Remove plot frame lines
ax.spines["top"].set_visible(False)       
ax.spines["right"].set_visible(False)

# Only add ticks at bottom/left
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()

# Edit ticks
plt.xticks(data_x, fontsize=14, rotation='vertical')
plt.yticks(fontsize=14)

# Insert labels
plt.xlabel("Year", fontsize=16)  
plt.ylabel("Count", fontsize=16)

# Insert title
plt.title("Number of ratings per year", fontsize=18)

# Create barchart
ax.bar(
    data_x,
    data_y,
    align="center",
    color="#3F5D7D",
    width=1.0,
    alpha=1.0
)

plt.show()

<font color='red'>**EXERCISE: **</font>Explore the dataset through visualizations. Select a couple of aggregations that could be interesting to show and display them in a suitable format.

Examples of visualizations that can be performed:

* Compare the average ratings for each genre over time.
* Show in which months, days of the week or/and hours that users are most active.

Think of what type of visualization that is suitable for your subset of data. Read for example [this](http://blog.hubspot.com/marketing/data-visualization-choosing-chart#sm.00008fchn9p1dfecu9x2chwb2h5qi) and [this](http://img.labnol.org/di/data-chart-type.png).

Use [Colorbrewer](http://colorbrewer2.org/) when choosing color palette.

Check out [D3.js](https://github.com/d3/d3/wiki/Gallery) for inspiration.

*Tips: Remember the [data-to-ink-ratio](http://static1.squarespace.com/static/56713bf4dc5cb41142f28d1f/t/5671eae2816924fc2265189a/1454121618204/data-ink.gif?format=750w) when producing visualizations!*

In [None]:
# ...

## Interatice Visualizations

Sometimes when visualizing data, one wants more interactivity built into the visuals and not just static images. This is because the interaction gives another dimension of perception. There are other libaries in Python that targets web browsers for producing more interactive plots

There are other libraries in Python that targets web browsers for producing more interactive plots. They are also often very high-level which makes them ideal for rapidly developing prototypes. One such tool is [Bokeh](http://bokeh.pydata.org/).

<font color='red'>**EXERCISE: **</font>Use an interactive visualization tool, e.g. Bokeh, and find out if movies with higher ratings also has more ratings. Do this by producing a scatter plot (see example below) with for example number of ratings at one axis and average rating for the other axis. To make it more interactive one can add functionality such as displaying title in tooltip, color by genre etc.

In [None]:
# Aggregate data by using Spark
# ...

In [None]:
from bokeh.io import output_notebook, show
from bokeh import *

# Load bokeh
output_notebook(resources=resources.INLINE)

In [None]:
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.sampledata.iris import flowers
flowers.head()

# Define data source
df = flowers

# Create figure with size and tools to use
fig = figure(
    plot_width=800,
    plot_height=600,
    tools=["wheel_zoom", "reset"]
)

# Define colors
def spec_color(species):
    if species == "virginica":
        return "#de2d26"
    elif species == "versicolor":
        return "#2b8cbe"
    else:
        return "#31a354"

colors = [spec_color(s) for s in df["species"]]

# Define plot and data to use for axes
fig.scatter(
    df["petal_length"],
    df["petal_width"], 
    source=ColumnDataSource(data=df),
    size=10,
    alpha=0.5,
    color=colors
)

# Define axis labels
fig.xaxis.axis_label = "Petal Length"
fig.yaxis.axis_label = "Petal Width"

# Create tooltips for markes to display extra information
hover = HoverTool()
hover.tooltips = [
    ("Sepal Length", "@sepal_length"),
    ("Sepal Width", "@sepal_width")
]

# Add tools to figure
fig.add_tools(hover)

# Show figure
show(fig)

In [None]:
# ...