![](http://pbpython.com/images/sns_header.png)

# Visual Data Literacy - 02 - Statistical Plotting

In [None]:
# %load utils/imports.py
import numpy as np
import pandas as pd

from utils import *
from utils.styles import *
from utils.plotting import *
from utils.demo import *

## 04 Challenge - Perform EDA on your own datasets

Each block has instructions on which steps to follow in your exploration. Instructions are either `CODE` or `ANSWER` blocks. Please anser the former in code, and the latter in writing. Bullet points are fine, as long as you reveal your thought processes as much as the outcomes they produce.  

You've been split up into 5 'areas of interest' in the dataset. Each area of interest refers to a set of attributes from the `Summary` table which you will be looking at for this assignment.

* Group A (personal data) : Jeffrey & Terrence 
* Group B (phone data) : Sparrow
* Group C (sms logs data) : Stanley
* Group D (3rd party data) : Noah
* Group E (derived data) : Jazz

There are also 47 columns which are available to _everyone_ these are marked as 'all' in the CSV.

These are individual assignments, and so we are expecting everyone to attempt them alone first - of course ask your team mates and us for help on ideas, syntax or if you get stuck! 

The assignment is due on **Thursday, 4 August**. I will review it on Friday and get the feedback to you on Monday.


The columns for your consideration are (fill in the `your_group` variable):

In [None]:
your_group = 'A' # your assigned group i.e. {A,B,C,D,E}

In [None]:
pd.set_option('display.max_rows',120)
df = pd.read_csv('groupings.csv')
df.ix[:, 0 + "ABCDE".index(your_group) * 2]

## Warming Up!

The `CSV` that we gave you isn't in the _tidy_ format.

In [None]:
# ANSWER : What is wrong with it, and how could you fix it?

In [None]:
# CODE : Convert groupings.csv into a tidy data format

## Data Wrangling

Data Wrangling is the process of cleaning and unifying messy and complex data sets. This process includes manually converting/mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. In the following steps you'll be wrangling your data into a usable format.

### Data Loading

In [None]:
# CODE : Load an extract of the data you are going to be analysing

In [None]:
# ANSWER : What rationalisation did you use when deciding the size of your extract?

### From raw data to technically correct data

With data that are technically correct, we understand a data set where
each value:

1. can be directly recognized as belonging to a certain variable;
2. is stored in a data type that represents the value domain of the real-world variable.

In other words, for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all this in a format that is consistent across the data set.

In [None]:
# CODE : Transform all attributes in your dataset, from raw data to technically correct data.

In [None]:
# CODE : Is there any string normalisation that needs to be applied to your attributes?

### From technically correct data to consistent data

Consistent data are technically correct data that are fit for statistical analysis. They are data in
which missing values, special values, (obvious) errors and outliers are either removed, corrected
or imputed. The data are consistent with constraints based on real-world knowledge about the
subject that the data describe.

The process towards consistent data always involves the following three steps:
1. Detection of an inconsistency. That is, one establishes which constraints are violated. For example, an age variable is constrained to non-negative values.
2. Selection of the field or fields causing the inconsistency. This is trivial in the case of a univariate demand as in the previous step, but may be more cumbersome when cross-variable relations are expected to hold. For example the marital status of a child must be unmarried. In the case of a violation it is not immediately clear whether age, marital status or both are wrong.
3. Correction of the fields that are deemed erroneous by the selection method. This may be done through deterministic (model-based) or stochastic methods.

In [None]:
# CODE : Transform all attributes in your dataset, from technically correct to consistent data

In [None]:
# CODE : Use visualisation to validate your transformations in the previous step

In [None]:
# ANSWER : For which of the attributes did you not have enough information to decide how to make it consistent?
#          Could you think of ways in which triangulating with other attributes could help?

## Seaborn

Seaborn is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once. Using Seaborn's `factorplot`, `pairplot`, and `jointgrid`, explore your dataset and develop the following visualisations.

### Visualizing Distribution with Seaborn

When dealing with a set of data, often the first thing you'll want to do is get a sense for how the variables are distributed.

In [None]:
# CODE : Plot the Univariate distirbutions or the attributes you are interested in exploring. Make sure that the 
#        graphs are readable, i.e. they shouldn't be so small, dense or cluttered, that you can't do any analysis
#        on them.

In [None]:
# ANSWER : Seeing the distributions of your datasets, what are 5 characteristics or phenomena that you weren't
#          expecting to see and would possible need to have a closer look at?

In [None]:
# CODE : Select the attributes which are of interest to you and plot the pairwise relationships between them.

In [None]:
# ANSWER : What are the three most important take-aways?

### Visualizing linear relationships

Your datasets contain multiple quantitative variables, and the goal of our analysis is to relate those variables to each other. It can be very helpful to use statistical models to estimate a simple relationship between two noisy sets of observations. You'll be using linear regression here to explore those relationships.

In [None]:
# ANSWER : What's the difference between a `regplot` and a `lmplot`, when do you use which one?

Based on the pair plot you've run in the previous step, select one attribute which you want to treat as target variable for your analysis here.

In [None]:
# CODE : Find out how strongly each of your most promising attributes correlate with your chosen attribute.
#       Try at least a dozen from the shared attributes, and a dozen from your area of interest.    

In [None]:
# ANSWER : What can you learn by looking at these graphs?

In [None]:
# CODE : Explore the residuals of your selected attributes.

In [None]:
# ANSWER : Which of the residual plots suggest that there is a 'hidden parameter' which you haven't accounted for?

In [None]:
# CODE : Explore what happens when you use high order polynomials.

In [None]:
# ANSWER : What can you learn by looking at these graphs?

### Plotting with categorical data

It's useful to divide seaborn's categorical plots into three groups: those that **show each observation at each level** of the categorical variable, those that **show an abstract representation** of each distribution of observations, and those that apply a statistical estimation to **show a measure of central tendency** and confidence interval.

In [None]:
# ANSWER : Since there are these three different types of plots - what are the advantages / disadvantages of each?

In [None]:
# CODE : Filter our your categorical attributes from your dataframe.

In [None]:
# CODE : Explore those categorical attributes in scatterplots.

In [None]:
# ANSWER : What do you see when you plot those attributes against 'platforms`, `gender`?

In [None]:
# CODE : Explore the categorical attributes in your area of interest through boxplots.

In [None]:
# ANSWER : What can you say about the outliers in your datasets?

### Plotting on data-aware grids


When exploring medium-dimensional data, a useful approach is to draw multiple instances of the same plot on different subsets of your dataset. Since your dataset has over a hundred columns, it makes sense to plot on data-aware grids.

In [None]:
# CODE : For your most interesting linear relationships you found, see what plotting them against some of your
#        most interesting categorical variables reveal.            

In [None]:
# ANSWER : The more attributes / dimensions you add to a plot, the more complex it becomes,
#          what are different ways in which you can visualise an additional dimension in your graph? 
#          Name 4, other than 'X' and 'Y' coordinates.

In [None]:
# CODE : Create a graph representing 5 dimensions of your choice.

### Open Exploration

Now that you have a good foundational understanding of your dataset, it's time to go out and explore on your own!

In [None]:
# ANSWER : What are three questions that you have regarding relationships in your dataset that you have following the
#          plots you've made in the previous steps? 

In [None]:
# CODE : Attempt to answer those three questions with visualisations