# Review for Lecture on 13/06/2017 
---

## General remarks: 
---
We will learn about 
  - Elements of a good data visualisation: What makes 
  - Tools available:
    - Python is great! 
    - Matplotlib: The grandfather of most other tools in python. 
    - Numpy: Essential for numerical calculations. 
    - Pandas: provides data storage types such as "DataFrame" (think Pythonic Excel) and "Series" together with tools to manipulate them. 
    - Seaborn: Great for early stage data explorations. 
    - Bokeh, Plotly: interactivity and online tools
    - ...
  - One of the important thing you should learn: How to find and understand the tools you need for your visualisation task. Some useful references for tool in python include
     - Pandas documentation and cookbook: http://pandas.pydata.org/pandas-docs/stable/
     - Matplotlib documentation, gallery: http://matplotlib.org/
     - ... Google will get you there. Read widely and actively to visualisation task to know the keywords to search. Good foundation in algorithmic thinking and knowledge about Python help. 

  
Regarding Data visualisation itself:
  
  - It is a process. 
     - **Exploration**: Analyse your dataset with an open mind. You never know if your assumptions or biases are correct. 
     - **Exposition**. Pin down what you want to inform your audience and then
     - **Design**: think about what is the best way to present them. 
  - Don't make your presentation looks "busy", be aware of your *data-to-ink ratio*. Make sure each element you add into your visualisation serves a purpose and serves it well. 
  - Beware of summary statistics, they can be misleading. Check for outliars. Do a scatter plot, it reveals more structure in the data than summary statistics can. 
  - Encodings (various forms information)
    - numbers and strings (but humans don't process these well, they are for machines)
    - Positions (points in 1, 2 or 3 dimensional space, height, width) 
    - Colours (grey scale, RGB, HLS, HSV, contrasts)  
    - Shape
    - Kinetics (movements or what the brain will intepret as movement) 
    - ...


---
## Running Python for data visualisation
--- 
*Do ask me if you have any issues / questions regarding installing Miniconda and setting up the `dataviz` environment.* 

There are various ways to work with Python. For the purpose of data analysis, here are some popular ones: 

  - Spyder
  - IPython 
  - Jupyter (you're looking at one)

It is highly recommended to develop your code in a controlled "Python environment", i.e. a separate Python interpreter, libraries / packages for each project. This ensure that you know what packages were needed for the project and avoiding polluting the working environment of other projects. `Anaconda` (and `Miniconda` you downloaded in lecture) is serves this purpose as well as providing convenient ways to download the packages you required in a few keystrokes. In lecture, you created the `dataviz` environment using the specification given by `dataviz.yml` which gives you access to standard data analytic packages such as `pandas`, `seaborn`, `numpy`, `scipy` ... When you *activate* the `dataviz` environment using the command ``source activate dataviz``, you have now access to these packages and any modifications will be contained in this environment only. 

---
## On Tidy Data
---
Reference: http://vita.had.co.nz/papers/tidy-data.pdf (by Hadley Wickham)  
80% (so it is said) of data analytic effort goes into tidying up raw data into a form suitable for explorations, analysis, visualisation, etc ... "Tidy data" is about adhering, whenever possible, to a standard or convention of data organisation that link structure of data (for our purpose: rows and columns) to its meaning (variables or observations). 

With the mantra that "all tidy data are alike" (that's why its a standard), we require that tidy data

  - Columns are variables
  - Rows are observations
  - Different tables for different observation unit
Advantage? 

  - It provides a uniform way of thinking and talking about data. 
    - You can talk about "rows" and others will know you're taking about the individual observations. "That is large dataset!" translate to "the data have a lot of rows" which translate to "the data has a lot of observations". Similarly, "this dataset is tall", "that's too many columns!" only make sense when we know what conventions we adhere to.
  - Standard data analysis tools often assume tidy data format. And the code you wrote on your own to analyse the data will be easier to follow if you conform to tidy data convention. 
    - e.g. seaborn.boxplot(data=data) produce a boxplot for each column since it assumes each column corresponds to a variable of interest.





----
#### Notes
----
You might want to be ready to do a simple tidy data example to reinforce their knowledge, and if time permits, a strip/jitter/swarm plot as well.


In [63]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# This command tells Jupyter notebook to always display plots as "inline output". 
%matplotlib inline 

---
## Some data tidying
---
Below shows some examples of how I will read the datasets provided in `mbs-datasets` into a Pandas dataframe. I will convert the into tidy form if they are not already in that form. 

These aren't the only correct ways to do it. Some decisions (such as what do use as index columns, what rows to skip) are context dependent and sometimes a matter of style. 


#### Notes:
 
  - A convention that I adopted is to name variables that refer to a ` Pandas.DataFrame ` object with a ` _df ` suffix. 
  - Each of these takes a few trials. You first read in the dataset naively and see what modification you can do to it. To understand the code, it is highly recommended that you try it yourself before you look at what I have done. Do tell me if you strongly disagree with what I have done. 
  - Read (just roughly) documentation of these parsers (e.g. `read_csv`, `read_excel`). 

---

#### Data: `gapminder-health-income.csv`
- The parser will automatically infer that the first row in the file is a header. 
- We manually specify that we only want the columns with index 1 to 4 (i.e. skip the first column). 
- Of these columns the first one (index 0 in the chosen columns, index 1 in the original file) is chosen as the index column
- This is a tidy data.

In [54]:
health_income_df = pd.read_csv("./mbs-datasets/gapminder-health-income.csv", usecols=range(1, 5), index_col=0)
# uncomment the row bellow to see result
#health_income_df  

#### Data: `air-quality-exposure.csv`
- Skipped the first row in the file since I consider it a "meta-data". It gives information about what the numbers means. There are various ways to handle this including: recording it some where in your code (e.g. variable name), or you can manually change your dataframe's header to incorporate that information. 
- There are missing data which is noted in the file as `No data`. We can alert the parser about this using the `na_values=` option. If missing data is denoted in more than one ways (e.g. `No data`, `no data`, `null`, `<empty>`), we should alert the parser about all of them so that missing data can be treated uniformly during analysis. 
- Otherwise, this is a tidy data.

In [58]:
air_quality_exposure_2014_df = pd.read_csv("./mbs-datasets/air-quality-exposure.csv", 
                                           skiprows=2, 
                                           index_col=0, 
                                           na_values=['No data'])
#air_quality_exposure_2014_df

#### Data: 

In [61]:
pd.read_excel("./mbs-datasets/Journal_publishing_cost_FOIs_UK_universities.xlsx", header=[0, 1])
pd.read_excel("./mbs-datasets/Journalsubscost20152016v3.xlsx")

Unnamed: 0,Elsevier,Wiley,Springer,Taylor & Francis,De Gruyter,Sage,RSC,IOP,Total (for these eight publishers)
University of Aberdeen,7.678757e+05,254286.02,180968.980,165926.27,965.600,67527.76,22731.77,22821.94,1.483104e+06
Abertay University,5.109810e+04,44060.20,12083.900,20784.00,0.000,29092.30,0.00,0.00,1.571185e+05
Aberystwyth University,2.815634e+05,155577.24,84073.600,101943.50,1793.150,47229.83,3113.02,7346.22,6.826399e+05
Anglia Ruskin University,7.392400e+04,115389.00,2623.000,,,,695.00,,1.926310e+05
University of the Arts London,8.091670e+03,0.00,0.000,2400.00,19397.580,,0.00,0.00,2.988925e+04
Arts University Bournemouth,0.000000e+00,345.52,0.000,7687.64,0.000,1148.82,0.00,0.00,9.181980e+03
Aston University,2.693984e+05,90968.33,22166.850,63974.93,2232.900,36559.20,10607.32,8383.58,5.042915e+05
Bangor University,2.703484e+05,148027.07,78898.380,172098.86,5062.080,68883.64,20153.60,6116.30,7.695883e+05
University of Bath,5.689650e+05,210773.00,113835.000,158201.00,1804.000,103147.00,23214.00,21931.00,1.201870e+06
Bath Spa University,2.100000e+04,,,37500.00,,,0.00,0.00,5.850000e+04


In [39]:
aap_pm_df = pd.read_excel("./mbs-datasets/AAP_PM_database_May2014.xls", skip_footer=10, sheetname='cities')
