# Getting Started with Exploratory Data Analysis

3 important Python packages
1. NumPy for efficient computation on arrays
2. Pandas for data analysis for small and medium data
3. Matplotlib for plotting in the notebook

## Outline
- Pandas basic concepts
- Analysis of a Lustre logfile
    - Machine learning
- Analysis of Penguin data
    - Plotting on world map

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Pandas

Python module for manipulating tabular data

## `pandas`

- Provides python a `DataFrame`
- Structured manipulation tools
- Built on top of `numpy`
- Huge growth from 2011-2012
- Very **efficient**
- Great for *medium* data

Resources

- [pandas.pydata.org](http://pandas.pydata.org/)
- [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney
- [Data Wrangling Kung Fu with Pandas](vimeo.com/63295598) by Wes McKinney
- [Cheat sheet](https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf) by Quandl

### Why `pandas`?

> 80% of the effort in data analysis is spent cleaning data. [Hadley Wickham](http://vita.had.co.nz/papers/tidy-data.pdf)

Efficency

- Different views of data
- [Tidy data](http://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham

Raw data is often in the wrong format

- How often to you download an array ready for array-oriented computing?
- e.g. `scikit-learn` interface

Storage may be best in a different format

- Sparse representations
- Upload to database




## Converting a logfile into a Pandas Data Frame

Log file contains the following columsn
- host
- metric
- value 
- type of value
- units of value
- time stamp

Steps we are taking:
- Read the CSV file
- Add the column names
- print the first few elements

In [11]:
csv_filename = '2014-04-24.csv'
df = pd.read_csv(csv_filename, sep=';', names=['host', 'metric', 'value', 'type', 'units', 'time stamp'])
df.head()

Unnamed: 0,host,metric,value,type,units,time stamp
0,oss07,lustre.scratch.ost.obdfilter.OST0017.cache_access,0.0,float,pages/s,1398382546
1,oss07,lustre.scratch.ost.obdfilter.OST0015.disconnect,0.0,float,requests/s,1398382546
2,oss07,cpu_intr,0.0,float,%,1398382546
3,oss07,lustre.scratch.ost.obdfilter.hosttotal.cache_a...,0.0,float,pages/s,1398382546
4,oss07,lustre.scratch.ost.obdfilter.OST0025.connect,0.0,float,requests/s,1398382546


In [21]:
df.shape

(8352247, 6)

In [22]:
statfs = df[df['metric'].str.contains('statfs')]

In [23]:
statfs.head()

Unnamed: 0,host,metric,value,type,units,time stamp
101,oss07,lustre.scratch.ost.obdfilter.OST0007.statfs,0.0,float,requests/s,1398382546
104,oss07,lustre.scratch.ost.obdfilter.OST0027.statfs,0.0,float,requests/s,1398382546
106,oss07,lustre.scratch.ost.obdfilter.hosttotal.statfs,8.79832,float,requests/s,1398382546
159,oss07,lustre.scratch.ost.obdfilter.OST0005.statfs,2.19956,float,requests/s,1398382546
163,oss07,lustre.scratch.ost.obdfilter.OST0025.statfs,2.1996,float,requests/s,1398382546


In [29]:
statfs[statfs.host == 'mds02']

Unnamed: 0,host,metric,value,type,units,time stamp
1871,mds02,lustre.scratch.mdt.mdt.MDT0000.statfs,0.00000,float,ops/s,1398382546
1933,mds02,lustre.scratch.mdt.mdt.hosttotal.statfs,0.00000,float,ops/s,1398382546
8203,mds02,lustre.scratch.mdt.mdt.MDT0000.statfs,0.00000,float,ops/s,1398382569
8265,mds02,lustre.scratch.mdt.mdt.hosttotal.statfs,0.00000,float,ops/s,1398382569
14535,mds02,lustre.scratch.mdt.mdt.MDT0000.statfs,0.00000,float,ops/s,1398382581
14597,mds02,lustre.scratch.mdt.mdt.hosttotal.statfs,0.00000,float,ops/s,1398382581
20867,mds02,lustre.scratch.mdt.mdt.MDT0000.statfs,0.00000,float,ops/s,1398382598
20929,mds02,lustre.scratch.mdt.mdt.hosttotal.statfs,0.00000,float,ops/s,1398382598
27205,mds02,lustre.scratch.mdt.mdt.MDT0000.statfs,0.00000,float,ops/s,1398382612
27267,mds02,lustre.scratch.mdt.mdt.hosttotal.statfs,0.00000,float,ops/s,1398382612


In [38]:
pivot = df.pivot(index='time stamp', columns='metric', values='value')
pivot.head()

ValueError: Index contains duplicate entries, cannot reshape