# Brief Introduction to Pandas

### Limitations of using numpy for tabular data

We have seen how to use numpy to import tabular data stored in a CSV file

In [7]:
import numpy as np

data = np.loadtxt('data.csv', delimiter=',', skiprows=2)

data

array([[ 200.,  240.],
       [ 400.,  370.],
       [ 600.,  630.]])

However, there are two limitations in using numpy for tabular data:

- numpy arrays just stores the data, not the metadata (columns names, row index)
- a numpy array has a single data type (e.g., integer, float), while tables may have columns of data with different types

### Here comes Pandas

- Pandas (http://pandas.pydata.org/) is a widely-used Python library to handle tabular data
  - read from / write to different formats (CSV...)
  - analytics, statistics, tansformations, plotting (on top of matplotlib).


- Borrows many features from R’s dataframes.
    - A 2-dimenstional table whose columns have names and potentially have different data types.

We first import the library

In [2]:
import pandas as pd

## A real example

First, look at the real dataset of land-surface temperature (region averages) that we will use in the project:

http://berkeleyearth.org/data/

http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/

This is a good example of real dataset used in science: text format, good documentation, human readable but a bit harder to deal with progammatically (e.g., column names as comments instead of strict CSV).

Start by importing the packages that we will need for the project.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Load the data (a single region/file ; we won't use all the columns available). Note that we can provide an URL to `pandas.read_csv` !

In [9]:
df = pd.read_csv(
    "http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/united-states-TAVG-Trend.txt",
    delim_whitespace=True,
    comment='%',
    header=None,
    parse_dates=[[0,1]],
    index_col=(0),
    usecols=(0, 1, 2, 3, 8, 9),
    names=("year", "month", "anomaly", "uncertainty", "10-year-anomaly", "10-year-uncertainty")
)

df.index.name = "date"

## First exercice: data inspection, basic statistics and plotting

Objectives:

1. show the head and a sample of the data
2. plot the data
    * plot 'anomaly', playing around with `linewidth` and opacity (`alpha`)
    * plot '10-year-anomaly'
    * plot '10-year-uncertainty' around it (`plt.fill_between`)
    * try adjusting the size of the figure
* statistics of 'anomaly'
    * print descriptive statistics, print the mean of each column
    * plot distribution of 'anomaly' using `hist`, playing around with `bins`
    * plot distribution of 'anomaly' separated by years but in one plot
        1. 1850-1900
        2. 1950-2000

### Solution

## Second exercise: more advanced pandas analytics features 

Try re-calculating the 10-year anomaly from the anomaly (rolling mean) using pandas. Assign the results to a new column '10-year-anomaly-pandas' in the dataframe. Compare in a plot these results with the '10-year-anomaly' column.

Tip: look at `rolling` and `mean` in the pandas documentation.

### Solution

## Reuse the code for other data (countries)

create a function that take a country name as input.