[Back to Index](index.ipynb)

# A quick overview of working with Pandas

Begin by importing all python modules we will need. For this exercise we will use Pandas.

Pandas is a popular open source Python library for data analysis. It introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).
- [Series](http://pandas.pydata.org/pandas-docs/version/0.15.2/dsintro.html#basicsseries) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
- [DataFrame](http://pandas.pydata.org/pandas-docs/version/0.15.2/dsintro.html#basics-dataframe) is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Pandas is also tightly integrated with [matplotlib](http://pandas.pydata.org/pandas-docs/version/0.15.2/visualization.html) and you can do basic plotting directly from a dataframe. It also provides functionality for applying complex transformations and filters to the data and much more. There are lot of great Pandas tutorials on the web, here is one - [An Introduction to Pandas](http://synesthesiam.com/posts/an-introduction-to-pandas.html)

In [None]:
# A 'magic' command to display plots inline
%matplotlib inline

import requests
import pandas as pd
import matplotlib.pyplot as plt

### 1. Create a Pandas DataFrame

We will create a new Pandas DataFrame from a timeseries. Let's get the tiemseries for Annual Average Maximum Temperature for one model (CNRM-CM5) and one scenario (RCP 4.5) for Sacramento County.

In [None]:
url = 'http://api.cal-adapt.org/api/series/tasmax_year_CNRM-CM5_rcp45/rasters/?pagesize=100&stat=mean&ref=/api/counties/34/'

# Make request
response = requests.get(url)
# Get json from response object
data = response.json()

# Create a new pandas dataframe. 
df = pd.DataFrame(data['results'])
df

### 2. Explore the data in a DataFrame

Uncomment each one and examine the output

In [None]:
len(df)
#df.head()
#df.tail()
#df.columns
#df.image
#df['image']
#df.event.head(20)

### 3. Apply functions to a DataFrame

There are lots useful built-in methods we can use on a specific column, such as `mean()` to get the average. Most of pandas' methods will ignore missing values like NaN. You can also apply functions to all columns in a DataFrame. Try `std()`, `max()`, `min()`, `sum()`

In [None]:
df.image.mean()
#df.mean()

You can also create custom functions. The `image` values in our dataframe are in Kelvin and we want to change them to degrees Fahrenheit. Let's create a function that does this conversion and apply it to each row in the `image` column using `.apply()`. An intro to [lambda](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) functions in Python.

In [None]:
df.image = df.image.apply(lambda x: (x - 273.15) * 9 / 5 + 32)
# For simple calculations you dont need to use apply
# df.image = (df.image - 273.15) * 9 / 5 + 32

df.image.head()

### 4. Indexing

Each row in a DataFrame has a unique identifier called the `index`. By default Pandas autogenerates an integer index for each row. But it can be useful to idenitfy each row by other indices. 

In our DateFrame each row represents max. temp for a year. The date (contained in the `event` field) is a unique identifier. We can tell the DataFrame to use the `event` field as the index field. This creates a timeseries and pandas provides some extra [functionality](https://tomaugspurger.github.io/modern-7-timeseries.html) for working with timeseries data.

In [None]:
# First change format of `event` field to datetime
df['event'] = pd.to_datetime(df['event'], format='%Y-%m-%d')
# Set event field as index
df = df.set_index(['event'])
# You can index by multiple fields

In [None]:
df.head(20)

### 5. Filtering

In [None]:
# Filter by value
hi_temps = df[df.image >= 80]
hi_temps.head()

# Filter by time. Uncomment the following lines and run the cell again
#filtered_df = df['20200101':'20300101']
#filtered_df

### 6. Resampling

Resampling is similar to a groupby: you split the time series into groups (10 year bins below), apply a function to each group (mean), and combine the result (one row per group). This creates another data structure commonly used in Pandas called `Series` - a one-dimensional labeled array capable of holding any data type. See the pandas documentation for more examples of [resampling](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html).

In [None]:
# Only resample rows where year is between 2020 and 2099.
# 10AS refers to Annual, Start of Year. Refer to pandas docs for other ways to specify bins
decadal_avg = df['20100101':'20990101'].image.resample('10AS').mean()
print(type(decadal_avg))
decadal_avg

In [None]:
# The first row of the resampling result above is equivalent to:
df['20100101':'20190101'].image.mean()

In [None]:
decadal_stats = df['20100101':'20990101'].image.resample('10AS').agg(['mean', 'max', 'min', 'std'])
print(type(decadal_stats))
decadal_stats

### 7. Plotting

The plot method on DataFrame is just a simple wrapper around a matplotlib method `plt.plot()`

In [None]:
df.image.plot()
#df.image.hist()
#dfP['image'].plot.box()

In [None]:
df.image.plot(figsize=(10, 8), color='#348ABD')
plt.title("CNRM-CM5 RCP 4.5")
plt.ylabel("Temperature (degrees F)")
plt.grid(True)

Experiment with more examples [here]( http://pandas.pydata.org/pandas-docs/stable/visualization.html)