![title](img/pandas_logo.png)

`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

<center>Reference: http://pandas.pydata.org</center>

Topics:
    1. Pandas data structures
    2. Loading data
    3. Cleaning and formatting data
    4. Basic visualization

#### SETUP

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

#### The <font color='red'>Pandas</font> library.

In [None]:
import pandas as pd
print 'Using pandas version ',pd.__version__

pd.set_option("display.max_rows", 16) # only 16 rows of data will be displayed

LARGE_FIGSIZE = (12, 8) # set figure size

Remember

In [None]:
#pd.<TAB>  # display the contents of the pandas namespace

## <center>Pandas data structures</center>

<center>Pandas data structures are similar to numpy ndarrays but with extra functionality.</center>

#### 1D data structures

A <font color='red'>Series</font>  is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

Think of a Series as a cross between a list and a dict.

A series can be constructed with the `pd.Series` constructor (passing a list or array of values).

In [None]:
a = np.array([1, 1, 2, 3, 5])   # Numpy

In [None]:
s = pd.Series([1, 1, 2, 3, 5])  # Pandas

In [None]:
type(s)

In [None]:
s.describe()

#### NumPy arrays as backend of Pandas

In [None]:
s.values

In [None]:
print s.index
print 'Type of index:',type(s.index)

#### Pandas data structures have an index

In [None]:
countries = np.array(['Albania', 'Algeria', 'Andorra', 'Angola'])   
print countries[0]

In Numpy single element indexing for a 1-D array is what one expects. 

In [None]:
# data corresponding to countries "index"
life_expectancy_values = np.array([74.7,  75. ,  83.4,  57.6])

In [None]:
if False:
    print life_expectancy_values[tuple(countries)]

In [None]:
life_expectancy_values = pd.Series([74.7,  75. ,  83.4,  57.6],
                                  index=['Albania', 'Algeria', 'Andorra', 'Angola'])
print life_expectancy_values

#### Numpy Array has an implicitly defined integer index used to access the values while the Pandas Series has an explicitly defined index associated with the values.

In [None]:
print life_expectancy_values[0]  # value at position 0 in series

In [None]:
print life_expectancy_values.loc['Angola'] # value at given index

In [None]:
# Note
life_expectancy_values = pd.Series([74.7,  75. ,  83.4,  57.6]) 
print life_expectancy_values 

...get default index values

In [None]:
print life_expectancy_values.iloc[0] # use iloc to get value at position n

#### index and position are not the same !!!

#### 2D data structures

Pandas: <font color='red'>DataFrame</font> is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.

A <font color='red'>DataFrame</font> is like a sequence of aligned <font color='red'>Series</font> objects, i.e. they share the same index.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
states.columns

In [None]:
states.index

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

#### Create a DataFrame from a 2D Numpy array

Given a two-dimensional array of data, we can create a dataframe with any specified column and index names. If left out, an integer index will be used for each.

In [None]:
np.random.rand(3, 2)

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

**EXERCISE:** Investigate how to acces elements of a DataFrame using the following data:

In [None]:
# NYC subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)
print type(ridership_df)
print ridership_df

In [None]:
# Accessing elements
if False:
    print ridership_df.iloc[0]
    print ridership_df.loc['05-05-11']
    print ridership_df['R003']
    print ridership_df.iloc[1, 3]
    
# Accessing multiple rows
if False:
    print ridership_df.iloc[1:4]
    
# Accessing multiple columns
if False:
    print ridership_df[['R003', 'R005']]
    
# Pandas axis
if False:
    df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df.sum()
    print df.sum(axis=1)
    print df.values.sum()

## <center>Load some data</center>

#### Pandas <font color='red'>read_csv</font>

Weather data from Weather Underground (http://www.wunderground.com/)

In [None]:
!head data/weather_year_10-10-15_10-10-16.csv

In [None]:
weather_data = pd.read_csv("data/weather_year_10-10-15_10-10-16.csv")

In [None]:
weather_data

In [None]:
print len(weather_data) # only gets you the number of rows

In [None]:
weather_data.columns

In [None]:
#weather_data["EDT"]

# or

weather_data.EDT  # nicer because you can autocomplete

In [None]:
weather_data[["EDT", "Mean TemperatureF"]]

In [None]:
weather_data.EDT.head() # can also pass an argument

In [None]:
weather_data["Mean TemperatureF"].head()

#### Rename columns

Assign a new list of column names to the columns property of the DataFrame.

In [None]:
weather_data.columns = ["date", "max_temp", "mean_temp", "min_temp", "max_dew",
                "mean_dew", "min_dew", "max_humidity", "mean_humidity",
                "min_humidity", "max_pressure", "mean_pressure",
                "min_pressure", "max_visibilty", "mean_visibility",
                "min_visibility", "max_wind", "mean_wind", "min_wind",
                "precipitation", "cloud_cover", "events", "wind_dir"]

In [None]:
weather_data

In [None]:
# Now, we can use . dot 
weather_data.mean_temp.head()

In [None]:
weather_data.mean_temp.std()

In [None]:
weather_data.mean_temp.hist()

### Climate data

### Sea surface temperature anomalies

In [None]:
!head data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat

#### Pandas  <font color='red'>read_table</font>

In [None]:
filename = "data/temperatures/annual.land_ocean.90S.90N.df_1901-2000mean.dat"
sst_anom = pd.read_table(filename)
print type(sst_anom)

In [None]:
sst_anom.columns

There is only 1 column! Let's reformat the data noting that values are separated by any number of spaces.

## <center>Data cleaning</center>

In [None]:
sst_anom = pd.read_table(filename, sep="\s+")
sst_anom

There are columns but the column names are 1880 and -0.1591!

In [None]:
sst_anom = pd.read_table(filename, sep="\s+", names=["year", "mean_anom"])
sst_anom

Since we only have 2 columns, one of which would be nicer to access the data (the year of the record), let's try using the `index_col` option:

In [None]:
sst_anom = pd.read_table(filename, sep="\s+", names=["year", "mean_anom"], 
                                index_col=0)
sst_anom

Last step: the index is made of dates. Let's make that explicit:

In [None]:
sst_anom = pd.read_table(filename, sep="\s+", names=["year", "mean_anom"], 
                                index_col=0, parse_dates=True)
sst_anom

### Dealing with missing values

In [None]:
sst_anom.tail()

In [None]:
# Convert to missing values to NaN values
sst_anom[sst_anom == -999.000] = np.nan
sst_anom.tail()

In [None]:
# Remove NaN values
sst_anom.dropna().tail()

### Sea level dataset

The university of colorado posts updated timeseries for mean sea level globably, per hemisphere, or even per ocean, sea, ... Let's download the global one, and the ones for the northern and southern hemisphere.

That will also illustrate that to load text files that are online, there is no more work than replacing the filepath by a URL in `read_table`:

In [None]:
northern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_nh.txt", 
                                   sep="\s+")
northern_sea_level

In [None]:
southern_sea_level = pd.read_table("http://sealevel.colorado.edu/files/current/sl_sh.txt", 
                                   sep="\s+")
southern_sea_level

In [None]:
# The 2016 version of the global dataset:
url = "http://sealevel.colorado.edu/files/2016_rel2/sl_ns_global.txt"
global_sea_level = pd.read_table(url, sep="\s+")
global_sea_level

### Creating new DataFrames

As shown before `DataFrame`s can  be created manually by grouping several Series together. Let's make a new frame from the 3 sea level datasets we downloaded above. They will be displayed along the same index. 

Wait, does it make sense to do that?

In [None]:
# For two Series to share the same DataFrame the Series have to be aligned.
# Let's look at sea level data for NH and SH. Are they aligned?
southern_sea_level.year == northern_sea_level.year

In [None]:
# Could use Numpy:
np.all(southern_sea_level.year == northern_sea_level.year)

So the northern hemisphere and southern hemisphere datasets are aligned. What about the global one?

In [None]:
print len(global_sea_level.year) 
print len(northern_sea_level.year)
len(global_sea_level.year) == len(northern_sea_level.year)

For now, let's just build a DataFrame with the 2 hemisphere datasets then. We will come back to add the global one later...

In [None]:
# A dictionary of Series
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"], 
                               "date": northern_sea_level.year})
mean_sea_level

There is still the date in a regular column and a numerical index that is not that meaningful (or useful). We can specify the `index` of a `DataFrame` at creation. Let's try:

In [None]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"], 
                               "southern_hem": southern_sea_level["msl_ib(mm)"]},
                               index = northern_sea_level.year)
mean_sea_level

#### What's going on?

In [None]:
# Note that
northern_sea_level["msl_ib(mm)"].index

There is no `value` corresponding to the Series' `index`.

But there is `value` corresponding to the specified `index`.

So, replace the Series by their values when creating the DataFrame:

In [None]:
mean_sea_level = pd.DataFrame({"northern_hem": northern_sea_level["msl_ib(mm)"].values, 
                               "southern_hem": southern_sea_level["msl_ib(mm)"].values},
                               index = northern_sea_level.year)
mean_sea_level

Note the following:

In [None]:
mean_sea_level.index

Index name, `year`, is not an accurate description of what it indexes

We can rename an index by setting its name. 

For example, the index of the `mean_sea_level` dataFrame could be called `date` since it contains more than just the year:

In [None]:
mean_sea_level.index.name = "date"
mean_sea_level

### Adding columns

While building the `mean_sea_level` dataFrame earlier, we didn't include the values from `global_sea_level` since the years were not aligned. Adding a column to a dataframe is as easy as adding an entry to a dictionary. So let's try:

In [None]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level

The column is full of NaNs again because the auto-alignment feature of Pandas is searching for the index values like `1992.9323` in the index of `global_sea_level["msl_ib_ns(mm)"]` series and not finding them. Let's set its index to these years so that that auto-alignment can work for us and figure out which values we have and not:

In [None]:
global_sea_level

In [None]:
global_sea_level = global_sea_level.set_index("year")
global_sea_level["msl_ib_ns(mm)"]

In [None]:
mean_sea_level["mean_global"] = global_sea_level["msl_ib_ns(mm)"]
mean_sea_level

In [None]:
mean_sea_level.fillna(value=0).head()

**EXERCISE:** Create a new series containing the average of the 2 hemispheres minus the global value to see if that is close to 0. Work inside the mean_sea_level dataframe first. Then try with the original Series to see what happens with data alignment while doing computations.

### Reading from a local or remote HTML file

To be able to grab more local data about mean sea levels, we can download and extract data about mean sea level stations around the world from the PSMSL (http://www.psmsl.org/). Again to download and parse all tables in a webpage, just give `read_html` the URL to parse:

In [None]:
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")

In [None]:
len(table_list)

In [None]:
table_list

In [None]:
# there is 1 table on that page which contains metadata about the stations where 
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations

That table can be used to search for a station in a region of the world we choose, extract an ID for it and download the corresponding time series with the URL http://www.psmsl.org/data/obtaining/met.monthly.data/< ID >.metdata

The datasets that we obtain straight from the reading functions are pretty raw. A lot of pre-processing can be done during data read but we haven't used all the power of the reading functions. Let's learn to do a lot of cleaning and formatting of the data.

In [None]:
# The columns of the local_sea_level_stations aren't clean: they contain spaces and dots.
local_sea_level_stations.columns

In [None]:
# Let's clean them up a bit:
local_sea_level_stations.columns = [name.strip().replace(".", "") 
                                    for name in local_sea_level_stations.columns]
local_sea_level_stations.columns

### Global temperature climatology

Let's load a different file with temperature data. NASA's GISS dataset is written in chunks: look at it in `data/temperatures/GLB.Ts+dSST.txt`

In [None]:
!head data/temperatures/GLB.Ts+dSST.txt

In [None]:
giss_temp = pd.read_table("data/temperatures/GLB.Ts+dSST.txt", sep="\s+", skiprows=7,
                          skip_footer=11, engine="python")
type(giss_temp)
giss_temp

#### Exercise
What happens if you remove the `skiprows`? `skipfooter`? `engine`?

#### Exercise
Load some readings of CO2 concentrations in the atmosphere from the `data/greenhouse_gaz/co2_mm_global.txt` data file.
    1. Use read_table to load data
    2. Cast `year` and `month` columns into one date (hint use parse_dates)
    3. Remove `decimal` column.

In [None]:
# Internal nature of the object
print giss_temp.shape 
print giss_temp.dtypes

Descriptors for the vertical axis (axis=0)

In [None]:
print giss_temp.index

Descriptors for the horizontal axis (axis=1)

In [None]:
giss_temp.columns

#### Recall: every column is a Series

A lot of information at once including memory usage:

In [None]:
giss_temp.info()

### Setting the index

In [None]:
# We didn't set a column number of the index of giss_temp, 
# we can do that after we have read the data:
giss_temp = giss_temp.set_index("Year")
giss_temp.head()

Note Year.1 column is redundant

### Dropping rows and columns

In [None]:
giss_temp.columns

In [None]:
# Let's drop it:
giss_temp = giss_temp.drop("Year.1", axis=1) # axis=1 is the data axis
giss_temp

In [None]:
# We can also just select the columns we want to keep (another way to drop columns)
giss_temp = giss_temp[[u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul', 
                       u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']]
# Note how we passed a List of column names

giss_temp

In [None]:
# Let's remove the last row (Year  Jan ...).
giss_temp = giss_temp.drop("Year")  # by  default drop() works on index axis (axis=0)
giss_temp

Let's also set `****` to a real missing value (`np.nan`). We can often do it using a boolean mask, but that may trigger pandas warning. Another way to assign based on a boolean condition is to use the `where` method:

In [None]:
#giss_temp[giss_temp == "****"] = np.nan # WARNING due to memory layout

# use .where
giss_temp = giss_temp.where(giss_temp != "****", np.nan)

In [None]:
giss_temp.tail()

Because of the labels (strings) found in the middle of the timeseries, every column only assumed to contain strings (didn't convert them to floating point values):

In [None]:
giss_temp.dtypes

That can be changed after the fact (and after the cleanup) with the `astype` method of a `Series`:

In [None]:
giss_temp["Jan"].astype("float32")

In [None]:
# Loop over all columns that had 'Object' type and make them 'float32'
for col in giss_temp.columns:
    giss_temp[col] = giss_temp[col].astype(np.float32)

An index has a `dtype` just like any Series and that can be changed after the fact too.

In [None]:
giss_temp.index.dtype

For now, let's change it to an integer so that values can at least be compared properly.

In [None]:
giss_temp.index = giss_temp.index.astype(np.int32)

### Removing missing values

In [None]:
# This will remove any year that has a missing value. Use how='all' to keep partial years
giss_temp.dropna(how="any").tail()

In [None]:
# Replace (fill) NaN with 0 (or some other value, like -999)
giss_temp.fillna(value=0).tail()

In [None]:
# ffill = forward fill: This fills them with the previous year.
giss_temp.fillna(method="ffill").tail()

There is also a `.interpolate` method that works on a `Series`:

In [None]:
giss_temp.Aug.interpolate().tail()

For now, we will leave the missing values in all our datasets, because it wouldn't be meaningful to fill them.

**EXERCISE:** Go back to the reading functions, and learn more about other options that could have allowed us to fold some of these pre-processing steps into the data loading.

## <center>Basic visualization</center>

Once we are done with data wrangling it is easy to do visualization with Pandas. 

One can simply invoke `.plot` to generate basic line plots (pandas uses matplotlib under the covers).

In [None]:
sst_anom.plot()

In [None]:
giss_temp.plot(figsize=LARGE_FIGSIZE)

In [None]:
mean_sea_level.plot(subplots=True, figsize=(16, 12));

There are more plot options inside `pandas.tools.plotting`; for example:

In [None]:
mean_sea_level.plot(kind='kde', figsize=(12, 8));

In [None]:
# A boxplot
giss_temp.boxplot()

In [None]:
# Are there correlations between the northern and southern sea level timeseries we loaded?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mean_sea_level, figsize=LARGE_FIGSIZE);

## <center>Storing our work</center>

For each `read_**` function to load data, there is a `to_**` method attached to Series and DataFrames.

Another file format that is commonly used is Excel.

Multiple datasets can be stored in 1 file.

In [None]:
writer = pd.ExcelWriter("test.xls")

In [None]:
giss_temp.to_excel(writer, sheet_name="GISS temp data")
sst_anom.to_excel(writer, sheet_name="NASA sst anom data")

In [None]:
writer.close()

Another, more powerful file format to store binary data, which allows us to store both `Series` and `DataFrame`s without having to cast anybody is HDF5.

In [None]:
with pd.HDFStore("all_data.h5") as writer:
    giss_temp.to_hdf(writer, "/temperatures/giss")
    sst_anom.to_hdf(writer, "/temperatures/anomalies")
    mean_sea_level.to_hdf(writer, "/sea_level/mean_sea_level")
    local_sea_level_stations.to_hdf(writer, "/sea_level/stations")

In [None]:
%ls

**EXERCISE**: Add the greenhouse gas dataset in this data store. Store it in a separate folder.

## Extra material

### Bulk operations

Methods like sum() and std() work on entire columns. We can run our own functions across all values in a column (or row) using apply().

In [None]:
weather_data.date.tail()

We can use the values property of the column to get a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.

In [None]:
first_date = weather_data.date.values[0]
first_date

Use the <font color='red'>strptime</font> function from the <font color='red'>datetime</font> module.

In [None]:
from datetime import datetime
dt = datetime.strptime(first_date, "%Y-%m-%d")
print dt

Using the <font color='red'>apply()</font> method, which takes an <font color='blue'>anonymous function</font>, we can apply strptime to each value in the column. We'll overwrite the string date values with their Python datetime equivalents.

In [None]:
weather_data.date = weather_data.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
weather_data.date.head()

Let's go one step futher. Each row in our DateFrame represents the weather from a single day. Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

Our row indices up to now have been auto-generated by pandas, and are simply integers from 0 to 365. If we use dates instead of integers for our index, we will get some extra benefits from pandas when plotting later on. Overwriting the index is as easy as assigning to the <font color='red'>index</font> property of the DataFrame.

In [None]:
weather_data.index = weather_data.date
weather_data.head()

Now we can quickly look up a row by its date with the <font color='red'>ix[]</font> property.

In [None]:
weather_data.ix[datetime(2016, 10, 8)]  # Lots of rain/wind on this date (hurricane Matthew)

In [None]:
weather_data.columns

In [None]:
weather_data.precipitation.plot(figsize=LARGE_FIGSIZE)

In [None]:
weather_data.max_temp.tail().plot()

In [None]:
weather_data.max_temp.tail().plot(kind="bar", rot=10)

The <font color='red'>plot()</font> function returns a matplotlib <font color='red'>AxesSubPlot</font> object. You can pass this object into subsequent calls to plot() in order to compose plots.

In [None]:
ax = weather_data.max_temp.plot(title="Min and Max Temperatures")
weather_data.min_temp.plot(style="red", ax=ax)
ax.set_ylabel("Temperature (F)")