
##### CSCI 303
# Introduction to Data  Science
<p/>

### 9 - pandas basics

![pandas logo](pandas_logo.png)

## This Lecture
---
- Learn pandas basics

The obligatory setup code...

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.datasets

%matplotlib inline

## pandas
---
Python toolkit for data analysis

- provides Series and DataFrame data structures
- DataFrame type inspired by R
- designed to interact with the whole Python data science stack
- eases many of the data science tasks, particularly data "wrangling"

## Series
---
A one-dimensional array-like object:

- contains a sequence of values of one particular type
- has an associated array of *index* labels
  - labels do not have to be integers
  - labels do not have to be unique
  - labels do not have to be sequential

Like a NumPy array, a Series can be constructed from any iterable:

In [None]:
from pandas import Series

s = Series([42, 17, 99])
s

The *index* is shown on the left

- default: RangeIndex (representing sequential integers)
- access index via `index` property of the Series object

In [None]:
s.index

There is also a `values` property:

In [None]:
s.values
# shows the values in the Series

Things get interesting when you use *labels* for the index:

In [None]:
s = Series([42, 17, 99], index=['apple', 'pear', 'orange']) 
s

Like a dictionary:

- associate values with labels
- retrieve values via [ ] operator

Unlike a dictionary:

- retain original order
- labels can duplicate

In [None]:
s2 = Series([42, 17, 99, 3.1415], index=['apple', 'pear', 'orange', 'apple'])
s2

In [None]:
s2['orange']

In [None]:
#fields = ['orange', 'pear']
s2[['orange', 'pear']] # can access more than one value at a time

In [None]:
s2['apple'] # when duplicate labels are in the Series, both values are shown that correspond with each label

In [None]:
test = Series([1,2,3],index=['foo',17,True])
test

Note the last two lookups resulted in Series objects.

You can apply math and other NumPy-like operations:

In [None]:
s2

In [None]:
s2 = s2 * 2

In [None]:
np.cos(s2)

Data aligns by label in arithmetic operations:

In [None]:
s3 = Series([1, 2, 3, 4], ['a', 'b', 'c', 'd'])
s4 = Series([5, 6, 7, 8, 9], ['d', 'b', 'a', 'e', 'd'])
s3 + s4

In [None]:
s5 = Series(['hello', 'goodbye', np.NaN], index=['a','b','c'])
s5

Note the unmatched labels turned into NaNs - pandas notation for missing data.

Series objects can also be *named*, via the `name` property:

In [None]:
s2.name = 'tonnes'
s2

The index can also be named:

In [None]:
s2.index.name = 'fruit'
#s2['orange']
s2

## DataFrame
---
A data structure which functions much like a database table

- ordered collection of columns, each of a specific type
- column index labels the columns, similar to attribute names
- row index labels rows, similar to a primary key

However, more complex than a database table (and more powerful!)

You can make a DataFrame object from a dictionary object:

In [None]:
from pandas import DataFrame

df = DataFrame(
    {'fruit' : ['apple', 'orange', 'peach', 'apple'],
     'tonnes' : [42, 17, 99, 3.1415],
     'type' : ['pome', 'citrus', 'drupe', 'pome']})

df.index = ['crate a', 'crate b', 'crate w', 'crate f']
df
print(df)
print(df[:2][['fruit','tonnes']]) # shows the first two crates in the dataframe

In [None]:
# use df.head() to show the entire dataframe as well
df.head(2)

...although mostly we'll be getting DataFrames in other ways, such as from external sources.

DataFrame objects have much of the same extensible naming/indexing as Series objects:

In [None]:
df.index = ['crate 1', 'crate 2', 'crate 16', 'crate 11']
df

In [None]:
df.index.name = 'location'
df

You access columns by name, usign either [ ] or the . operator:

In [None]:
df['fruit']  # or df.fruit

In [None]:
df[['tonnes', 'fruit']]
mySeries = df[:1]
mySeries

However, note that slicing notation applies to rows:

In [None]:
df[1:3]

You can more precisely access rows by label or position using the `loc` and `iloc` special operators (*not methods!*):

In [None]:
df.loc['crate 16', ['fruit','tonnes']]

In [None]:
df.loc[:'crate 16', ['type', 'tonnes']]

In [None]:
df.iloc[1:3,1:2]

In [None]:
df.iloc[3]

There's also Boolean indexing:

In [None]:
df[df['fruit']=='apple']

In [None]:
df[df.tonnes > 20]

Confused yet?

We'll explore these further as needed.  Don't forget the pandas documentation under the Help menu in your notebook!

Also, here's a ["cheat sheet"](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

## The Boston Housing Dataset (REMOVED) - Now using California Housing Dataset
---
A well known and heavily studied dataset for statistical inference.

Available in the scikit-learn package, or many sources online.

In [None]:
from sklearn.datasets import fetch_california_housing    
raw = fetch_california_housing()

In [None]:
raw.keys()

In [None]:
print(raw.DESCR)

We can view the raw data and target arrays...

In [None]:
raw.data

In [None]:
raw.target

Instead, let's load the data into a DataFrame where we can explore it a bit more easily.

Along the way, we'll explore some of the DataFrame object's interface.

In [None]:
cali = DataFrame(raw.data, columns=raw.feature_names)

In [None]:
cali

Adding/deleting a column is simple:

In [None]:
cali['Target'] = raw.target
#del cali['Target']
cali[:10]

## Basic Statistics
---
pandas provides the `describe` function (similar to R's `summary`):

In [None]:
cali.describe()

pandas has other convenience methods.  How about pairwise correlations in the data?

In [None]:
cali.corr()

We can take sums, means, standard deviations, etc. by row or column:

In [None]:
cali.mean()

In [None]:
cali.sum(axis=1)[:10] 

## Next Time
---
Next lecture, we'll do some exploratory data analysis on the California housing set.

![Exploratory data analysis plots](eda.png) 