# Pandas
## What is Pandas?

Pandas is a python library providing rich functionality on top of numpy. In addition to 'Excel like' tables, Pandas works well with numpy constructs and scikit-learn.

In [None]:
import pandas as pd
import numpy as np

## Pandas Series

Pandas series are like 1-dimensional numpy arrays, except that they are _labeled_, or have indices. In addition, the elements can be numeric, bools, strings, date time objects, functional objects, etc.

In [None]:
int_series = pd.Series( np.random.random(10) )
int_series.head()
int_series.tail(2)

In [None]:
num_series = pd.Series( np.random.random(10) )
num_series.head()

In [None]:
str_series = pd.Series([x for x in 'abcdefg'])
str_series.head()

In [None]:
tup_series = pd.Series([(x,1) for x in range(10)])
tup_series.head()

In [None]:
fun_series = pd.Series( [map for x in range(5) ])
fun_series.head()

## Indexes
(These are not the same as the _labels_ mentioned in supervised learning
Each row of the pandas series has an index by default. In fact, you can specify your own indices perform fast lookups, grouping operations, descriptive stats associated with these indices.

Indices can be strings, integers, or even time series

In [None]:
indSeries1 = pd.Series(np.random.random(5), index = ['CA','AK','IL','IN','NY'])

In [None]:
indSeries2 = pd.Series(np.random.random(3), index=['CA','IL','WA'])

In [None]:
indSeries1 + indSeries2

In [None]:
datSeries = pd.Series(np.random.random(5), index= pd.date_range('2015-01-01','2015-06-01',freq='m'))
datSeries

In [None]:
datSeries.resample('q')

## DataFrame
DataFrames are extensions of series into tables. They can have multiple indices (rows) and columns. Think of data frames as horizontally stacked series sharing the same set of indices

In [None]:
df = pd.DataFrame(np.random.random((5,5)))
df

In [None]:
df_idx = pd.date_range('2015-01-01','2015-01-05',freq='d')
df_col = ['sun','mon','tues','wed','thurs']

In [None]:
df = pd.DataFrame(np.random.random((5,5)),index=df_idx,columns=df_col)
df.head()

In [None]:
df.mean()
df.min()
df.cumsum()
df.describe()

In [None]:
df.columns #List the column labels of df

In [None]:
df.index # List the rows labels of df

In [None]:
df.sun # the columns can be addressed directly as pandas series

In [None]:
df['sun'] # this way is more preferable

In [None]:
df[ ['sun'] ] # Index by label vs. array returns different values

In [None]:
df[ ['sun','mon'] ] # Can just select certain columns

In [None]:
df.ix['2015-01-01'] # rows can be addressed through ix, also returns a series

In [None]:
df.ix[0] # same as above. ix supports both label and index based lookup

In [None]:
df['fri'] = 1.0* df['wed'] + 2.0 * df['thurs']
df

Be careful here. axis is switched with pandas. Here, axis=0 refers to rows, axis=1 refers to columns.

In [None]:
df.drop('fri',axis=1)

In [None]:
df.drop('2015-01-01',axis=0) #wont work

In [None]:
type(df.index)

In [None]:
df.drop( pd.Timestamp('2015-01-01'), axis = 0)

In [None]:
df.ix[ pd.Timestamp('2015-01-01') ] = pd.Series(np.random.random(6),index=df.columns)

You can also create dataframes using a dictionary of objects for each column

In [None]:
df2 = pd.DataFrame({'a': 1.,
                    'b': pd.Timestamp('2015-01-01'),
                    'c': pd.Series(np.random.random(),index=list(range(10))),
                    'd': 'foo'})
df2

## Subsetting  dataframes

Subsetting dataframes works similarly to numpy, but with some additional functionality

In [None]:
df[ df['sun'] > .5 ] #Subset certain rows, where the 'sun' column for that row is greater than .5

In [None]:
df[ (df['sun'] > .5) & (df['mon'] < .6) ] #multiple conditions, use tuples for each condition

## Advanced Subsetting

We've subsetted portions of the columns and rows. What if we need to select from both the columns and rows?

#### Indexing functions summary

Pandas Dataframes support various methods for indexing:

- .iloc <- Index by integer/position
- .loc  <- Index by labels
- .ix   <- Supports both label and integer/positional indexing

In [None]:
df.iloc[1:3,0:2] #Integer indexing into the rows and columns using noninclusive endpoints

In [None]:
df.loc['2015-01-02':'2015-01-03',['sun','mon'] ] #NOTE the date range slice is INCLUSIVE

In [None]:
df.ix[ '2015-01-01':'2015-01-03',0:2 ] #Mix of positional and label index

In [None]:
df3 = pd.DataFrame(np.random.randn(5))
type(df3)


In [None]:
type(df3.ix[:,0]) #Note that the Type changed

In [None]:
type(df3.ix[:,0:1]) #Still a dataframe

### Multiple indexing
Pandas allows you to have more than one set of indices or columns


In [None]:
df3 = pd.DataFrame(np.random.randn(30,5),index=pd.date_range('2015-1-1','2017-7-1',freq='m'))

In [None]:
df3['blah'] = ['b1','b2','b3']*10
df3.head()

In [None]:
df3 = df3.reset_index() #Reset the index of df3, set it as a new column
df3 = df3.set_index(['blah','index'])

In [None]:
df3.index.names = ['blah','date']
df3.head()

In [None]:
df3.loc['b1'] #Works!

In [None]:
df3.loc[ [pd.Timestamp('2015-01-31')]  ] #doesnt work

In [None]:
df3.loc[('b1','2015-01-31')] # works

In [None]:
# You can address up until the left most unaddressed level
df4 = pd.DataFrame(np.random.randn(8),index=['idx'+str(x) for x in range(8)])
df4['idx2'] = ['foo','foo','bar','bar']*2
df4['idx3'] = ['fah','bah']*4

In [None]:
df4 = df4.reset_index().set_index(['index','idx2','idx3'])
df4.index.names = ['idx1','idx2','idx3']
df4

## Group By: Split-Apply-Combine
Pandas provides for powerful aggregation within 'groups'. The process involves:
* **Splitting** the data into groups based on criteria
* **Applying** a function to each of the groups independently
* **Combining** the groups back together into a dataframe

The **Apply** step can be any function such as Aggregating values (mean,min,median,count,etc), Transforming values (similar to the winsorization example), or Filtration (removing data)

The most similar paradigm would be SQL based statements such as:
```
SELECT column1, mean(column2), max(column3)
FROM TheTable
GROUP BY column1, column2
```

In [None]:
df = pd.DataFrame({'A': ['foo','bar','foo','bar',
                         'foo','bar','foo','foo'],
                   'B': ['one','one','two','three',
                         'two','two','one','three'],
                   'C': np.random.randint(0,10,8),
                   'D': np.random.randint(0,10,8)})
df

In [None]:
grouped = df.groupby('A') #groupby object
grouped = df.groupby(['A','B']) #creates groups based on distinct combn of A and B
#Groupby does NOT split. It just validates a correct mapping of labels to group names
df

In [None]:
for name, group in grouped:
    print "Name:", name
    print "Group:", group

In [None]:
#Can also split on columns, and even based on your own rules
def tmp(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'


grouped = df.groupby(tmp, axis=1)
grouped.get_group('consonant')

In [None]:
#Various descriptive stats measured on each of the groups
grouped.all() # All of the elements are true (or coercible to true)
grouped.any()
grouped.count()
grouped.sum()
grouped.groups #Returns a dict of the groups. Keys are the group titles, values are axis labels
#Grouped itself is an object

In [None]:
df4 = pd.DataFrame(np.random.randn(8),index=['idx'+str(x) for x in range(8)],columns=['val'])
df4['idx2'] = ['foo','foo','bar','bar']*2
df4['idx3'] = ['fah','bah']*4
df4 = df4.reset_index().set_index(['index','idx2','idx3'])
df4.index.names = ['idx1','idx2','idx3']

#Once you have groupings, you can perform functions on them
grouped = df4.groupby(level=2) # Groupby using the 3rd index
grouped.mean()

In [None]:
# OR you can specify your own aggregation functions
grouped.agg(np.mean)

In [None]:
# can handle multiple aggregations on the columns through dicts
grouped.agg({'val': ['mean','min','max']})

In [None]:
#you can also do in place transformations of the data
def zScore(x):
    return (x - x.mean()) / x.std()

grouped = df4.groupby(level=2)
grouped.transform(zScore) 
#Group by 'fah' and 'bah', take all of the elements in them and zscore
df4

In [None]:
#apply is similar to transform, 
grouped.apply(np.mean)

In [None]:
#Does not do what you think. Notice that each element got of 'fah' got the same mean
grouped.transform(np.mean)

## Exercise

Pandas also has the ability to dynamically download datasets using the read_csv function.

Download the following dataset using the following command. Note that this may take a while--you'll know if Python is still running if there is an asterisk to the left of the command
```
import pandas as pd
chi = pd.read_csv('https://data.cityofchicago.org/api/views/4ijn-s7e5/rows.csv?accessType=DOWNLOAD')
chi.head()
```

What is the shape of this data set? How many rows and columns are there?

How many distinct cities are in this dataset?

What is the most common Inspection Type? Hint: Use `groupby` and `idxmax`