#Pandas for Data Manipulation
### USI ARA AAM, Nov 19

## Notebook Material Credits
- Alfred Essa ( @alfessa )
- Harrison Kinsley ( @sentdex )
- Verena Kaynig-Fittkau (Harvard CS)

##What is pandas?

Pandas is one of the most popular tools for data wrangling in python. In essence, pandas is the equivalent of data frames in R. Additionally, it is tightly tied with numpy and matplotlib. This allows it to be readily amenable to modeling (in scikit-learn) and plotting.

## Importing modules

To start off we shall import the most basic modules that work with pandas.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# this allows you to plot and display within the notebook environment
%matplotlib inline    

# Pandas Series

The most basic pandas data structure is a *Series* object. A Series **ducktypes** as an numpy array as well as a dictionary. 

In [None]:
s1= pd.Series([33,19,15,89,11,-5,9])
print type(s1)
print s1
print s1[0],s1[3]

## Index in a Series

A pandas Series has an **index** that you can optionally specify. If unspecified, it becomes a simple serial number like above. The index is like a reference to the row - it says what that entry is about. Here is an example of a Series that records the temperature on 7 days of a week. 

In [None]:
data1= [33,19,15,89,11,-5,9]
index1= ['mon','tue','wed','thu','fri','sat','sun']
s2= pd.Series(data1, index=index1)
print s2, "\n\n"

print s2['mon'], "\n\n"
print s2['wed':'fri'], "\n\n"
print s2[2:4]

It is possible for you to give names to your index and data.

In [None]:
s2.name= 'Daily temperature'
s2.index.name= 'Weekday'
print s2

## Series from a dictionary

As you might have guessed by now, the reason Series behaves like a dictionary is because of this index. Thus, you can create a Series out of a dict, and the dict keys become the indices. There is a catch, though. Pandas will *rearrange* the indices in alphabetic / numeric sort order (which might not be what you wanted).

In [None]:
dict1= {'mon':33, 'tue':19.7, 'wed':15, 'thu':89, 'fri': 34, 'sat': 43, 'sun': 51}
s4= pd.Series(dict1)
print s4

To remedy this you need to specify the index in the way you want it, while the data can come from the dict. 

In [None]:
s4= pd.Series(dict1, index= index1)
print s4

Also, just like a dict, you can iterate over the entries.

In [None]:
for k,v in s4.iteritems():
    print k, v

Also, just like a dictionary, you can operate on the keys.

In [None]:
print s4['thu'], "\n\n"
print s4[3], "\n\n"
print 'sun' in s4, "\n\n"
print 'moon' in s4

## Vectorized operations

Because Series is derived from  a numpy array, it allows numpy functions and vectorized operations. 

In [None]:
print s4.sum(), "\n\n"
print s4.median(), "\n\n"
print s4.cumsum()

In [None]:
print s4 * 2 , "\n\n"
print s4 **2, "\n\n"
print s4 + 100

You can also perform list comprehension

In [None]:
new = [x**2 for x in s4]
print new

#Pandas DataFrame

While pandas Series are interesting, theya re not too useful - you can make do with dicts and arrays ni place of Series. Pandas becomes really useful with its next datastructure - the *DataFrame*. DataFrame is a set of Series objects stacked horizontally together across a single index for each row. As you might imagine, this is essentially data like a spreadsheet, in rows and columns. This is the most common format data scientists use on a daily basis. 

You can create your own dataframes as demonstrated below. 

In [75]:
import datetime as dtm

dt= dtm.datetime(2014,12,1)
en= dtm.datetime(2014,12,8)
step= dtm.timedelta(days=1)
dates= []
while dt < en:
    dates.append(dt.strftime('%Y-%m-%d'))
    dt += step
print dates

['2014-12-01', '2014-12-02', '2014-12-03', '2014-12-04', '2014-12-05', '2014-12-06', '2014-12-07']


In [76]:
t1= [15,19,15,11,9,8,13]
t2= [20,18,23,19,25,27,23]
t3= [-2,0,2,5,7,-5,-3]
d= {'Date': dates,'Tokyo': t1, 'Mumbai': t2, 'Paris': t3}
print d

{'Date': ['2014-12-01', '2014-12-02', '2014-12-03', '2014-12-04', '2014-12-05', '2014-12-06', '2014-12-07'], 'Paris': [-2, 0, 2, 5, 7, -5, -3], 'Mumbai': [20, 18, 23, 19, 25, 27, 23], 'Tokyo': [15, 19, 15, 11, 9, 8, 13]}


In [82]:
temps= pd.DataFrame(d)
temps

Unnamed: 0,Date,Mumbai,Paris,Tokyo
0,2014-12-01,20,-2,15
1,2014-12-02,18,0,19
2,2014-12-03,23,2,15
3,2014-12-04,19,5,11
4,2014-12-05,25,7,9
5,2014-12-06,27,-5,8
6,2014-12-07,23,-3,13


Each column in the dataset is a Series, and it can be referenced using the column name.

In [97]:
print type(temps['Mumbai'])

<class 'pandas.core.series.Series'>


In [84]:
temps= temps.set_index('Date') 
#you cannot repeat this command!!
temps.head()

            Mumbai  Paris  Tokyo
Date                            
2014-12-01      20     -2     15
2014-12-02      18      0     19
2014-12-03      23      2     15
2014-12-04      19      5     11
2014-12-05      25      7      9


## Rows and Columns

The most basic thing you want to do on the data frame is to be able to access its rows and columns for data. This can be done using column names and indices. 

In [94]:
days= pd.date_range('2014-01-01', '2014-03-01', freq= 'D')
dim= (60,5)
df= pd.DataFrame(np.random.random_integers(-20,40,dim),
                index= days,
                columns= ['Madrid','Boston','Tokyo','Shanghai','Kolkata'])
df.tail()

Unnamed: 0,Madrid,Boston,Tokyo,Shanghai,Kolkata
2014-02-25,7,20,12,37,12
2014-02-26,-9,17,13,-14,-12
2014-02-27,28,39,11,27,10
2014-02-28,13,-1,-9,-7,1
2014-03-01,8,18,-10,-11,7


In [95]:
print df.shape, "\n\n"
print df.columns.values

(60, 5) 


['Madrid' 'Boston' 'Tokyo' 'Shanghai' 'Kolkata']


In [96]:
print df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10', '2014-01-11', '2014-01-12',
               '2014-01-13', '2014-01-14', '2014-01-15', '2014-01-16',
               '2014-01-17', '2014-01-18', '2014-01-19', '2014-01-20',
               '2014-01-21', '2014-01-22', '2014-01-23', '2014-01-24',
               '2014-01-25', '2014-01-26', '2014-01-27', '2014-01-28',
               '2014-01-29', '2014-01-30', '2014-01-31', '2014-02-01',
               '2014-02-02', '2014-02-03', '2014-02-04', '2014-02-05',
               '2014-02-06', '2014-02-07', '2014-02-08', '2014-02-09',
               '2014-02-10', '2014-02-11', '2014-02-12', '2014-02-13',
               '2014-02-14', '2014-02-15', '2014-02-16', '2014-02-17',
               '2014-02-18', '2014-02-19', '2014-02-20', '2014-02-21',
               '2014-02-22', '2014-02-23', '2014-02-24', '2014-02-25',
      

### Column Selection

You can select columns by referring to the column name

In [101]:
print df.Madrid.head()

2014-01-01     1
2014-01-02    -2
2014-01-03    -5
2014-01-04    11
2014-01-05   -13
Freq: D, Name: Madrid, dtype: int32


In [102]:
print df['Tokyo'].tail()

2014-02-25    12
2014-02-26    13
2014-02-27    11
2014-02-28    -9
2014-03-01   -10
Freq: D, Name: Tokyo, dtype: int32


In [103]:
df[['Boston', 'Shanghai']].head()

Unnamed: 0,Boston,Shanghai
2014-01-01,19,6
2014-01-02,31,-7
2014-01-03,30,-2
2014-01-04,-18,38
2014-01-05,4,0


### Row selection 

You can select rows by using the *ix* function. Note that there are 2 other functions: *loc* and *iloc*. Read about the subtle differences [here](http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation). I personally prefer using **ix** since its a wrapper on the others. And then again, mostly nobody selects rows by index.

In [104]:
df.ix['2014-01-15']

Madrid      21
Boston      30
Tokyo        0
Shanghai   -14
Kolkata      3
Name: 2014-01-15 00:00:00, dtype: int32

In [105]:
df.ix['2014-01-24':'2014-01-31']

Unnamed: 0,Madrid,Boston,Tokyo,Shanghai,Kolkata
2014-01-24,37,-1,13,-11,-11
2014-01-25,12,32,-18,-8,20
2014-01-26,32,-15,21,5,10
2014-01-27,24,35,-14,16,-17
2014-01-28,25,18,29,7,16
2014-01-29,40,-8,-3,19,23
2014-01-30,19,1,8,19,21
2014-01-31,15,-12,-10,2,-12


In [106]:
df.ix['2014-02-10':'2014-02-15', ['Madrid', 'Kolkata']]

Unnamed: 0,Madrid,Kolkata
2014-02-10,12,35
2014-02-11,25,9
2014-02-12,-2,1
2014-02-13,25,-6
2014-02-14,-2,9
2014-02-15,35,5
