# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [121]:
!pip install package_name



In [122]:
import numpy as np

### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [123]:
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)

In [124]:
print (a)

[ 0.  0.  0.]


In [125]:
print b 

[[ 1.  1.  1.]
 [ 1.  1.  1.]]


In [126]:
print c

[[[7 3 2 5]
  [7 2 4 3]
  [2 1 6 2]]

 [[4 9 1 3]
  [3 5 7 1]
  [2 9 4 7]]]


In [127]:
print d

[ 0  1  2  3  4  5  6  7  8  9 10]


What are these functions?

    arange?

In [128]:
c = np.random.randint(1,10,4)
print c

[7 2 8 6]


In [129]:
# Note the way each array is printed:
a,b,c,d

(array([ 0.,  0.,  0.]), array([[ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]), array([7, 2, 8, 6]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

In [130]:
## Arithmetic in arrays is element wise

In [131]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b

array([0, 1, 2, 3])

In [132]:
c = a-b
c

array([20, 29, 38, 47])

In [133]:
b**2

array([0, 1, 4, 9])

In [134]:
#b**2 will not work so need to use map
map( lambda x: x**2, [0,1,2,3])

[0, 1, 4, 9]

## Indexing, Slicing and Iterating

In [135]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2
print a

#map( lambda x: X **2, range(10))

[ 0  1  4  9 16 25 36 49 64 81]


In [136]:
a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [137]:
a[2:5]

array([ 4,  9, 16])

In [138]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [139]:
b = np.random.randint(1,100,(4,4))

In [140]:
b

array([[74, 57, 39, 68],
       [28, 74, 24, 40],
       [85, 83, 81, 25],
       [27,  8, 18, 21]])

In [141]:
# Guess the output
print(b[2,3])
print(b[0,0])


25
74


In [142]:
b[0:3,1],b[:,1]

(array([57, 74, 83]), array([57, 74, 83,  8]))

In [143]:
b[1:3,:]

array([[28, 74, 24, 40],
       [85, 83, 81, 25]])

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [144]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [145]:
dates = pd.date_range('20140101',periods=6)
dates

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [146]:
#this will returns the years instead of the days
dates = pd.date_range('20140101',periods=6, freq="366D")
print dates

DatetimeIndex(['2014-01-01', '2015-01-02', '2016-01-03', '2017-01-03',
               '2018-01-04', '2019-01-05'],
              dtype='datetime64[ns]', freq='366D')


In [147]:
#this returns a timestamp. helps with historical data
dates = pd.date_range('20140101',periods=6)
dates [0]

#wherever you have stringified dates like "2010-05-01" or "2010/04/04", it will return a timestamp.

Timestamp('2014-01-01 00:00:00', freq='D')

In [148]:
#sample timestamp
date1 = dates[0]

In [149]:
#printing the timestamp
print date1.day
print date1.month
print date1.year

1
1
2014


In [150]:
#returns random numbers from a random distribution on 6 columns and 4 rows.
np.random.randn(6,4)

array([[ 0.77787037, -1.63403812, -0.05073682, -0.91129071],
       [ 0.91387228,  1.22458463, -0.38830443, -1.62152504],
       [-0.02701094,  2.09591455, -0.43175507, -0.19026274],
       [-0.67588368, -0.9470706 , -0.16523217, -0.49757321],
       [ 0.58897033,  0.88520008, -1.92176667, -2.08512682],
       [ 1.21335908, -0.75901335, -1.64831826, -0.03224365]])

In [151]:
import time

In [152]:
time.localtime()
#this gives you the local time on your computer

time.struct_time(tm_year=2016, tm_mon=11, tm_mday=21, tm_hour=21, tm_min=23, tm_sec=46, tm_wday=0, tm_yday=326, tm_isdst=0)

In [153]:
#making a dataframe

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [154]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


In [155]:
# Index, columns, underlying numpy data
#Transpose meaning, it flips the table around
df.T


Unnamed: 0,2014-01-01 00:00:00,2014-01-02 00:00:00,2014-01-03 00:00:00,2014-01-04 00:00:00,2014-01-05 00:00:00,2014-01-06 00:00:00
A,1.621396,-0.41164,0.880895,1.479393,0.889966,2.915477
B,1.312366,-0.481411,0.169607,-0.495958,1.339485,-0.221361
C,-0.076154,-0.699226,0.189482,-0.157277,-0.435839,-0.348233
D,0.026133,0.449376,2.03209,-1.952386,-1.169412,1.341672


In [156]:
#we can make a dataframe by throwing in a dictionary. This is another way of doing a dataframe
#columns are A,B,C,D,E and every value for the columns is what is assigned to it. So Column A will be 1.
#every value for B will be the exact samedate.
#every single column of Column C will be a float data type. So its printing 1 on 4 rows.
#column D will return np.array 3,3,3,3 and they will be all integers.

df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo


In [157]:
# With specific dtypes
#df2.dtypes => schema. schema is simply column data types.
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

#### Viewing Data

In [158]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


In [159]:
df.tail()

Unnamed: 0,A,B,C,D
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412
2014-01-06,2.915477,-0.221361,-0.348233,1.341672


In [160]:
df.index #tells you the index column

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [161]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,1.229248,0.270455,-0.254541,0.121245
std,1.094413,0.852426,0.309138,1.498214
min,-0.41164,-0.495958,-0.699226,-1.952386
25%,0.883163,-0.416398,-0.413938,-0.870526
50%,1.184679,-0.025877,-0.252755,0.237755
75%,1.585895,1.026676,-0.096435,1.118598
max,2.915477,1.339485,0.189482,2.03209


In [162]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-06,2.915477,-0.221361,-0.348233,1.341672
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


In [163]:
df.sort_values(by='B') #the inplace makes the sorting stay.

Unnamed: 0,A,B,C,D
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-06,2.915477,-0.221361,-0.348233,1.341672
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


In [164]:
df.head() #the index columns never change

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


### Selection

In [165]:
df[['A','B']] #returns a dataframe with just 2 columns.

Unnamed: 0,A,B
2014-01-01,1.621396,1.312366
2014-01-02,-0.41164,-0.481411
2014-01-03,0.880895,0.169607
2014-01-04,1.479393,-0.495958
2014-01-05,0.889966,1.339485
2014-01-06,2.915477,-0.221361


In [166]:
df[0:3] #rows will be 0,1,2,3 => 4 rows. Columns will be 0,1,2 =>  3 columns. 

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209


In [167]:
# By label
df.loc[dates[0]]

A    1.621396
B    1.312366
C   -0.076154
D    0.026133
Name: 2014-01-01 00:00:00, dtype: float64

In [168]:
# multi-axis by label
# basically give me column A and B
df.loc[:,['A','B']]

Unnamed: 0,A,B
2014-01-01,1.621396,1.312366
2014-01-02,-0.41164,-0.481411
2014-01-03,0.880895,0.169607
2014-01-04,1.479393,-0.495958
2014-01-05,0.889966,1.339485
2014-01-06,2.915477,-0.221361


In [169]:
df.loc['2014-01-01':'2014-01-02',['A','B']]

Unnamed: 0,A,B
2014-01-01,1.621396,1.312366
2014-01-02,-0.41164,-0.481411


In [170]:
# Date Range
#When index is between those days, we need column B
df.loc['20140102':'20140104',['B']]

Unnamed: 0,B
2014-01-02,-0.481411
2014-01-03,0.169607
2014-01-04,-0.495958


In [171]:
# Fast access to scalar
df.at[dates[1],'B']

-0.48141078838954465

In [172]:
# iloc provides integer locations similar to np style
df.iloc[3:]

Unnamed: 0,A,B,C,D
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412
2014-01-06,2.915477,-0.221361,-0.348233,1.341672


### Boolean Indexing

In [173]:
df[df.A < 0] # Basically a 'where' operation. 
#Give me all those values where the value is less than zero

Unnamed: 0,A,B,C,D
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376


### Setting

In [174]:
#This simply creates a copy of your dataframe
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA.head()

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412


In [175]:
df_posA[df_posA.A < 0] = -1*df_posA #this is switching the direction of the numbers in the rows.

In [176]:
df_posA

Unnamed: 0,A,B,C,D
2014-01-01,1.621396,1.312366,-0.076154,0.026133
2014-01-02,0.41164,0.481411,0.699226,-0.449376
2014-01-03,0.880895,0.169607,0.189482,2.03209
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386
2014-01-05,0.889966,1.339485,-0.435839,-1.169412
2014-01-06,2.915477,-0.221361,-0.348233,1.341672


In [177]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [178]:
s1

2014-01-02    1
2014-01-03    2
2014-01-04    3
2014-01-05    4
2014-01-06    5
2014-01-07    6
Freq: D, dtype: int64

In [179]:
df['F'] = s1

In [180]:
df

Unnamed: 0,A,B,C,D,F
2014-01-01,1.621396,1.312366,-0.076154,0.026133,
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376,1.0
2014-01-03,0.880895,0.169607,0.189482,2.03209,2.0
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386,3.0
2014-01-05,0.889966,1.339485,-0.435839,-1.169412,4.0
2014-01-06,2.915477,-0.221361,-0.348233,1.341672,5.0


### Missing Data

In [181]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [182]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [183]:
df1

Unnamed: 0,A,B,C,D,F,E
2014-01-01,1.621396,1.312366,-0.076154,0.026133,,1.0
2014-01-02,-0.41164,-0.481411,-0.699226,0.449376,1.0,1.0
2014-01-03,0.880895,0.169607,0.189482,2.03209,2.0,
2014-01-04,1.479393,-0.495958,-0.157277,-1.952386,3.0,


In [184]:
# find where values are null
pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2014-01-01,False,False,False,False,True,False
2014-01-02,False,False,False,False,False,False
2014-01-03,False,False,False,False,False,True
2014-01-04,False,False,False,False,False,True


### Operations

In [185]:
df.describe()

Unnamed: 0,A,B,C,D,F
count,6.0,6.0,6.0,6.0,5.0
mean,1.229248,0.270455,-0.254541,0.121245,3.0
std,1.094413,0.852426,0.309138,1.498214,1.581139
min,-0.41164,-0.495958,-0.699226,-1.952386,1.0
25%,0.883163,-0.416398,-0.413938,-0.870526,2.0
50%,1.184679,-0.025877,-0.252755,0.237755,3.0
75%,1.585895,1.026676,-0.096435,1.118598,4.0
max,2.915477,1.339485,0.189482,2.03209,5.0


In [186]:
df.mean(),df.mean(1) # Operation on two different axes

(A    1.229248
 B    0.270455
 C   -0.254541
 D    0.121245
 F    3.000000
 dtype: float64, 2014-01-01    0.720935
 2014-01-02   -0.028580
 2014-01-03    1.054415
 2014-01-04    0.374754
 2014-01-05    0.924840
 2014-01-06    1.737511
 Freq: D, dtype: float64)

### Applying functions

In [187]:
#we use apply when we are trying to derive a column
!pwd

/Users/edmond_20000/Desktop/Edmond-repo/lessons/lesson-02/code


In [188]:
import pandas as pd
df = pd.DataFrame ({"A" : range(5), "B" : range(5)})
df.head()

Unnamed: 0,A,B
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4


In [189]:
df["C"] = df["A"]*2

In [190]:
df.columns

Index([u'A', u'B', u'C'], dtype='object')

In [191]:
df.drop("C", axis = 1)

df2 = df[["A", "B"]]

In [192]:
df.head()

Unnamed: 0,A,B,C
0,0,0,0
1,1,1,2
2,2,2,4
3,3,3,6
4,4,4,8


In [193]:
df2 = pd.DataFrame( {"firstName" : ["Alex", "Tom"], "lastName" : ["Henry", "Smith"]})

df2.head()

Unnamed: 0,firstName,lastName
0,Alex,Henry
1,Tom,Smith


In [194]:
def makeFullName(obj):
    #obj returns a list
    firstName = obj[0]
    lastName = obj [1]
    return firstName + "" + lastName

df2["fullname"] = df2.apply(makeFullName, axis = 1)

In [195]:
df2.head()

Unnamed: 0,firstName,lastName,fullname
0,Alex,Henry,AlexHenry
1,Tom,Smith,TomSmith


In [196]:
df2["fullName"] = df2.firstName + "" + df2.lastName

In [197]:
df2.head()

Unnamed: 0,firstName,lastName,fullname,fullName
0,Alex,Henry,AlexHenry,AlexHenry
1,Tom,Smith,TomSmith,TomSmith


In [198]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C
0,0,0,0
1,1,1,2
2,3,3,6
3,6,6,12
4,10,10,20


In [199]:
df.apply(lambda x: x.max() - x.min()) 

#lambda is like def f(x): return x.max() - x.min()

A    4
B    4
C    8
dtype: int64

In [200]:
#Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])


In [201]:
df["C"] = df.A.apply(lambda x: x*2) #df{"C"} =df["A"]*2]

In [202]:
#create a new column called E but derive the values from D and multiply all the values by 2.
df['E'] = df.C.apply(lambda x: x*2)

#lambda is a function that will not be used again
#you can also write it this way => df['E'] = df.D.apply(multiplyBy2)

In [203]:
df.head()

Unnamed: 0,A,B,C,E
0,0,0,0,0
1,1,1,2,4
2,2,2,4,8
3,3,3,6,12
4,4,4,8,16


In [204]:
df.apply(np.cumsum, axis = 1).head() #aggregate everything on the same row. basically adding the rows next to it.

Unnamed: 0,A,B,C,E
0,0,0,0,0
1,1,2,4,8
2,2,4,8,16
3,3,6,12,24
4,4,8,16,32


In [205]:
df.apply(np.cumsum, axis = 0).head() #aggregate by column. it's adding by the column.

Unnamed: 0,A,B,C,E
0,0,0,0,0
1,1,1,2,4
2,3,3,6,12
3,6,6,12,24
4,10,10,20,40


In [206]:
df.apply(lambda x: x.max() - x.min()).head() #this is giving you the range. the max of A - the min of B.
#Returns a series. This is being done on the column

A     4
B     4
C     8
E    16
dtype: int64

In [207]:
import math
df.apply(lambda x: math.exp(5)*x, axis = 1).head()

Unnamed: 0,A,B,C,E
0,0.0,0.0,0.0,0.0
1,148.413159,148.413159,296.826318,593.652636
2,296.826318,296.826318,593.652636,1187.305273
3,445.239477,445.239477,890.478955,1780.957909
4,593.652636,593.652636,1187.305273,2374.610546


In [208]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

### Merge

In [209]:
np.random.randn(10,4)

array([[ 0.70071167,  0.73305915, -0.09581073, -0.04764483],
       [-0.51673945, -0.37624324,  0.67581334,  2.23847276],
       [-1.42546133,  0.04856823, -0.0955727 , -0.88679877],
       [ 2.64031314, -1.12641422,  0.16075074, -0.49874895],
       [ 1.54098668, -0.72645427,  1.19673223,  0.55665444],
       [ 0.18214469, -0.37565474, -1.17147092, -0.49884837],
       [-0.29401988,  0.1535869 ,  1.39330932,  0.21509311],
       [ 0.18080443,  1.20339762, -1.05715376, -0.08529368],
       [-1.39750215,  1.22076071,  0.17448627,  0.80496534],
       [-1.86095568, -0.0774186 , -0.03112078, -1.33010556]])

In [210]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,0.047037,-0.036591,0.081803,-0.408284
1,-2.472627,0.674638,-0.378715,1.934594
2,-1.293841,-0.069443,-0.683601,0.43563
3,-0.356498,-0.103545,1.326275,-0.694971
4,0.459868,1.219288,-0.790061,0.379402
5,-0.207504,-1.783782,0.469649,-0.039768
6,-0.771914,-0.023665,0.927234,0.852817
7,0.774478,-0.177907,-0.422939,-0.357576
8,-0.692516,0.425219,-0.887986,0.803792
9,0.213281,0.401548,-0.996292,0.966377


In [211]:
print df[3:7]

          0         1         2         3
3 -0.356498 -0.103545  1.326275 -0.694971
4  0.459868  1.219288 -0.790061  0.379402
5 -0.207504 -1.783782  0.469649 -0.039768
6 -0.771914 -0.023665  0.927234  0.852817


In [212]:
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pieces

[          0         1         2         3
 0  0.047037 -0.036591  0.081803 -0.408284
 1 -2.472627  0.674638 -0.378715  1.934594
 2 -1.293841 -0.069443 -0.683601  0.435630,
           0         1         2         3
 3 -0.356498 -0.103545  1.326275 -0.694971
 4  0.459868  1.219288 -0.790061  0.379402
 5 -0.207504 -1.783782  0.469649 -0.039768
 6 -0.771914 -0.023665  0.927234  0.852817,
           0         1         2         3
 7  0.774478 -0.177907 -0.422939 -0.357576
 8 -0.692516  0.425219 -0.887986  0.803792
 9  0.213281  0.401548 -0.996292  0.966377]

In [213]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.047037,-0.036591,0.081803,-0.408284
1,-2.472627,0.674638,-0.378715,1.934594
2,-1.293841,-0.069443,-0.683601,0.43563
3,-0.356498,-0.103545,1.326275,-0.694971
4,0.459868,1.219288,-0.790061,0.379402
5,-0.207504,-1.783782,0.469649,-0.039768
6,-0.771914,-0.023665,0.927234,0.852817
7,0.774478,-0.177907,-0.422939,-0.357576
8,-0.692516,0.425219,-0.887986,0.803792
9,0.213281,0.401548,-0.996292,0.966377


In [214]:
# Also can "Join" and "Append"
df

Unnamed: 0,0,1,2,3
0,0.047037,-0.036591,0.081803,-0.408284
1,-2.472627,0.674638,-0.378715,1.934594
2,-1.293841,-0.069443,-0.683601,0.43563
3,-0.356498,-0.103545,1.326275,-0.694971
4,0.459868,1.219288,-0.790061,0.379402
5,-0.207504,-1.783782,0.469649,-0.039768
6,-0.771914,-0.023665,0.927234,0.852817
7,0.774478,-0.177907,-0.422939,-0.357576
8,-0.692516,0.425219,-0.887986,0.803792
9,0.213281,0.401548,-0.996292,0.966377


### Grouping


In [215]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [216]:
df

Unnamed: 0,A,B,C,D
0,foo,one,1.291539,-0.178794
1,bar,one,-0.884615,0.176141
2,foo,two,-0.426549,0.97886
3,bar,three,-0.610475,-0.867778
4,foo,two,0.535769,-0.44426
5,bar,two,0.557147,-1.722213
6,foo,one,-0.822591,-0.373227
7,foo,three,0.849102,0.143569


In [218]:
someMapping = {"Count" : { "count"}}

df3= df.groupby(['A','B']).apply(lambda x: len(x)).reset_index()
df3.columns = ["A", "B", "grouCount"]

In [220]:
df3.head()

Unnamed: 0,A,B,grouCount
0,bar,one,1
1,bar,three,1
2,bar,two,1
3,foo,one,2
4,foo,three,1


### Reshaping

In [None]:
# You can also stack or unstack levels

In [None]:
a = df.groupby(['A','B']).sum()

In [None]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

### Time Series


In [None]:
import pandas as pd
import numpy as np

In [None]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [None]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [None]:
ts

In [None]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

In [None]:
# Many additional time series features
ts. #use tab

### Plotting


In [None]:
ts.plot()

In [None]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [None]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [None]:
# Pandas plot function will print with labels as default

In [None]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas