## Pandas
pandas consists of the following things
• A set of labeled array data structures, the primary of which are Series and DataFrame

• Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing

• An integrated group by engine for aggregating and transforming data sets

•Daterangegeneration(date_range)andcustomdateoffsetsenablingtheimplementationofcustomizedfrequen- cies

• Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.

• Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)

• Moving window statistics (rolling mean, rolling standard deviation, etc.)

• Static and moving window linear and panel regression


Data structures 

Series - 1 Dimension - 1D labeled homogeneously-typed array

DataFrame -- General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns

Panel  --General 3D labeled, also size-mutable array


The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.


When using ndarrays to store 2- and 3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions; axes are considered more or less equivalent . In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data transformations in downstream functions.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. And iterating through the columns of the DataFrame thus results in more readable code:

￼for col in df.columns:

    series = df[col]
    
    do something with series￼for col in df.columns:



In [None]:
import pandas as pd 
import numpy as np
import matplotlib .pyplot as plt

### Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [26]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [28]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [29]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.89216,0.154374,2.121266,-0.246541
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [31]:
## Creating a DataFrame by passing a dict of objects that can be converted to series-like.

df2 = pd.DataFrame({ 'A' : 1.,'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [32]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

In [33]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.89216,0.154374,2.121266,-0.246541
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858


In [34]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [35]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [36]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [37]:
df.values

array([[-0.89216   ,  0.15437446,  2.1212661 , -0.24654066],
       [ 1.32096766, -0.19476784, -0.74916941, -1.49074521],
       [ 0.36881362, -1.18142336,  0.34405092, -0.77263313],
       [ 0.28443339, -0.73162517,  0.47067385, -1.16719507],
       [-1.14694607, -0.35899936,  0.86293825,  1.06785817],
       [ 0.39653312, -0.52371969,  0.38261526,  0.5627568 ]])

In [38]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.055274,-0.472693,0.572062,-0.341083
std,0.917848,0.459419,0.930425,1.00014
min,-1.146946,-1.181423,-0.749169,-1.490745
25%,-0.598012,-0.679649,0.353692,-1.068555
50%,0.326624,-0.44136,0.426645,-0.509587
75%,0.389603,-0.235826,0.764872,0.360432
max,1.320968,0.154374,2.121266,1.067858


In [39]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.89216,1.320968,0.368814,0.284433,-1.146946,0.396533
B,0.154374,-0.194768,-1.181423,-0.731625,-0.358999,-0.52372
C,2.121266,-0.749169,0.344051,0.470674,0.862938,0.382615
D,-0.246541,-1.490745,-0.772633,-1.167195,1.067858,0.562757


In [42]:
# Sorting by an axis

df.sort_index(axis =1 , ascending = False)



Unnamed: 0,D,C,B,A
2013-01-01,-0.246541,2.121266,0.154374,-0.89216
2013-01-02,-1.490745,-0.749169,-0.194768,1.320968
2013-01-03,-0.772633,0.344051,-1.181423,0.368814
2013-01-04,-1.167195,0.470674,-0.731625,0.284433
2013-01-05,1.067858,0.862938,-0.358999,-1.146946
2013-01-06,0.562757,0.382615,-0.52372,0.396533


In [46]:
# sorting by values 

df.sort_values(by ='B', ascending = True)

Unnamed: 0,A,B,C,D
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-06,0.396533,-0.52372,0.382615,0.562757
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-01,-0.89216,0.154374,2.121266,-0.246541


### Selection 

#### Getting 

Selecting a single column, which yields a Series, equivalent to df.A

In [47]:
df['A']

2013-01-01   -0.892160
2013-01-02    1.320968
2013-01-03    0.368814
2013-01-04    0.284433
2013-01-05   -1.146946
2013-01-06    0.396533
Freq: D, Name: A, dtype: float64

In [48]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.89216,0.154374,2.121266,-0.246541
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633


In [49]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195


####  Selection by Label

In [50]:
df.loc[dates[0]]

A   -0.892160
B    0.154374
C    2.121266
D   -0.246541
Name: 2013-01-01 00:00:00, dtype: float64

In [57]:
df.loc[dates[1]]

A    1.320968
B   -0.194768
C   -0.749169
D   -1.490745
Name: 2013-01-02 00:00:00, dtype: float64

In [63]:
## Selecting on a multi-axis by label

df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.89216,0.154374
2013-01-02,1.320968,-0.194768
2013-01-03,0.368814,-1.181423
2013-01-04,0.284433,-0.731625
2013-01-05,-1.146946,-0.358999
2013-01-06,0.396533,-0.52372


In [64]:
#label slicing, both endpoints are included

df.loc['20130101': '20130104',['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.89216,0.154374
2013-01-02,1.320968,-0.194768
2013-01-03,0.368814,-1.181423
2013-01-04,0.284433,-0.731625


In [68]:
print(type(df.loc['20130101': '20130104',['A','B']]))
# Reduction in the dimensions of the returned object

df.loc['20130102',['A','B']]

print(df.loc['20130102',['A','B']])

print(type(df.loc['20130102',['A','B']]))



<class 'pandas.core.frame.DataFrame'>
A    1.320968
B   -0.194768
Name: 2013-01-02 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>


In [69]:
# For getting a scalar value
df.loc[dates[0],'A']

-0.8921599988037986

In [70]:
# For getting fast access to a scalar (equiv to the prior method)

df.at[dates[0],'A']

-0.8921599988037986

#### Selection by Position

In [72]:
df.iloc[3]

A    0.284433
B   -0.731625
C    0.470674
D   -1.167195
Name: 2013-01-04 00:00:00, dtype: float64

In [73]:
# By integer slices, acting similar to numpy/python

df.iloc[3:5,0:2]


Unnamed: 0,A,B
2013-01-04,0.284433,-0.731625
2013-01-05,-1.146946,-0.358999


In [74]:
df.iloc[3:5]

Unnamed: 0,A,B,C,D
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858


In [76]:
df.iloc[:,2:5]


Unnamed: 0,C,D
2013-01-01,2.121266,-0.246541
2013-01-02,-0.749169,-1.490745
2013-01-03,0.344051,-0.772633
2013-01-04,0.470674,-1.167195
2013-01-05,0.862938,1.067858
2013-01-06,0.382615,0.562757


In [77]:
df.iloc[:,2:6]

Unnamed: 0,C,D
2013-01-01,2.121266,-0.246541
2013-01-02,-0.749169,-1.490745
2013-01-03,0.344051,-0.772633
2013-01-04,0.470674,-1.167195
2013-01-05,0.862938,1.067858
2013-01-06,0.382615,0.562757


In [78]:
df.iloc[1:9,:]

Unnamed: 0,A,B,C,D
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [79]:
## By lists of integer position locations, similar to the numpy/python style

df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,1.320968,-0.749169
2013-01-03,0.368814,0.344051
2013-01-05,-1.146946,0.862938


In [80]:
df.iloc[[1,2,7],[0,2]]

IndexError: positional indexers are out-of-bounds

In [82]:
# For getting a value explicitly
df.iloc[1,1]

-0.1947678388344518

#### Boolean Indexing

In [84]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.89216,0.154374,2.121266,-0.246541
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [83]:
df[df.A >0]

Unnamed: 0,A,B,C,D
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [85]:
df[df >0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.154374,2.121266,
2013-01-02,1.320968,,,
2013-01-03,0.368814,,0.344051,
2013-01-04,0.284433,,0.470674,
2013-01-05,,,0.862938,1.067858
2013-01-06,0.396533,,0.382615,0.562757


In [86]:
### Using the isin() method for filtering:

df2 = df.copy()

In [87]:
df2

Unnamed: 0,A,B,C,D
2013-01-01,-0.89216,0.154374,2.121266,-0.246541
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [88]:
df2['E'] = ['one', 'one','two','three','four','three']

df2


In [90]:
df2[df2['E'].isin(['two','four']) ]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.368814,-1.181423,0.344051,-0.772633,two
2013-01-05,-1.146946,-0.358999,0.862938,1.067858,four


#### Setting 

Setting a new column automatically aligns the data by the indexes

In [97]:
s1 = pd.Series([1,2,3,4,5,6], index = pd.date_range('20130102', periods =6))
s1


2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [98]:
# setting values by label 
df['F'] = s1

In [99]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,-0.89216,0.154374,2.121266,-0.246541,
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745,1.0
2013-01-03,0.368814,-1.181423,0.344051,-0.772633,2.0
2013-01-04,0.284433,-0.731625,0.470674,-1.167195,3.0
2013-01-05,-1.146946,-0.358999,0.862938,1.067858,4.0
2013-01-06,0.396533,-0.52372,0.382615,0.562757,5.0


In [100]:
# Setting values by label

df.at[dates[0],'A'] =0

In [101]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,2.121266,-0.246541,
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745,1.0
2013-01-03,0.368814,-1.181423,0.344051,-0.772633,2.0
2013-01-04,0.284433,-0.731625,0.470674,-1.167195,3.0
2013-01-05,-1.146946,-0.358999,0.862938,1.067858,4.0
2013-01-06,0.396533,-0.52372,0.382615,0.562757,5.0


In [102]:
#Setting values by position
df.iat[0,2] = 0

In [103]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,0.0,-0.246541,
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745,1.0
2013-01-03,0.368814,-1.181423,0.344051,-0.772633,2.0
2013-01-04,0.284433,-0.731625,0.470674,-1.167195,3.0
2013-01-05,-1.146946,-0.358999,0.862938,1.067858,4.0
2013-01-06,0.396533,-0.52372,0.382615,0.562757,5.0


In [105]:
#Setting by assigning with a numpy array

df.loc[:,'D'] = np.array([5]*len(df))

In [106]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,0.0,5,
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0
2013-01-03,0.368814,-1.181423,0.344051,5,2.0
2013-01-04,0.284433,-0.731625,0.470674,5,3.0
2013-01-05,-1.146946,-0.358999,0.862938,5,4.0
2013-01-06,0.396533,-0.52372,0.382615,5,5.0


In [107]:
df2 = df.copy


In [108]:
df2


<bound method NDFrame.copy of                    A         B         C  D    F
2013-01-01  0.000000  0.154374  0.000000  5  NaN
2013-01-02  1.320968 -0.194768 -0.749169  5  1.0
2013-01-03  0.368814 -1.181423  0.344051  5  2.0
2013-01-04  0.284433 -0.731625  0.470674  5  3.0
2013-01-05 -1.146946 -0.358999  0.862938  5  4.0
2013-01-06  0.396533 -0.523720  0.382615  5  5.0>

In [113]:
df2[df2 > 0 ] 

TypeError: unorderable types: method() > int()

#### Missing Data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [115]:
df1 = df.reindex(index = dates[0:4], columns= list(df.columns)+['E'])

In [118]:
df1
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.154374,0.0,5,,1.0
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0,1.0
2013-01-03,0.368814,-1.181423,0.344051,5,2.0,
2013-01-04,0.284433,-0.731625,0.470674,5,3.0,


In [119]:
# To drop any rows that have missing data.
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0,1.0


In [121]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.154374,0.0,5,5.0,1.0
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0,1.0
2013-01-03,0.368814,-1.181423,0.344051,5,2.0,5.0
2013-01-04,0.284433,-0.731625,0.470674,5,3.0,5.0


In [122]:
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0,1.0


In [123]:
# To get the boolean mask where values are nan

pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


#### Operations 
Operations in general exclude missing data. 

Performing a descriptive statistic

In [124]:
df.mean()

A    0.203967
B   -0.472693
C    0.218518
D    5.000000
F    3.000000
dtype: float64

In [125]:
# Same operation on the other axis

df.mean(1)

2013-01-01    1.288594
2013-01-02    1.275406
2013-01-03    1.306288
2013-01-04    1.604696
2013-01-05    1.671399
2013-01-06    2.051086
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

In [134]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates)
s

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64

In [135]:
s= s.shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [129]:
help(df.sub)

Help on method sub in module pandas.core.ops:

sub(other, axis='columns', level=None, fill_value=None) method of pandas.core.frame.DataFrame instance
    Subtraction of dataframe and other, element-wise (binary operator `sub`).
    
    Equivalent to ``dataframe - other``, but with support to substitute a fill_value for
    missing data in one of the inputs.
    
    Parameters
    ----------
    other : Series, DataFrame, or constant
    axis : {0, 1, 'index', 'columns'}
        For Series input, axis to match Series index on
    level : int or name
        Broadcast across a level, matching Index values on the
        passed MultiIndex level
    fill_value : None or float value, default None
        Fill existing missing (NaN) values, and any new element needed for
        successful DataFrame alignment, with this value before computation.
        If data in both corresponding DataFrame locations is missing
        the result will be missing
    
    Notes
    -----
    Mismatched in

In [137]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,0.0,5,
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0
2013-01-03,0.368814,-1.181423,0.344051,5,2.0
2013-01-04,0.284433,-0.731625,0.470674,5,3.0
2013-01-05,-1.146946,-0.358999,0.862938,5,4.0
2013-01-06,0.396533,-0.52372,0.382615,5,5.0


In [136]:
df.sub(s,axis = 'index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.631186,-2.181423,-0.655949,4.0,1.0
2013-01-04,-2.715567,-3.731625,-2.529326,2.0,0.0
2013-01-05,-6.146946,-5.358999,-4.137062,0.0,-1.0
2013-01-06,,,,,


#### Apply 

Applying function to data

In [138]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,0.0,5,
2013-01-02,1.320968,-0.040393,-0.749169,10,1.0
2013-01-03,1.689781,-1.221817,-0.405118,15,3.0
2013-01-04,1.974215,-1.953442,0.065555,20,6.0
2013-01-05,0.827269,-2.312441,0.928494,25,10.0
2013-01-06,1.223802,-2.836161,1.311109,30,15.0


In [140]:
df.apply (lambda x : x.max()-x.min())

A    2.467914
B    1.335798
C    1.612108
D    0.000000
F    4.000000
dtype: float64

In [141]:
type(df.apply (lambda x : x.max()-x.min()))

pandas.core.series.Series

#### Histrogramming 

In [143]:
s = pd.Series(np.random.randint(0,7, size =10))
s

0    4
1    5
2    1
3    5
4    3
5    2
6    1
7    1
8    2
9    4
dtype: int64

In [144]:
s.value_counts()

1    3
5    2
4    2
2    2
3    1
dtype: int64

####  String methods

In [147]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [146]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#### merge

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat():

In [150]:
df  = pd.DataFrame(np.random.randn(10,4))

In [151]:
df

Unnamed: 0,0,1,2,3
0,1.264577,0.051834,-1.144204,1.207308
1,1.161618,-1.051908,-0.200668,-0.173539
2,0.11575,-0.119277,-1.976578,-0.051619
3,-0.538089,-0.555356,-0.103677,1.183125
4,-0.055464,-0.008104,-0.148013,1.566107
5,0.551317,-1.801087,-0.957699,0.511444
6,0.069443,-1.318194,-0.458085,-0.911792
7,1.452841,0.732603,-0.421263,0.142031
8,-0.048391,0.408418,-0.711116,-1.399384
9,0.917506,-1.253084,-0.415187,-1.12451


In [154]:
piceses =  [df[:3],df[3:7],df[7:]]

In [155]:
piceses

[          0         1         2         3
 0  1.264577  0.051834 -1.144204  1.207308
 1  1.161618 -1.051908 -0.200668 -0.173539
 2  0.115750 -0.119277 -1.976578 -0.051619,
           0         1         2         3
 3 -0.538089 -0.555356 -0.103677  1.183125
 4 -0.055464 -0.008104 -0.148013  1.566107
 5  0.551317 -1.801087 -0.957699  0.511444
 6  0.069443 -1.318194 -0.458085 -0.911792,
           0         1         2         3
 7  1.452841  0.732603 -0.421263  0.142031
 8 -0.048391  0.408418 -0.711116 -1.399384
 9  0.917506 -1.253084 -0.415187 -1.124510]

In [156]:
pd.concat(piceses)

Unnamed: 0,0,1,2,3
0,1.264577,0.051834,-1.144204,1.207308
1,1.161618,-1.051908,-0.200668,-0.173539
2,0.11575,-0.119277,-1.976578,-0.051619
3,-0.538089,-0.555356,-0.103677,1.183125
4,-0.055464,-0.008104,-0.148013,1.566107
5,0.551317,-1.801087,-0.957699,0.511444
6,0.069443,-1.318194,-0.458085,-0.911792
7,1.452841,0.732603,-0.421263,0.142031
8,-0.048391,0.408418,-0.711116,-1.399384
9,0.917506,-1.253084,-0.415187,-1.12451


#### Join



In [158]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [159]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [161]:
pd.merge(left,right, on ='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


#### Append

In [163]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1.105848,-1.603355,-1.182758,-0.904533
1,-0.504589,-0.27093,-0.968451,0.614221
2,-0.887285,-0.95648,0.670953,-0.336781
3,0.779374,-0.792451,0.108343,0.253346
4,-0.700572,-0.412557,3.046421,-1.225675
5,1.927459,0.43529,0.337455,-1.015704
6,-2.093842,0.401445,-0.045044,0.308447
7,-0.328095,-1.25382,-2.396085,-0.292637


In [165]:
s = df.iloc[3]
s


A    0.779374
B   -0.792451
C    0.108343
D    0.253346
Name: 3, dtype: float64

In [166]:
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,1.105848,-1.603355,-1.182758,-0.904533
1,-0.504589,-0.27093,-0.968451,0.614221
2,-0.887285,-0.95648,0.670953,-0.336781
3,0.779374,-0.792451,0.108343,0.253346
4,-0.700572,-0.412557,3.046421,-1.225675
5,1.927459,0.43529,0.337455,-1.015704
6,-2.093842,0.401445,-0.045044,0.308447
7,-0.328095,-1.25382,-2.396085,-0.292637
8,0.779374,-0.792451,0.108343,0.253346


#### Grouping 

By “group by” we are referring to a process involving one or more of the following steps 

• Splitting the data into groups based on some criteria

• Applying a function to each group independently

• Combining the results into a data structure


In [168]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.304921,0.731358
1,bar,one,0.049356,-1.087067
2,foo,two,0.19349,1.046361
3,bar,three,-1.462271,-2.119326
4,foo,two,-0.144763,-0.13542
5,bar,two,-0.357238,-0.464117
6,foo,one,-0.348356,0.558083
7,foo,three,-1.353866,0.301702


In [169]:
# Grouping and then applying a function sum to the resulting groups.
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.770153,-3.67051
foo,-1.958416,2.502084


In [171]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.049356,-1.087067
bar,three,-1.462271,-2.119326
bar,two,-0.357238,-0.464117
foo,one,-0.653277,1.289441
foo,three,-1.353866,0.301702
foo,two,0.048727,0.91094


#### Stack

In [172]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))

In [173]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [175]:
index = pd.MultiIndex.from_tuples(tuples,names=['first','second'])
index


MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [177]:
df =pd.DataFrame(np.random.randn(8,2),index = index , columns= ['A','B'])
df


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.266321,-0.366412
bar,two,-0.019614,1.124003
baz,one,0.651159,-1.559394
baz,two,0.82872,0.192244
foo,one,-0.12757,-0.781729
foo,two,0.196075,1.12858
qux,one,2.930268,0.700657
qux,two,-0.842797,-2.056811


In [179]:
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.266321,-0.366412
bar,two,-0.019614,1.124003
baz,one,0.651159,-1.559394
baz,two,0.82872,0.192244


In [None]:
#The stack() method “compresses” a level in the DataFrame’s columns.


In [181]:
stacked = df2.stack()
stacked

first  second   
bar    one     A   -0.266321
               B   -0.366412
       two     A   -0.019614
               B    1.124003
baz    one     A    0.651159
               B   -1.559394
       two     A    0.828720
               B    0.192244
dtype: float64

In [182]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.266321,-0.366412
bar,two,-0.019614,1.124003
baz,one,0.651159,-1.559394
baz,two,0.82872,0.192244


In [183]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-0.266321,-0.019614
bar,B,-0.366412,1.124003
baz,A,0.651159,0.82872
baz,B,-1.559394,0.192244


In [184]:
stacked.unstack(0)


Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.266321,0.651159
one,B,-0.366412,-1.559394
two,A,-0.019614,0.82872
two,B,1.124003,0.192244


#### Pivot Tables

In [186]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})

df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.704409,-1.917265
1,one,B,foo,-0.866929,-0.4228
2,two,C,foo,-0.499975,0.261985
3,three,A,bar,1.070945,-1.017651
4,one,B,bar,0.086886,1.750031
5,one,C,bar,-0.731205,-0.909896
6,two,A,foo,-1.583108,1.224982
7,three,B,foo,0.451708,2.405149
8,one,C,foo,1.481875,0.705291
9,one,A,bar,0.106135,0.405283


In [187]:
help(pd.pivot_table)

Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
    Create a spreadsheet-style pivot table as a DataFrame. The levels in
    the pivot table will be stored in MultiIndex objects (hierarchical
    indexes) on the index and columns of the result DataFrame
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can c

In [188]:
pd.pivot_table(df,values='D',index =['A','B'],columns= ['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.106135,0.704409
one,B,0.086886,-0.866929
one,C,-0.731205,1.481875
three,A,1.070945,
three,B,,0.451708
three,C,0.363401,
two,A,,-1.583108
two,B,-1.309732,
two,C,,-0.499975


#### Time Series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency con- version (e.g., converting secondly data into 5-minutely data). 

In [190]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng


DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-

In [191]:
 ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2012-01-01 00:00:00    430
2012-01-01 00:00:01     54
2012-01-01 00:00:02    276
2012-01-01 00:00:03    195
2012-01-01 00:00:04    223
2012-01-01 00:00:05    432
2012-01-01 00:00:06    439
2012-01-01 00:00:07    355
2012-01-01 00:00:08    406
2012-01-01 00:00:09    293
2012-01-01 00:00:10    145
2012-01-01 00:00:11    468
2012-01-01 00:00:12    203
2012-01-01 00:00:13     89
2012-01-01 00:00:14     36
2012-01-01 00:00:15     27
2012-01-01 00:00:16     43
2012-01-01 00:00:17     79
2012-01-01 00:00:18    435
2012-01-01 00:00:19    373
2012-01-01 00:00:20    374
2012-01-01 00:00:21     91
2012-01-01 00:00:22     22
2012-01-01 00:00:23    282
2012-01-01 00:00:24     68
2012-01-01 00:00:25    391
2012-01-01 00:00:26    436
2012-01-01 00:00:27    403
2012-01-01 00:00:28     62
2012-01-01 00:00:29    444
                      ... 
2012-01-01 00:01:10    307
2012-01-01 00:01:11    462
2012-01-01 00:01:12     21
2012-01-01 00:01:13    232
2012-01-01 00:01:14     84
2012-01-01 00:01:15    151
2

In [192]:
ts.resample('5Min').sum()

2012-01-01    25625
Freq: 5T, dtype: int64

In [195]:
## Time zone representation
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06    2.260751
2012-03-07   -0.149555
2012-03-08   -0.336333
2012-03-09    0.737622
2012-03-10   -1.741359
Freq: D, dtype: float64

In [196]:
ts_utc = ts.tz_localize('UTC')

In [197]:
ts_utc

2012-03-06 00:00:00+00:00    2.260751
2012-03-07 00:00:00+00:00   -0.149555
2012-03-08 00:00:00+00:00   -0.336333
2012-03-09 00:00:00+00:00    0.737622
2012-03-10 00:00:00+00:00   -1.741359
Freq: D, dtype: float64

In [198]:
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    2.260751
2012-03-06 19:00:00-05:00   -0.149555
2012-03-07 19:00:00-05:00   -0.336333
2012-03-08 19:00:00-05:00    0.737622
2012-03-09 19:00:00-05:00   -1.741359
Freq: D, dtype: float64

In [202]:
## Converting between time span representations
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31    0.115396
2012-02-29   -1.487858
2012-03-31    2.495916
2012-04-30    1.396403
2012-05-31    0.521523
Freq: M, dtype: float64

In [204]:
ps = ts.to_period()
ps

2012-01    0.115396
2012-02   -1.487858
2012-03    2.495916
2012-04    1.396403
2012-05    0.521523
Freq: M, dtype: float64

In [205]:
ps.to_timestamp()

2012-01-01    0.115396
2012-02-01   -1.487858
2012-03-01    2.495916
2012-04-01    1.396403
2012-05-01    0.521523
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [206]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [207]:
ts = pd.Series(np.random.randn(len(prng)), prng)

In [211]:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [212]:
ts.head()

1990-03-01 09:00    1.292438
1990-06-01 09:00    0.071346
1990-09-01 09:00    0.303950
1990-12-01 09:00   -1.486121
1991-03-01 09:00    0.999761
Freq: H, dtype: float64

#### Categoricals

In [224]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

In [214]:
df


Unnamed: 0,id,raw_grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,a
5,6,e


In [225]:
df['grade'] = df['raw_grade'].astype("category")


In [226]:
df

Unnamed: 0,id,raw_grade,grade
0,1,a,a
1,2,b,b
2,3,b,b
3,4,a,a
4,5,a,a
5,6,e,e


In [227]:

df['grade'].cat.categories = ["Very_good","Good", "Very_bad"]
df

Unnamed: 0,id,raw_grade,grade
0,1,a,Very_good
1,2,b,Good
2,3,b,Good
3,4,a,Very_good
4,5,a,Very_good
5,6,e,Very_bad


In [222]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [223]:
df


Unnamed: 0,id,raw_grade,grade
0,1,a,
1,2,b,
2,3,b,
3,4,a,
4,5,a,
5,6,e,


In [221]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
0,1,a,
1,2,b,
2,3,b,
3,4,a,
4,5,a,
5,6,e,


#### Plotting

In [228]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

In [229]:
ts = ts.cumsum()

In [230]:
ts.plot()


<matplotlib.axes._subplots.AxesSubplot at 0x116a7a828>

In [231]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])


In [232]:
 df = df.cumsum()

In [233]:
plt.figure();

In [235]:
df.plot(); 

In [236]:
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x116c45908>

#### Getting Data In/Out

In [237]:
df.to_csv('foo.csv')

In [238]:
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
1,2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2,2000-01-03,2.920340,-1.696290,1.647347,-1.240526
3,2000-01-04,3.787990,-2.873107,1.641727,-1.586159
4,2000-01-05,2.888086,-1.652603,1.769317,-2.383967
5,2000-01-06,3.108331,-3.507232,0.434552,-1.243871
6,2000-01-07,3.724756,-4.451210,0.431474,-2.480920
7,2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
8,2000-01-09,4.488749,-6.776646,0.392486,-2.329330
9,2000-01-10,5.351117,-5.798648,0.975288,-2.341406


#### HDF5

In [239]:
df.to_hdf('foo.h5','df')

In [240]:
pd.read_hdf('foo.h5','df')

Unnamed: 0,A,B,C,D
2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2000-01-03,2.920340,-1.696290,1.647347,-1.240526
2000-01-04,3.787990,-2.873107,1.641727,-1.586159
2000-01-05,2.888086,-1.652603,1.769317,-2.383967
2000-01-06,3.108331,-3.507232,0.434552,-1.243871
2000-01-07,3.724756,-4.451210,0.431474,-2.480920
2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
2000-01-09,4.488749,-6.776646,0.392486,-2.329330
2000-01-10,5.351117,-5.798648,0.975288,-2.341406


#### Excel

In [242]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')


In [243]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])


Unnamed: 0,A,B,C,D
2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2000-01-03,2.920340,-1.696290,1.647347,-1.240526
2000-01-04,3.787990,-2.873107,1.641727,-1.586159
2000-01-05,2.888086,-1.652603,1.769317,-2.383967
2000-01-06,3.108331,-3.507232,0.434552,-1.243871
2000-01-07,3.724756,-4.451210,0.431474,-2.480920
2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
2000-01-09,4.488749,-6.776646,0.392486,-2.329330
2000-01-10,5.351117,-5.798648,0.975288,-2.341406



## Pandas Memory usage

In [209]:
dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', 
          'complex128', 'object', 'bool']
n = 5000
import numpy as np
test = [(np.random.randint(100, size=n).astype(t)) for t in dtypes]
#print("Test is",test)
data = dict([ (t, np.random.randint(100, size=n).astype(t)) for t in dtypes])
#print("Data is",data)
print("Data is",type(data['complex128']))
print("Data is",data['complex128'].size)
import pandas as pd
df = pd.DataFrame(data)
df['categorical'] = df['object'].astype('category')
df.info()

## The + symbol indicates that the true memory usage could be 
#higher, because pandas does not count the memory used by values in columns with dtype=object.

Data is <class 'numpy.ndarray'>
Data is 5000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
bool               5000 non-null bool
complex128         5000 non-null complex128
datetime64[ns]     5000 non-null datetime64[ns]
float64            5000 non-null float64
int64              5000 non-null int64
object             5000 non-null object
timedelta64[ns]    5000 non-null timedelta64[ns]
categorical        5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 289.1+ KB


In [210]:
df.head()

Unnamed: 0,bool,complex128,datetime64[ns],float64,int64,object,timedelta64[ns],categorical
0,True,(18+0j),1970-01-01 00:00:00.000000094,71.0,50,87,00:00:00.000000,87
1,True,(75+0j),1970-01-01 00:00:00.000000049,70.0,23,57,00:00:00.000000,57
2,True,(16+0j),1970-01-01 00:00:00.000000068,1.0,1,25,00:00:00.000000,25
3,True,(30+0j),1970-01-01 00:00:00.000000069,26.0,31,50,00:00:00.000000,50
4,True,(83+0j),1970-01-01 00:00:00.000000052,90.0,80,43,00:00:00.000000,43


Passing memory_usage=’deep’ will enable a more accurate memory usage report, that accounts for the full usage of the contained objects. This is optional as it can be expensive to do this deeper introspection.


The memory usage of each column can be found by calling the memory_usage method. This returns a Series with an index represented by column names and memory usage of each column shown in bytes. For the dataframe above, the memory usage of each column and the total memory usage of the dataframe can be found with the memory_usage method:


In [18]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
bool               5000 non-null bool
complex128         5000 non-null complex128
datetime64[ns]     5000 non-null datetime64[ns]
float64            5000 non-null float64
int64              5000 non-null int64
object             5000 non-null object
timedelta64[ns]    5000 non-null timedelta64[ns]
categorical        5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 425.6 KB


In [19]:
df.memory_usage()

Index                 80
bool                5000
complex128         80000
datetime64[ns]     40000
float64            40000
int64              40000
object             40000
timedelta64[ns]    40000
categorical        10920
dtype: int64

In [20]:
df.memory_usage().sum()

296000

In [21]:
df.memory_usage(index=False)

bool                5000
complex128         80000
datetime64[ns]     40000
float64            40000
int64              40000
object             40000
timedelta64[ns]    40000
categorical        10920
dtype: int64

In [1]:
##ESSENTIAL BASIC FUNCTIONALITY

In [3]:
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=8)

In [4]:
index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [6]:
df = pd.DataFrame(np.random.randn(8, 3), index=index,
                  columns=['A', 'B', 'C'])

In [7]:
df


Unnamed: 0,A,B,C
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759
2000-01-03,-0.071699,0.831925,0.001975
2000-01-04,-0.505624,0.959901,-0.292569
2000-01-05,0.74626,-0.589335,-1.074199
2000-01-06,1.695816,-1.505707,-0.094453
2000-01-07,0.975223,1.537385,-1.299893
2000-01-08,-0.587501,1.131928,0.821056


In [8]:
wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
major_axis=pd.date_range('1/1/2000', periods=5),
minor_axis=['A', 'B', 'C', 'D'])


In [9]:
wp

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

In [10]:
long_series = pd.Series(np.random.randn(1000))

In [11]:
long_series.head()

0    0.391092
1   -0.645428
2   -0.375977
3    1.329915
4    0.331439
dtype: float64

In [12]:
long_series.tail(3)

997   -0.735797
998   -0.371751
999    1.087176
dtype: float64

In [13]:
df[:2]

Unnamed: 0,A,B,C
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759


In [14]:
df.columns = [x.lower() for x in df.columns]


In [15]:
df

Unnamed: 0,a,b,c
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759
2000-01-03,-0.071699,0.831925,0.001975
2000-01-04,-0.505624,0.959901,-0.292569
2000-01-05,0.74626,-0.589335,-1.074199
2000-01-06,1.695816,-1.505707,-0.094453
2000-01-07,0.975223,1.537385,-1.299893
2000-01-08,-0.587501,1.131928,0.821056


In [16]:
s.values

array([-1.07171067,  1.41070939,  2.31439459,  0.84051414,  0.30788718])

In [17]:
df.values

array([[ 0.80301965,  1.00435777,  0.14590118],
       [-1.18718625, -0.2121428 ,  0.02075891],
       [-0.07169884,  0.83192452,  0.00197486],
       [-0.50562428,  0.95990123, -0.29256872],
       [ 0.7462605 , -0.5893348 , -1.07419877],
       [ 1.6958162 , -1.50570735, -0.09445301],
       [ 0.97522336,  1.53738523, -1.29989305],
       [-0.58750082,  1.13192798,  0.82105574]])

In [18]:
wp.values

array([[[ 0.78752296, -0.34653899,  1.85953583, -0.12741824],
        [-0.26576317,  0.89267403, -0.90956655, -1.51146916],
        [ 0.6584737 ,  1.06258322,  0.30971611, -0.37042125],
        [-0.85545809, -0.39299941,  0.94935573,  1.17210453],
        [-1.47721789,  0.24396343,  1.99931577,  0.35319816]],

       [[ 0.66648108, -0.41773557, -0.187028  , -0.8957193 ],
        [ 1.29895514, -0.24778625,  0.77040594,  1.62211833],
        [-0.5061132 ,  0.40083772,  0.91897324, -0.06481612],
        [-0.04642707,  0.48244299,  1.43386829, -0.78360549],
        [ 0.3124662 ,  1.15914644, -0.18086868,  0.51571104]]])

In [19]:
## Accelerated operations

In [20]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [21]:
df

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.876369,0.093015,0.634339
c,-0.644334,-0.145912,0.968932
d,,-1.104801,1.299398


In [23]:
row = df.iloc[1]

In [24]:
row

one      0.876369
three    0.093015
two      0.634339
Name: b, dtype: float64

In [25]:
column = df['two']

In [27]:
df.sub(row, axis='columns')

Unnamed: 0,one,three,two
a,-1.260736,,-1.113632
b,0.0,0.0,0.0
c,-1.520703,-0.238927,0.334593
d,,-1.197816,0.665058


In [28]:
df.sub(row, axis=1)

Unnamed: 0,one,three,two
a,-1.260736,,-1.113632
b,0.0,0.0,0.0
c,-1.520703,-0.238927,0.334593
d,,-1.197816,0.665058


In [29]:
df.sub(column, axis='index')

Unnamed: 0,one,three,two
a,0.094925,,0.0
b,0.24203,-0.541324,0.0
c,-1.613267,-1.114844,0.0
d,,-2.404199,0.0


In [30]:
df.sub(column, axis=0)

Unnamed: 0,one,three,two
a,0.094925,,0.0
b,0.24203,-0.541324,0.0
c,-1.613267,-1.114844,0.0
d,,-2.404199,0.0


In [31]:
## Furthermore you can align a level of a multi-indexed DataFrame with a Series.
dfmi = df.copy()

In [32]:
dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
names=['first','second'])

In [33]:
dfmi.sub(column, axis=0, level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,three,two
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,0.094925,,0.0
1,b,0.24203,-0.541324,0.0
1,c,-1.613267,-1.114844,0.0
2,a,,-0.625509,1.77869


In [34]:
major_mean = wp.mean(axis='major')

In [35]:
wp.sub(major_mean, axis='major')

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

In [36]:
## Missing data / operations with fill values
df

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.876369,0.093015,0.634339
c,-0.644334,-0.145912,0.968932
d,,-1.104801,1.299398


In [38]:
df2 = pd.DataFrame(np.random.randn(8, 3), index=index,
columns=['A', 'B', 'C'])

In [39]:
df2

Unnamed: 0,A,B,C
2000-01-01,-1.180721,-2.7763,0.732917
2000-01-02,1.267004,-0.60201,-0.447131
2000-01-03,-0.020138,-0.620051,-1.404851
2000-01-04,-0.658608,-0.594525,0.924978
2000-01-05,1.070208,-0.643575,2.529221
2000-01-06,-0.367219,0.732131,-0.507819
2000-01-07,-0.133431,1.111262,1.733206
2000-01-08,0.759702,-0.154683,0.849453


In [40]:
df+df2

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,,,,,,
2000-01-02 00:00:00,,,,,,
2000-01-03 00:00:00,,,,,,
2000-01-04 00:00:00,,,,,,
2000-01-05 00:00:00,,,,,,
2000-01-06 00:00:00,,,,,,
2000-01-07 00:00:00,,,,,,
2000-01-08 00:00:00,,,,,,
a,,,,,,
b,,,,,,


In [41]:
df.add(df2, fill_value=0)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,-1.180721,-2.7763,0.732917,,,
2000-01-02 00:00:00,1.267004,-0.60201,-0.447131,,,
2000-01-03 00:00:00,-0.020138,-0.620051,-1.404851,,,
2000-01-04 00:00:00,-0.658608,-0.594525,0.924978,,,
2000-01-05 00:00:00,1.070208,-0.643575,2.529221,,,
2000-01-06 00:00:00,-0.367219,0.732131,-0.507819,,,
2000-01-07 00:00:00,-0.133431,1.111262,1.733206,,,
2000-01-08 00:00:00,0.759702,-0.154683,0.849453,,,
a,,,,-0.384367,,-0.479292
b,,,,0.876369,0.093015,0.634339


In [42]:
df.gt(df2)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,False,False,False,False,False,False
2000-01-02 00:00:00,False,False,False,False,False,False
2000-01-03 00:00:00,False,False,False,False,False,False
2000-01-04 00:00:00,False,False,False,False,False,False
2000-01-05 00:00:00,False,False,False,False,False,False
2000-01-06 00:00:00,False,False,False,False,False,False
2000-01-07 00:00:00,False,False,False,False,False,False
2000-01-08 00:00:00,False,False,False,False,False,False
a,False,False,False,False,False,False
b,False,False,False,False,False,False


In [43]:
df2.ne(df)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,True,True,True,True,True,True
2000-01-02 00:00:00,True,True,True,True,True,True
2000-01-03 00:00:00,True,True,True,True,True,True
2000-01-04 00:00:00,True,True,True,True,True,True
2000-01-05 00:00:00,True,True,True,True,True,True
2000-01-06 00:00:00,True,True,True,True,True,True
2000-01-07 00:00:00,True,True,True,True,True,True
2000-01-08 00:00:00,True,True,True,True,True,True
a,True,True,True,True,True,True
b,True,True,True,True,True,True


In [44]:
# Boolean Reductions

In [45]:
(df > 0).all()

one      False
three    False
two      False
dtype: bool

In [46]:
(df > 0).any()


one      True
three    True
two      True
dtype: bool

In [47]:
(df > 0).any().any()

True

In [48]:
df.empty

False

In [49]:
pd.DataFrame(columns=list('ABC')).empty

True

In [50]:
pd.Series([True]).bool()

True

In [51]:
pd.Series([False]).bool()

False

In [52]:
pd.DataFrame([[True]]).bool()

True

In [53]:
pd.DataFrame([[False]]).bool()

False

In [54]:
    ## Comparing if objects are equivalent

In [55]:
df+df == df*2

Unnamed: 0,one,three,two
a,True,False,True
b,True,True,True
c,True,True,True
d,False,True,True


In [56]:
(df+df == df*2).all()

one      False
three    False
two       True
dtype: bool

In [57]:
## Notice that the boolean DataFrame df+df == df*2 contains some False values! That is because NaNs do not
#compare as equals:

np.nan == np.nan

False

In [58]:
(df+df).equals(df*2)

True

In [59]:
df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [60]:
df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [61]:
df1.equals(df2)


False

In [62]:
df1.equals(df2.sort_index())

True

In [63]:
## Comparing array-like objects
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

In [64]:
pd.Index(['foo', 'bar', 'baz']) == 'foo'

array([ True, False, False], dtype=bool)

In [65]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [66]:
pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [67]:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

ValueError: Can only compare identically-labeled Series objects

In [68]:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

ValueError: Can only compare identically-labeled Series objects

In [70]:
np.array([1, 2, 3]) == np.array([2])

array([False,  True, False], dtype=bool)

In [71]:
np.array([1, 2, 3]) == np.array([1, 2])

  if __name__ == '__main__':


False

In [72]:
## Combining overlapping data sets

In [73]:
df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
'B' : [np.nan, 2., 3., np.nan, 6.]})

In [74]:
df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
'B' : [np.nan, np.nan, 3., 4., 6., 8.]})


In [75]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [76]:
df2


Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0
