## Pandas
pandas consists of the following things
• A set of labeled array data structures, the primary of which are Series and DataFrame

• Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing

• An integrated group by engine for aggregating and transforming data sets

•Daterangegeneration(date_range)andcustomdateoffsetsenablingtheimplementationofcustomizedfrequen- cies

• Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.

• Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value)

• Moving window statistics (rolling mean, rolling standard deviation, etc.)

• Static and moving window linear and panel regression


Data structures 

Series - 1 Dimension - 1D labeled homogeneously-typed array

DataFrame -- General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns

Panel  --General 3D labeled, also size-mutable array


The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.


When using ndarrays to store 2- and 3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions; axes are considered more or less equivalent . In pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set there is likely to be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data transformations in downstream functions.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. And iterating through the columns of the DataFrame thus results in more readable code:

￼for col in df.columns:

    series = df[col]
    
    do something with series￼for col in df.columns:



In [6]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

### Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [5]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [37]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [38]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,-1.175745,0.36733
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151


In [39]:
## Creating a DataFrame by passing a dict of objects that can be converted to series-like.

df2 = pd.DataFrame({ 'A' : 1.,'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [15]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

In [40]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,-1.175745,0.36733
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598


In [34]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-05,-1.146946,-0.358999,0.862938,1.067858
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [41]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [42]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [19]:
df.values

array([[-1.22088445,  0.33874724,  1.29463986, -0.10512401],
       [ 2.21046127, -0.52146734, -2.65790279, -0.62062096],
       [-1.14463234,  0.78462611, -0.67048203, -0.60940414],
       [ 1.63168405,  1.48492866, -0.2678563 , -1.15941899],
       [ 0.77209586, -1.23684343, -1.59607903,  0.50323467],
       [-0.46415537,  0.42662131, -0.19778566,  1.72957376]])

In [43]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.347165,-0.555981,-0.369713,0.097656
std,0.857758,0.467936,0.539589,1.010367
min,-1.139229,-1.231775,-1.175745,-1.908496
25%,0.301147,-0.819465,-0.65537,0.325531
50%,0.316095,-0.527669,-0.341761,0.36924
75%,0.709686,-0.297856,-0.034272,0.461576
max,1.452981,0.089208,0.332668,0.952633


In [39]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.89216,1.320968,0.368814,0.284433,-1.146946,0.396533
B,0.154374,-0.194768,-1.181423,-0.731625,-0.358999,-0.52372
C,2.121266,-0.749169,0.344051,0.470674,0.862938,0.382615
D,-0.246541,-1.490745,-0.772633,-1.167195,1.067858,0.562757


In [21]:
# Sorting by an axis

df.sort_index(axis =1 , ascending = False)



Unnamed: 0,D,C,B,A
2013-12-01,-0.105124,1.29464,0.338747,-1.220884
2013-12-02,-0.620621,-2.657903,-0.521467,2.210461
2013-12-03,-0.609404,-0.670482,0.784626,-1.144632
2013-12-04,-1.159419,-0.267856,1.484929,1.631684
2013-12-05,0.503235,-1.596079,-1.236843,0.772096
2013-12-06,1.729574,-0.197786,0.426621,-0.464155


In [22]:
# sorting by values 

df.sort_values(by ='B', ascending = True)

Unnamed: 0,A,B,C,D
2013-12-05,0.772096,-1.236843,-1.596079,0.503235
2013-12-02,2.210461,-0.521467,-2.657903,-0.620621
2013-12-01,-1.220884,0.338747,1.29464,-0.105124
2013-12-06,-0.464155,0.426621,-0.197786,1.729574
2013-12-03,-1.144632,0.784626,-0.670482,-0.609404
2013-12-04,1.631684,1.484929,-0.267856,-1.159419


### Selection 

#### Getting 

Selecting a single column, which yields a Series, equivalent to df.A

In [44]:
df['A']

2013-01-01    1.452981
2013-01-02    0.297813
2013-01-03    0.321040
2013-01-04    0.311151
2013-01-05    0.839235
2013-01-06   -1.139229
Freq: D, Name: A, dtype: float64

In [24]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-12-01,-1.220884,0.338747,1.29464,-0.105124
2013-12-02,2.210461,-0.521467,-2.657903,-0.620621
2013-12-03,-1.144632,0.784626,-0.670482,-0.609404


In [45]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496


####  Selection by Label

In [47]:
df

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,-1.175745,0.36733
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151


In [46]:
df.loc[dates[0]]

A    1.452981
B    0.089208
C   -1.175745
D    0.367330
Name: 2013-01-01 00:00:00, dtype: float64

In [57]:
df.loc[dates[1]]

A    1.320968
B   -0.194768
C   -0.749169
D   -1.490745
Name: 2013-01-02 00:00:00, dtype: float64

In [32]:
## Selecting on a multi-axis by label

df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-12-01,-1.220884,0.338747
2013-12-02,2.210461,-0.521467
2013-12-03,-1.144632,0.784626
2013-12-04,1.631684,1.484929
2013-12-05,0.772096,-1.236843
2013-12-06,-0.464155,0.426621


In [48]:
#label slicing, both endpoints are included

df.loc['20130101': '20130104',['A','B']]

Unnamed: 0,A,B
2013-01-01,1.452981,0.089208
2013-01-02,0.297813,-0.600171
2013-01-03,0.32104,-0.892563
2013-01-04,0.311151,-1.231775


In [49]:
print(type(df.loc['20130101': '20130104',['A','B']]))
# Reduction in the dimensions of the returned object

df.loc['20130102',['A','B']]

print(df.loc['20130102',['A','B']])

print(type(df.loc['20130102',['A','B']]))



<class 'pandas.core.frame.DataFrame'>
A    0.297813
B   -0.600171
Name: 2013-01-02 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>


In [51]:
# For getting a scalar value
df.loc[dates[0],'A']

1.4529809863826681

In [52]:
# For getting fast access to a scalar (equiv to the prior method)

df.at[dates[0],'A']

1.4529809863826681

#### Selection by Position

In [53]:
df.iloc[3]

A    0.311151
B   -1.231775
C   -0.440496
D   -1.908496
Name: 2013-01-04 00:00:00, dtype: float64

In [54]:
# By integer slices, acting similar to numpy/python

df.iloc[3:5,0:2]


Unnamed: 0,A,B
2013-01-04,0.311151,-1.231775
2013-01-05,0.839235,-0.24542


In [55]:
df.iloc[3:5]

Unnamed: 0,A,B,C,D
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598


In [56]:
df.iloc[:,2:5]


Unnamed: 0,C,D
2013-01-01,-1.175745,0.36733
2013-01-02,-0.726994,0.491718
2013-01-03,0.035313,0.952633
2013-01-04,-0.440496,-1.908496
2013-01-05,0.332668,0.311598
2013-01-06,-0.243025,0.371151


In [57]:
df.iloc[:,2:6]

Unnamed: 0,C,D
2013-01-01,-1.175745,0.36733
2013-01-02,-0.726994,0.491718
2013-01-03,0.035313,0.952633
2013-01-04,-0.440496,-1.908496
2013-01-05,0.332668,0.311598
2013-01-06,-0.243025,0.371151


In [59]:
df.iloc[1:9,:]

Unnamed: 0,A,B,C,D
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151


In [61]:
## By lists of integer position locations, similar to the numpy/python style

df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,0.297813,-0.726994
2013-01-03,0.32104,0.035313
2013-01-05,0.839235,0.332668


In [80]:
df.iloc[[1,2,7],[0,2]]

IndexError: positional indexers are out-of-bounds

In [62]:
# For getting a value explicitly
df.iloc[1,1]

-0.60017078860716844

#### Boolean Indexing

In [63]:
df

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,-1.175745,0.36733
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151


In [83]:
df[df.A >0]

Unnamed: 0,A,B,C,D
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745
2013-01-03,0.368814,-1.181423,0.344051,-0.772633
2013-01-04,0.284433,-0.731625,0.470674,-1.167195
2013-01-06,0.396533,-0.52372,0.382615,0.562757


In [65]:
df[df >0]

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,,0.36733
2013-01-02,0.297813,,,0.491718
2013-01-03,0.32104,,0.035313,0.952633
2013-01-04,0.311151,,,
2013-01-05,0.839235,,0.332668,0.311598
2013-01-06,,,,0.371151


In [86]:
### Using the isin() method for filtering:

df2 = df.copy()

In [66]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [70]:
df2['E'] = ['one', 'one','two','three']
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,one,foo
1,1.0,2013-01-02,1.0,3,one,foo
2,1.0,2013-01-02,1.0,3,two,foo
3,1.0,2013-01-02,1.0,3,three,foo


In [72]:
df

Unnamed: 0,A,B,C,D
2013-01-01,1.452981,0.089208,-1.175745,0.36733
2013-01-02,0.297813,-0.600171,-0.726994,0.491718
2013-01-03,0.32104,-0.892563,0.035313,0.952633
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496
2013-01-05,0.839235,-0.24542,0.332668,0.311598
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151


df2


In [79]:
df2[~df2['E'].isin(['one','three']) ]


Unnamed: 0,A,B,C,D,E,F
2,1.0,2013-01-02,1.0,3,two,foo


#### Setting 

Setting a new column automatically aligns the data by the indexes

In [80]:
s1 = pd.Series([1,2,3,4,5,6], index = pd.date_range('20130102', periods =6))
s1


2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [82]:
# setting values by label 
df['F'] = s1

In [83]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,1.452981,0.089208,-1.175745,0.36733,
2013-01-02,0.297813,-0.600171,-0.726994,0.491718,1.0
2013-01-03,0.32104,-0.892563,0.035313,0.952633,2.0
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496,3.0
2013-01-05,0.839235,-0.24542,0.332668,0.311598,4.0
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151,5.0


In [84]:
# Setting values by label

df.at[dates[0],'A'] =0

In [101]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,2.121266,-0.246541,
2013-01-02,1.320968,-0.194768,-0.749169,-1.490745,1.0
2013-01-03,0.368814,-1.181423,0.344051,-0.772633,2.0
2013-01-04,0.284433,-0.731625,0.470674,-1.167195,3.0
2013-01-05,-1.146946,-0.358999,0.862938,1.067858,4.0
2013-01-06,0.396533,-0.52372,0.382615,0.562757,5.0


In [85]:
#Setting values by position
df.iat[0,2] = 0

In [86]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.089208,0.0,0.36733,
2013-01-02,0.297813,-0.600171,-0.726994,0.491718,1.0
2013-01-03,0.32104,-0.892563,0.035313,0.952633,2.0
2013-01-04,0.311151,-1.231775,-0.440496,-1.908496,3.0
2013-01-05,0.839235,-0.24542,0.332668,0.311598,4.0
2013-01-06,-1.139229,-0.455167,-0.243025,0.371151,5.0


In [91]:
#Setting by assigning with a numpy array

df.loc[:,'D'] = np.array([3]*len(df))

In [92]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.089208,0.0,3,
2013-01-02,0.297813,-0.600171,-0.726994,3,1.0
2013-01-03,0.32104,-0.892563,0.035313,3,2.0
2013-01-04,0.311151,-1.231775,-0.440496,3,3.0
2013-01-05,0.839235,-0.24542,0.332668,3,4.0
2013-01-06,-1.139229,-0.455167,-0.243025,3,5.0


In [107]:
df2 = df.copy


In [108]:
df2


<bound method NDFrame.copy of                    A         B         C  D    F
2013-01-01  0.000000  0.154374  0.000000  5  NaN
2013-01-02  1.320968 -0.194768 -0.749169  5  1.0
2013-01-03  0.368814 -1.181423  0.344051  5  2.0
2013-01-04  0.284433 -0.731625  0.470674  5  3.0
2013-01-05 -1.146946 -0.358999  0.862938  5  4.0
2013-01-06  0.396533 -0.523720  0.382615  5  5.0>

In [93]:
df2[df2 > 0 ] 

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,one,foo
1,1.0,2013-01-02,1.0,3,one,foo
2,1.0,2013-01-02,1.0,3,two,foo
3,1.0,2013-01-02,1.0,3,three,foo


#### Missing Data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [97]:
df1 = df.reindex(index = dates[0:4], columns= list(df.columns)+['E'])


In [99]:

df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)

                   A         B         C  D    F    E
2013-01-01  0.000000  0.089208  0.000000  3  NaN  1.0
2013-01-02  0.297813 -0.600171 -0.726994  3  1.0  1.0
2013-01-03  0.321040 -0.892563  0.035313  3  2.0  NaN
2013-01-04  0.311151 -1.231775 -0.440496  3  3.0  NaN


In [119]:
# To drop any rows that have missing data.
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.320968,-0.194768,-0.749169,5,1.0,1.0


In [100]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.089208,0.0,3,5.0,1.0
2013-01-02,0.297813,-0.600171,-0.726994,3,1.0,1.0
2013-01-03,0.32104,-0.892563,0.035313,3,2.0,5.0
2013-01-04,0.311151,-1.231775,-0.440496,3,3.0,5.0


In [101]:
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,0.297813,-0.600171,-0.726994,3,1.0,1.0


In [123]:
# To get the boolean mask where values are nan

pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


#### Operations 
Operations in general exclude missing data. 

Performing a descriptive statistic

In [124]:
df.mean()

A    0.203967
B   -0.472693
C    0.218518
D    5.000000
F    3.000000
dtype: float64

In [125]:
# Same operation on the other axis

df.mean(1)

2013-01-01    1.288594
2013-01-02    1.275406
2013-01-03    1.306288
2013-01-04    1.604696
2013-01-05    1.671399
2013-01-06    2.051086
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

In [127]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates)
s

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64

In [134]:
s= s.shift(2)
s


2013-01-01    NaN
2013-01-02    NaN
2013-01-03    NaN
2013-01-04    NaN
2013-01-05    1.0
2013-01-06    3.0
Freq: D, dtype: float64

In [140]:
type(s)
print(s.shape)

print(df.shape)

(6,)
(6, 5)


In [136]:
help(df.sub)


Help on method sub in module pandas.core.ops:

sub(other, axis='columns', level=None, fill_value=None) method of pandas.core.frame.DataFrame instance
    Subtraction of dataframe and other, element-wise (binary operator `sub`).
    
    Equivalent to ``dataframe - other``, but with support to substitute a fill_value for
    missing data in one of the inputs.
    
    Parameters
    ----------
    other : Series, DataFrame, or constant
    axis : {0, 1, 'index', 'columns'}
        For Series input, axis to match Series index on
    fill_value : None or float value, default None
        Fill missing (NaN) values with this value. If both DataFrame
        locations are missing, the result will be missing
    level : int or name
        Broadcast across a level, matching Index values on the
        passed MultiIndex level
    
    Notes
    -----
    Mismatched indices will be unioned together
    
    Returns
    -------
    result : DataFrame
    
    See also
    --------
    DataFrame.

In [132]:
df


Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.089208,0.0,3,
2013-01-02,0.297813,-0.600171,-0.726994,3,1.0
2013-01-03,0.32104,-0.892563,0.035313,3,2.0
2013-01-04,0.311151,-1.231775,-0.440496,3,3.0
2013-01-05,0.839235,-0.24542,0.332668,3,4.0
2013-01-06,-1.139229,-0.455167,-0.243025,3,5.0


In [133]:

df.sub(s,axis = 'index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.67896,-1.892563,-0.964687,2.0,1.0
2013-01-04,-2.688849,-4.231775,-3.440496,0.0,0.0
2013-01-05,-4.160765,-5.24542,-4.667332,-2.0,-1.0
2013-01-06,,,,,


#### Apply 

Applying function to data

In [138]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.154374,0.0,5,
2013-01-02,1.320968,-0.040393,-0.749169,10,1.0
2013-01-03,1.689781,-1.221817,-0.405118,15,3.0
2013-01-04,1.974215,-1.953442,0.065555,20,6.0
2013-01-05,0.827269,-2.312441,0.928494,25,10.0
2013-01-06,1.223802,-2.836161,1.311109,30,15.0


In [140]:
df.apply (lambda x : x.max()-x.min())

A    2.467914
B    1.335798
C    1.612108
D    0.000000
F    4.000000
dtype: float64

In [141]:
type(df.apply (lambda x : x.max()-x.min()))

pandas.core.series.Series

#### Histrogramming 

In [143]:
s = pd.Series(np.random.randint(0,7, size =10))
s

0    4
1    5
2    1
3    5
4    3
5    2
6    1
7    1
8    2
9    4
dtype: int64

In [144]:
s.value_counts()

1    3
5    2
4    2
2    2
3    1
dtype: int64

####  String methods

In [141]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [142]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#### merge

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat():

In [143]:
df  = pd.DataFrame(np.random.randn(10,4))

In [144]:
df


Unnamed: 0,0,1,2,3
0,-0.828817,0.024161,-1.155168,-1.312135
1,-0.250658,-1.8587,0.92224,0.963875
2,-1.130809,-0.379527,0.166091,-0.223364
3,0.16323,0.918472,1.244863,0.054307
4,-1.162385,0.958406,1.336289,-1.47228
5,-0.536736,-0.408177,-0.592653,0.395826
6,-1.679761,0.818185,1.298533,1.220418
7,-0.997911,-0.205352,1.272323,-0.986731
8,2.17561,1.122211,0.229506,-0.313692
9,0.610868,-0.076015,-1.587109,-0.105152


In [154]:
piceses =  [df[:3],df[3:7],df[7:]]

In [155]:
piceses

[          0         1         2         3
 0  1.264577  0.051834 -1.144204  1.207308
 1  1.161618 -1.051908 -0.200668 -0.173539
 2  0.115750 -0.119277 -1.976578 -0.051619,
           0         1         2         3
 3 -0.538089 -0.555356 -0.103677  1.183125
 4 -0.055464 -0.008104 -0.148013  1.566107
 5  0.551317 -1.801087 -0.957699  0.511444
 6  0.069443 -1.318194 -0.458085 -0.911792,
           0         1         2         3
 7  1.452841  0.732603 -0.421263  0.142031
 8 -0.048391  0.408418 -0.711116 -1.399384
 9  0.917506 -1.253084 -0.415187 -1.124510]

In [156]:
pd.concat(piceses)

Unnamed: 0,0,1,2,3
0,1.264577,0.051834,-1.144204,1.207308
1,1.161618,-1.051908,-0.200668,-0.173539
2,0.11575,-0.119277,-1.976578,-0.051619
3,-0.538089,-0.555356,-0.103677,1.183125
4,-0.055464,-0.008104,-0.148013,1.566107
5,0.551317,-1.801087,-0.957699,0.511444
6,0.069443,-1.318194,-0.458085,-0.911792
7,1.452841,0.732603,-0.421263,0.142031
8,-0.048391,0.408418,-0.711116,-1.399384
9,0.917506,-1.253084,-0.415187,-1.12451


#### Join



In [145]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [146]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [147]:
pd.merge(left,right, on ='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


#### Append

In [148]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1.63675,-1.566832,0.954594,0.278716
1,0.624538,-1.373829,0.502789,-1.213104
2,0.104419,0.056162,-0.005679,0.247602
3,1.140408,1.067881,-1.705766,0.091304
4,0.670082,1.958619,0.066951,0.947443
5,1.849962,-1.512124,0.159452,0.914792
6,-0.211039,0.357835,1.088611,0.152963
7,0.653162,-0.412609,1.246842,-1.041493


In [165]:
s = df.iloc[3]
s


A    0.779374
B   -0.792451
C    0.108343
D    0.253346
Name: 3, dtype: float64

In [166]:
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,1.105848,-1.603355,-1.182758,-0.904533
1,-0.504589,-0.27093,-0.968451,0.614221
2,-0.887285,-0.95648,0.670953,-0.336781
3,0.779374,-0.792451,0.108343,0.253346
4,-0.700572,-0.412557,3.046421,-1.225675
5,1.927459,0.43529,0.337455,-1.015704
6,-2.093842,0.401445,-0.045044,0.308447
7,-0.328095,-1.25382,-2.396085,-0.292637
8,0.779374,-0.792451,0.108343,0.253346


#### Grouping 

By “group by” we are referring to a process involving one or more of the following steps 

• Splitting the data into groups based on some criteria

• Applying a function to each group independently

• Combining the results into a data structure


In [149]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,0.673245,0.74305
1,bar,one,-1.644412,0.903547
2,foo,two,-0.404697,-0.957063
3,bar,three,0.940455,1.763688
4,foo,two,-0.255676,-0.498992
5,bar,two,0.739576,-0.85663
6,foo,one,-1.548273,-1.113304
7,foo,three,0.259137,-1.143938


In [169]:
# Grouping and then applying a function sum to the resulting groups.
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.770153,-3.67051
foo,-1.958416,2.502084


In [171]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.049356,-1.087067
bar,three,-1.462271,-2.119326
bar,two,-0.357238,-0.464117
foo,one,-0.653277,1.289441
foo,three,-1.353866,0.301702
foo,two,0.048727,0.91094


#### Stack

In [150]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))

In [151]:
tuples


[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [153]:
index = pd.MultiIndex.from_tuples(tuples,names=['first','second'])
index


MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [155]:
df =pd.DataFrame(np.random.randn(8,2),index = index , columns= ['A','B'])
df


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.429859,0.736744
bar,two,0.571468,-0.004701
baz,one,-0.275011,-0.913135
baz,two,-0.155922,-1.171553
foo,one,0.227955,0.435839
foo,two,-0.031727,-0.446389
qux,one,-0.871404,0.346541
qux,two,0.404015,-0.45609


In [156]:
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.429859,0.736744
bar,two,0.571468,-0.004701
baz,one,-0.275011,-0.913135
baz,two,-0.155922,-1.171553


In [None]:
#The stack() method “compresses” a level in the DataFrame’s columns.


In [158]:
stacked = df2.stack()
stacked

first  second   
bar    one     A    0.429859
               B    0.736744
       two     A    0.571468
               B   -0.004701
baz    one     A   -0.275011
               B   -0.913135
       two     A   -0.155922
               B   -1.171553
dtype: float64

In [159]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.429859,0.736744
bar,two,0.571468,-0.004701
baz,one,-0.275011,-0.913135
baz,two,-0.155922,-1.171553


In [183]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-0.266321,-0.019614
bar,B,-0.366412,1.124003
baz,A,0.651159,0.82872
baz,B,-1.559394,0.192244


In [184]:
stacked.unstack(0)


Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.266321,0.651159
one,B,-0.366412,-1.559394
two,A,-0.019614,0.82872
two,B,1.124003,0.192244


#### Pivot Tables

In [162]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})

df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-0.138447,0.596549
1,one,B,foo,-0.94471,0.340938
2,two,C,foo,-0.592216,0.793961
3,three,A,bar,1.898187,-0.646526
4,one,B,bar,1.219602,-2.115418
5,one,C,bar,1.131427,-0.315726
6,two,A,foo,-0.727554,0.842931
7,three,B,foo,1.483589,0.125674
8,one,C,foo,-0.24456,0.55327
9,one,A,bar,-0.476104,-0.580648


In [187]:
help(pd.pivot_table)

Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
    Create a spreadsheet-style pivot table as a DataFrame. The levels in
    the pivot table will be stored in MultiIndex objects (hierarchical
    indexes) on the index and columns of the result DataFrame
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can c

In [188]:
pd.pivot_table(df,values='D',index =['A','B'],columns= ['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.106135,0.704409
one,B,0.086886,-0.866929
one,C,-0.731205,1.481875
three,A,1.070945,
three,B,,0.451708
three,C,0.363401,
two,A,,-1.583108
two,B,-1.309732,
two,C,,-0.499975


#### Time Series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency con- version (e.g., converting secondly data into 5-minutely data). 

In [163]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
rng

DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09',
               '2012-01-01 00:00:10', '2012-01-01 00:00:11',
               '2012-01-01 00:00:12', '2012-01-01 00:00:13',
               '2012-01-01 00:00:14', '2012-01-01 00:00:15',
               '2012-01-01 00:00:16', '2012-01-01 00:00:17',
               '2012-01-01 00:00:18', '2012-01-01 00:00:19',
               '2012-01-01 00:00:20', '2012-01-01 00:00:21',
               '2012-01-01 00:00:22', '2012-01-01 00:00:23',
               '2012-01-01 00:00:24', '2012-01-01 00:00:25',
               '2012-01-01 00:00:26', '2012-01-01 00:00:27',
               '2012-01-01 00:00:28', '2012-01-01 00:00:29',
               '2012-01-01 00:00:30', '2012-01-01 00:00:31',
               '2012-01-

In [164]:
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2012-01-01 00:00:00    230
2012-01-01 00:00:01    204
2012-01-01 00:00:02    251
2012-01-01 00:00:03    348
2012-01-01 00:00:04     50
2012-01-01 00:00:05    219
2012-01-01 00:00:06    342
2012-01-01 00:00:07    378
2012-01-01 00:00:08    371
2012-01-01 00:00:09    283
2012-01-01 00:00:10    118
2012-01-01 00:00:11      1
2012-01-01 00:00:12    495
2012-01-01 00:00:13    383
2012-01-01 00:00:14    282
2012-01-01 00:00:15    132
2012-01-01 00:00:16    189
2012-01-01 00:00:17     24
2012-01-01 00:00:18    486
2012-01-01 00:00:19     31
2012-01-01 00:00:20    264
2012-01-01 00:00:21    350
2012-01-01 00:00:22    420
2012-01-01 00:00:23    317
2012-01-01 00:00:24    423
2012-01-01 00:00:25    329
2012-01-01 00:00:26    335
2012-01-01 00:00:27    463
2012-01-01 00:00:28    344
2012-01-01 00:00:29    352
                      ... 
2012-01-01 00:01:10    153
2012-01-01 00:01:11    221
2012-01-01 00:01:12     10
2012-01-01 00:01:13    289
2012-01-01 00:01:14    421
2012-01-01 00:01:15    212
2

In [165]:
ts.resample('1Min').sum()

2012-01-01 00:00:00    16310
2012-01-01 00:01:00     9387
Freq: T, dtype: int64

In [166]:
## Time zone representation
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06   -1.425213
2012-03-07    2.140607
2012-03-08   -0.660842
2012-03-09    1.581978
2012-03-10    1.241751
Freq: D, dtype: float64

In [196]:
ts_utc = ts.tz_localize('UTC')

In [197]:
ts_utc

2012-03-06 00:00:00+00:00    2.260751
2012-03-07 00:00:00+00:00   -0.149555
2012-03-08 00:00:00+00:00   -0.336333
2012-03-09 00:00:00+00:00    0.737622
2012-03-10 00:00:00+00:00   -1.741359
Freq: D, dtype: float64

In [198]:
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00    2.260751
2012-03-06 19:00:00-05:00   -0.149555
2012-03-07 19:00:00-05:00   -0.336333
2012-03-08 19:00:00-05:00    0.737622
2012-03-09 19:00:00-05:00   -1.741359
Freq: D, dtype: float64

In [167]:
## Converting between time span representations
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -0.149327
2012-02-29   -1.819186
2012-03-31    0.984010
2012-04-30    1.347231
2012-05-31   -0.273676
Freq: M, dtype: float64

In [168]:
ps = ts.to_period()
ps

2012-01   -0.149327
2012-02   -1.819186
2012-03    0.984010
2012-04    1.347231
2012-05   -0.273676
Freq: M, dtype: float64

In [170]:
ps.to_timestamp()

2012-01-31   -0.149327
2012-02-29   -1.819186
2012-03-31    0.984010
2012-04-30    1.347231
2012-05-31   -0.273676
Freq: M, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [173]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng

PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]', freq='Q-NOV')

In [179]:
ts = pd.Series(np.random.randn(len(prng)), prng)
ts

1990Q1    0.971381
1990Q2   -1.108808
1990Q3   -0.444173
1990Q4   -1.386688
1991Q1    1.955035
1991Q2   -0.361296
1991Q3    0.812855
1991Q4   -1.260271
1992Q1   -1.174834
1992Q2    0.419585
1992Q3    1.039369
1992Q4    0.263308
1993Q1    1.869706
1993Q2    1.133111
1993Q3    0.468307
1993Q4    1.023134
1994Q1   -1.496256
1994Q2    0.625379
1994Q3    0.833143
1994Q4   -1.337258
1995Q1    0.357554
1995Q2   -1.750471
1995Q3   -1.474140
1995Q4    1.321157
1996Q1    0.385847
1996Q2    0.092388
1996Q3    1.168368
1996Q4    0.500581
1997Q1    1.614860
1997Q2   -0.220471
1997Q3    0.297750
1997Q4   -1.088346
1998Q1   -2.158687
1998Q2   -0.480569
1998Q3    0.046427
1998Q4    2.047984
1999Q1   -0.266469
1999Q2    0.239019
1999Q3    1.876142
1999Q4    0.148843
2000Q1   -0.731716
2000Q2    0.448813
2000Q3   -0.588068
2000Q4    2.746139
Freq: Q-NOV, dtype: float64

In [177]:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [178]:
ts.head()

1990-03-01 09:00    0.035074
1990-06-01 09:00    0.715765
1990-09-01 09:00    1.133733
1990-12-01 09:00   -0.017971
1991-03-01 09:00    0.114940
Freq: H, dtype: float64

#### Categoricals

In [221]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'c', 'd', 'a', 'e']})

In [222]:
df


Unnamed: 0,id,raw_grade
0,1,a
1,2,b
2,3,c
3,4,d
4,5,a
5,6,e


In [223]:
df['grade'] = df['raw_grade'].astype("category")


In [224]:
df

Unnamed: 0,id,raw_grade,grade
0,1,a,a
1,2,b,b
2,3,c,c
3,4,d,d
4,5,a,a
5,6,e,e


In [225]:

df['grade'].cat.categories = ["very bad", "bad", "medium", "good", "very good"]
df


Unnamed: 0,id,raw_grade,grade
0,1,a,very bad
1,2,b,bad
2,3,c,medium
3,4,d,good
4,5,a,very bad
5,6,e,very good


In [226]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [227]:
df


Unnamed: 0,id,raw_grade,grade
0,1,a,very bad
1,2,b,bad
2,3,c,medium
3,4,d,good
4,5,a,very bad
5,6,e,very good


In [193]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
0,1,a,
1,2,b,
2,3,b,
3,4,a,
4,5,a,
5,6,e,


#### Plotting

In [228]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

In [229]:
ts = ts.cumsum()

In [230]:
ts.plot()


<matplotlib.axes._subplots.AxesSubplot at 0x116a7a828>

In [231]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])


In [232]:
 df = df.cumsum()

In [233]:
plt.figure();

In [235]:
df.plot(); 

In [236]:
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x116c45908>

#### Getting Data In/Out

In [237]:
df.to_csv('foo.csv')

In [238]:
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
1,2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2,2000-01-03,2.920340,-1.696290,1.647347,-1.240526
3,2000-01-04,3.787990,-2.873107,1.641727,-1.586159
4,2000-01-05,2.888086,-1.652603,1.769317,-2.383967
5,2000-01-06,3.108331,-3.507232,0.434552,-1.243871
6,2000-01-07,3.724756,-4.451210,0.431474,-2.480920
7,2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
8,2000-01-09,4.488749,-6.776646,0.392486,-2.329330
9,2000-01-10,5.351117,-5.798648,0.975288,-2.341406


#### HDF5

In [239]:
df.to_hdf('foo.h5','df')

In [240]:
pd.read_hdf('foo.h5','df')

Unnamed: 0,A,B,C,D
2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2000-01-03,2.920340,-1.696290,1.647347,-1.240526
2000-01-04,3.787990,-2.873107,1.641727,-1.586159
2000-01-05,2.888086,-1.652603,1.769317,-2.383967
2000-01-06,3.108331,-3.507232,0.434552,-1.243871
2000-01-07,3.724756,-4.451210,0.431474,-2.480920
2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
2000-01-09,4.488749,-6.776646,0.392486,-2.329330
2000-01-10,5.351117,-5.798648,0.975288,-2.341406


#### Excel

In [242]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')


In [243]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])


Unnamed: 0,A,B,C,D
2000-01-01,0.827621,-0.281782,-0.166937,-0.869104
2000-01-02,1.398077,-0.395234,1.625438,-0.765463
2000-01-03,2.920340,-1.696290,1.647347,-1.240526
2000-01-04,3.787990,-2.873107,1.641727,-1.586159
2000-01-05,2.888086,-1.652603,1.769317,-2.383967
2000-01-06,3.108331,-3.507232,0.434552,-1.243871
2000-01-07,3.724756,-4.451210,0.431474,-2.480920
2000-01-08,3.887005,-7.194912,-0.180652,-2.935083
2000-01-09,4.488749,-6.776646,0.392486,-2.329330
2000-01-10,5.351117,-5.798648,0.975288,-2.341406



## Pandas Memory usage

In [228]:
dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', 
          'complex128', 'object', 'bool']
n = 5000
import numpy as np
test = [(np.random.randint(100, size=n).astype(t)) for t in dtypes]
#print("Test is",test)
data = dict([ (t, np.random.randint(100, size=n).astype(t)) for t in dtypes])
#print("Data is",data)
print("Data is",type(data['complex128']))
print("Data is",data['complex128'].size)
import pandas as pd
df = pd.DataFrame(data)
df['categorical'] = df['object'].astype('category')
df.info()

## The + symbol indicates that the true memory usage could be 
#higher, because pandas does not count the memory used by values in columns with dtype=object.

Data is <class 'numpy.ndarray'>
Data is 5000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
bool               5000 non-null bool
complex128         5000 non-null complex128
datetime64[ns]     5000 non-null datetime64[ns]
float64            5000 non-null float64
int64              5000 non-null int64
object             5000 non-null object
timedelta64[ns]    5000 non-null timedelta64[ns]
categorical        5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 289.1+ KB


In [210]:
df.head()

Unnamed: 0,bool,complex128,datetime64[ns],float64,int64,object,timedelta64[ns],categorical
0,True,(18+0j),1970-01-01 00:00:00.000000094,71.0,50,87,00:00:00.000000,87
1,True,(75+0j),1970-01-01 00:00:00.000000049,70.0,23,57,00:00:00.000000,57
2,True,(16+0j),1970-01-01 00:00:00.000000068,1.0,1,25,00:00:00.000000,25
3,True,(30+0j),1970-01-01 00:00:00.000000069,26.0,31,50,00:00:00.000000,50
4,True,(83+0j),1970-01-01 00:00:00.000000052,90.0,80,43,00:00:00.000000,43


Passing memory_usage=’deep’ will enable a more accurate memory usage report, that accounts for the full usage of the contained objects. This is optional as it can be expensive to do this deeper introspection.


The memory usage of each column can be found by calling the memory_usage method. This returns a Series with an index represented by column names and memory usage of each column shown in bytes. For the dataframe above, the memory usage of each column and the total memory usage of the dataframe can be found with the memory_usage method:


In [18]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
bool               5000 non-null bool
complex128         5000 non-null complex128
datetime64[ns]     5000 non-null datetime64[ns]
float64            5000 non-null float64
int64              5000 non-null int64
object             5000 non-null object
timedelta64[ns]    5000 non-null timedelta64[ns]
categorical        5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 425.6 KB


In [19]:
df.memory_usage()

Index                 80
bool                5000
complex128         80000
datetime64[ns]     40000
float64            40000
int64              40000
object             40000
timedelta64[ns]    40000
categorical        10920
dtype: int64

In [20]:
df.memory_usage().sum()

296000

In [21]:
df.memory_usage(index=False)

bool                5000
complex128         80000
datetime64[ns]     40000
float64            40000
int64              40000
object             40000
timedelta64[ns]    40000
categorical        10920
dtype: int64

In [1]:
##ESSENTIAL BASIC FUNCTIONALITY

In [3]:
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=8)

In [4]:
index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [6]:
df = pd.DataFrame(np.random.randn(8, 3), index=index,
                  columns=['A', 'B', 'C'])

In [7]:
df


Unnamed: 0,A,B,C
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759
2000-01-03,-0.071699,0.831925,0.001975
2000-01-04,-0.505624,0.959901,-0.292569
2000-01-05,0.74626,-0.589335,-1.074199
2000-01-06,1.695816,-1.505707,-0.094453
2000-01-07,0.975223,1.537385,-1.299893
2000-01-08,-0.587501,1.131928,0.821056


In [229]:
wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
major_axis=pd.date_range('1/1/2000', periods=5),
minor_axis=['A', 'B', 'C', 'D'])


In [232]:
wp.as_matrix()



array([[[ 0.98896692, -1.42003175,  0.60820254, -0.06927342],
        [-0.40587222, -1.04939629,  0.32040119,  1.51253948],
        [ 0.14837544,  0.36873058,  0.26658242,  0.294606  ],
        [-0.35518178,  0.28761132, -0.91239707, -0.01757868],
        [-2.38374351, -0.61226   ,  0.28058534,  0.73783934]],

       [[-1.070275  ,  0.86777953,  0.5868191 ,  0.01690768],
        [ 0.57519848,  1.75594034, -1.25763281, -2.19813679],
        [ 0.13461648, -0.75891673,  0.18725296, -2.11348946],
        [-0.7533341 , -0.72605239,  0.01422364, -0.83744066],
        [ 0.70661603,  1.12180823, -1.27664818,  0.53657784]]])

In [10]:
long_series = pd.Series(np.random.randn(1000))

In [11]:
long_series.head()

0    0.391092
1   -0.645428
2   -0.375977
3    1.329915
4    0.331439
dtype: float64

In [12]:
long_series.tail(3)

997   -0.735797
998   -0.371751
999    1.087176
dtype: float64

In [13]:
df[:2]

Unnamed: 0,A,B,C
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759


In [14]:
df.columns = [x.lower() for x in df.columns]


In [15]:
df

Unnamed: 0,a,b,c
2000-01-01,0.80302,1.004358,0.145901
2000-01-02,-1.187186,-0.212143,0.020759
2000-01-03,-0.071699,0.831925,0.001975
2000-01-04,-0.505624,0.959901,-0.292569
2000-01-05,0.74626,-0.589335,-1.074199
2000-01-06,1.695816,-1.505707,-0.094453
2000-01-07,0.975223,1.537385,-1.299893
2000-01-08,-0.587501,1.131928,0.821056


In [16]:
s.values

array([-1.07171067,  1.41070939,  2.31439459,  0.84051414,  0.30788718])

In [17]:
df.values

array([[ 0.80301965,  1.00435777,  0.14590118],
       [-1.18718625, -0.2121428 ,  0.02075891],
       [-0.07169884,  0.83192452,  0.00197486],
       [-0.50562428,  0.95990123, -0.29256872],
       [ 0.7462605 , -0.5893348 , -1.07419877],
       [ 1.6958162 , -1.50570735, -0.09445301],
       [ 0.97522336,  1.53738523, -1.29989305],
       [-0.58750082,  1.13192798,  0.82105574]])

In [18]:
wp.values

array([[[ 0.78752296, -0.34653899,  1.85953583, -0.12741824],
        [-0.26576317,  0.89267403, -0.90956655, -1.51146916],
        [ 0.6584737 ,  1.06258322,  0.30971611, -0.37042125],
        [-0.85545809, -0.39299941,  0.94935573,  1.17210453],
        [-1.47721789,  0.24396343,  1.99931577,  0.35319816]],

       [[ 0.66648108, -0.41773557, -0.187028  , -0.8957193 ],
        [ 1.29895514, -0.24778625,  0.77040594,  1.62211833],
        [-0.5061132 ,  0.40083772,  0.91897324, -0.06481612],
        [-0.04642707,  0.48244299,  1.43386829, -0.78360549],
        [ 0.3124662 ,  1.15914644, -0.18086868,  0.51571104]]])

In [19]:
## Accelerated operations

In [20]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [21]:
df

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.876369,0.093015,0.634339
c,-0.644334,-0.145912,0.968932
d,,-1.104801,1.299398


In [23]:
row = df.iloc[1]

In [24]:
row

one      0.876369
three    0.093015
two      0.634339
Name: b, dtype: float64

In [25]:
column = df['two']

In [27]:
df.sub(row, axis='columns')

Unnamed: 0,one,three,two
a,-1.260736,,-1.113632
b,0.0,0.0,0.0
c,-1.520703,-0.238927,0.334593
d,,-1.197816,0.665058


In [28]:
df.sub(row, axis=1)

Unnamed: 0,one,three,two
a,-1.260736,,-1.113632
b,0.0,0.0,0.0
c,-1.520703,-0.238927,0.334593
d,,-1.197816,0.665058


In [29]:
df.sub(column, axis='index')

Unnamed: 0,one,three,two
a,0.094925,,0.0
b,0.24203,-0.541324,0.0
c,-1.613267,-1.114844,0.0
d,,-2.404199,0.0


In [30]:
df.sub(column, axis=0)

Unnamed: 0,one,three,two
a,0.094925,,0.0
b,0.24203,-0.541324,0.0
c,-1.613267,-1.114844,0.0
d,,-2.404199,0.0


In [31]:
## Furthermore you can align a level of a multi-indexed DataFrame with a Series.
dfmi = df.copy()

In [32]:
dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
names=['first','second'])

In [33]:
dfmi.sub(column, axis=0, level='second')

Unnamed: 0_level_0,Unnamed: 1_level_0,one,three,two
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,0.094925,,0.0
1,b,0.24203,-0.541324,0.0
1,c,-1.613267,-1.114844,0.0
2,a,,-0.625509,1.77869


In [34]:
major_mean = wp.mean(axis='major')

In [35]:
wp.sub(major_mean, axis='major')

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

In [36]:
## Missing data / operations with fill values
df

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.876369,0.093015,0.634339
c,-0.644334,-0.145912,0.968932
d,,-1.104801,1.299398


In [38]:
df2 = pd.DataFrame(np.random.randn(8, 3), index=index,
columns=['A', 'B', 'C'])

In [39]:
df2

Unnamed: 0,A,B,C
2000-01-01,-1.180721,-2.7763,0.732917
2000-01-02,1.267004,-0.60201,-0.447131
2000-01-03,-0.020138,-0.620051,-1.404851
2000-01-04,-0.658608,-0.594525,0.924978
2000-01-05,1.070208,-0.643575,2.529221
2000-01-06,-0.367219,0.732131,-0.507819
2000-01-07,-0.133431,1.111262,1.733206
2000-01-08,0.759702,-0.154683,0.849453


In [40]:
df+df2

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,,,,,,
2000-01-02 00:00:00,,,,,,
2000-01-03 00:00:00,,,,,,
2000-01-04 00:00:00,,,,,,
2000-01-05 00:00:00,,,,,,
2000-01-06 00:00:00,,,,,,
2000-01-07 00:00:00,,,,,,
2000-01-08 00:00:00,,,,,,
a,,,,,,
b,,,,,,


In [41]:
df.add(df2, fill_value=0)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,-1.180721,-2.7763,0.732917,,,
2000-01-02 00:00:00,1.267004,-0.60201,-0.447131,,,
2000-01-03 00:00:00,-0.020138,-0.620051,-1.404851,,,
2000-01-04 00:00:00,-0.658608,-0.594525,0.924978,,,
2000-01-05 00:00:00,1.070208,-0.643575,2.529221,,,
2000-01-06 00:00:00,-0.367219,0.732131,-0.507819,,,
2000-01-07 00:00:00,-0.133431,1.111262,1.733206,,,
2000-01-08 00:00:00,0.759702,-0.154683,0.849453,,,
a,,,,-0.384367,,-0.479292
b,,,,0.876369,0.093015,0.634339


In [42]:
df.gt(df2)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,False,False,False,False,False,False
2000-01-02 00:00:00,False,False,False,False,False,False
2000-01-03 00:00:00,False,False,False,False,False,False
2000-01-04 00:00:00,False,False,False,False,False,False
2000-01-05 00:00:00,False,False,False,False,False,False
2000-01-06 00:00:00,False,False,False,False,False,False
2000-01-07 00:00:00,False,False,False,False,False,False
2000-01-08 00:00:00,False,False,False,False,False,False
a,False,False,False,False,False,False
b,False,False,False,False,False,False


In [43]:
df2.ne(df)

Unnamed: 0,A,B,C,one,three,two
2000-01-01 00:00:00,True,True,True,True,True,True
2000-01-02 00:00:00,True,True,True,True,True,True
2000-01-03 00:00:00,True,True,True,True,True,True
2000-01-04 00:00:00,True,True,True,True,True,True
2000-01-05 00:00:00,True,True,True,True,True,True
2000-01-06 00:00:00,True,True,True,True,True,True
2000-01-07 00:00:00,True,True,True,True,True,True
2000-01-08 00:00:00,True,True,True,True,True,True
a,True,True,True,True,True,True
b,True,True,True,True,True,True


In [44]:
# Boolean Reductions

In [45]:
(df > 0).all()

one      False
three    False
two      False
dtype: bool

In [46]:
(df > 0).any()


one      True
three    True
two      True
dtype: bool

In [47]:
(df > 0).any().any()

True

In [48]:
df.empty

False

In [49]:
pd.DataFrame(columns=list('ABC')).empty

True

In [50]:
pd.Series([True]).bool()

True

In [51]:
pd.Series([False]).bool()

False

In [52]:
pd.DataFrame([[True]]).bool()

True

In [53]:
pd.DataFrame([[False]]).bool()

False

In [54]:
    ## Comparing if objects are equivalent

In [55]:
df+df == df*2

Unnamed: 0,one,three,two
a,True,False,True
b,True,True,True
c,True,True,True
d,False,True,True


In [56]:
(df+df == df*2).all()

one      False
three    False
two       True
dtype: bool

In [57]:
## Notice that the boolean DataFrame df+df == df*2 contains some False values! That is because NaNs do not
#compare as equals:

np.nan == np.nan

False

In [58]:
(df+df).equals(df*2)

True

In [59]:
df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [60]:
df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [61]:
df1.equals(df2)


False

In [62]:
df1.equals(df2.sort_index())

True

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the
other. An example would be two data series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further back in history or have more complete
data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame
are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation
is combine_first(),

In [63]:
## Comparing array-like objects
pd.Series(['foo', 'bar', 'baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

In [64]:
pd.Index(['foo', 'bar', 'baz']) == 'foo'

array([ True, False, False], dtype=bool)

In [65]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [66]:
pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

In [67]:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

ValueError: Can only compare identically-labeled Series objects

In [68]:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

ValueError: Can only compare identically-labeled Series objects

In [70]:
np.array([1, 2, 3]) == np.array([2])

array([False,  True, False], dtype=bool)

In [71]:
np.array([1, 2, 3]) == np.array([1, 2])

  if __name__ == '__main__':


False

In [72]:
## Combining overlapping data sets

In [73]:
df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
'B' : [np.nan, 2., 3., np.nan, 6.]})

In [74]:
df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
'B' : [np.nan, np.nan, 3., 4., 6., 8.]})


In [75]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [76]:
df2


Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [77]:
df1.combine_first(df2)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [78]:
combiner = lambda x, y: np.where(pd.isnull(x), y, x)

In [79]:
df1.combine(df2, combiner)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


In [80]:
##Descriptive statistics


In [82]:
df

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.876369,0.093015,0.634339
c,-0.644334,-0.145912,0.968932
d,,-1.104801,1.299398


In [83]:
df.mean(0)

one     -0.050777
three   -0.385899
two      0.605844
dtype: float64

In [84]:
df.mean(1)

a   -0.431830
b    0.534574
c    0.059562
d    0.097298
dtype: float64

In [85]:
# All such methods have a skipna option signaling whether to exclude missing data (True by default):
df.sum(0, skipna=False)

one           NaN
three         NaN
two      2.423377
dtype: float64

In [86]:
df.sum(axis=1, skipna=True)

a   -0.863659
b    1.603723
c    0.178687
d    0.194596
dtype: float64

In [87]:
ts_stand = (df - df.mean()) / df.std()

In [88]:
ts_stand.std()

one      1.0
three    1.0
two      1.0
dtype: float64

In [89]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)


In [90]:
xs_stand.std(1)

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

In [91]:
df.cumsum()

Unnamed: 0,one,three,two
a,-0.384367,,-0.479292
b,0.492002,0.093015,0.155047
c,-0.152332,-0.052897,1.123979
d,,-1.157698,2.423377


In [92]:
np.mean(df['one'])

-0.0507774580609372

In [93]:
np.mean(df['one'].values)

nan

In [94]:
series = pd.Series(np.random.randn(500))

In [95]:
series[20:500] = np.nan

In [96]:
series[10:20] = 5

In [98]:
## Series also has a method nunique() which will return the number of unique non-null values:
series.nunique()

11

In [99]:
## Summarizing data: describe

In [100]:
series = pd.Series(np.random.randn(1000))

In [101]:
series[::2] = np.nan


In [102]:
series.describe()

count    500.000000
mean       0.126953
std        1.037183
min       -2.931329
25%       -0.618176
50%        0.087927
75%        0.790774
max        3.526849
dtype: float64

In [103]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [105]:
frame.loc[::2] = np.nan

In [106]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,0.04527,0.050884,-0.029944,-0.025274,-0.010222
std,0.992895,0.968022,1.030619,0.981891,1.029361
min,-3.677458,-2.585459,-3.833547,-2.496552,-3.201139
25%,-0.668046,-0.597402,-0.753482,-0.672231,-0.684884
50%,0.040577,0.014885,-0.084266,-0.014768,-0.02685
75%,0.762296,0.76732,0.761946,0.57178,0.743115
max,2.459558,3.137463,2.539736,3.286979,3.17338


In [109]:
frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
frame.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,Yes
freq,2


In [110]:
frame.describe(include=['number'])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [111]:
frame.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,Yes,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


In [112]:
## Index of Min/Max Values

In [113]:
s1 = pd.Series(np.random.randn(5))

In [114]:
s1

0   -1.644652
1    0.048466
2    0.522437
3    0.694215
4   -1.192757
dtype: float64

In [115]:
s1.idxmin(), s1.idxmax()

(0, 3)

In [116]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])

In [118]:
df1

Unnamed: 0,A,B,C
0,0.976981,1.196252,-0.49237
1,1.656404,-1.686317,-0.426543
2,-0.846679,0.104997,-0.676612
3,-1.012821,-0.291061,1.071064
4,0.367634,0.284725,1.207348


In [119]:
df1.idxmin(axis=0)

A    3
B    1
C    2
dtype: int64

In [120]:
df1.idxmax(axis=1)

0    B
1    A
2    B
3    C
4    C
dtype: object

In [121]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [122]:
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [123]:
df3['A'].idxmin()

'd'

In [124]:
## Value counts (histogramming) / Mode


In [125]:
data = np.random.randint(0, 7, size=50)

In [126]:
data

array([1, 4, 2, 4, 4, 6, 2, 0, 3, 5, 0, 1, 5, 1, 2, 6, 5, 3, 6, 1, 4, 6, 1,
       5, 4, 6, 2, 1, 5, 2, 6, 3, 6, 5, 6, 3, 4, 2, 5, 3, 1, 0, 4, 6, 1, 5,
       0, 3, 0, 4])

In [127]:
s = pd.Series(data)

In [129]:
s.value_counts()


6    9
5    8
4    8
1    8
3    6
2    6
0    5
dtype: int64

In [130]:
pd.value_counts(data)

6    9
5    8
4    8
1    8
3    6
2    6
0    5
dtype: int64

In [131]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [132]:
s5.mode()


0    3
1    7
dtype: int64

In [133]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})

In [134]:
df5.mode()

Unnamed: 0,A,B
0,0,-1.0
1,4,
2,5,


In [135]:
## Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample
##quantiles) functions:

In [136]:
arr = np.random.randn(20)

In [137]:
factor = pd.cut(arr, 4)

In [138]:
factor

[(-0.449, 0.429], (-1.33, -0.449], (-0.449, 0.429], (-0.449, 0.429], (-1.33, -0.449], ..., (1.307, 2.185], (0.429, 1.307], (-0.449, 0.429], (0.429, 1.307], (-0.449, 0.429]]
Length: 20
Categories (4, interval[float64]): [(-1.33, -0.449] < (-0.449, 0.429] < (0.429, 1.307] < (1.307, 2.185]]

In [139]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [140]:
factor

[(0, 1], (-5, -1], (-1, 0], (-1, 0], (-5, -1], ..., (1, 5], (0, 1], (0, 1], (1, 5], (0, 1]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

In [141]:
arr = np.random.randn(30)

In [142]:
factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [143]:
factor

[(0.112, 2.013], (-2.059, -0.924], (-0.924, -0.396], (-0.924, -0.396], (0.112, 2.013], ..., (-2.059, -0.924], (-2.059, -0.924], (0.112, 2.013], (0.112, 2.013], (0.112, 2.013]]
Length: 30
Categories (4, interval[float64]): [(-2.059, -0.924] < (-0.924, -0.396] < (-0.396, 0.112] < (0.112, 2.013]]

In [144]:
pd.value_counts(factor)

(0.112, 2.013]      8
(-2.059, -0.924]    8
(-0.396, 0.112]     7
(-0.924, -0.396]    7
dtype: int64

In [145]:
arr = np.random.randn(20)

In [146]:
factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [147]:
factor


[(0.0, inf], (-inf, 0.0], (0.0, inf], (-inf, 0.0], (0.0, inf], ..., (0.0, inf], (-inf, 0.0], (0.0, inf], (-inf, 0.0], (-inf, 0.0]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]