In [1]:
import array
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


 Creating a **Series** by passing a list of values, letting pandas creat a default interger index

In [2]:
s= pd.Series([1,3,5, np.nan,6,8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a *DataFrame* by passing a Numpy array, with a datatime index and labeled columns

In [4]:
dates= pd.date_range('20130101', periods=6) #Get first day of 6 months
dates



DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df= pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.002974,-0.084594,-0.979501,0.617876
2013-01-02,-0.023601,1.640083,2.232776,-1.735685
2013-01-03,-0.970831,0.130434,-0.694079,-0.155694
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-06,0.85407,1.760759,1.162881,-0.945938


In [6]:
m=np.random.randn(6,4)
m

array([[ 0.01938665, -1.99589551, -0.15124967,  0.01853181],
       [ 1.91795696,  1.2413333 , -0.62226611,  0.94501343],
       [ 1.25771592, -0.32006981,  1.69449089,  0.48667823],
       [-0.12436185,  0.77400296, -0.34639285, -0.82931585],
       [-1.10437613, -1.39633681, -0.76916567, -0.13040665],
       [ 1.35258943, -0.19376015, -0.52840844, -1.64653387]])

Creating a **DataFrame** by passing a dict of objects that can be converted to series-like.

In [7]:
df2= pd.DataFrame({ 'A': 1.,
             'B': pd.Timestamp('20130102'),
             'C': pd.Series(1, index=list(range(4)), dtype='float32'),
             'D': np.array([3]*4, dtype='int32'),
             'E': pd.Categorical(['test', 'train', 'test','train']),
             'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting *DataFrame* have different dtypes

In [8]:
%time sum([x for x in range(100000)]) #Example of magic function

CPU times: user 11.2 ms, sys: 326 µs, total: 11.5 ms
Wall time: 10.7 ms


4999950000

The columns of the resulting **DataFrame** have different dtypes

In [9]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [17]:
#df2.<tab> #Check why the <TAB> attribute is not working

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

# Viewing Data

See the [Basic Section](https://pandas.pydata.org/pandas-docs/stable/basics.html)

Here is how to view the top and bottom rows of the frame:

In [31]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-06,0.85407,1.760759,1.162881,-0.945938


In [32]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-06,0.85407,1.760759,1.162881,-0.945938


Display the index, columns, and the underlying *NumPy* data:

In [33]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [34]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [35]:
df.values

array([[-0.00297442, -0.08459447, -0.97950057,  0.61787578],
       [-0.02360141,  1.64008278,  2.23277602, -1.73568477],
       [-0.97083102,  0.13043365, -0.69407947, -0.15569449],
       [ 0.43401216,  0.13101973, -0.21612008, -0.24813895],
       [-0.02057651,  0.17729579,  1.3839061 , -0.93924275],
       [ 0.85407003,  1.76075901,  1.16288118, -0.94593834]])

[describe()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows a quick statistic summary of your data:

In [37]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.045016,0.625833,0.481644,-0.567804
std,0.608136,0.838202,1.292189,0.8152
min,-0.970831,-0.084594,-0.979501,-1.735685
25%,-0.022845,0.13058,-0.57459,-0.944264
50%,-0.011775,0.154158,0.473381,-0.593691
75%,0.324766,1.274386,1.32865,-0.178806
max,0.85407,1.760759,2.232776,0.617876


**Transposing** your data:

In [38]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-0.002974,-0.023601,-0.970831,0.434012,-0.020577,0.85407
B,-0.084594,1.640083,0.130434,0.13102,0.177296,1.760759
C,-0.979501,2.232776,-0.694079,-0.21612,1.383906,1.162881
D,0.617876,-1.735685,-0.155694,-0.248139,-0.939243,-0.945938


**Sorting** by an axis

In [40]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.617876,-0.979501,-0.084594,-0.002974
2013-01-02,-1.735685,2.232776,1.640083,-0.023601
2013-01-03,-0.155694,-0.694079,0.130434,-0.970831
2013-01-04,-0.248139,-0.21612,0.13102,0.434012
2013-01-05,-0.939243,1.383906,0.177296,-0.020577
2013-01-06,-0.945938,1.162881,1.760759,0.85407


Sorting by values:

In [41]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,-0.002974,-0.084594,-0.979501,0.617876
2013-01-03,-0.970831,0.130434,-0.694079,-0.155694
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-02,-0.023601,1.640083,2.232776,-1.735685
2013-01-06,0.85407,1.760759,1.162881,-0.945938


# Selection

**Note:** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, *.at, .iat, .loc* and *.iloc*. 

See the indexing documentation [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and [MultiIndex / Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced).

# Getting

Selecting a single column, which yields a *Series*, equivalent to *df.A*:

In [45]:
n=df['A']
n

2013-01-01   -0.002974
2013-01-02   -0.023601
2013-01-03   -0.970831
2013-01-04    0.434012
2013-01-05   -0.020577
2013-01-06    0.854070
Freq: D, Name: A, dtype: float64

In [46]:
type(n)

pandas.core.series.Series

Selecting via *[]*, which slices the rows (subseting).

In [47]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.002974,-0.084594,-0.979501,0.617876
2013-01-02,-0.023601,1.640083,2.232776,-1.735685
2013-01-03,-0.970831,0.130434,-0.694079,-0.155694


In [None]:
df['20130102':'20130104']

# Selection by Label

See more in [Selection by Label](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label)

For getting a cross section using a label:

In [49]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.002974,-0.084594,-0.979501,0.617876
2013-01-02,-0.023601,1.640083,2.232776,-1.735685
2013-01-03,-0.970831,0.130434,-0.694079,-0.155694
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-06,0.85407,1.760759,1.162881,-0.945938


In [48]:
df.loc[dates[0]] #Selection using row numbers

A   -0.002974
B   -0.084594
C   -0.979501
D    0.617876
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label (Selection by Column name):

In [50]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.002974,-0.084594
2013-01-02,-0.023601,1.640083
2013-01-03,-0.970831,0.130434
2013-01-04,0.434012,0.13102
2013-01-05,-0.020577,0.177296
2013-01-06,0.85407,1.760759


Showing label slicing, both endpoints are included (Row by Col selection):

In [55]:
df.loc['20130102':'20130104',['A','B', 'D']]

Unnamed: 0,A,B,D
2013-01-02,-0.023601,1.640083,-1.735685
2013-01-03,-0.970831,0.130434,-0.155694
2013-01-04,0.434012,0.13102,-0.248139


Reduction in the dimensions of the returned object:

In [54]:
df.loc['20130102', ['A', 'B', 'D']]

A   -0.023601
B    1.640083
D   -1.735685
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value (use index; 0-5):

In [58]:
df.loc[dates[0], 'A']

-0.002974415489522606

In [60]:
df.loc[dates[1], 'A']

-0.02360141489271486

For getting fast access to a scalar (equivalent to the prior method):

In [62]:
df.at[dates[0], 'A']

-0.002974415489522606

# Selection by Position

See more in [Selection by Position](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer).

Select via the position of the passed integers:

In [67]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.002974,-0.084594,-0.979501,0.617876
2013-01-02,-0.023601,1.640083,2.232776,-1.735685
2013-01-03,-0.970831,0.130434,-0.694079,-0.155694
2013-01-04,0.434012,0.13102,-0.21612,-0.248139
2013-01-05,-0.020577,0.177296,1.383906,-0.939243
2013-01-06,0.85407,1.760759,1.162881,-0.945938


In [70]:
df.shape

(6, 4)

In [71]:
df.iloc[:3, :3]

Unnamed: 0,A,B,C
2013-01-01,-0.002974,-0.084594,-0.979501
2013-01-02,-0.023601,1.640083,2.232776
2013-01-03,-0.970831,0.130434,-0.694079


In [87]:
df.iloc[:2, 2]

2013-01-01   -0.979501
2013-01-02    2.232776
Freq: D, Name: C, dtype: float64