# Pandas Example

Library that helps you interact with row and column type data

## Object Creation

In [1]:
import numpy as np
import pandas as pd

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creates a Series by passing a list of values, letting pandas create a default integer index

In [3]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Returns date range

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.38299,-0.164855,0.383959,-1.601526
2013-01-02,-1.190067,0.323596,0.874489,-2.016138
2013-01-03,-0.170926,-0.96004,0.218619,1.041684
2013-01-04,-0.244392,0.93356,-1.380531,0.573557
2013-01-05,-0.816873,1.060858,1.457101,-0.276459
2013-01-06,-1.204632,-1.832616,0.108081,-0.050782


Creates a DataFrame by passing a NumPy array, with a datetime index and labeled columns

In [5]:
 df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Creates a DataFrame by passing a dict of objects that can be converted to series-like

In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

The columns of the resulting DataFrame have different dtypes

## Viewing Data

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.38299,-0.164855,0.383959,-1.601526
2013-01-02,-1.190067,0.323596,0.874489,-2.016138
2013-01-03,-0.170926,-0.96004,0.218619,1.041684
2013-01-04,-0.244392,0.93356,-1.380531,0.573557
2013-01-05,-0.816873,1.060858,1.457101,-0.276459


In [8]:
df.tail()

Unnamed: 0,A,B,C,D
2013-01-02,-1.190067,0.323596,0.874489,-2.016138
2013-01-03,-0.170926,-0.96004,0.218619,1.041684
2013-01-04,-0.244392,0.93356,-1.380531,0.573557
2013-01-05,-0.816873,1.060858,1.457101,-0.276459
2013-01-06,-1.204632,-1.832616,0.108081,-0.050782


Here is how to view the top and bottom rows of the frame

In [9]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Display index

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

Display columns

In [11]:
df.to_numpy()

array([[ 0.38299023, -0.16485515,  0.38395892, -1.60152574],
       [-1.19006691,  0.32359642,  0.87448911, -2.0161382 ],
       [-0.17092561, -0.96003979,  0.21861923,  1.04168424],
       [-0.24439246,  0.93356017, -1.38053126,  0.57355668],
       [-0.81687299,  1.06085777,  1.45710114, -0.27645853],
       [-1.20463163, -1.8326158 ,  0.10808118, -0.05078215]])

Dataframe to numpy with single data types

In [12]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Dataframe to numpy for multuple data types (different per column) is relatively expensive

*Note:
DataFrame.to_numpy() does not include the index or column labels in the output.

In [13]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.54065,-0.106583,0.276953,-0.388277
std,0.635057,1.12613,0.952935,1.201518
min,-1.204632,-1.832616,-1.380531,-2.016138
25%,-1.096768,-0.761244,0.135716,-1.270259
50%,-0.530633,0.079371,0.301289,-0.16362
75%,-0.189292,0.781069,0.751857,0.417472
max,0.38299,1.060858,1.457101,1.041684


Shows quick statistic summary

In [14]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.38299,-1.190067,-0.170926,-0.244392,-0.816873,-1.204632
B,-0.164855,0.323596,-0.96004,0.93356,1.060858,-1.832616
C,0.383959,0.874489,0.218619,-1.380531,1.457101,0.108081
D,-1.601526,-2.016138,1.041684,0.573557,-0.276459,-0.050782


Transposes data

In [15]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-1.601526,0.383959,-0.164855,0.38299
2013-01-02,-2.016138,0.874489,0.323596,-1.190067
2013-01-03,1.041684,0.218619,-0.96004,-0.170926
2013-01-04,0.573557,-1.380531,0.93356,-0.244392
2013-01-05,-0.276459,1.457101,1.060858,-0.816873
2013-01-06,-0.050782,0.108081,-1.832616,-1.204632


Sorts by axis

In [16]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-1.204632,-1.832616,0.108081,-0.050782
2013-01-03,-0.170926,-0.96004,0.218619,1.041684
2013-01-01,0.38299,-0.164855,0.383959,-1.601526
2013-01-02,-1.190067,0.323596,0.874489,-2.016138
2013-01-04,-0.244392,0.93356,-1.380531,0.573557
2013-01-05,-0.816873,1.060858,1.457101,-0.276459


Sorts by values

## Selection

In [19]:
df['A']

2013-01-01    0.382990
2013-01-02   -1.190067
2013-01-03   -0.170926
2013-01-04   -0.244392
2013-01-05   -0.816873
2013-01-06   -1.204632
Freq: D, Name: A, dtype: float64

Selects single column --> series

In [18]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.38299,-0.164855,0.383959,-1.601526
2013-01-02,-1.190067,0.323596,0.874489,-2.016138
2013-01-03,-0.170926,-0.96004,0.218619,1.041684
