# Introduction Pandas

## Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data
- load
- prepare
- manipulate
- model
- analyze

### Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Import pasdas module and additional modules

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Object Creation


Creating a Series by passing a list of values, letting pandas create a default integer index:

In [4]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [5]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
#np.random.randn(rows,columns)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892
2013-01-04,0.177149,0.054317,-0.843511,-1.823639
2013-01-05,1.451345,-0.346647,-0.073056,0.760828
2013-01-06,-0.535913,-0.265819,0.557702,1.351285


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [7]:
df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.

In [8]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

Here is how to view the top and bottom rows of the frame:

In [9]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892
2013-01-04,0.177149,0.054317,-0.843511,-1.823639
2013-01-05,1.451345,-0.346647,-0.073056,0.760828


In [10]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.177149,0.054317,-0.843511,-1.823639
2013-01-05,1.451345,-0.346647,-0.073056,0.760828
2013-01-06,-0.535913,-0.265819,0.557702,1.351285


Display the index, columns, and the underlying NumPy data:

In [11]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [12]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [13]:
df.values

array([[-1.55218273, -0.43312931,  0.97289359, -1.34332236],
       [ 1.14285308, -0.46568701, -0.79661866, -1.62362653],
       [-0.63610355,  0.00455149, -2.12568708, -0.4268918 ],
       [ 0.17714879,  0.0543168 , -0.84351102, -1.82363908],
       [ 1.45134549, -0.34664659, -0.07305597,  0.76082786],
       [-0.53591338, -0.26581939,  0.55770188,  1.35128532]])

describe() shows a quick statistic summary of your data:

In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.007858,-0.242069,-0.384713,-0.517561
std,1.144083,0.222115,1.117097,1.322587
min,-1.552183,-0.465687,-2.125687,-1.823639
25%,-0.611056,-0.411509,-0.831788,-1.55355
50%,-0.179382,-0.306233,-0.434837,-0.885107
75%,0.901427,-0.063041,0.400012,0.463898
max,1.451345,0.054317,0.972894,1.351285


Transposing your data:

In [15]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,-1.552183,1.142853,-0.636104,0.177149,1.451345,-0.535913
B,-0.433129,-0.465687,0.004551,0.054317,-0.346647,-0.265819
C,0.972894,-0.796619,-2.125687,-0.843511,-0.073056,0.557702
D,-1.343322,-1.623627,-0.426892,-1.823639,0.760828,1.351285


Sorting by an axis:

In [16]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-1.343322,0.972894,-0.433129,-1.552183
2013-01-02,-1.623627,-0.796619,-0.465687,1.142853
2013-01-03,-0.426892,-2.125687,0.004551,-0.636104
2013-01-04,-1.823639,-0.843511,0.054317,0.177149
2013-01-05,0.760828,-0.073056,-0.346647,1.451345
2013-01-06,1.351285,0.557702,-0.265819,-0.535913


Sorting by values:

In [17]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322
2013-01-05,1.451345,-0.346647,-0.073056,0.760828
2013-01-06,-0.535913,-0.265819,0.557702,1.351285
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892
2013-01-04,0.177149,0.054317,-0.843511,-1.823639


### Selection

#### Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [18]:
df['A']

2013-01-01   -1.552183
2013-01-02    1.142853
2013-01-03   -0.636104
2013-01-04    0.177149
2013-01-05    1.451345
2013-01-06   -0.535913
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [19]:
#get 3 rows and start from row 0
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892


In [20]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892
2013-01-04,0.177149,0.054317,-0.843511,-1.823639


#### Selection by Label

For getting a cross section using a label:

In [22]:
df.loc[dates[0]]

A   -1.552183
B   -0.433129
C    0.972894
D   -1.343322
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [23]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-1.552183,-0.433129
2013-01-02,1.142853,-0.465687
2013-01-03,-0.636104,0.004551
2013-01-04,0.177149,0.054317
2013-01-05,1.451345,-0.346647
2013-01-06,-0.535913,-0.265819


Showing label slicing, both endpoints are included:

In [24]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,1.142853,-0.465687
2013-01-03,-0.636104,0.004551
2013-01-04,0.177149,0.054317


For getting a scalar value:

In [25]:
df.loc[dates[0],'A']

-1.552182731343274

#### Selection by Position

Select via the position of the passed integers:

In [26]:
df.iloc[3]

A    0.177149
B    0.054317
C   -0.843511
D   -1.823639
Name: 2013-01-04 00:00:00, dtype: float64

By lists of integer position locations

In [29]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,1.142853,-0.796619
2013-01-03,-0.636104,-2.125687
2013-01-05,1.451345,-0.073056


### Boolean Indexing

Using a single column’s values to select data.

In [30]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627
2013-01-04,0.177149,0.054317,-0.843511,-1.823639
2013-01-05,1.451345,-0.346647,-0.073056,0.760828


Selecting values from a DataFrame where a boolean condition is met.

In [31]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,0.972894,
2013-01-02,1.142853,,,
2013-01-03,,0.004551,,
2013-01-04,0.177149,0.054317,,
2013-01-05,1.451345,,,0.760828
2013-01-06,,,0.557702,1.351285


Using the isin() method for filtering:

In [32]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322,one
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627,one
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892,two
2013-01-04,0.177149,0.054317,-0.843511,-1.823639,three
2013-01-05,1.451345,-0.346647,-0.073056,0.760828,four
2013-01-06,-0.535913,-0.265819,0.557702,1.351285,three


In [33]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892,two
2013-01-05,1.451345,-0.346647,-0.073056,0.760828,four


### Reference:

https://pandas.pydata.org/pandas-docs/stable/10min.html#selection-by-label