<a href="https://colab.research.google.com/github/cweiqiang/wq.github.io/blob/main/Cheatsheet_Pandas_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

The Pandas library is built on NumPy and provides easy-to-use
 data
structures and data analysis tools for the Python
 programming language.
Use the following import convention:

`import pandas as pd`

In [2]:
import pandas as pd

# Pandas Data Structures

## Series

A one-dimensional labeled array
capable of holding any data type

In [4]:
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s

a    3
b   -5
c    7
d    4
dtype: int64

## Dataframe

A two-dimensional labeled
 data structure
with columns
 of potentially different types

In [12]:
data = {'Country' : ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasília'],
        'Population': [11190846, 1303171035, 207847528]}
data

{'Capital': ['Brussels', 'New Delhi', 'Brasília'],
 'Country': ['Belgium', 'India', 'Brazil'],
 'Population': [11190846, 1303171035, 207847528]}

In [13]:
df = pd.DataFrame(data, columns=['Country','Capital','Population'])
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


# Dropping

In [15]:
s.drop(['a', 'c']) #Drop values from rows (axis=0)

b   -5
d    4
dtype: int64

In [16]:
df.drop('Country', axis=1) #Drop values from columns

Unnamed: 0,Capital,Population
0,Brussels,11190846
1,New Delhi,1303171035
2,Brasília,207847528


# Asking For Help

In [None]:
help(pd.Series.loc)

# Sort & Rank

In [20]:
df.sort_index() #Sort by labels along an axis

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


In [21]:
df.sort_values(by='Country') #Sort by the values

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
2,Brazil,Brasília,207847528
1,India,New Delhi,1303171035


In [22]:
df.rank() #Assign ranks to entries

Unnamed: 0,Country,Capital,Population
0,1.0,2.0,1.0
1,3.0,3.0,3.0
2,2.0,1.0,2.0


# I/O

## Read and Write to CSV

In [None]:
pd.read_csv('file.csv', header=None, nrows=5)
df.to_csv('myDataFrame.csv')

## Read and Write to Excel

In [None]:
pd.read_excel('file.xlsx')
df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')

### Read multiple sheets from the same file

In [None]:
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx,'Sheet1')

## Read and Write to SQL Query or Database Table

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query("SELECT * FROM my_table;", engine)

`read_sql()` is a convenience wrapper around `read_sql_table()` and
`read_sql_query()`

In [None]:
df.to_sql('myDf', engine)

# Selection

## Getting

In [23]:
s['b']

-5

In [24]:
df[1:]

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


## Selecting, Boolean Indexing & Setting

### By Position

In [25]:
#Select single value by row & column
df.iloc[[0],[0]]

Unnamed: 0,Country
0,Belgium


### By Label

In [30]:
#Select single value by row &
df.loc[[0],['Country']]

Unnamed: 0,Country
0,Belgium


### By Label/Position

In [33]:
df.iloc[2] #Select single row of subset of rows

#Select a single column of subset of columns
#Select rows and columns

Country          Brazil
Capital        Brasília
Population    207847528
Name: 2, dtype: object

In [40]:
df['Capital'] #Select a single column of subset of columns

0     Brussels
1    New Delhi
2     Brasília
Name: Capital, dtype: object

In [44]:
df[['Capital']] 

Unnamed: 0,Capital
0,Brussels
1,New Delhi
2,Brasília


In [41]:
df[['Capital','Population']]

Unnamed: 0,Capital,Population
0,Brussels,11190846
1,New Delhi,1303171035
2,Brasília,207847528


In [47]:
df[['Capital']].iloc[1]

Capital    New Delhi
Name: 1, dtype: object

### Boolean Indexing

In [48]:
s

a    3
b   -5
c    7
d    4
dtype: int64

In [49]:
s[~(s > 1)] #Series s where value is not >1

b   -5
dtype: int64

In [50]:
s[(s < -1) | (s > 2)] #s where value is <-1 or >2

a    3
b   -5
c    7
d    4
dtype: int64

In [51]:
#Use filter to adjust DataFrame
df[df['Population']>1200000000]

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035


## Setting

In [52]:
s['a'] = 6 #Set index a of Series s to 6

# Retrieving Series/DataFrame Information

## Basic Information

In [53]:
df.shape

(3, 3)

In [54]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [55]:
df.columns

Index(['Country', 'Capital', 'Population'], dtype='object')

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Capital     3 non-null      object
 2   Population  3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


In [57]:
df.count()

Country       3
Capital       3
Population    3
dtype: int64

## Summary

In [64]:
df[['Population']].sum()

Population    1522209409
dtype: int64

In [65]:
df[['Population']].cumsum()

Unnamed: 0,Population
0,11190846
1,1314361881
2,1522209409


In [66]:
df[['Population']].min()/df[['Population']].max()

Population    0.008587
dtype: float64

In [67]:
df[['Population']].idxmin()/df[['Population']].idxmax() #Minimum/Maximum index value

Population    0.0
dtype: float64

In [68]:
df.describe()

Unnamed: 0,Population
count,3.0
mean,507403100.0
std,696134600.0
min,11190850.0
25%,109519200.0
50%,207847500.0
75%,755509300.0
max,1303171000.0


In [69]:
df[['Population']].mean()

Population    5.074031e+08
dtype: float64

In [70]:
df[['Population']].median()

Population    207847528.0
dtype: float64

# Applying Lambda Functions

In [75]:
f = lambda x: x*2
df.apply(f) #Apply function

Unnamed: 0,Country,Capital,Population
0,BelgiumBelgium,BrusselsBrussels,22381692
1,IndiaIndia,New DelhiNew Delhi,2606342070
2,BrazilBrazil,BrasíliaBrasília,415695056


In [76]:
df.applymap(f) #Apply function element-wise

Unnamed: 0,Country,Capital,Population
0,BelgiumBelgium,BrusselsBrussels,22381692
1,IndiaIndia,New DelhiNew Delhi,2606342070
2,BrazilBrazil,BrasíliaBrasília,415695056


# Data Alignment



# Internal Data Alignment

NA values are introduced in the indices that don’t overlap:

In [77]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s + s3

a    13.0
b     NaN
c     5.0
d     7.0
dtype: float64

## Arithmetic Operations with Fill Methods

You can also do the internal data alignment yourself with
 the help of the fill methods:

In [81]:
S=s + s3

In [82]:
S.fillna(5)

a    13.0
b     5.0
c     5.0
d     7.0
dtype: float64