## Pandas

The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.

Use the following import convention:

In [1]:
import pandas as pd

In [8]:
help(pd.Series.loc)

Help on property:

    Access a group of rows and columns by label(s) or a boolean array.
    
    ``.loc[]`` is primarily label based, but may also be used with a
    boolean array.
    
    Allowed inputs are:
    
    - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
      interpreted as a *label* of the index, and **never** as an
      integer position along the index).
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'``.
    
          start and the stop are included
    
    - A boolean array of the same length as the axis being sliced,
      e.g. ``[True, False, True]``.
    - An alignable boolean Series. The index of the key will be aligned before
      masking.
    - An alignable Index. The Index of the returned selection will be the input.
    - A ``callable`` function with one argument (the calling Series or
      DataFrame) and that returns valid output for indexing (one of the above)
    
    See more at 

### Asking For Help

### Pandas Data Structures

* Series

A one-dimensional labeled array capable of holding any data type

In [2]:
s = pd.Series([3,-5,7,1.5], index=["a","b","c","d"])

s

a    3.0
b   -5.0
c    7.0
d    1.5
dtype: float64

* Dataframe

A two-dimensional labeled data structure with columns of potentially different types

In [4]:
data = {"Country":["Turkey","Poland","French"],
        "Capital":["Ankara","Warsaw","Paris"],
        "Country Phone Code":[90,48,33]}

dataframe = pd.DataFrame(data) 

dataframe

Unnamed: 0,Country,Capital,Country Phone Code
0,Turkey,Ankara,90
1,Poland,Warsaw,48
2,French,Paris,33


1. Columns = Country,Capital, Country Phone Code

2. Index = 0,1,2

### Dropping

In [7]:
data = {"Country":["Turkey","Poland","French"],
        "Capital":["Ankara","Warsaw","Paris"],
        "Country Phone Code":[90,48,33]}

df = pd.DataFrame(data)   # df = dataframe

df = df.drop("Country", axis= 1)  # Drop values from columns(axis=1)

df

Unnamed: 0,Capital,Country Phone Code
0,Ankara,90
1,Warsaw,48
2,Paris,33


### Sort & Rank

In [13]:
d = {'col1': [5, 2,10,4,7], 'col2': [1, 3,8,11,25]}

df = pd.DataFrame(data=d)

df.sort_index()  # ort by labels along an axis

Unnamed: 0,col1,col2
0,5,1
1,2,3
2,10,8
3,4,11
4,7,25


In [14]:
df.sort_values(by = "col1")  # Sort by the values along an axis

Unnamed: 0,col1,col2
1,2,3
3,4,11
0,5,1
4,7,25
2,10,8


In [15]:
df.rank()  # Assign ranks to entries

Unnamed: 0,col1,col2
0,3.0,1.0
1,1.0,2.0
2,5.0,3.0
3,2.0,4.0
4,4.0,5.0


**Rename of Columns**

In [78]:
d = {'col_old': [5, 2,10,4,7], 'col2': [1, 3,8,11,25]}

df = pd.DataFrame(data=d)

df.rename(columns={"col_old":"col_new"}, inplace=True)

df

Unnamed: 0,col_new,col2
0,5,1
1,2,3
2,10,8
3,4,11
4,7,25


**Read and Write to CSV**

* File type is important for read

In [17]:
# pd.read_csv("file.csv")

**Read and Write to Excel**

In [18]:
# pd.read_excel("file.xlsx")

### Selection

* Getting

In [20]:
import numpy as np

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [21]:
df["a"]  # Get column

0    1
1    4
2    7
Name: a, dtype: int32

In [22]:
df[1:]  # Get subset of a DataFrame

Unnamed: 0,a,b,c
1,4,5,6
2,7,8,9


* Selecting, Boolean Indexing & Setting

By Position

In [28]:
data = {"Country":["Turkey","Poland","French"],
        "Capital":["Ankara","Warsaw","Paris"],
        "Country Phone Code":[90,48,33]}

df = pd.DataFrame(data)

df

Unnamed: 0,Country,Capital,Country Phone Code
0,Turkey,Ankara,90
1,Poland,Warsaw,48
2,French,Paris,33


In [29]:
df.iloc[0,0]  # Select single value by row & column

'Turkey'

In [24]:
df.iloc[[0],[0]]

Unnamed: 0,Country
0,Turkey


By Label

In [26]:
df.loc[1,"Country"]  #Select single value by row & column labels

'Poland'

In [27]:
df.loc[[1],["Country"]]

Unnamed: 0,Country
1,Poland


Boolean Indexing

In [31]:
data = {"Country":["Belgium","India","Brazil"],
        "Capital":["Brussels","New Delhi","Brasilia"],
        "Population":[11190846, 1303171035, 207847528]}

df = pd.DataFrame(data)

df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasilia,207847528


In [33]:
df[df["Country"] == "India"]  # Use filter to adjust DataFrame

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035


In [38]:
df[df["Population"] >= 12000000]

Unnamed: 0,Country,Capital,Population
1,India,New Delhi,1303171035
2,Brazil,Brasilia,207847528


In [45]:
df[(df["Population"] >= 10000) & (df["Population"] < 120000000)]

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846


* Setting

In [47]:
df["Population"] = 100000

df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,100000
1,India,New Delhi,100000
2,Brazil,Brasilia,100000


### Retrieving Series/DataFrame Information

* Basic Selection

In [86]:
data = {"a":[1, 2, 3, 5, 6, 7, 8, 9, 10,"a"], 
                   "b":[11, 12, 13, 14, 15, 16, 17, 18, 19,"b"]}

df = pd.DataFrame(data)

df.head()  # Select first n rows.

Unnamed: 0,a,b
0,1,11
1,2,12
2,3,13
3,5,14
4,6,15


In [87]:
df.tail()  # Select last n rows.

Unnamed: 0,a,b
5,7,16
6,8,17
7,9,18
8,10,19
9,a,b


In [88]:
data = {"Country":["Belgium","India","Brazil"],
        "Capital":["Brussels","New Delhi","Brasilia"],
        "Population":[11190846, 1303171035, 207847528]}

df = pd.DataFrame(data)

df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasilia,207847528


In [90]:
df[["Country","Population"]]  # Select multiple columns with specific names

Unnamed: 0,Country,Population
0,Belgium,11190846
1,India,1303171035
2,Brazil,207847528


In [91]:
df.Country

0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

* Basic Information

In [48]:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [50]:
df.shape  # (rows,columns)

(3, 3)

In [53]:
df.index  # Describe index

RangeIndex(start=0, stop=3, step=1)

In [54]:
df.columns   # Describe DataFrame columns

Index(['a', 'b', 'c'], dtype='object')

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       3 non-null      int32
 1   b       3 non-null      int32
 2   c       3 non-null      int32
dtypes: int32(3)
memory usage: 168.0 bytes


In [57]:
df.count()  

a    3
b    3
c    3
dtype: int64

In [59]:
df = pd.DataFrame(np.array([[1, np.nan, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df

Unnamed: 0,a,b,c
0,1.0,,3.0
1,4.0,5.0,6.0
2,7.0,8.0,9.0


In [60]:
df.isnull()

Unnamed: 0,a,b,c
0,False,True,False
1,False,False,False
2,False,False,False


In [61]:
df.isnull().sum()  # Sum of NA Values

a    0
b    1
c    0
dtype: int64

* Summary

In [100]:
data = {'col1': [1, 2, 5, 7, 9], 'col2': [3, 4, 10, 8, 6]}

df = pd.DataFrame(data)

df

Unnamed: 0,col1,col2
0,1,3
1,2,4
2,5,10
3,7,8
4,9,6


In [101]:
df.sum()  # Sum of values

col1    24
col2    31
dtype: int64

In [102]:
df.cumsum()  # Cummulative sum of values

Unnamed: 0,col1,col2
0,1,3
1,3,7
2,8,17
3,15,25
4,24,31


In [103]:
display(df.min())

#Minimum/maximum values

display(df.max())

col1    1
col2    3
dtype: int64

col1     9
col2    10
dtype: int64

In [70]:
display(df.idxmin())

#Minimum/Maximum index value

display(df.idxmax())

col1    0
col2    0
dtype: int64

col1    4
col2    2
dtype: int64

In [104]:
df.describe()  # Summary statistics

Unnamed: 0,col1,col2
count,5.0,5.0
mean,4.8,6.2
std,3.34664,2.863564
min,1.0,3.0
25%,2.0,4.0
50%,5.0,6.0
75%,7.0,8.0
max,9.0,10.0


In [105]:
df.shift(1)

Unnamed: 0,col1,col2
0,,
1,1.0,3.0
2,2.0,4.0
3,5.0,10.0
4,7.0,8.0


In [106]:
df.shift(-1)

Unnamed: 0,col1,col2
0,2.0,4.0
1,5.0,10.0
2,7.0,8.0
3,9.0,6.0
4,,


### Applying Functions

In [72]:
f = lambda x: x*2

In [73]:
df.apply(f)

Unnamed: 0,col1,col2
0,2,6
1,4,8
2,10,20
3,14,16
4,18,12


### Data Alignment

*  Internal Data Alignment

NA values are introduced in the indices that don’t overlap:

In [75]:
s = pd.Series([3,-5,7,1.5], index=["a","b","c","d"])

s2 = pd.Series([7,-2,3], index =["a","c","d"])


s + s2

a    10.0
b     NaN
c     5.0
d     4.5
dtype: float64

* Arithmetic Operations with Fill Methods

You can also do the internal data alignment yourself with the help of the fill methods:

In [76]:
s.add(s2, fill_value=0)

a    10.0
b    -5.0
c     5.0
d     4.5
dtype: float64

### Combine Data Set

In [94]:
data_1 = {"x1":["A","B","C"], "x2":[1,2,3]}
data_2 = {"x1":["A","B","D"], "x3":[4,5,6]}

df_1 = pd.DataFrame(data_1)

df_2 = pd.DataFrame(data_2)

df_merge = pd.merge(df_1,df_2, how = "left", on ="x1")

df_merge

Unnamed: 0,x1,x2,x3
0,A,1,4.0
1,B,2,5.0
2,C,3,


In [95]:
df_merge = pd.merge(df_1,df_2, how = "right", on ="x1")

df_merge

Unnamed: 0,x1,x2,x3
0,A,1.0,4
1,B,2.0,5
2,D,,6


In [97]:
df_merge = pd.merge(df_1,df_2, how = "inner", on ="x1")

df_merge

Unnamed: 0,x1,x2,x3
0,A,1,4
1,B,2,5


In [98]:
df_merge = pd.merge(df_1,df_2, how = "outer", on ="x1")

df_merge

Unnamed: 0,x1,x2,x3
0,A,1.0,4.0
1,B,2.0,5.0
2,C,3.0,
3,D,,6.0


In [99]:
df_isin = df_1[df_1["x1"].isin(df_2["x1"])]

df_isin

Unnamed: 0,x1,x2
0,A,1
1,B,2
