**Pandas Tutorial**

Pandas provides numerous tools to work with tabular data like you'd find in spreadsheets or databases. It is widely used for data preparation, cleaning, and analysis. It can work with a wide variety of data and provides many visualization options. It is built on top of NumPy.

Imports

In [1]:
import numpy as np
import pandas as pd
from numpy import random

# **Series**

In [2]:
# Pandas uses something called a dataframe. It is a 
# 2D data structure that can hold multiple data types.
# Columns have labels.

# Series are built on top of NumPy arrays. 
# Create a series by first creating a list
list_1 = ['a', 'b', 'c', 'd']
# I can define that I want the series indexes to be the
# provided labels
labels = [1, 2, 3, 4]
ser_1 = pd.Series(data=list_1, index=labels)
ser_1

1    a
2    b
3    c
4    d
dtype: object

In [3]:
# You can also add a NumPy array
arr_1 = np.array([1, 2, 3, 4])
arr_1

array([1, 2, 3, 4])

In [4]:
# Transform into series, it creates an index automatically
ser_2 = pd.Series(arr_1)
ser_2

0    1
1    2
2    3
3    4
dtype: int64

*Dictionary*

In [5]:
# You can quickly add labels and values with a dictionary
dict_1 = {"f_name": "Derek", "l_name": "Banas", "age": 44}
dict_1

{'f_name': 'Derek', 'l_name': 'Banas', 'age': 44}

In [6]:
ser_3 = pd.Series(dict_1)
ser_3

f_name    Derek
l_name    Banas
age          44
dtype: object

In [7]:
# Get data by label
ser_3['f_name']

'Derek'

In [8]:
# You can get the datatype
ser_1.dtype, ser_2.dtype, ser_3.dtype

(dtype('O'), dtype('int64'), dtype('O'))

## *Simple maths in Series*

In [9]:
# You can perform math operations on series
ser_2 + ser_2

0    2
1    4
2    6
3    8
dtype: int64

In [10]:
ser_2 - ser_2

0    0
1    0
2    0
3    0
dtype: int64

In [11]:
ser_2 * ser_2

0     1
1     4
2     9
3    16
dtype: int64

In [12]:
ser_2 / ser_2

0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64

In [13]:
np.exp(ser_2)

0     2.718282
1     7.389056
2    20.085537
3    54.598150
dtype: float64

In [14]:
# The difference between Series and ndarray is that operations
# align by labels

In [15]:
# Create a series from a dictionary
ser_4 = pd.Series({4: 5, 5: 6, 6: 7, 7: 8})
# If labels don't align you will get NaN
ser_2 + ser_4

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
dtype: float64

In [16]:
# You can assign names to series
ser_5 = pd.Series({8: 9, 9: 10}, name='rand_nums')
ser_5

8     9
9    10
Name: rand_nums, dtype: int64

In [17]:
ser_5.name

'rand_nums'

# DataFrames

DataFrames are the most commonly used data structure with Pandas. They are made up of multiple series that share the same index / label. They can contain multiple data types. They can be created from dicts, series, lists or other dataframes.

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

## Creating DataFrames

In [18]:
# Create random matrix 2x3 with values between 10 and 50
arr_2 = np.random.randint(10, 50, size=(2, 3))
arr_2

array([[30, 23, 17],
       [39, 28, 32]])

In [19]:
# Create DF with data, row labels & column labels
df_1 = pd.DataFrame(arr_2, index = ['A', 'B'], columns=['C', 'D', 'E'] )
df_1

Unnamed: 0,C,D,E
A,30,23,17
B,39,28,32


In [20]:
df_1A = pd.DataFrame(arr_2, index = ['A', 'B'], 
                    columns=['C', 'D', 'E'],dtype = 'float')
df_1A

Unnamed: 0,C,D,E
A,30.0,23.0,17.0
B,39.0,28.0,32.0


In [21]:
# Create a DF from multiple series in a dict
# If series are of different lengthes extra spaces are NaN
dict_3 = {'one': pd.Series([1., 2., 3.], index = ['a', 'b', 'c']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
dict_3

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [22]:
df_2 = pd.DataFrame(dict_3)
df_2

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [23]:
# from_dict accepts a column labels and lists
dict_exm = dict([('A', [1,2,3]), ('B', [4,5,6])])
dict_exm

{'A': [1, 2, 3], 'B': [4, 5, 6]}

In [24]:
pd.DataFrame.from_dict(dict_exm)

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [25]:
# You can assign the keys as row labels and column labels separate
# with orient='index'
pd.DataFrame.from_dict(dict([('A',[1,2,3]), ('B', [4,5,5])]), orient = 'index',
                      columns = ['one', 'two', 'three'])

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,5


In [26]:
# Get number of rows and columns as tuple
df_1.shape, df_2.shape

((2, 3), (4, 2))

In [62]:
# Multi Index DataFrames
df_mult_index = pd.DataFrame({"a" : [4, 5, 6],
                                "b" : [7, 8, 9],
                                "c" : [10, 11, 12]},
                               index = [1, 2, 3])
df_mult_index

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [70]:
df_mult_index_2 = pd.DataFrame({"a" : [4, 5, 6], "b" : [7, 8, 9],"c" : [10, 11, 12]},
                               index = pd.MultiIndex.from_tuples([('d', 1), 
                                                                  ('d', 2),
                                                                  ('e', 4)], 
                                                                 names=['n', 'v']))
df_mult_index_2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4,7,10
d,2,5,8,11
e,4,6,9,12


# Editing and Retrieving Data

In [27]:
print(df_1)

    C   D   E
A  30  23  17
B  39  28  32


In [28]:
# Grab a **column**
print(df_1['C']) # Use 1 ['x']
print('--------')
print(df_1[['C', 'E']]) # Use 2 [['x', 'y']]

A    30
B    39
Name: C, dtype: int64
--------
    C   E
A  30  17
B  39  32


In [29]:
# Grabb a row as a series
print(df_1.loc['A'])
print('--------')
# Grab row by index position
df_1.iloc[1]

C    30
D    23
E    17
Name: A, dtype: int64
--------


C    39
D    28
E    32
Name: B, dtype: int64

In [30]:
# Grab cell with Row & Column
df_1.loc['A', 'C']

30

In [31]:
# Grab multiple cells by defining rows wanted & the
# columns from those rows
df_1.loc[['A', 'B'], ['C', 'E']]

Unnamed: 0,C,E
A,30,17
B,39,32


In [32]:
# Make new column
df_1['Total'] = df_1['C'] + df_1['D'] + df_1['E']
df_1

Unnamed: 0,C,D,E,Total
A,30,23,17,70
B,39,28,32,99


In [33]:
df_2

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [34]:
# You can perform multiple calculations
df_2['mult'] = df_2['one'] * df_2['two']
df_2

Unnamed: 0,one,two,mult
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [35]:
# Make a new row by appending
dict_2 = {"C": 44, "D": 45, "E": 46}
new_row = pd.Series(dict_2, name = 'F')
df_1 = df_1.append(new_row)
df_1

Unnamed: 0,C,D,E,Total
A,30.0,23.0,17.0,70.0
B,39.0,28.0,32.0,99.0
F,44.0,45.0,46.0,


In [36]:
# Delete column (axis = 1) and set inplace to True which is required
# because Pandas tries to help you not delete data by accident
df_1.drop('Total', axis=1, inplace=True)
df_1

Unnamed: 0,C,D,E
A,30.0,23.0,17.0
B,39.0,28.0,32.0
F,44.0,45.0,46.0


In [37]:
# Delete a row (axis=0) -- it doesn't need to add axis = 0!
df_1.drop('B', axis = 0, inplace=True)
df_1

Unnamed: 0,C,D,E
A,30.0,23.0,17.0
F,44.0,45.0,46.0


In [38]:
# Create a new column and make it the index
df_1['Sex'] = ['Men', 'Women']
df_1.set_index('Sex', inplace=True)
df_1

Unnamed: 0_level_0,C,D,E
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Men,30.0,23.0,17.0
Women,44.0,45.0,46.0


In [39]:
# You can reset index values to numbers
df_1.reset_index(inplace=True)
df_1

Unnamed: 0,Sex,C,D,E
0,Men,30.0,23.0,17.0
1,Women,44.0,45.0,46.0


In [40]:
df_2

Unnamed: 0,one,two,mult
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [41]:
# Assign can be used to create a column while leaving the
# original DF untouched
df_2.assign(div=df_2['one'] / df_2['two'])
df_2

Unnamed: 0,one,two,mult
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [42]:
# You can pass in a function as well
df_2.assign(div=lambda x: (x['one'] / x['two']))
df_2

Unnamed: 0,one,two,mult
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [None]:
# Combine DataFrames while keeping df_3 data unless
# there is a NaN value

In [48]:
df_3 = pd.DataFrame({'A': [1., np.nan, 3., np.nan]})
df_3

Unnamed: 0,A
0,1.0
1,
2,3.0
3,


In [49]:
df_4 = pd.DataFrame({'A': [8., 9., 2., 4.]})
df_4

Unnamed: 0,A
0,8.0
1,9.0
2,2.0
3,4.0


In [52]:
df_3new = df_3.combine_first(df_4)
df_3new

Unnamed: 0,A
0,1.0
1,9.0
2,3.0
3,4.0


In [53]:
df_3 = pd.DataFrame({'A': [1., np.nan, 3., np.nan]})
df_4 = pd.DataFrame({'A': [8., 9., 2., 4.]})
df_3.combine_first(df_4)

Unnamed: 0,A
0,1.0
1,9.0
2,3.0
3,4.0


In [54]:
# Compare columns
df_3.compare(df_4)

Unnamed: 0_level_0,A,A
Unnamed: 0_level_1,self,other
0,1.0,8.0
1,,9.0
2,3.0,2.0
3,,4.0


In [58]:
# Combine, using a function
df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})
print(df1)
print('--------')
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
print(df2)
print('--------')
df1.combine(df2, np.minimum)

   A  B
0  5  2
1  0  4
--------
   A  B
0  1  3
1  1  3
--------


Unnamed: 0,A,B
0,1,2
1,0,3
