# Python Pandas Tutorial

Pandas is great and a defacto library for data mungling in Python. In this tutorial, we will take a look at the basic operations that a data scientist needs to use Pandas. 

See official page for more information: https://pandas.pydata.org/

## Import Pandas and Numpy

In [1]:
import pandas as pd
import numpy as np

## Data Types

Before we delve into the details, let's first understand the Pandas Data Types, namely Series and DataFrames.

### Series

Series are the foundational block of Pandas. You can think of it as labeled (indexed) arrays

In [6]:
# Create Series
x = pd.Series(np.random.randn(5))
x

0   -0.895791
1    0.003069
2    1.011121
3   -0.871680
4    0.556485
dtype: float64

In [8]:
# Create Series with index (if no index is provided, then numbered from 1 to n)
x = pd.Series(np.random.randn(5), index=['k', 'l', 'm', 'n', 'o'])
x

k    0.867775
l    0.096473
m    0.288008
n   -1.628125
o    0.971568
dtype: float64

In [12]:
# You can provide a dictionary to create series as well (with key as indices)
d = {
    'k': 1,
    'l': 2,
    'm': 3,
    'n': 7
}
y = pd.Series(d)
y

k    1
l    2
m    3
n    7
dtype: int64

In [24]:
# Series acts like ndarray in numpy so you can slice and do operations
print("From second element to end:\n{}\n".format(y[1:]))
print("Except for the last element:\n{}\n".format(y[:-1]))
print("Elements at indices:\n{}\n".format(y[[2,3]]))

From second element to end:
l    2
m    3
n    7
dtype: int64

Except for the last element:
k    1
l    2
m    3
dtype: int64

Elements at indices:
m    3
n    7
dtype: int64



In [26]:
# You can even do operations, if an index does not exist in one operand, it will be NaN, which indicates not a number
# NaN basically means missing element in pandas
x + y

k    1.867775
l    2.096473
m    3.288008
n    5.371875
o         NaN
dtype: float64

### DataFrames

DataFrames are the foundational block of Pandas. It is most important piece of Pandas that you need to master.
One can see a dataframe as a table (e.g. pivot table in Excel or a database table), organized as rows and columns.
Below let us see how DataFrames can be created and used

In [3]:
# Create an empty DataFrame with columns specified
df = pd.DataFrame(columns=['a', 'b', 'c'])
print("Empty DF:\n{}\n".format(df))

# Create from Series
s = pd.Series(np.random.rand(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame(s, columns=['N'])
print("DF from created from Series:\n{}\n".format(df))

# Create from Dictionary (keys will be column names, values will be rows)
d = {'a': np.random.rand(5), 'b': np.random.rand(5), 'c': np.random.rand(5)}
df = pd.DataFrame(d)
print("DF from created from a Dictionary:\n{}\n".format(df))

Empty DF:
Empty DataFrame
Columns: [a, b, c]
Index: []

DF from created from Series:
          N
a  0.140113
b  0.379190
c  0.508186
d  0.641902
e  0.847614

DF from created from a Dictionary:
          a         b         c
0  0.604368  0.940134  0.226047
1  0.926921  0.721891  0.749715
2  0.719402  0.381389  0.261365
3  0.175019  0.369784  0.323642
4  0.668848  0.634273  0.982425



In [4]:
# Create a sample dataframe
d = {'a': np.random.rand(5), 'b': np.random.rand(5), 'c': np.random.rand(5)}
df = pd.DataFrame(d)
print("Original:\n{}\n".format(df))

# Column Selection
col_a = df['a']
print("Column A has type: {} and content:\n{}\n".format(type(col_a), col_a))

# Column Addition (just treat df like a dictionary)
col_d = np.random.rand(5)
df['d'] = col_d
print("New df with column d:\n{}\n".format(df))

# You can insert column to specific location with insert function
col_e = np.random.rand(5)
df.insert(1, 'e', col_e)
print("New df with column e:\n{}\n".format(df))

# Delete a column
del df['d']
del df['e']
print("Original:\n{}\n".format(df))

Original:
          a         b         c
0  0.881947  0.138732  0.775113
1  0.040201  0.845211  0.123721
2  0.417347  0.669749  0.238554
3  0.858018  0.633113  0.258512
4  0.635574  0.339315  0.904747

Column A has type: <class 'pandas.core.series.Series'> and content:
0    0.881947
1    0.040201
2    0.417347
3    0.858018
4    0.635574
Name: a, dtype: float64

New df with column d:
          a         b         c         d
0  0.881947  0.138732  0.775113  0.720730
1  0.040201  0.845211  0.123721  0.434242
2  0.417347  0.669749  0.238554  0.487331
3  0.858018  0.633113  0.258512  0.033973
4  0.635574  0.339315  0.904747  0.877786

New df with column e:
          a         e         b         c         d
0  0.881947  0.044834  0.138732  0.775113  0.720730
1  0.040201  0.789613  0.845211  0.123721  0.434242
2  0.417347  0.293025  0.669749  0.238554  0.487331
3  0.858018  0.100410  0.633113  0.258512  0.033973
4  0.635574  0.093754  0.339315  0.904747  0.877786

Original:
          a   

In [50]:
# Create a sample dataframe
d = {'a': np.random.rand(5), 'b': np.random.rand(5), 'c': np.random.rand(5)}
df = pd.DataFrame(d)
df.index = ['e', 'f', 'g', 'h', 'j']
print("Original:\n{}\n".format(df))

# Select row by index label
row = df.loc['f']
print("Row of type:{} content:\n{}\n".format(type(row), row))


# Select row by integer location
row = df.iloc[3]
print("Row of type:{} content:\n{}\n".format(type(row), row))

# Select multiple rows (result is a dataframe)
df_selected = df[2:4]
print("Sliced with indices:\n{}\n".format(df_selected))

# Select multiple rows with a boolean vector
vec = [True, False, False, True, False]
df_selected = df[vec]
print("Sliced with boolean vector:\n{}\n".format(df_selected))

Original:
          a         b         c
e  0.556364  0.918399  0.410224
f  0.547123  0.728116  0.890848
g  0.097189  0.008000  0.328595
h  0.880321  0.688627  0.055330
j  0.251598  0.844132  0.961841

Row of type:<class 'pandas.core.series.Series'> content:
a    0.547123
b    0.728116
c    0.890848
Name: f, dtype: float64

Row of type:<class 'pandas.core.series.Series'> content:
a    0.880321
b    0.688627
c    0.055330
Name: h, dtype: float64

Sliced with indices:
          a         b         c
g  0.097189  0.008000  0.328595
h  0.880321  0.688627  0.055330

Sliced with boolean vector:
          a         b         c
e  0.556364  0.918399  0.410224
h  0.880321  0.688627  0.055330



In [22]:
# You can do any operations with df and it will be auto broadcasted to all elements
d = {'a': np.random.rand(3), 'b': np.random.rand(3)}
df = pd.DataFrame(d, index=['e', 'f', 'g'])
print("Original df:\n{}\n".format(df))
print("Original df:\n{}\n".format(df + 3))
print("Original df:\n{}\n".format(df * 2))

Original df:
          a         b
e  0.098913  0.736981
f  0.823385  0.891014
g  0.990247  0.882337

Original df:
          a         b
e  3.098913  3.736981
f  3.823385  3.891014
g  3.990247  3.882337

Original df:
          a         b
e  0.197826  1.473961
f  1.646769  1.782028
g  1.980493  1.764675



In [24]:
# You can print details of a dataframe with info() method
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, e to g
Data columns (total 2 columns):
a    3 non-null float64
b    3 non-null float64
dtypes: float64(2)
memory usage: 72.0+ bytes
