# Introduction to Pandas
+ Pandas is an open sourse library built on top of NumPy.
+ It allows for fast analysis and data cleaning and preparation.
+ It excels in performance and productivity.
+ It also has built-in visualization features.
+ It can work with data from a wide variety of sources.

In Pandas we will learn about:
+ Series
+ DataFrames
+ Missing Data
+ GroupBy
+ Merging, Joining, and Concatenating
+ Operations
+ Data Input and Output

<hr>

# Series
Series is a a Pandas data type. It is similar to a NumPy array.
The difference between NumPy array and Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [1]:
import numpy as np
import pandas as pd

## Creating a Series
We can covert a list, numpy array, or dictionary to a Series:


In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10, 'b':20, 'c':30}

**Using lists**

In [3]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [4]:
pd.Series(data=my_list, index=labels)

a    10
b    20
c    30
dtype: int64

In [5]:
pd.Series(my_list, labels)

a    10
b    20
c    30
dtype: int64

**NumPy arrays**

In [6]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [7]:
pd.Series(arr, labels)

a    10
b    20
c    30
dtype: int32

**Dictionary**

In [8]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

## Data in a Series
A pandas Series can hold a variety of object types:

In [9]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [10]:
# it can even store references to python functions
pd.Series(data=[sum,len,len])

0    <built-in function sum>
1    <built-in function len>
2    <built-in function len>
dtype: object

## Using an Index
Pandas makes use of index names or numbers by allowing for fast look ups of information (sililar to a hash map or dictionary).

In [11]:
ser1 = pd.Series([1,2,3,4], index=['one', 'two', 'three', 'four'])
ser1

one      1
two      2
three    3
four     4
dtype: int64

In [12]:
ser2 = pd.Series([1,2,5,4], index=['one', 'two', 'five', 'four'])
ser2

one     1
two     2
five    5
four    4
dtype: int64

In [13]:
ser1['two']

2

Operations are then also based on indexes:

In [14]:
ser1 + ser2

five     NaN
four     8.0
one      2.0
three    NaN
two      4.0
dtype: float64

<hr>

# DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [16]:
data = np.random.randn(5,4)
data

array([[ 0.20346832,  0.6246626 , -0.47802864, -0.04007864],
       [-0.9859476 , -1.12843599,  0.39249414,  0.64144693],
       [-1.25569334, -0.60091718,  0.25742   , -1.53283206],
       [ 0.43016398,  0.2980349 ,  0.78130118, -0.63004176],
       [ 0.69058464,  0.0727666 ,  1.33069513, -0.81471154]])

In [17]:
df = pd.DataFrame(data, index='A B C D E'.split(), columns='W X Y Z'.split())

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
B,-0.985948,-1.128436,0.392494,0.641447
C,-1.255693,-0.600917,0.25742,-1.532832
D,0.430164,0.298035,0.781301,-0.630042
E,0.690585,0.072767,1.330695,-0.814712


## Selection and Indexing

In [19]:
df['W']

A    0.203468
B   -0.985948
C   -1.255693
D    0.430164
E    0.690585
Name: W, dtype: float64

In [21]:
# Pass a list of columns
df[['W','Y']]

Unnamed: 0,W,Y
A,0.203468,-0.478029
B,-0.985948,0.392494
C,-1.255693,0.25742
D,0.430164,0.781301
E,0.690585,1.330695


In [22]:
# SQL syntax
df.W

A    0.203468
B   -0.985948
C   -1.255693
D    0.430164
E    0.690585
Name: W, dtype: float64

DataFrame columns are just Series

In [23]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [24]:
df['new'] = df['W'] + df['Y']

In [25]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.203468,0.624663,-0.478029,-0.040079,-0.27456
B,-0.985948,-1.128436,0.392494,0.641447,-0.593453
C,-1.255693,-0.600917,0.25742,-1.532832,-0.998273
D,0.430164,0.298035,0.781301,-0.630042,1.211465
E,0.690585,0.072767,1.330695,-0.814712,2.02128


**Removing Columns**

In [26]:
df.drop('new', axis=1)

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
B,-0.985948,-1.128436,0.392494,0.641447
C,-1.255693,-0.600917,0.25742,-1.532832
D,0.430164,0.298035,0.781301,-0.630042
E,0.690585,0.072767,1.330695,-0.814712


In [27]:
# Original data is not updated because function is not inplace.
df

Unnamed: 0,W,X,Y,Z,new
A,0.203468,0.624663,-0.478029,-0.040079,-0.27456
B,-0.985948,-1.128436,0.392494,0.641447,-0.593453
C,-1.255693,-0.600917,0.25742,-1.532832,-0.998273
D,0.430164,0.298035,0.781301,-0.630042,1.211465
E,0.690585,0.072767,1.330695,-0.814712,2.02128


In [28]:
df.drop('new',axis=1,inplace=True)

In [29]:
df

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
B,-0.985948,-1.128436,0.392494,0.641447
C,-1.255693,-0.600917,0.25742,-1.532832
D,0.430164,0.298035,0.781301,-0.630042
E,0.690585,0.072767,1.330695,-0.814712


**Cols can be dropped using same way**

In [30]:
df.drop('E', axis=0)

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
B,-0.985948,-1.128436,0.392494,0.641447
C,-1.255693,-0.600917,0.25742,-1.532832
D,0.430164,0.298035,0.781301,-0.630042


**Selecting Rows**

In [34]:
df.loc['A']

W    0.203468
X    0.624663
Y   -0.478029
Z   -0.040079
Name: A, dtype: float64

In [36]:
# or based on position
df.iloc[2]

W   -1.255693
X   -0.600917
Y    0.257420
Z   -1.532832
Name: C, dtype: float64

**Selecting Subsets of rows and columns**

In [37]:
df.loc['B','Y']

0.3924941439703172

In [38]:
df.loc[['A','B'],['X','Y']]

Unnamed: 0,X,Y
A,0.624663,-0.478029
B,-1.128436,0.392494


## Conditional Selection
An important deature of pandas is conditional selection using bracket notation.

In [39]:
df

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
B,-0.985948,-1.128436,0.392494,0.641447
C,-1.255693,-0.600917,0.25742,-1.532832
D,0.430164,0.298035,0.781301,-0.630042
E,0.690585,0.072767,1.330695,-0.814712


In [40]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,False,False,True,True
C,False,False,True,False
D,True,True,True,False
E,True,True,True,False


In [41]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,,
B,,,0.392494,0.641447
C,,,0.25742,
D,0.430164,0.298035,0.781301,
E,0.690585,0.072767,1.330695,


In [43]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,0.203468,0.624663,-0.478029,-0.040079
D,0.430164,0.298035,0.781301,-0.630042
E,0.690585,0.072767,1.330695,-0.814712


In [44]:
df[df['W']>0]['Y']

A   -0.478029
D    0.781301
E    1.330695
Name: Y, dtype: float64

In [45]:
df[df['W']>0][['Y','Z']]

Unnamed: 0,Y,Z
A,-0.478029,-0.040079
D,0.781301,-0.630042
E,1.330695,-0.814712


For merging two conditions you can use | and & with paranthesis:

In [46]:
df[(df['W']>0) & (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
E,0.690585,0.072767,1.330695,-0.814712
