## Pandas - Series

Series in Pandas is just like a NumPy array. The only difference that lies here is that Series can have axis labels. Some data / object having an axis label means to say that it can be indexed by a label (which could essentially be anything you want) instead of the traditional 0 number indexing in Python. A series can also hold any arbitrary Python object or a collection of objects.

In [69]:

import numpy as np
import pandas as pd

# we could essentially convert a list, numpy array or a dictionary into a Pandas series

keys = ['a', 'b', 'c'] # a regular Python list
ls = [10, 20, 30] # a regular Python list
arr = np.array(ls) # a numpy array
d = {'a':10, 'b':20, 'c':30} # a Python dictionary

# creating Pandas series using a list
seriesListRegularIndexing = pd.Series(data=ls) # Series created will have traditional 0 based indexing
print("Pandas Series using List (Traditional Index) - \n", seriesListRegularIndexing)
seriesListCustomIndexing = pd.Series(data=ls, index=keys) # Series created will have indexing as specified in list Keys
print("\nPandas Series using List (Custom Index) - \n", seriesListCustomIndexing)

# creating Pandas series using a NumPy array
seriesArrayRegularIndexing = pd.Series(data=arr) # Series created will have traditional 0 based indexing
print("\nPandas Series using NumPy Array (Traditional Index) - \n", seriesArrayRegularIndexing)
seriesArrayCustomIndexing = pd.Series(data=arr, index=keys) # Series created will have indexing as specified in list Keys
print("\nPandas Series using NumPy Array (Custom Index) - \n", seriesArrayCustomIndexing)

# creating Pandas series using a Python Dictionary
# by default in creation of Series using a Dictionary, the keys are mapped to the index and values as the the actual elements of the series
seriesDict = pd.Series(data=d) # Series created will have traditional 0 based indexing
print("\nPandas Series using Dictionary - \n", seriesDict)

Pandas Series using List (Traditional Index) - 
 0    10
1    20
2    30
dtype: int64

Pandas Series using List (Custom Index) - 
 a    10
b    20
c    30
dtype: int64

Pandas Series using NumPy Array (Traditional Index) - 
 0    10
1    20
2    30
dtype: int32

Pandas Series using NumPy Array (Custom Index) - 
 a    10
b    20
c    30
dtype: int32

Pandas Series using Dictionary - 
 a    10
b    20
c    30
dtype: int64


## Pandas - DataFrames

A Pandas DataFrame is just another bunch of series objects, put together to share a same index.

In [70]:
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(data=randn(5,4), index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
print("Basic DataFrame - \n\n", df)

# Selection and Indexing of DataFrames

# df['W'] # to access the W column, which is actually a series by itself
print("\n\nSelecting single column (W) - \n\n", df['W'])

# df[['W','Z']] to access multiple columns, W and Z
# basically to access multiple columns, pass their names as a list
print("\n\nSelecting two columns (W and Z) - \n\n", df[['W', 'Z']])

# creating / adding columns
df['newColumn (adding W and Z)'] = df['W'] + df['Z']
print("\n\nAfter adding new column - \n\n", df)

# removing columns and rows
# axis=1 has to be specified while removing a column since by default axis=0 is set which points to the index vales and not the column values
# unless specified inplace=True, Pandas does not delete the column in the original DataFrame
df.drop('newColumn (adding W and Z)', axis=1)
print("\n\nColumn not deleted in original DataFrame - \n\n", df)
df.drop('newColumn (adding W and Z)', axis=1, inplace=True)
print("\n\nColumn deleted in original DataFrame - \n\n", df)
# in a similar way rows also could be deleted
# but for rows, axis=0 since rows technically are on the indices side of the DataFrame
df.drop('E', axis=0, inplace=True)
print("\n\nRow E has been removed from original DataFrame - \n\n", df)

# selecting rows from a DataFrame has two methods
# Method 1 is using the loc function, wherein you pass the name of the row you wish to select
# Method 2 is using the iloc function, where you pass the numeric index of the row you wish to select (regular 0 indexing followed)
print("\n\nSelecting row A using loc - \n\n", df.loc['A'])
print("\n\nSelecting row A using iloc - \n\n", df.iloc[0])
# one conclusion we come to is that, even rows are Pandas Series objects

# a subset of rows and columns could also be accessed using the NumPy comma notation we saw in the NumPy section
print("\n\nAccessing data at row B and column Y - ", df.loc['B', 'Y'])
print("\n\nAccessing data in rows A and B with columns W and Y - \n\n", df.loc[['A', 'B'], ['W', 'Y']])

# Conditional Selection
print("\n\nDataFrame where values are greater than 0 - \n\n", df>0)
print("\n\nTo the above boolean DataFrame, here we get vales only to the True correspondent and None / NaN at the False ones - \n\n", df[df>0])
# now certain operation could additionally be performed on Conditionally Selected DataFrames
print("\n\nSome Operations of Conditionally selected DataFrame - \n\n", df[df['W']>0]['Y'])
print("\n\nSome Operations of Conditionally selected DataFrame - \n\n", df[df['W']>0][['Y', 'X']])
# for multiple conditions, we can use logical (NOT REALLY) operators as well
# the operator for and = &
# the operator for or = |
print("\n\nSome Operations of Conditionally selected DataFrame - \n\n", df[(df['W']>0) & (df['Y'] > 1)])

# more on Indexing
# to reset to default index (0 based indexing), use the reset_index() method
# the reset_index() method does not do the changes in place and it also makes your previous indexes as a new column with column name 'index'
print("\n\nUsing reset_index() function - \n\n", df.reset_index())
# to set the index to something else which you want to specify, use the set_index() method
# again index changes by default are not done in place
# in order to do this, you will have to add that particular index column to your DataFrame first
newIndices = ["CA", "NY", "WY", "CO"]
df["Country Codes"] = newIndices
print("\n\nUsing set_index() function - \n\n", df.set_index("Country Codes"))

# Multi-Index, Index Hierarchy and Multi-Level DataFrames
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
df = pd.DataFrame(data=randn(6,2), index=hier_index, columns=['A', 'B'])
print("\n\nMulti-Level DataFrame - \n\n", df)
print("\n\nIndex Labels by default - ", df.index.names)
# setting index labels
df.index.names = ['Groups', 'Num']
print("\n\nIndex Labels after setting - ", df.index.names)
print("\n\nDataFrame - \n\n", df)
print("\n\nAccessing Cross Section of DataFrame where index label in Num and we want to access the values corresponding the index=1 - \n\n", df.xs(1,level='Num'))

Basic DataFrame - 

           W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


Selecting single column (W) - 

 A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64


Selecting two columns (W and Z) - 

           W         Z
A  2.706850  0.503826
B  0.651118  0.605965
C -2.018168 -0.589001
D  0.188695  0.955057
E  0.190794  0.683509


After adding new column - 

           W         X         Y         Z  newColumn (adding W and Z)
A  2.706850  0.628133  0.907969  0.503826                    3.210676
B  0.651118 -0.319318 -0.848077  0.605965                    1.257083
C -2.018168  0.740122  0.528813 -0.589001                   -2.607169
D  0.188695 -0.758872 -0.933237  0.955057                    1.143752
E  0.190794  1.978757  2.605967  0.683509     

## Pandas - Missing Data

Methods to deal with missing data like Null or NaN values

In [10]:
import numpy as np
import pandas as pd

d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}
df = pd.DataFrame(d)
print("\n\nOriginal Dataframe - \n\n", df)

# to drop rows with null values
# have to provide inplace=True to make changes permanent
print("\n\nDropping rows with null values - \n\n", df.dropna())

# to drop columns with null values
# have to provide inplace=True to make changes permanent
print("\n\nDropping columns with null values - \n\n", df.dropna(axis=1))

# setting threshold while dropping
# have to provide inplace=True to make changes permanent
# in dropna(thresh=2) for example will keep those rows who have atleast 2 or more not Null values and drop the rest
print("\n\nDropping rows with a threshold of 2 - \n\n", df.dropna(thresh=2))

# filling the Null values with a filler
# have to provide inplace=True to make changes permanent
print("\n\nFilling the Null values with a filler - \n\n", df.fillna(value='Filler Value'))



Original Dataframe - 

      A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3


Dropping rows with null values - 

      A    B  C
0  1.0  5.0  1


Dropping columns with null values - 

    C
0  1
1  2
2  3


Dropping rows with a threshold of 2 - 

      A    B  C
0  1.0  5.0  1
1  2.0  NaN  2


Filling the Null values with a filler - 

               A             B  C
0           1.0           5.0  1
1           2.0  Filler Value  2
2  Filler Value  Filler Value  3
