# DataFrames

DataFrames are the real workhorse of Pandas and are directly inspired by the the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic.

In [89]:
import pandas as pd
import numpy as np

In [90]:
from numpy.random import randn

In [91]:
# Set the seed
np.random.seed(101)

# Setting a seed makes sure that we get the same kind of random numbers

In [92]:
df = pd.DataFrame(data=randn(5,4), index=['A', 'B', 'C', 'D', 'E'], columns=['W', 'X', 'Y', 'Z'])

# randn(5, 4) will generate a 5 X 4 matrix populated with random values from a standard normal distribution centred around 0
# index is for labels of rows
# column is for labels of columns

In [93]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Each of the columns (W, X, Y, Z) is a pandas series. All of them share common indices (A, B, C, D, E)

## Indexing and Selection 

Let us learn the various methods to grab data from a Data Frame

In [94]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

As we can see from the output, each of the columns of a dataframe ('W' in this example) is a series. This can be further verified by knowing the type of df['W']

In [95]:
type(df['W'])

pandas.core.series.Series

If we find the type of df, we will notice that it is a DataFrame, as opposed to each of its columns which are pandas series respectively.

In [96]:
type(df)

pandas.core.frame.DataFrame

Another way of selecting columns from a pandas dataFrame is through the SQL syntax wherein we specify the column name post the dot (.) operator

In [97]:
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

This is not the recommended way of extracting columns from the dataFrame because you might get confused with all the methods that are available with a dataFrame object in pandas. At worst, one of the column names may get overwritten by a built-in method.

In [98]:
# To grab data from multiple columns, pass in a list of columns
df[['W', 'X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


Remember that when we ask for multiple columns from a dataFrame, we get a sub-dataFrame and not a series whereas when we grab data from just a single column of a dataFrame we get pandas series object. 

In [99]:
type(df[['W', 'X']])

pandas.core.frame.DataFrame

## Creating a new column 

In [100]:
# Create a new column named 'new' whose data values would be the sum of corresponding values from column X and Y
df['new'] = df['W'] + df['Y']

# This means that new columns can be created by using the data values from existing columns

In [101]:
# show the updated dataframe
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


## Removing Columns 

To remove a column from the dataFrame, make use of the built-in drop method in pandas. Remember to specify the axis as 1

In [102]:
# Drop the column with the name "new" from the dataFrame "df"
df.drop(labels='new', axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [103]:
# show the original dataframe
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


An important thing to note when using the DataFrame's built-in method to drop the column is that the removal of the column doesn't happen in-place i.e. it doesn't affect the state of the original dataframe on which the method was called until we explicitly mention this by setting "inplace = True"

In [104]:
# drop the column named "new" and also affect the dataFrame in place
df.drop(labels='new', axis=1, inplace=True)

In [105]:
# show the original dataFrame to check whether the changes have taken place
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Removing Rows

The built-in drop() method can also be used to remove rows from the dataFrame. We just need to mention the axis argument's value as 0. Note that if we omit the axis argument altogether then Pandas will assume it to be 0. Consequently, it'll try and match the label-names with the index names.

In [106]:
df.drop(labels='E', axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


Sometimes a point of confusion is about the reason behind annotating rows with axis=0 and columns with axis=1 respectively. This reference style actually comes back to NumPy. Since dataFrames are essentially just fancy index markers on top of NumPy arrays. (See below)

In [107]:
df.shape

(5, 4)

Note that shape attribute returns a tuple for a two dimensional dataFrame like the one we have above. At the 0th index we have the number of rows whereas at the 1st index we have the number of columns.

In [108]:
df.shape[0]

5

In [109]:
df.shape[1]

4

This is why rows are referred to as the 0th axis and columns are referred to as the 1st axis since it's directly taken from the shape attribute of the dataFrame just as you do with a NumPy array.

## Selecting Rows

In [110]:
# print the dataFrame
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


There are two methods that can be used to select rows from a dataFrame in Pandas. These are: -
1. loc for location
2. iloc for index-location

In [111]:
# Pass in the index label which delineates the row in the dataframe
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

The above method returns a pandas series, which brings us to the conclusion that in a dataFrame not just the columns are a pandas series object but also the rows.

In [112]:
type(df.loc['A'])

pandas.core.series.Series

In [113]:
# Pass in the numberical index of row which you want to select
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

## Selecting a subset of rows and columns

In [114]:
# Returns the data value at Row 'B' and column 'Y'
df.loc['B' , 'Y']

-0.8480769834036315

In [115]:
# Returns a dataFrame containing values in rows A, B and columns W, Y
df.loc[['A', 'B'], ['W', 'Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


In [117]:
# Since we retrieved mutiple rows and columns at once, we get a mini-dataframe and not a single value or a series object
type(df.loc[['A', 'B'], ['W', 'Y']])

pandas.core.frame.DataFrame

## Conditional Selection 

A very important feature of pandas is the ability to perform conditional selection using the bracket notation.

In [118]:
# Using a conditional operator against the dataFrame
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


We get a boolean matrix (dataframe) of the same size with cells having satisfied the condition containing True and others containing False.

In [119]:
# store the boolean matrix (dataframe)
bool_df = df > 0

In [120]:
# display the boolean dataframe
bool_df

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [121]:
# use the boolean dataframe with the original dataframe to conditionally select the values. This is known as Boolean Indexing.
df[bool_df]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


The cells for which the boolean dataframe contained a True are displayed whilst the others contain NaN.

In [122]:
# Combining the steps into a single one
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


This sort of an operation with a dataFrame and a comparison operator isn't that common. Most likely, instead of passing in the entire dataframe we'll just pass in a row or a column. This will ensure that we don't get any NaNs but only the subsets of the dataFrame where the condition was true.

In [124]:
df['W'] > 0
# This returns a series of boolean values

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

We can now use this series of boolean values corresponding to rows to filter out them out based off a column value and that means if we pass in the series into a data frame using bracket notation we will only get back the rows where the condition happens to be true

In [125]:
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [126]:
# return all the rows in which the value of column 'Z' is negative
df[df['Z'] < 0]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


Note that the resultant of this kind of conditional selection is a dataframe which means we can call methods on it.

In [129]:
result_df = df[df['W'] > 0]

print(result_df)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


In [130]:
result_df['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [131]:
# Combining the two steps together
df[df['W'] > 0]['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64