The DataFrame data structures is the heart of the Pandas library. Its a primary object that you'll be working with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. We can think of the DataFrame itself as simply a two-axes labeled array.

In [1]:
# lets start by importing our pandas library
import pandas as pd

In [2]:
# we are going to jump in with an example, lets create three school records for students and
# their class grades. we'll create each as a series which has a student name, the class name, and the score.
record1 = pd.Series({'Name': 'Alice',
                     'Class': 'Physics',
                     'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                     'Class': 'Chemistry',
                     'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                     'Class': 'Biology',
                     'Score': 90})

In [3]:
record1

Name       Alice
Class    Physics
Score         85
dtype: object

In [4]:
record2

Name          Jack
Class    Chemistry
Score           82
dtype: object

In [5]:
record3

Name       Helen
Class    Biology
Score         90
dtype: object

In [6]:
# like a Series, the DataFrame object is index. Here we'll use a group of series, where each series
# represent a row of data. just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments
df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school3'])
# we can use head() function to see the first several rows of the dataframe, including
# indices from both axes, and we can use this to verify the columns and the rows
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


In [7]:
# you'll notice here that Jupyter creates a nice bit of html to render the results of the
# dataframe. so we have the index, which is the leftmost column and is the school name, and
# then we have the rows of data, where each row has a column header which was given in our initial
# record dictionaries

In [8]:
# an alternative method is that you could use a list of dictionaries, where each dictionary
# represent a row of data

students = [{ 
    'Name': 'Alice',
    'Class': 'Physics',
    'Score': 85
},
{
    'Name': 'Jack',
    'Class': 'Chemistry',
    'Score': 82
},
{
    'Name': 'Helen',
    'Class': 'Biology',
    'Score': 90
}]
students

[{'Class': 'Physics', 'Name': 'Alice', 'Score': 85},
 {'Class': 'Chemistry', 'Name': 'Jack', 'Score': 82},
 {'Class': 'Biology', 'Name': 'Helen', 'Score': 90}]

In [9]:
# then we pass this list of dictionaries into the DataFrame funciton
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [10]:
# similar to the Series, we can extract data using the .iloc and .loc attributes. beacuse
# the DataFrame is 2d, passing a single value to the loc indexing operator will return
# the series if there's only one row to return.

# for instance, if we wanted to select dat associated with school2, we would just query
# the .loc attribute with one parameter
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [11]:
# you'll not that the name of the series it returned as the index value, while the column
# name is included in the output.

# we can check the data type of the return using the python type function
type(df.loc['school2'])

pandas.core.series.Series

In [12]:
# its important to remember that the indices and column names along either axes horizontal
# or vertical, could be non-unique. in this example, we see two records for school1 as different
# rows. if we use a single value with the DataFrame .loc attribute, multi rows of th DataFrame
# will return, not as a new series, but as a new DataFrame

# lets query for school1 records
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [13]:
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [14]:
# one of the powers of the Pandas DataFrame is that you can quickly select data bases on
# multiple axes. for instance, if you wanted to just list of the student names for school1, 
# you would supply two parameters to .loc, one being the row index and the other being the
# column name.

# for instance, if we are only interested in school1's student names
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [15]:
# alternative 
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [17]:
# remember, just like the SEries, the padnas developers have implemented this using the indexing
# operator and not as parameters to a function.

# what would we do if we just wanted to select a single column though? welll, there are a few
# mechanism. firstly, we could transpose the matrix. this pivots all of the rows into columns
# and all of the columns into rows, and is done with the T attribute
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [18]:
# then we can call .loc on the transpose to get the student names only
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [24]:
df.iloc[:]['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [27]:
# however, since iloc and loc are used for row selection, Panda reserves the indexing operator
# directly on the DataFrame for column selection. In a Pandas's Dataframe, column always have a name.
# so this selection is always label based, and is not as confusing as it was when suing the square
# bracket operator on the series objects. For those familiar with relational databaes, this opeartor
# is anlogous to column projection.
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [28]:
# in practice, this works really well since you're often trying to add or drop new columns.
# however, this also means that you get a key error if you try and use .loc with a column name
df.loc['Name']

KeyError: ignored

In [29]:
# note too that the result of a single column projection is a Series object
type(df['Name'])

pandas.core.series.Series

In [31]:
# since the result of using the indexing operator is either a DataFrame or Series, you can chain
# operations together. for instance, we can select all of the rows which related to school1 using
# .loc, then  project the name column from just those rows
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [32]:
# if you get confused, use type to check the responses from resulting operations
print(type(df.loc['school1'])) # should be a DataFrame
print(type(df.loc['school1']['Name'])) # should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [33]:
# chaining, by indexing on the return type of another index, can come with some costs and is
# best avoided if you can use another approach. in particular, chaining tends to cause Pandas
# to return a copy of the DataFrame instead of a view on the DataFrame.
# for selection data, this is not a big deal, though it might be slower than necessary.
# if you are changing data though this is an important distinction and can be a source of error.

In [34]:
# here's another approach. as we saw, .loc does row selection, and it can take two parameters,
# the row index and the list of columns index. the .loc attribute also supports slicing.

# if we wanted to select all rows, we can use a colon to indicate a full slice from beginning
# to end. this just like slicing characters in a list  in python. then we can add the column name
# as the second parameter as a string. if we wanted to include multiple columns, we could do 
# so in a list. and pandas will bring back only the columns we have asked for.

# here's an example, where we ask for all the names and scores for all schools using the .loc
df.loc[:, ['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


In [38]:
# take a look at that again. the colon means that we want to get all of the rows, and the list
# in the second argument posistion is the list of columns we want to get back

In [39]:
# thats selecting and projecting data from a DataFrame based on row and column labels. the key
# concepts to remember are that the rows and columns are really just for our benefit. underneath
# this is just a two axes labeled array, and transposing the columns is easy. also, consider
# the issue of chaining carefully and try to avoid it, as it can cause unpredictable results, where
# your intent was to obtain a view of the data, but instead Pandas return to you a copy.

In [40]:
# before we leave the discussion of accessing data in DataFrames, lets talk about dropping data.
# its easy to delete data in Series and DataFrames, and we can use the drop function to do so.
# this function takes a single parameter, which is the index or row labelm tod rop. this is another
# tricky place for new users -- the drop function doesnt change the DataFrame by default! instead,
# the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [41]:
# but if we look at our original DataFrame we see that the data is still intact.
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [43]:
# drop has two interesting optional parameters. the first is called inplace, and if its
# set to true, the DataFrame will be updated in place, instead of a copy being returned.
# the second parameter is the axes, which should be dropped. by default, this values is0,
# indicating the row axis. but you could change it ot 1 if you want to drop a column.

# for example, lets make a copy of a DataFrame using .copy()
copy_df = df.copy()
# now lets drop the name column in this copy
copy_df.drop('Name', inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [44]:
# there is a second way to drop a column, and thats directly through the use of the idnexing
# operator, using the del keyword. this way of dropping data, however takes immediate effect
# on the DataFrame and does not return a view
del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


In [45]:
# finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. for instance, if we wanted to add a class ranking column with default
# value of None, we could do so by using the assignment operator after the square brackets.
# this broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


in this lecture we have learned about the data structure we'll use the most in pandas, the DataFrame. the DataFrame is indexed both by row and column, and we can easily select individual rows and project the column we're interested in using the familiar indexing method from the Series class.