### Data Frame Data Structure

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

In [1]:
import pandas as pd

In [2]:
record_1 = pd.Series({'Name' : 'Alice',
                     'Class': 'Chemistry',
                     'Score': 89})
record_2 = pd.Series({'Name' : 'Mark',
                     'Class': 'Botany',
                     'Score': 95})
record_3 = pd.Series({'Name' : 'Lilly',
                     'Class': 'Psychology',
                     'Score': 98})

In [22]:
# Like a Series, the DataFrame object is index. Here I'll use a group of series, where each series 
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments

df = pd.DataFrame([record_1, record_2, record_3], index=['School 1', 'School 2', 'School 3'])
df.head()

Unnamed: 0,Name,Class,Score
School 1,Alice,Chemistry,89
School 2,Mark,Botany,95
School 3,Lilly,Psychology,98


In [20]:
# Jupyter creates a nice bit of HTML to render the results of the dataframe.
# So we have the index, which is the leftmost column and is the school name, and
# then we have the rows of data, where each row has a column header which was given in our initial
# record dictionaries

In [27]:
# An alternative method is that you could use a list of dictionaries, where each dictionary 
# represents a row of data.

students = [{'Name' : 'Alice',
                     'Class': 'Chemistry',
                     'Score': 89},
           {'Name' : 'Mark',
                     'Class': 'Botany',
                     'Score': 95},
           {'Name' : 'Lilly',
                     'Class': 'Psychology',
                     'Score': 98}]

df = pd.DataFrame(students, index=['School 1', 'School 2', 'School 1'])
df.head()

Unnamed: 0,Name,Class,Score
School 1,Alice,Chemistry,89
School 2,Mark,Botany,95
School 1,Lilly,Psychology,98


In [28]:
# Similar to the series, we can extract data using the .iloc and .loc attributes. Because the 
# DataFrame is two-dimensional, passing a single value to the loc indexing operator will return 
# the series if there's only one row to return.

# For instance, if we wanted to select data associated with school2, we would just query the 
# .loc attribute with one parameter.
df.loc['School 2']

Name       Mark
Class    Botany
Score        95
Name: School 2, dtype: object

In [29]:
# We can check the data type of the return using the python type function
type(df.loc['School 2'])

pandas.core.series.Series

In [32]:
# One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.
# For instance, if you wanted to just list the student names for school1, you would supply two 
# parameters to .loc, one being the row index and the other being the column name.

# For instance, if we are only interested in school1's student names
df.loc['School 1', 'Name']

School 1    Alice
School 1    Lilly
Name: Name, dtype: object

In [33]:
# Remember, just like the Series, the pandas developers have implemented this using the indexing
# operator and not as parameters to a function.

# What would we do if we just wanted to select a single column though? Well, there are a few
# mechanisms. Firstly, we could transpose the matrix. This pivots all of the rows into columns
# and all of the columns into rows, and is done with the T attribute
df.T

Unnamed: 0,School 1,School 2,School 1.1
Name,Alice,Mark,Lilly
Class,Chemistry,Botany,Psychology
Score,89,95,98


In [34]:
# Then we can call .loc on the transpose to get the student names only
df.T.loc['Name']

School 1    Alice
School 2     Mark
School 1    Lilly
Name: Name, dtype: object

In [36]:
# However, since iloc and loc are used for row selection, Panda reserves the indexing operator 
# directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name. 
# So this selection is always label based, and is not as confusing as it was when using the square 
# bracket operator on the series objects. For those familiar with relational databases, this operator 
# is analogous to column projection.
df['Name']

School 1    Alice
School 2     Mark
School 1    Lilly
Name: Name, dtype: object

In [37]:
# In practice, this works really well since you're often trying to add or drop new columns. However,
# this also means that you get a key error if you try and use .loc with a column name

# df.loc['Name'] --> This will result as a KeyError.

In [38]:
# Note too that the result of a single column projection is a Series object
type(df['Name'])

pandas.core.series.Series

In [39]:
# Since the result of using the indexing operator is either a DataFrame or Series, you can chain 
# operations together. For instance, we can select all of the rows which related to school1 using
# .loc, then project the name column from just those rows
df.loc['School 1']['Name']

School 1    Alice
School 1    Lilly
Name: Name, dtype: object

In [42]:
# If you get confused, use type to check the responses from resulting operations
print(type(df.loc['School 1'])) # Should be a DataFrame
print(type(df.loc['School 1']['Name'])) # Should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [43]:
# Chaining, by indexing on the return type of another index, can come with some costs and is
# best avoided if you can use another approach. In particular, chaining tends to cause Pandas 
# to return a copy of the DataFrame instead of a view on the DataFrame. 
# For selecting data, this is not a big deal, though it might be slower than necessary. 
# If you are changing data though this is an important distinction and can be a source of error.

In [44]:
# Here's another approach. As we saw, .loc does row selection, and it can take two parameters, 
# the row index and the list of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. 
# This is just like slicing characters in a list in python. Then we can add the column name as the 
# second parameter as a string. If we wanted to include multiple columns, we could do so in a list. 
# and Pandas will bring back only the columns we have asked for.

# Here's an example, where we ask for all the names and scores for all schools using the .loc operator.
df.loc[:, ['Name', 'Score']]

Unnamed: 0,Name,Score
School 1,Alice,89
School 2,Mark,95
School 1,Lilly,98


In [45]:
# It's easy to delete data in Series and DataFrames, and we can use the drop function to do so. 
# This function takes a single parameter, which is the index or row label, to drop. This is another 
# tricky place for new users -- the drop function doesn't change the DataFrame by default! Instead,
# the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('School 1')

Unnamed: 0,Name,Class,Score
School 2,Mark,Botany,95


In [46]:
# But if we look at our original DataFrame we see the data is still intact.
df

Unnamed: 0,Name,Class,Score
School 1,Alice,Chemistry,89
School 2,Mark,Botany,95
School 1,Lilly,Psychology,98


In [58]:
# Drop has two interesting optional parameters. The first is called inplace, and if it's 
# set to true, the DataFrame will be updated in place, instead of a copy being returned. 
# The second parameter is the axis, which should be dropped. By default, this value is 0, 
# indicating the row axis. But you could change it to 1 if you want to drop a column.

# For example, lets make a copy of a DataFrame using .copy()
copy_df = df.copy()
# Now lets drop the name column in this copy
copy_df.drop("Name", inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
School 1,Chemistry,89
School 2,Botany,95
School 1,Psychology,98


In [59]:
# There is a second way to drop a column, and that's directly through the use of the indexing 
# operator, using the del keyword. This way of dropping data, however, takes immediate effect 
# on the DataFrame and does not return a view.

del copy_df['Class']
copy_df

Unnamed: 0,Score
School 1,89
School 2,95
School 1,98


In [64]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. For instance, if we wanted to add a class ranking column with default 
# value of None, we could do so by using the assignment operator after the square brackets.
# This broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
School 1,Alice,Chemistry,89,
School 2,Mark,Botany,95,
School 1,Lilly,Psychology,98,
