In [1]:
import pandas as pd
import numpy as np

## Using Pandas to work with dataframes

We'll explore the widely used pandas library in this part of the course. We'll cover:
- Intro to pandas
- Creating dataframes
- Structure of a dataframe
- Viewing dataframes
- Slicing dataframes
- Joining dataframes

### Intro to pandas
Pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

Most of the functionality of pandas is delivered in Dataframes and Series'.

### Creating dataframes

We can create a DataFrame from iterables of iterables (like a list of dictionaries), or NumPy arrays:

In [2]:
data = [[i + 3*j for i in range(3)]for j in range(4)]
df = pd.DataFrame(data)
df.head()

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


pandas automatically names the columns and indexes if we don't provide them.

In [3]:
data = [[i + 3*j for i in range(3)]for j in range(4)]
df = pd.DataFrame(data, index = ['Lisa', 'Bart', 'Maggie', 'Nelson'], columns = ['Age','Salary','Grade'])
df.head()

Unnamed: 0,Age,Salary,Grade
Lisa,0,1,2
Bart,3,4,5
Maggie,6,7,8
Nelson,9,10,11


We've provided our own indices and columns for this data

We can do the same using a 2D numpy array

In [4]:
df = pd.DataFrame(np.random.randn(2, 3), columns = ['sin', 'cos', 'tan'])
df.head()

Unnamed: 0,sin,cos,tan
0,-0.721819,-1.625807,0.70112
1,-0.783506,-1.30809,-0.118255


We've created a dataframe with random floats and our own column names, with default indices

We can also create dataframes from csv and Excel files.

In [5]:
pd.read_csv('_data/simpsons.csv')

Unnamed: 0,Character,Profession,Age,Alignment
0,Bart,Student,8,Chaotic good
1,Homer,Nuclear technician,47,Neutral good
2,Lisa,Student,9,Lawful good
3,Nelson,Student,10,Chaotic evil
4,Mongomery,Billionare,80,Lawful evil
5,Moe,Bartender,65,Neutral neutral
6,Barney,Unemployed,35,Chaotic neutral


Pandas has loaded our csv into a dataframe and inferred the column names.

In [6]:
pd.read_excel('_data/Futurama.xlsx')

Unnamed: 0,Character,Profession,Age
0,Fry,Delivery person,27
1,Bender,Robot,timeless
2,Farnsworth,Professor,300


Pandas has automatically opened our excel file into a dataframe and inferred the column names

### Structure of a dataframe
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Rows are labelled with an index, which can be of any type - by default this is a sequential set of integers.

In [7]:
import pandas as pd
import numpy as np
df = pd.DataFrame([{'name':'Nickols', 'height in cm':123, 'hair colour': 'red'},
                   {'name':'Benjals', 'height in cm':200, 'hair colour': 'black'},
                   {'name':'Dennisons', 'height in cm':180, 'hair colour': 'brunette'}])
df.head()

Unnamed: 0,name,height in cm,hair colour
0,Nickols,123,red
1,Benjals,200,black
2,Dennisons,180,brunette


We've created a simple dataframe with columns name, height in cm and hair colour

Every column in a pandas dataframe is a Series object:

In [8]:
print(df['name'])
print(type(df['name']))

0      Nickols
1      Benjals
2    Dennisons
Name: name, dtype: object
<class 'pandas.core.series.Series'>


We can see the 'name' column is a Series containing 3 values.

You can create a Series from scratch as well

In [9]:
ages = pd.Series([23, 45, 37])
print(ages)

0    23
1    45
2    37
dtype: int64


Here we've created a Series of integers - we could add this as a column to our DataFrame.

### Viewing dataframes
Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

In [10]:
df = pd.read_csv('_data/simpsons.csv')
df.head()

Unnamed: 0,Character,Profession,Age,Alignment
0,Bart,Student,8,Chaotic good
1,Homer,Nuclear technician,47,Neutral good
2,Lisa,Student,9,Lawful good
3,Nelson,Student,10,Chaotic evil
4,Mongomery,Billionare,80,Lawful evil


By default df.head and df.tail show 5 rows.

In [11]:
df.tail(3)

Unnamed: 0,Character,Profession,Age,Alignment
4,Mongomery,Billionare,80,Lawful evil
5,Moe,Bartender,65,Neutral neutral
6,Barney,Unemployed,35,Chaotic neutral


We've chosen to look at the 3 last rows in our DataFrame.

Display the DataFrame.index or DataFrame.columns:

In [12]:
print(df.index)
print(df.columns)

RangeIndex(start=0, stop=7, step=1)
Index(['Character', 'Profession', 'Age', 'Alignment'], dtype='object')


We can see there's an auto-generated index, and we have four named columns.

describe() shows a quick statistic summary of your data:

In [13]:
df.describe()

Unnamed: 0,Age
count,7.0
mean,36.285714
std,29.118804
min,8.0
25%,9.5
50%,35.0
75%,56.0
max,80.0


The stats of the numerical Age column is described by pandas.

DataFrame.sort_values() sorts by values:

In [14]:
df.sort_values(by='Age')

Unnamed: 0,Character,Profession,Age,Alignment
0,Bart,Student,8,Chaotic good
2,Lisa,Student,9,Lawful good
3,Nelson,Student,10,Chaotic evil
6,Barney,Unemployed,35,Chaotic neutral
1,Homer,Nuclear technician,47,Neutral good
5,Moe,Bartender,65,Neutral neutral
4,Mongomery,Billionare,80,Lawful evil


We've sorted our table by Age - notice the indices move with the data.

### Slicing dataframes

We can slice a dataframe using square brackets to return a Series:

In [15]:
df['Age']

0     8
1    47
2     9
3    10
4    80
5    65
6    35
Name: Age, dtype: int64

We've selected the 'Age' column

And slice a series to return a value

In [16]:
age_series = df['Age']
age_series[0]

8

Grab the 0-indexed value in the 'age' column.

You can pass a list of columns to [] to select columns in that order. 

In [17]:
df[['Character','Age']]

Unnamed: 0,Character,Age
0,Bart,8
1,Homer,47
2,Lisa,9
3,Nelson,10
4,Mongomery,80
5,Moe,65
6,Barney,35


Here we see a subset of columns we've selected

Use loc to access a subset of rows in the dataframe - loc slicing is similar to array slicing.

In [18]:
df.loc[1]

Character                  Homer
Profession    Nuclear technician
Age                           47
Alignment           Neutral good
Name: 1, dtype: object

Grab the row with index 1

In [19]:
df.loc[1:3, 'Profession':'Alignment']

Unnamed: 0,Profession,Age,Alignment
1,Nuclear technician,47,Neutral good
2,Student,9,Lawful good
3,Student,10,Chaotic evil


A more complicated slice of our data.

### Joining dataframes


The concat function does all the heavy lifting of concatenating similar dataframes together along an axis:

In [20]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)


df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

pd.concat([df1, df2, df3], join='outer')

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


Dataframes concatenated along axis 0

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following two ways:

Take the union of them all, join='outer'. This is the default option as it results in zero information loss.

Take the intersection, join='inner'.

In [21]:
df4 = pd.DataFrame(
    {
        "E": ["B2", "B3", "B6", "B7"],
        "F": ["D2", "D3", "D6", "D7"],
        "G": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)


pd.concat([df1, df4], axis=1, join = 'outer')

Unnamed: 0,A,B,C,D,E,F,G
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


An outer join, all rows are preserved

In [22]:
df4 = pd.DataFrame(
    {
        "D": ["B2", "B3", "B6", "B7"],
        "E": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)


pd.concat([df1, df4], axis=1, join = 'inner')

Unnamed: 0,A,B,C,D,D.1,E,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


Concat with join type inner only returns rows with matching indices.

We can also join or merge two pandas DataFrames:

pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects - let's create a couple of objects to merge:

In [23]:
left = pd.DataFrame({"Character": ['Bart', 'Marge'], "Id": [1, 2]})
left

Unnamed: 0,Character,Id
0,Bart,1
1,Marge,2


In [24]:
right = pd.DataFrame({"Id": [1, 5, 6], "Profession": ['Student', 'Police Officer', 'Doctor']})
right

Unnamed: 0,Id,Profession
0,1,Student
1,5,Police Officer
2,6,Doctor


In [25]:
pd.merge(left, right, on="Id", how="outer")

Unnamed: 0,Character,Id,Profession
0,Bart,1,Student
1,Marge,2,
2,,5,Police Officer
3,,6,Doctor


Here we join our two tables in an outer join on the Id key in both tables, so all rows are returned, with missing values set to null.

In [26]:
pd.merge(left, right, on="Id", how="inner")

Unnamed: 0,Character,Id,Profession
0,Bart,1,Student


Here we perform an inner join, so we'll only get rows that have a matching Id in both tables

In [27]:
pd.merge(left, right, on="Id", how="left")

Unnamed: 0,Character,Id,Profession
0,Bart,1,Student
1,Marge,2,


Here we perform a left join, so all rows in the left table are returned, along with only the matching rows in the right table.

In [28]:
pd.merge(left, right, on="Id", how="right")

Unnamed: 0,Character,Id,Profession
0,Bart,1,Student
1,,5,Police Officer
2,,6,Doctor


Conversely, in a right join, all rows in the right table are returned, but only the matching info on the left table is returned.