In [1]:
import pandas as pd
import numpy as np

In [2]:
print(pd.__version__)
print(np.__version__)

2.2.0
1.26.4


We'd be remiss if we didn't cover _what dataframes are_ and  how to _create_ datafames with Pandas, so here's a quick primer.


A Pandas DataFrame is a way to store data in Python that's similar to an Excel spreadsheet or a table in a relational database. Like an Excel spreadsheet, it organizes data into rows and columns, which makes it easy to see and work with. Each column in a DataFrame holds data of the same type, for example, all numbers or all text, just like columns in Excel. Each row in a DataFrame is like a record in a database table, containing different types of data across its columns.

If you're familiar with how databases work, you know that you can filter data, join tables, and aggregate data in a database. DataFrames allow you to do similar things in Python. You can filter rows, join DataFrames together like you would join tables in a database, and summarize data. This makes DataFrames a powerful tool for data analysis, giving you a familiar way to handle data in Python if you're already used to working with Excel spreadsheets or database tables.

Dataframes can be constructed like any other object— you just need an `index`, `columns`, & `data`!

In [3]:
new_df = pd.DataFrame(
    columns=["A", "B", "C", "D", "E"],
    index=["V", "W", "X", "Y", "Z"],
    # Generate a 2D Vandermonte matrix (https://en.wikipedia.org/wiki/Vandermonde_matrix)
    data=np.vander((5, 4, 3, 2, 1), 5),
)
display(new_df)

Unnamed: 0,A,B,C,D,E
V,625,125,25,5,1
W,256,64,16,4,1
X,81,27,9,3,1
Y,16,8,4,2,1
Z,1,1,1,1,1


Columns can be added through assignment. Columnar operations are supported for like-kind columns.

In [4]:
new_df["F"] = new_df["A"] + new_df["B"] + new_df["C"] + new_df["D"] + new_df["E"]
display(new_df)

Unnamed: 0,A,B,C,D,E,F
V,625,125,25,5,1,781
W,256,64,16,4,1,341
X,81,27,9,3,1,121
Y,16,8,4,2,1,31
Z,1,1,1,1,1,5


Columns can also be dropped or renamed, but it's usually best to create a new dataframe:

In [5]:
# Hopefully, you're a bit more creative than us :)

# Drop and assign to a new dataframe
new_new_df = new_df.drop("F", axis=1)

# Drop the data in place
new_df.drop("F", axis=1, inplace=True)

In [6]:
new_new_df

Unnamed: 0,A,B,C,D,E
V,625,125,25,5,1
W,256,64,16,4,1
X,81,27,9,3,1
Y,16,8,4,2,1
Z,1,1,1,1,1


We can also manipulate entire dataframes

In [7]:
double_new_df = new_df * 2
display(double_new_df)

Unnamed: 0,A,B,C,D,E
V,1250,250,50,10,2
W,512,128,32,8,2
X,162,54,18,6,2
Y,32,16,8,4,2
Z,2,2,2,2,2


The main elements of a dataframe are it's columns, indices, & data... We can access these at any time:

In [8]:
new_df.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [9]:
new_df.index

Index(['V', 'W', 'X', 'Y', 'Z'], dtype='object')

In [10]:
new_df

Unnamed: 0,A,B,C,D,E
V,625,125,25,5,1
W,256,64,16,4,1
X,81,27,9,3,1
Y,16,8,4,2,1
Z,1,1,1,1,1


We can slice or "locate" data in dataframes in a few ways— by directly calling the columns _or_ by using the `.loc` or `.iloc` functions.

In [11]:
new_df["A"]

V    625
W    256
X     81
Y     16
Z      1
Name: A, dtype: int64

In [12]:
new_df.loc["V"]

A    625
B    125
C     25
D      5
E      1
Name: V, dtype: int64

In [16]:
new_df.loc[0:1]

TypeError: cannot do slice indexing on Index with these indexers [0] of type int

In [13]:
new_df.iloc[0]

A    625
B    125
C     25
D      5
E      1
Name: V, dtype: int64

Like relational data, dataframes can be joined (JOIN) or concatenated (UNION). The easiest way to do this with Pandas is using [pd.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)

In [14]:
join_df = pd.concat([new_df, double_new_df], axis=1)
display(join_df)

Unnamed: 0,A,B,C,D,E,A.1,B.1,C.1,D.1,E.1
V,625,125,25,5,1,1250,250,50,10,2
W,256,64,16,4,1,512,128,32,8,2
X,81,27,9,3,1,162,54,18,6,2
Y,16,8,4,2,1,32,16,8,4,2
Z,1,1,1,1,1,2,2,2,2,2


In [15]:
union_df = pd.concat([new_df, double_new_df], axis=0)
display(union_df)

Unnamed: 0,A,B,C,D,E
V,625,125,25,5,1
W,256,64,16,4,1
X,81,27,9,3,1
Y,16,8,4,2,1
Z,1,1,1,1,1
V,1250,250,50,10,2
W,512,128,32,8,2
X,162,54,18,6,2
Y,32,16,8,4,2
Z,2,2,2,2,2


Of course, you can also specify the `join` and `keys` parameters to modify the joins, just as you would SQL.

So that's the basics of DataFrames! You can think of them as you would a relational database or a spreadsheet— they're nothing more than a tabular way to store data with rows/indices.

DataFrames can be manipulated with Pandas, Polars, Numpy, or DuckDB, as you'll see in this course. There's an extensive amount of documentation online, so if you get stuck, Google or StackOverflow is your friend!