### DataFrames

In [1]:
import pandas as pd
import numpy as np

Let's see a few ways of creating `DataFrame` objects:

##### From a list of Series objects

In [2]:
columns = pd.Index(
    [
        'The Bronx', 
        'Brooklyn', 
        'Manhattan', 
        'Queens', 
        'Staten Island'
    ]
)
counties = pd.Series(
    ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond'],
    index=columns,
    name='county'
)
populations = pd.Series(
    [1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143],
    index = columns,
    name='population'
)
gdp = pd.Series(
    [42.695, 91.559, 600.244, 93.310, 14.514],
    index=columns,
    name='gdp'
)
areas = pd.Series(
    [42.10, 70.82, 22.83, 108.53, 58.37],
    index=columns,
    name='area'
)

In [3]:
new_york = pd.DataFrame([counties, populations, gdp, areas])
new_york

Unnamed: 0,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
county,Bronx,Kings,New York,Queens,Richmond
population,1418207,2559903,1628706,2253858,476143
gdp,42.695,91.559,600.244,93.31,14.514
area,42.1,70.82,22.83,108.53,58.37


As you can see, the series names became the column "headers" (and in fact the explicit index for the columns), while the series had a common index that is now the row index for the data frame.

If we want, we can **transpose** the table (switch the columns and rows around):

In [4]:
new_york.transpose()

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


##### From a dictionary of Series objects

In [5]:
d = {
    'county': counties,
    'population': populations,
    'gdp': gdp,
    'area': areas
}

new_york = pd.DataFrame(d)
new_york

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


The dictionary keys became the labels for the column index in the data frame, and the rows were aligned to a common row index based on the series indexes (the burroughs).

Again, we can transpose this table if we prefer:

In [6]:
new_york.transpose()

Unnamed: 0,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
county,Bronx,Kings,New York,Queens,Richmond
population,1418207,2559903,1628706,2253858,476143
gdp,42.695,91.559,600.244,93.31,14.514
area,42.1,70.82,22.83,108.53,58.37


##### From a Dictionary of Dictionaries

In [7]:
counties = {
    'The Bronx': 'Bronx',
    'Brooklyn': 'Kings',
    'Manhattan': 'New York',
    'Queens': 'Queens',
    'Staten Island': 'Richmond'
}
populations = {
    # note how the keys are not necessarily in the same order
    'Manhattan': 1_628_706,
    'Queens': 2_253_858,
    'Staten Island': 476_143,
    'The Bronx': 1_418_207,
    'Brooklyn': 2_559_903
}
gdp = {
    'The Bronx': 42.695,
    'Brooklyn': 91.559,
    'Manhattan': 600.244,
    'Queens': 93.310,
    'Staten Island': 14.514
}
areas = {
    'The Bronx': 2.10,
    'Brooklyn': 70.82,
    'Manhattan': 22.83,
    'Queens': 108.53,
    'Staten Island': 58.37
}

d = {
    'county': counties,
    'population': populations,
    'gpd': gdp,
    'area': areas
}

new_york = pd.DataFrame(d)
new_york

Unnamed: 0,county,population,gpd,area
The Bronx,Bronx,1418207,42.695,2.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


As you can see the keys of the outer dictionary became the columns of the data frame, and the items in the sub dictionaries were "aligned" to the same row index.

##### From a list of Dictionaries

We can also create a data frame from a list of dictionaries, but in that case there is nothing to actually define the column index values, and the way the data is loaded is slightly different.

Let's see what happens.

In [8]:
new_york = pd.DataFrame([counties, populations, gdp, areas])
new_york

Unnamed: 0,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,Bronx,Kings,New York,Queens,Richmond
1,1418207,2559903,1628706,2253858,476143
2,42.695,91.559,600.244,93.31,14.514
3,2.1,70.82,22.83,108.53,58.37


Notice how we did not have anything to define the row indices here, so we ended up with a default explicit index based on the position of each row.

We can rename the row indices (the labels), using the `rename()` method where we specify the old label and the new label using a dictionary:

In [9]:
new_york.rename(index={0: 'county', 1: 'population', 2: 'gdp', 3: 'area'})

Unnamed: 0,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
county,Bronx,Kings,New York,Queens,Richmond
population,1418207,2559903,1628706,2253858,476143
gdp,42.695,91.559,600.244,93.31,14.514
area,2.1,70.82,22.83,108.53,58.37


We could now transpose this matrix if we wanted to as well:

In [10]:
new_york.rename(
    index={0: 'county', 1: 'population', 2: 'gdp', 3: 'area'}
).transpose()

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,2.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


We could also transpose the data first:

In [11]:
new_york

Unnamed: 0,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,Bronx,Kings,New York,Queens,Richmond
1,1418207,2559903,1628706,2253858,476143
2,42.695,91.559,600.244,93.31,14.514
3,2.1,70.82,22.83,108.53,58.37


In [12]:
new_york = new_york.transpose()
new_york

Unnamed: 0,0,1,2,3
The Bronx,Bronx,1418207,42.695,2.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


And now we have to change the labels on the column index, again with the `rename` method:

In [13]:
new_york = new_york.rename(
    columns={0: 'county', 1: 'population', 2: 'gpd', 3: 'area'}
)
new_york

Unnamed: 0,county,population,gpd,area
The Bronx,Bronx,1418207,42.695,2.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


##### From a list of lists

In this example, our data is formatted as a list of lists: 

In [14]:
burroughs = ['The Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']
counties = ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond']
populations = [1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143]
gdp = [42.695, 91.559, 600.244, 93.310, 14.514]
areas = [42.10, 70.82, 22.83, 108.53, 58.37]

In [15]:
data = [burroughs, counties, populations, gdp, areas]
data

[['The Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
 ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond'],
 [1418207, 2559903, 1628706, 2253858, 476143],
 [42.695, 91.559, 600.244, 93.31, 14.514],
 [42.1, 70.82, 22.83, 108.53, 58.37]]

We can load it all up in a data frame:

In [16]:
new_york = pd.DataFrame(
    data, 
    index=['burroughs', 'county', 'population', 'gdp', 'area']
)
new_york

Unnamed: 0,0,1,2,3,4
burroughs,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
county,Bronx,Kings,New York,Queens,Richmond
population,1418207,2559903,1628706,2253858,476143
gdp,42.695,91.559,600.244,93.31,14.514
area,42.1,70.82,22.83,108.53,58.37


Here I'm going to transpose the table first:

In [17]:
new_york = new_york.transpose()
new_york

Unnamed: 0,burroughs,county,population,gdp,area
0,The Bronx,Bronx,1418207,42.695,42.1
1,Brooklyn,Kings,2559903,91.559,70.82
2,Manhattan,New York,1628706,600.244,22.83
3,Queens,Queens,2253858,93.31,108.53
4,Staten Island,Richmond,476143,14.514,58.37


You'll notice that our row index is based on positional values - instead we really want the row index to be based on the burrough names.

But since, the burroughs is already a column in our data frame, we can actually set the index using that column, via the `set_index()` method on the data frame - this essentially allows us to pick an existing column to become the row index:

In [18]:
new_york.set_index('burroughs')

Unnamed: 0_level_0,county,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


We could have picked any column to use as the row indices:

In [19]:
new_york.set_index('county')

Unnamed: 0_level_0,burroughs,population,gdp,area
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bronx,The Bronx,1418207,42.695,42.1
Kings,Brooklyn,2559903,91.559,70.82
New York,Manhattan,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Richmond,Staten Island,476143,14.514,58.37


#### DataFrame Properties

Notice that title above the index? That's the index name that Pandas automatically sets based on the column name we used to set the new index.

In [20]:
new_york = new_york.set_index('burroughs')
new_york

Unnamed: 0_level_0,county,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


We can get the row index using the `index` property:

In [21]:
new_york.index

Index(['The Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'], dtype='object', name='burroughs')

And we can get the column index using the `columns` property:

In [22]:
new_york.columns

Index(['county', 'population', 'gdp', 'area'], dtype='object')

If we wanted to, we could change the name property of either index:

In [23]:
new_york

Unnamed: 0_level_0,county,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


In [24]:
new_york.index.name = None
new_york

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


Data frames have many properties and methods, some of which we'll study later in this section, but you can read up more on them in the Pandas documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

An interesting one is the `info` method:

In [25]:
new_york.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, The Bronx to Staten Island
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   county      5 non-null      object
 1   population  5 non-null      object
 2   gdp         5 non-null      object
 3   area        5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


Notice how each column is of `object` type? That's a little suprising, since we do have homogeneous data in the columns:

In [26]:
new_york

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


Why do not see int64 for population and float64 for gdp and area?

You have to remember how we built up this dataset:

In [27]:
burroughs = ['The Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']
counties = ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond']
populations = [1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143]
gdp = [42.695, 91.559, 600.244, 93.310, 14.514]
areas = [42.10, 70.82, 22.83, 108.53, 58.37]

data = [burroughs, counties, populations, gdp, areas]

new_york = pd.DataFrame(
    data, 
    index=['burroughs', 'county', 'population', 'gdp', 'area']
)

new_york

Unnamed: 0,0,1,2,3,4
burroughs,The Bronx,Brooklyn,Manhattan,Queens,Staten Island
county,Bronx,Kings,New York,Queens,Richmond
population,1418207,2559903,1628706,2253858,476143
gdp,42.695,91.559,600.244,93.31,14.514
area,42.1,70.82,22.83,108.53,58.37


This was our original data set, and as you can see the columns are not homogeneous, and our data has been stored using the generic `object`:

In [28]:
new_york.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, burroughs to area
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5 non-null      object
 1   1       5 non-null      object
 2   2       5 non-null      object
 3   3       5 non-null      object
 4   4       5 non-null      object
dtypes: object(5)
memory usage: 240.0+ bytes


We then transposed, and renamed the indices, but the underlying data type of each value in the table was `object` and that remains the same:

In [29]:
new_york = new_york.transpose()
new_york = new_york.set_index('burroughs')
new_york

Unnamed: 0_level_0,county,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


In [30]:
new_york.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, The Bronx to Staten Island
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   county      5 non-null      object
 1   population  5 non-null      object
 2   gdp         5 non-null      object
 3   area        5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


If we had loaded the data up directly into homogeneous columns we would see different data types being used:

In [31]:
counties = pd.Series(
    ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond'],
    index=columns,
    name='county'
)
populations = pd.Series(
    [1_418_207, 2_559_903, 1_628_706, 2_253_858, 476_143],
    index = columns,
    name='population'
)
gdp = pd.Series(
    [42.695, 91.559, 600.244, 93.310, 14.514],
    index=columns,
    name='gdp'
)
areas = pd.Series(
    [42.10, 70.82, 22.83, 108.53, 58.37],
    index=columns,
    name='area'
)

d = {
    'county': counties,
    'population': populations,
    'gdp': gdp,
    'area': areas
}

new_york_homogeneous = pd.DataFrame(d)
new_york_homogeneous

Unnamed: 0,county,population,gdp,area
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


In [32]:
new_york_homogeneous.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, The Bronx to Staten Island
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   county      5 non-null      object 
 1   population  5 non-null      int64  
 2   gdp         5 non-null      float64
 3   area        5 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 360.0+ bytes


There are ways we can change the data type of a column, and we'll look at that later.

One last operation we'll look at is how to remove rows or columns from our data frame.

Just as we saw with `Series` objects, we can use the `drop()` method.

The difference here is that we may want to drop a row or a column.

We can drop (one or more) columns by specifying the `columns` argument in the `drop()` method - the argument can be either a single column label or a list of column labels:

In [33]:
new_york

Unnamed: 0_level_0,county,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Bronx,Bronx,1418207,42.695,42.1
Brooklyn,Kings,2559903,91.559,70.82
Manhattan,New York,1628706,600.244,22.83
Queens,Queens,2253858,93.31,108.53
Staten Island,Richmond,476143,14.514,58.37


In [34]:
new_df = new_york.drop(columns='county')
new_df

Unnamed: 0_level_0,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1418207,42.695,42.1
Brooklyn,2559903,91.559,70.82
Manhattan,1628706,600.244,22.83
Queens,2253858,93.31,108.53
Staten Island,476143,14.514,58.37


And we can drop rows the same way by using the `index` argument. Again the argument can be a single row label, or a list of row labels:

In [35]:
new_df = new_df.drop(index=['Brooklyn', 'Queens'])
new_df

Unnamed: 0_level_0,population,gdp,area
burroughs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1418207,42.695,42.1
Manhattan,1628706,600.244,22.83
Staten Island,476143,14.514,58.37


##### Renaming Index Labels

Lastly, let's look at the `rename()` method again - there are a few ways in which we can use this method to redefine the index labels.

In [36]:
df = pd.DataFrame(
    np.arange(9).reshape(3, 3),
    index = list('ABC'),
    columns = list('abc')
)
df

Unnamed: 0,a,b,c
A,0,1,2
B,3,4,5
C,6,7,8


We saw how we could rename the index labels of either columns or rows using a dictionary to map from the old index label to the new index label:

In [37]:
df.rename(
    columns={'a': 'aa', 'b': 'bb', 'c': 'cc'},
    index={'A': 'AA', 'B': 'BB', 'C': 'CC'}
)

Unnamed: 0,aa,bb,cc
AA,0,1,2
BB,3,4,5
CC,6,7,8


But, if we want to only rename a subset of the labels, we can just specify which ones we want to rename:

In [38]:
df.rename(
    columns={'a': 'aa'},
    index={'A': 'AA'}
)

Unnamed: 0,aa,b,c
AA,0,1,2
B,3,4,5
C,6,7,8
