# Best practices

## Setup

In [1]:
import pandas as pd

## Creation

Creation of an example DataFrame (starting from a dictionary of dictionaries):

In [2]:
data = {
    "Capital": {
        "Spain": "Madrid",
        "Belgium": "Brussels",
        "France": "Paris",
        "Italy": "Roma",
        "Germany": "Berlin",
        "Portugal": "Lisbon",
        "Norway": "Oslo",
        "Greece": "Athens",
    },
    "Population": {
        "Spain": 46733038,
        "Belgium": 11449656,
        "France": 67076000,
        "Italy": 60390560,
        "Germany": 83122889,
        "Portugal": 10295909,
        "Norway": 5391369,
        "Greece": 10718565,
    },
    "Monarch": {
        "Spain": "Felipe VI",
        "Belgium": "Philippe",
        "Norway": "Harald V",
    },
    "Area": {
        "Spain": 505990,
        "Belgium": 30688,
        "France": 640679,
        "Italy": 301340,
        "Germany": 357022,
        "Portugal": 92212,
        "Norway": 385207,
        "Greece": 131957,
    },
}

In [3]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Capital"] = df["Capital"].astype("string")
df["Monarch"] = df["Monarch"].astype("string")
df.index.name = "Country"
df = df.reset_index()

Apple stock data, taken from the [`matplotlib` sample datasets](https://github.com/matplotlib/sample_data/blob/master/aapl.csv)

In [4]:
# For now, let's forget about these steps:
apple = pd.read_csv("AAPL.csv")
apple["Date"] = apple["Date"].astype("datetime64[ns]")
apple = apple.set_index("Date")
apple = apple.sort_index()

## Best practice 1: Use method chaining

In [5]:
df

Unnamed: 0,Country,Capital,Population,Monarch,Area
0,Spain,Madrid,46733038,Felipe VI,505990
1,Belgium,Brussels,11449656,Philippe,30688
2,France,Paris,67076000,,640679
3,Italy,Roma,60390560,,301340
4,Germany,Berlin,83122889,,357022
5,Portugal,Lisbon,10295909,,92212
6,Norway,Oslo,5391369,Harald V,385207
7,Greece,Athens,10718565,,131957


Chain methods when they are related to each other:

In [6]:
df.set_index("Country")

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Spain,Madrid,46733038,Felipe VI,505990
Belgium,Brussels,11449656,Philippe,30688
France,Paris,67076000,,640679
Italy,Roma,60390560,,301340
Germany,Berlin,83122889,,357022
Portugal,Lisbon,10295909,,92212
Norway,Oslo,5391369,Harald V,385207
Greece,Athens,10718565,,131957


In [7]:
df.set_index("Country").sort_index()

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Belgium,Brussels,11449656,Philippe,30688
France,Paris,67076000,,640679
Germany,Berlin,83122889,,357022
Greece,Athens,10718565,,131957
Italy,Roma,60390560,,301340
Norway,Oslo,5391369,Harald V,385207
Portugal,Lisbon,10295909,,92212
Spain,Madrid,46733038,Felipe VI,505990


In [8]:
df = df.set_index("Country").sort_index()

In [9]:
df

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Belgium,Brussels,11449656,Philippe,30688
France,Paris,67076000,,640679
Germany,Berlin,83122889,,357022
Greece,Athens,10718565,,131957
Italy,Roma,60390560,,301340
Norway,Oslo,5391369,Harald V,385207
Portugal,Lisbon,10295909,,92212
Spain,Madrid,46733038,Felipe VI,505990


In [10]:
df.reset_index()

Unnamed: 0,Country,Capital,Population,Monarch,Area
0,Belgium,Brussels,11449656,Philippe,30688
1,France,Paris,67076000,,640679
2,Germany,Berlin,83122889,,357022
3,Greece,Athens,10718565,,131957
4,Italy,Roma,60390560,,301340
5,Norway,Oslo,5391369,Harald V,385207
6,Portugal,Lisbon,10295909,,92212
7,Spain,Madrid,46733038,Felipe VI,505990


In [11]:
df.reset_index().sort_values("Population")

Unnamed: 0,Country,Capital,Population,Monarch,Area
5,Norway,Oslo,5391369,Harald V,385207
6,Portugal,Lisbon,10295909,,92212
3,Greece,Athens,10718565,,131957
0,Belgium,Brussels,11449656,Philippe,30688
7,Spain,Madrid,46733038,Felipe VI,505990
4,Italy,Roma,60390560,,301340
1,France,Paris,67076000,,640679
2,Germany,Berlin,83122889,,357022


In [12]:
df.reset_index().sort_values("Population", ascending=False)

Unnamed: 0,Country,Capital,Population,Monarch,Area
2,Germany,Berlin,83122889,,357022
1,France,Paris,67076000,,640679
4,Italy,Roma,60390560,,301340
7,Spain,Madrid,46733038,Felipe VI,505990
0,Belgium,Brussels,11449656,Philippe,30688
3,Greece,Athens,10718565,,131957
6,Portugal,Lisbon,10295909,,92212
5,Norway,Oslo,5391369,Harald V,385207


In [13]:
df = df.reset_index().sort_values("Population", ascending=False)

In [14]:
df

Unnamed: 0,Country,Capital,Population,Monarch,Area
2,Germany,Berlin,83122889,,357022
1,France,Paris,67076000,,640679
4,Italy,Roma,60390560,,301340
7,Spain,Madrid,46733038,Felipe VI,505990
0,Belgium,Brussels,11449656,Philippe,30688
3,Greece,Athens,10718565,,131957
6,Portugal,Lisbon,10295909,,92212
5,Norway,Oslo,5391369,Harald V,385207


Similarly to the difference between `sort()` and `list.sorted()`, there is an `inplace=True` option available in most `pandas` methods.  
However, in most cases, it does **not** actually mean that the operation happens in-place, thus there is usually no performance improvement.  
Worse, using it means that the method returns `None`, so it prevents method chaining:

In [20]:
students = ["Carla", "Alice", "Bob"]

In [21]:
sorted(students)

['Alice', 'Bob', 'Carla']

In [22]:
students

['Carla', 'Alice', 'Bob']

In [23]:
students.sort()

In [24]:
students

['Alice', 'Bob', 'Carla']

In [25]:
df.set_index("Country")

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Germany,Berlin,83122889,,357022
France,Paris,67076000,,640679
Italy,Roma,60390560,,301340
Spain,Madrid,46733038,Felipe VI,505990
Belgium,Brussels,11449656,Philippe,30688
Greece,Athens,10718565,,131957
Portugal,Lisbon,10295909,,92212
Norway,Oslo,5391369,Harald V,385207


In [26]:
df.set_index("Country", inplace=True)

In [27]:
df

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Germany,Berlin,83122889,,357022
France,Paris,67076000,,640679
Italy,Roma,60390560,,301340
Spain,Madrid,46733038,Felipe VI,505990
Belgium,Brussels,11449656,Philippe,30688
Greece,Athens,10718565,,131957
Portugal,Lisbon,10295909,,92212
Norway,Oslo,5391369,Harald V,385207


<div class="alert alert-success">

<b>Best Practice:</b> Use <b>method chaining</b> when meaningful, and avoid the <code>inplace=True</code> setting as it brings no performance improvement!

</div>

## Best practice 2: ...

In [28]:
df

Unnamed: 0_level_0,Capital,Population,Monarch,Area
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Germany,Berlin,83122889,,357022
France,Paris,67076000,,640679
Italy,Roma,60390560,,301340
Spain,Madrid,46733038,Felipe VI,505990
Belgium,Brussels,11449656,Philippe,30688
Greece,Athens,10718565,,131957
Portugal,Lisbon,10295909,,92212
Norway,Oslo,5391369,Harald V,385207
