# One-hot encode countries by continent

Given a dataframe with [ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes), one-hot encode each row by continent.

Gracefully handle countries that can't be found.

Create a data frame with some fake data:

In [44]:
import pandas as pd
from pandas import DataFrame

df_initial: DataFrame = pd.DataFrame([
    {
        "country": "DE",
        "price": 1,
        "quantity": 10
    },
    {
        "country": "AU",
        "price": 2,
        "quantity": 20
    },
    {
        "country": "USA",
        "price": 3,
        "quantity": 30
    },
    {
        "country": "Singapore",
        "price": 4,
        "quantity": 40
    },
    {
        "country": "Gobbledygook",
        "price": 5,
        "quantity": 50
    }
])

print(df_initial.head())

        country  price  quantity
0            DE      1        10
1            AU      2        20
2           USA      3        30
3     Singapore      4        40
4  Gobbledygook      5        50


Use the `countryinfo` library to map country code to continent (note that `CountryInfo` can handle various country code/name formats):

In [45]:
from countryinfo import CountryInfo


def country_to_continent(country: str) -> str:
    country: CountryInfo = CountryInfo(country_name=country)
    try:
        return country.info()["region"]
    except KeyError:
        unclassified: str = "unclassified"
        print(f"unable to find country using country code/name {country}, returning {unclassified} as continent")

        return unclassified


df_with_continent: DataFrame = df_initial.assign(continent=df_initial["country"].map(lambda c: country_to_continent(c))).drop("country", axis=1)

df_with_continent.head()

unable to find country using country code/name <countryinfo.countryinfo.CountryInfo object at 0x11ef54e50>, returning unclassified as continent


Unnamed: 0,price,quantity,continent
0,1,10,Europe
1,2,20,Oceania
2,3,30,Americas
3,4,40,Asia
4,5,50,unclassified


One-hot encode by continent before removing the continent column:

In [46]:
one_hot_continent: DataFrame = pd.get_dummies(df_with_continent["continent"])
df_with_one_hot_continent: DataFrame = df_with_continent.join(one_hot_continent).drop("continent", axis=1)

df_with_one_hot_continent.head()

Unnamed: 0,price,quantity,Americas,Asia,Europe,Oceania,unclassified
0,1,10,0,0,1,0,0
1,2,20,0,0,0,1,0
2,3,30,1,0,0,0,0
3,4,40,0,1,0,0,0
4,5,50,0,0,0,0,1
