# `clean_country()`: Clean Country Names

Follow the [ISO 3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes); use only ascii characters.

Convert a country name into the formats:
1. "name": the country name
2. "official": the official state name
3. "alpha-2": two letter abbreviation
4. "alpha-3": three letter abbreviation
5. "numeric": numeric code

In [25]:
def clean_country(
    df: Union[pd.DataFrame, dd.DataFrame],
    column: str,
    input_format: Union["auto", *output_format],
    output_format: Union["name", "official", "alpha-2", "alpha-3", "numeric"],
    language: TODO
):

SyntaxError: invalid syntax (<ipython-input-25-477dfe41307b>, line 4)

## Parameters

* `df`: Union[*pandas.DataFrame, dask.DataFrame*] &mdash; the data frame to be transformed 
* `column`: *str* &mdash; the name of the column to be cleaned
* `input_format`*str* &mdash; the input format of the countries
* `output_format`*str* &mdash; the output format countries

# Implementation

0. Create a Database with ISO 3166 data, and associated regular expressions (see [country_convertor](https://github.com/konstantinstadler/country_converter/blob/master/country_converter/country_data.tsv),  for string formatting. This could be a json file, and can be stored in Python in a dict.

1. Standardize null values

2. If `input_format` is "auto": use regex matching and table lookups to try to identify the countries, then output the countries in the output format

3. If `input_format` is not "auto": map directly from the input to the output format

<!-- 4. regex [fuzzy match country names](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) when name type is "short" or "official", and standardize the output -->


# Resources

Python Libraries:
1. [country_converter](https://github.com/konstantinstadler/country_converter). Converts countries into standardized formats.
2. [pycountry](https://github.com/flyingcircusio/pycountry). Has "fuzzy search" for country matching.

R libraries:
1. [StandardizeText](https://www.rdocumentation.org/packages/StandardizeText/versions/1.0/topics/standardize.countrynames). They can [standardize country names](https://github.com/cran/StandardizeText/blob/master/R/standardize.countrynames.R) using regex and have a [regex database](https://github.com/cran/StandardizeText/blob/master/data/country.regex.rda) for matching countries.
2. [Passport](https://rdrr.io/cran/passport/man/parse_country.html). Parses irregular country names to ISO using regex, google maps geocoding, dstk geocoding. They have a dataset of codes [dataset of codes](https://github.com/alistaire47/passport/blob/master/data/codes.rda). [Github](https://github.com/alistaire47/passport)
3. [Countrycode](https://github.com/vincentarelbundock/countrycode). Standardize country names in many languages.
4. [RangeBuilder](https://github.com/ptitle/rangeBuilder). They [standardize country names](https://github.com/ptitle/rangeBuilder/blob/master/rangeBuilder/R/standardizeCountry.R) using regex and **fuzzy matching**.


In [24]:
import country_converter as coco
some_names = ['United Rep. of Tanzania', 'DE', 'Cape Verde', '788', 'Burma', 'COG',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)

['Tanzania', 'Germany', 'Cabo Verde', 'Tunisia', 'Myanmar', 'Congo Republic', 'Iran', 'South Korea', 'North Korea']
