# Pandas internationalisation: Bangala (বাংলা) data example

An example of reading in Bangala data in Pandas.

The file `/data/bn_global_popl.tsv` is a tab delimited file in Bangla (Bengali). 

Column 0 (নং) contains a ranking expressed in Bangla digits. Column 3 (জাতিসংঘের অনুমান) contains a integer in Bangla digits. Column 4 (তারিখ) contains dates that are formatted according to Bangla locale preferences.

A set of conversion functions are used with `pd.read_table()` to convert the data to a format that can be used in Pandas.

Columns 0 and 3 are converted from Bangla digits to Arabic digits, using the `convert_digits()` function. Column 4 is converted to the ISO 8601 date format, using `icu.DateFormat` and `icu.SimpleDateFormat`.

`icu.DateFormat` &ndash; a class for parsing (reading) and formating (writing) dates for any defined locale. \
`icu.SimpleDateFormat` &ndash; a class for language-independant parsing (reading) and formating (writing) dates. It is a concrete subclass of `icu.DateFormat`.

In [8]:
import unicodedataplus as ud, regex as re, pandas as pd
from icu import Locale, DateFormat, SimpleDateFormat, Transliterator, UTransDirection

def convert_digits(s, sep = (",", ".")):
    nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\u202F\p{Nd}]+$')
    tsep, dsep = sep
    if nd.match(s):
        s = s.replace(tsep, "")
        s = ''.join([str(ud.decimal(c, c)) for c in s])
        if dsep in s:
            return float(s.replace(dsep, ".")) if dsep != "." else float(s)
        return int(s)
    return s

loc = Locale("bn_IN")
dformat = DateFormat.LONG
inf = DateFormat.createDateInstance(dformat, loc)
outf = SimpleDateFormat("yyyy-MM-dd")

def convert_dates(ld, inf, outf):
    in_date = inf.parseObject(ld)
    return outf.format(in_date)

lf = lambda x: convert_dates(x, inf=inf, outf=outf)

In [9]:
conv = {'জাতিসংঘের অনুমান': convert_digits, 'নং': convert_digits, 'তারিখ': lf}
df = pd.read_table("../data/bn_global_popl.tsv", skiprows=range(1, 3), converters=conv, parse_dates=['তারিখ'])
df.head(3)

Unnamed: 0,নং,দেশ,মহাদেশ,জাতিসংঘের অনুমান,তারিখ
0,1,গণচীন,এশিয়া,1433783686,2019-07-01
1,2,ভারত,এশিয়া,1366417754,2019-07-01
2,3,যুক্তরাষ্ট্র,আমেরিকা,329064917,2019-07-01


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   নং                233 non-null    int64         
 1   দেশ               233 non-null    object        
 2   মহাদেশ            233 non-null    object        
 3   জাতিসংঘের অনুমান  233 non-null    int64         
 4   তারিখ             233 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 9.2+ KB


## Adding transliteration columns to dataframe

In [11]:
transliterator = Transliterator.createInstance('Bengali-Latin', UTransDirection.FORWARD)

df.insert(loc=2, column='dēśa', value=df['দেশ'].apply(lambda x: transliterator.transliterate(x)))
df.insert(loc=4, column='mahādēśa', value=df['মহাদেশ'].apply(lambda x: transliterator.transliterate(x)))

In [12]:
df.head(10)

Unnamed: 0,নং,দেশ,dēśa,মহাদেশ,mahādēśa,জাতিসংঘের অনুমান,তারিখ
0,1,গণচীন,gaṇacīna,এশিয়া,ēśiẏā,1433783686,2019-07-01
1,2,ভারত,bhārata,এশিয়া,ēśiẏā,1366417754,2019-07-01
2,3,যুক্তরাষ্ট্র,yuktarāṣṭra,আমেরিকা,āmērikā,329064917,2019-07-01
3,4,ইন্দোনেশিয়া,indōnēśiẏā,এশিয়া,ēśiẏā,270625568,2019-07-01
4,5,পাকিস্তান,pākistāna,এশিয়া,ēśiẏā,216565318,2019-07-01
5,6,ব্রাজিল,brājila,আমেরিকা,āmērikā,211049527,2019-07-01
6,7,নাইজেরিয়া,nā'ijēriẏā,আফ্রিকা,āphrikā,200963599,2019-07-01
7,8,বাংলাদেশ,bānlādēśa,এশিয়া,ēśiẏā,186893830,2019-07-01
8,9,রাশিয়া,rāśiẏā,ইউরোপ-এশিয়া,i'urōpa-ēśiẏā,145872256,2019-07-01
9,10,মেক্সিকো,mēksikō,আমেরিকা,āmērikā,127575529,2019-07-01


## Resources

* [Formatting Dates and Times](https://unicode-org.github.io/icu/userguide/format_parse/datetime/)
* [icu::DateFormat Class Reference](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1DateFormat.html)
* [icu::SimpleDateFormat Class Reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classSimpleDateFormat.html)