# Pandas internationalisation: Persian data example

An example of reading in Persian data in Pandas.

The file `fa_stats.tsv` is a tab delimited file in Persian. Column 1 contains a four digit year based on the Islamic calendar. Columns 2 and 3 contain integers using Eastern Arabic-Indic digits, using the Arabic thousands seperator.

A set of conversion functions are used with `pd.read_table()` to convert the data to a format that cen be used in Pandas.

Column 1 is converted to the Gregorian Calendar, using a combination of the `convert_digits()` function and PyICU's `icu.Calendar` and `icu.GregorianCalendar` modules. After the dataframe is available, we use `pandas.Series.dt.year` to convert the datetime objects in the column to Four digit year display.

The `convert_digits()` function is used to convert the Eastern Arabic-Indic digits in columns 2 and 3 to Arabic digits that can be manipulated by Pandas.

In [1]:
import unicodedataplus as ud, regex as re, pandas as pd
from icu import Locale, DateFormat, SimpleDateFormat, Calendar, GregorianCalendar

def convert_digits(s, sep = (",", ".")):
    nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]+$')
    tsep, dsep = sep
    if nd.match(s):
        s = s.replace(tsep, "")
        s = ''.join([str(ud.decimal(c, c)) for c in s])
        if dsep in s:
            return float(s.replace(dsep, ".")) if dsep != "." else float(s)
        return int(s)
    return s

loc = "fa_IR"
in_c = Calendar.createInstance(Locale(loc + "@calendar=persian"))
out_c = GregorianCalendar(Locale(loc + "@calendar=gregorian"))

def convert_islamic_year(y, in_c, out_c):
    y = convert_digits(y.strip())
    in_c.set(Calendar.YEAR, y)
    out_c.setTime(in_c.getTime())
    return out_c.get(Calendar.YEAR)

seps = ("\u066C", "\u066B")
digitf = lambda x: convert_digits(x.strip(), sep = seps)
datef = lambda x: convert_islamic_year(x, in_c=in_c, out_c=out_c)


In [2]:
conv = {"سال": datef ,"ولادت": digitf, "وفات": digitf}
df = pd.read_table("../data/csv/fa_stats.tsv", converters=conv, parse_dates=['سال'])
df["سال"] = df["سال"].dt.year
df.head(3)

Unnamed: 0,سال,ولادت,وفات
0,1959,864846,176288
1,1960,876206,171040
2,1961,902260,159371
