# Pandas internationalisation: Turkish (Türk dili) data example

An example of reading in Turkish data in Pandas, and string based data prep that may affect Turkish (or Azeri) data.

Turkish uses the baseline dot (period) as a grouping delimiter within numbers, and uses a comma as a decimal seperator. 

In [2]:
import unicodedataplus as ud, pandas as pd
from icu import Locale, UnicodeString

libpath = os.path.expanduser('../utils')
if libpath not in sys.path:
    sys.path.append(libpath)
import el_utils as elu

In [3]:
df_pre = pd.read_table("../data/türkiye'ninz-illeri.tsv")
df_pre.head(3)

Unnamed: 0,Ad,Alan (km²),Nüfus (2019),NY kişi/km²,Plaka kodu,Telefon kodu,Vali
0,İstanbul,5.461,15.519.267,"2.841,83",34,"212, 216",Ali Yerlikaya
1,Eskişehir,13.96,887.475,6357,26,222,Erol Ayyıldız
2,Bursa,10.813,3.056.120,28263,16,224,Yakup Canbolat


In [4]:
df_pre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Ad            81 non-null     object 
 1   Alan (km²)    81 non-null     float64
 2   Nüfus (2019)  81 non-null     object 
 3   NY kişi/km²   81 non-null     object 
 4   Plaka kodu    81 non-null     int64  
 5   Telefon kodu  81 non-null     object 
 6   Vali          81 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 4.6+ KB


_Alan (km²)_ is treated as a floating point number instead of a digit.
_Nüfus (2019)_, _NY kişi/km²_, and _Telefon kodu_ are treated as strings.

These can be easily corrected, while reading the data, by specifiying the __decimal__ and __thousands__ parameters on the `pd.read_table()` operation.

_Telefon kodu_ is read as a string, most values in the column are single integers, although one or more cells contain more than one integer. There are two approaches, either leave the column values as strings or convert each value to a list of integers. We will use a conversion function to convert the strings to a list of integers.

In [5]:
#lf = lambda x: list(map(int, x.split(", ")))
lf = lambda x: [int(n) for n in x.split(", ")]
conv = {"Telefon kodu": lf}
df = pd.read_table("../data/türkiye'ninz-illeri.tsv", decimal=",", thousands=".", converters=conv)
df.head(3)

Unnamed: 0,Ad,Alan (km²),Nüfus (2019),NY kişi/km²,Plaka kodu,Telefon kodu,Vali
0,İstanbul,5461,15519267,2841.83,34,"[212, 216]",Ali Yerlikaya
1,Eskişehir,13960,887475,63.57,26,[222],Erol Ayyıldız
2,Bursa,10813,3056120,282.63,16,[224],Yakup Canbolat


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Ad            81 non-null     object 
 1   Alan (km²)    81 non-null     int64  
 2   Nüfus (2019)  81 non-null     int64  
 3   NY kişi/km²   81 non-null     float64
 4   Plaka kodu    81 non-null     int64  
 5   Telefon kodu  81 non-null     object 
 6   Vali          81 non-null     object 
dtypes: float64(1), int64(3), object(3)
memory usage: 4.6+ KB


The operation `Series.str.lower()` is the Pandas equivalent of the core Python string casing operation `str.lower()`, and is language and locale insensitive as is `str.lower()`. `Series.str.lower()` takes a series as input and returns a series. It is suitable to use in most cases. Alternatives should be saught when language specific tailorings are required, as is the case for Turkish.

The value in the first column and first row is _İstanbul_, as can be seen in the code snippet below the standard lowercasing of a string yields _i̇stanbul_. In the default casing operation <U+0130> is converted to <U+0069, U+0307>, while a Turkish specific casing operation would convert <U+0130> to <U+0069>.

In [7]:
initial_value = df.at[0,'Ad']
print(elu.cp(initial_value))
print(elu.cp(initial_value.lower()))

U+0130 (İ) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)
U+0069 (i) U+0307 (̇) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)


As can be seen below, the `Series.str.lower()` operation will yield the same results as `str.lower()`

In [9]:
df_temp = df.copy()
df_temp["Ad"] = df_temp["Ad"].str.lower()
df_temp["Vali"] = df_temp["Vali"].str.lower()
temp_value = df_temp.at[0,'Ad']
print(elu.cp(temp_value))
print(temp_value == initial_value.lower())

U+0069 (i) U+0307 (̇) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)
True


There are two approaches to handling Turkish casing:

1. write a wrapper function around `str.lower()`. This is the most common approach, or 
2. use pyICU to create a Turkish locale instance and perform a casing operation using that locale.

The first approach:

In [None]:
# To lowercase
def kucukharfyap(s):
    return ud.normalize("NFC", s).replace('İ', 'i').replace('I', 'ı').lower()

# To uppercase
def buyukharfyap(s):
    return ud.normalize("NFC", s).replace('ı', 'I').replace('i', 'İ').upper()

print(elu.cp(kucukharfyap(initial_value)))

df_temp2 = df.copy()
df_temp2[["Ad", "Vali"]] = df_temp2[["Ad", "Vali"]].applymap(kucukharfyap)
temp2_value = df_temp2.at[0,'Ad']
print(elu.cp(temp2_value))

U+0069 (i) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)
U+0069 (i) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)


Custom wrappers for casing operations are useful when you are working with a single language, but if your code needs to be adaptable and used with datasets in a range of different languages, maintaining code to handle the various tailorings adds a level of complexity to the code, and makes it more difficult to maintain over time.

The second approach is to leverage the PyICU package, which is a wrapper for `icu4c`.

For PyICU:

1. Create an ICU Locale object using `icu.Locale()`.
2. Create a UnicodeString object from the string, using `icu.UnicodeString()`.
3. Apply lowercasing operation for specified locale, using `icu.UnicodeString.toLower()`
4. Typecast to a string, using `str()`.

In [None]:
from icu import UnicodeString, Locale

loc = Locale("tr_TR")
def toLower(s, l):
    return str(UnicodeString(s).toLower(l))

df_temp3 = df.copy()
df_temp3[["Ad", "Vali"]] = df_temp3[["Ad", "Vali"]].applymap(lambda x: toLower(x, loc))

temp3_value = df_temp3.at[0,'Ad']
# temp3_value = df_temp3.iat[0,0]
print(elu.cp(temp3_value))

U+0069 (i) U+0073 (s) U+0074 (t) U+0061 (a) U+006E (n) U+0062 (b) U+0075 (u) U+006C (l)


In [None]:
df_temp3.to_csv('temp.csv')