# Demo of introducing errors into categorical column

Using the CategoricalLabelTransformer class we corrupt the language value in a fraction of the rows completely at random.
The CategoricalLabelTransformer replaces the correct value by a different value from the same column (default). Optionally a list can be given to specify the categories from which a value will be randomly selected to replace the correct values. Another option is to specify a dictionary in which you can specify with what values the original values should be replaced.


Importing the CategoricalLabelTransformer and pandas


In [1]:
from . import CategoricalLabelTransformer
import pandas as pd

Read the data into dataframe


In [2]:
data = pd.read_csv("consolidated.csv")

Create an instance of the CategoricalLabelTransformer class with a fraction of 0.2 on the language column. Then fit the data to the transformer and transform the data. The result will be a dataframe where 20% of the rows has a corrupted value in the language column.


In [3]:
corruptor = CategoricalLabelTransformer(0.2, "language")
corruptor.fit(data)
corrupted_data = corruptor.transform(data)

Print the 'review_body' and 'language' columns of the original dataframe and the corrupted dataframe


In [4]:
print(data[["review_body", "language"]])
print(corrupted_data[["review_body", "language"]])

                                               review_body language
0           Armband ist leider nach 1 Jahr kaputt gegangen       de
1                       In der Lieferung war nur Ein Akku!       de
2        Ein Stern, weil gar keine geht nicht. Es hande...       de
3        Dachte, das wären einfach etwas festere Binden...       de
4        Meine Kinder haben kaum damit gespielt und nac...       de
...                                                    ...      ...
1049995  Bir çok ürün denedim bu kategoride baya başarı...       tr
1049996  Çift hat özelliğinin olması dışında uygulamala...       tr
1049997  arkadas bu ne güzel bi kumaştır yaw çok begend...       tr
1049998  Çok kullanışlı bir ürün. içinde ekstra gözler ...       tr
1049999  çok araştırdım. hepsiburada güvencesi ve yorum...       tr

[1050000 rows x 2 columns]
                                               review_body language
0           Armband ist leider nach 1 Jahr kaputt gegangen       en
1                   

Print the rows where the language label is corrupted. There are 210000 corrupted rows which is indeed a fraction of 0.2 of the original dataset


In [5]:
corrupted_data[corrupted_data["language"] != data["language"]]

Unnamed: 0,review_body,language,stars
0,Armband ist leider nach 1 Jahr kaputt gegangen,en,1
3,"Dachte, das wären einfach etwas festere Binden...",es,1
11,Das Buch sagt mir nicht zu. Die Geschichten si...,es,1
15,Trotz diesem Fliegengitter haben ungebetene Gä...,tr,1
17,Das ist gefühlt das 10 was ich für mein Handy ...,tr,1
...,...,...,...
1049965,Almış olduğum hediye güzel ve anlamlı oldu ala...,es,4
1049972,yıllardır kullandığım marka kalıcılığı hiç yok...,de,1
1049981,%15 yok %20 indirim bugüne özel diye yazıyor e...,de,3
1049983,"Ürünü bugün aldım, bir tıraşlık yetecek kadar ...",es,4


Check if the rows with the incdices that are said to be corrupted by the corruptor are indeed the rows that are corrupted


In [6]:
corrupted_rows = list(
    corrupted_data[corrupted_data["language"] != data["language"]].index
)
set(corrupted_rows) == set(corruptor.affected_rows_indices)

True

Here we compare a corrupted row with the same row in the original dataset. As can be seen, the language label has changed while the rest of the row maintained the same


In [7]:
corrupted_row_index = corruptor.affected_rows_indices[0]
print(corrupted_data.iloc[corrupted_row_index])
print(data.iloc[corrupted_row_index])

review_body    Pour info, il laisse des traces blanches sur l...
language                                                      en
stars                                                          2
Name: 661237, dtype: object
review_body    Pour info, il laisse des traces blanches sur l...
language                                                      fr
stars                                                          2
Name: 661237, dtype: object


We now show how to specify a list from which a random value is chosen as the error value


In [8]:
corruptor = CategoricalLabelTransformer(
    0.2, "language", ["turkish", "german", "spanish", "english"]
)
corruptor.fit(data)
corrupted_data = corruptor.transform(data)

print the corrupted rows and the original rows


In [9]:
corrupted_rows = corrupted_data.iloc[corruptor.affected_rows_indices]
original_rows = data.iloc[corruptor.affected_rows_indices]
print(corrupted_rows)
print(original_rows)

                                               review_body language  stars
1017075  Sevgililer günü için eşime hediye aldım, fiyat...  turkish      4
861213   İyi ebat olarak tam istediğim gibi 1 haftadan ...   german      5
593735   Conectado para salida de aire junto a otro de ...  spanish      5
35935    Das Produkt ist defekt, ich hätte es gern erse...   german      1
540881   Casi perfecta. Si no fuera por el diseño de de...   german      4
...                                                    ...      ...    ...
191359   Der absolute Knaller für unsere kleine Maus! S...  spanish      5
143347   Sehr zufrieden mit der Auswahl der Nägel und d...  turkish      4
134366   sehr praktisch und angenehm, aber ein wenig kl...  spanish      4
217003                   Not worth the price..............  english      1
349256   I loved the material of the dress, it’s so sof...   german      4

[210000 rows x 3 columns]
                                               review_body language  star

The last option is to give the CategoricalLabelTransformer a dictionary which specifies the values with which the original values should be replaced.
Here we create a CategoricalLabelTransformer which replaces the values in the language column in a fraction of 0.2 of the rows. The dictionary we specify mimics the error where instead of the language abbreviation the full language name is given in some rows. Beware that the dictionary should contain an entry for all categories in the dataframe


In [11]:
corruptor = CategoricalLabelTransformer(
    0.2,
    "language",
    {"en": "english", "es": "spanish", "tr": "turkish", "de": "german", "fr": "french"},
)
corruptor.fit(data)
corrupted_data = corruptor.transform(data)

print the corrupted dataframe


In [12]:
print(corrupted_data)

                                               review_body language  stars
0           Armband ist leider nach 1 Jahr kaputt gegangen   german      1
1                       In der Lieferung war nur Ein Akku!       de      1
2        Ein Stern, weil gar keine geht nicht. Es hande...       de      1
3        Dachte, das wären einfach etwas festere Binden...       de      1
4        Meine Kinder haben kaum damit gespielt und nac...       de      1
...                                                    ...      ...    ...
1049995  Bir çok ürün denedim bu kategoride baya başarı...       tr      5
1049996  Çift hat özelliğinin olması dışında uygulamala...       tr      5
1049997  arkadas bu ne güzel bi kumaştır yaw çok begend...       tr      5
1049998  Çok kullanışlı bir ürün. içinde ekstra gözler ...       tr      5
1049999  çok araştırdım. hepsiburada güvencesi ve yorum...       tr      5

[1050000 rows x 3 columns]


print the corrupted rows and the original rows


In [13]:
corrupted_rows = corrupted_data.iloc[corruptor.affected_rows_indices]
original_rows = data.iloc[corruptor.affected_rows_indices]
print(corrupted_rows)
print(original_rows)

                                              review_body language  stars
7304    Ich habe den Artikel nach einem Monat immer no...   german      1
629050  A éviter. J'ai utilisée la caméra sous l'eau t...   french      1
308042  The stereo is good ! fit and worked well on my...  english      3
925621                            Piyasada daha ucuzu yok  turkish      5
939011  birkaç aydır araştırdığım telefon. farklı yerl...  turkish      5
...                                                   ...      ...    ...
258414  DO NOT WASTE YOU MONEY! They sent it in a bag ...  english      2
881096  ürün 1.5 ve 3.5 olarak toplam 2 parca 5cm  gel...  turkish      3
269512  It does not work, it's nonsense, it's constant...  english      2
845097  worl exel bedava.ios da para ile satın alıyors...  turkish      5
39965   Das Ventil war direkt defekt. Verursachte Kurz...   german      1

[210000 rows x 3 columns]
                                              review_body language  stars
7304    Ich