# Compare data sources

The __Tamilmixsentiment datset__ presented in
 [Chakravarthi et. al](https://aclanthology.org/2020.sltu-1.28.pdf)
 was firstly contributed to the
 [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/TamilSentiMix)
 and later published in the [Hugging Face Data Repository](https://huggingface.co/datasets/tamilmixsentiment).

In this notebook I compare both datsets and make sure there are no differences between them.

In [1]:
import datasets
import pandas as pd

datasets.logging.set_verbosity_error()

Read from Hugging Face

In [2]:
ds = datasets.load_dataset('tamilmixsentiment')
ds

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11335
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3149
    })
})

and merge into a pandas dataframe

In [3]:
hf = pd.concat([v.to_pandas() for k, v in ds.items()])
hf

Unnamed: 0,text,label
0,Trailer late ah parthavanga like podunga,0
1,Move pathutu vanthu trailer pakurvnga yaru,0
2,Puthupetai dhanush ah yarellam pathinga,0
3,"Dhanush oda character ,puthu sa erukay , mass ta",0
4,vera level ippa pesungada mokka nu thalaivaaaaaa,0
...,...,...
3144,Tamil krish ah irukum oh...,0
3145,Thalaivaaaaaa... trailer ye pattaiya kelapudhe...,0
3146,Innum neraya neraya neraya neraya neraya,2
3147,1:05 to 1:30 Vere level masss,0


recode labels

In [4]:
hf['label'] = hf['label'].replace(dict(enumerate(ds['test'].features['label'].names)))
print(hf['label'].value_counts())

Positive          10559
Negative           2037
Mixed_feelings     1801
unknown_state       850
not-Tamil           497
Name: label, dtype: int64


clean whites

In [5]:
hf['text'] = hf['text'].str.strip()

Read from UCI

In [6]:
df = pd.read_csv(
    'http://archive.ics.uci.edu/ml/machine-learning-databases/00610/Tamil_first_ready_for_sentiment.csv',
    sep='\t', names=['label', 'text']
)

some cleaning

In [7]:
df['label'] = df['label'].str.strip()
df['text'] = df['text'].str.strip()

and reorder columns

In [8]:
df = df[hf.columns]
df

Unnamed: 0,text,label
0,Enna da ellam avan seyal Mari iruku,Negative
1,This movei is just like ellam avan seyal,Negative
2,Padam vanthathum 13k dislike pottavaga yellam ...,Positive
3,Neraya neraya neraya... ... V era level...thala,Positive
4,wow thavala sema mass....padam oru pundaikum a...,Positive
...,...,...
15739,ivaru cinemala laam nalla tha prasuraaru...aan...,Mixed_feelings
15740,Pattaya Kilaputhupaa trailer... !!!!! Get Rajn...,Positive
15741,En innum trending la varala? Ennada panringa Y...,Mixed_feelings
15742,Rajnikant sir plz aap india ke pm ban jaao,not-Tamil


and recode some labels

In [9]:
df['label'] = df['label']  # .replace('unknown_state', 'Neutral').replace('not-Tamil', 'Other_language')
print(df['label'].value_counts())

Positive          10559
Negative           2037
Mixed_feelings     1801
unknown_state       850
not-Tamil           497
Name: label, dtype: int64


Lets arrange the datasets in the same order

In [10]:
hf = hf.sort_values(hf.columns.tolist()).reset_index(drop=True)
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)

and compare both dataframes.

In [11]:
(hf == df).all()

text     True
label    True
dtype: bool

In [12]:
hf

Unnamed: 0,text,label
0,# Apdi yellam nadakaathu # Naadaka kudathu,Positive
1,# Court la ippudillam kaththa kudathu Mr. vand...,Positive
2,# I set your screens on # Vaa thalaiva#,Positive
3,### Pink tamil version ## thalassemia,Positive
4,#0.32 that moment Kottaisami thalaikeela dhan ...,Positive
...,...,...
15739,🤣🤣🤣Maran waiting la panavendam poi saavunga Aj...,Positive
15740,🤣🤣🤣Trailer ah da edu edho ad patha mari eruku ...,Mixed_feelings
15741,🤣🤣🤣🤣🤣 Batman begins!!!! Dawn of Krish!!!! Muga...,Positive
15742,🤩Naan Tamil Cinema Fan... Semma Trailer #Thala,Positive


In [13]:
df

Unnamed: 0,text,label
0,# Apdi yellam nadakaathu # Naadaka kudathu,Positive
1,# Court la ippudillam kaththa kudathu Mr. vand...,Positive
2,# I set your screens on # Vaa thalaiva#,Positive
3,### Pink tamil version ## thalassemia,Positive
4,#0.32 that moment Kottaisami thalaikeela dhan ...,Positive
...,...,...
15739,🤣🤣🤣Maran waiting la panavendam poi saavunga Aj...,Positive
15740,🤣🤣🤣Trailer ah da edu edho ad patha mari eruku ...,Mixed_feelings
15741,🤣🤣🤣🤣🤣 Batman begins!!!! Dawn of Krish!!!! Muga...,Positive
15742,🤩Naan Tamil Cinema Fan... Semma Trailer #Thala,Positive


In conclusion, both datasets are the same after recoding the labels and stripping white spaces