## In previous notebooks we removed data instances of same sentiment different polarity, and even for same Polarity

## But we missed to address issues where, the PROFANITY and VIOLENCE example could be in GENERAL too. due to which the misclassification, let's address that


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("./Data/Cleaned_Nepali_dataset.csv")
df.loc[:,"Target"].value_counts()

Target
1    1358
0     901
2     250
5     159
4     108
3      83
Name: count, dtype: int64

In [2]:
df.head()

Unnamed: 0,Text,Target
0,गुठी विधेक ल्याएर ठमेल राज गुठि जग्गा छाया सेन...,0
1,दले देश सकेछन बेचे खान सुरू गरेछन दले लखेटनु पछ ।,1
2,नेपाल ससकृती ध्वस्त पार्ने योजना !,1
3,मठ मन्दिर गुम्बा जग्गा हरु भुमाफिया नजर परे हु...,1
4,नेपाल कल कर्खाना नदि नाला बेची सके मठ मन्दीर ब...,1


## Here the target is already label encodded but we know 0 and 1 is for GENERAL followed by PROFANITY and then VIOLENCE

## First for profanity

In [3]:
# Duplicate condition here
profanity_duplicates_condition = [{0, 2}, {0, 3}, {1, 2}, {1, 3}]

profanity_duplicates_in_general = (
    df.groupby('Text')['Target'].apply(lambda x: set(x) in profanity_duplicates_condition)
)

# Filter for groups that have a valid target combination
duplicates = profanity_duplicates_in_general[profanity_duplicates_in_general].index

# Merge with the original DataFrame to retrieve full details
duplicates_full = df[df['Text'].isin(duplicates)]

print(duplicates_full)

                                                   Text  Target
9     साला नाके बाहुन गूठी बिदयेक पारित भु माफिया सग...       1
15    कमिसन खोरी हरु बाटो खोजी रहे लाटा बुजे लाजमर्द...       1
56    पाखन्डी भन्डारी क्रिश्चियन दलाल ओलि मुत पिएर ह...       1
83    राजा हुदा देश बिचियो सरकारि जग्गा बिचियो खाते ...       1
88    हाम्रो देश दलाल भर्स्त हरु राज भयो देश कहिले ब...       1
...                                                 ...     ...
2580                                रन्डी भालु सत्यता ।       3
2582  आईमाई 100% पागलखाना लानु पर्न्ने भो तेति बेला ...       2
2583  अरू जान्दैनो हामी जनता यस्ता खाल कंलड़कित आईमाई...       2
2584     यस्ता मुजी आइमाइ , , बरू एउटा अड्डा खोलेर बस ।       2
2590                            मुजी आईमाई फटाई रहिछे !       2

[260 rows x 2 columns]


In [4]:
# Create a mask for rows that are in duplicates_full and have Target 0 or 1
mask = df.index.isin(duplicates_full.index) & df['Target'].isin([0, 1])

# Filter out those rows from the original df
df = df[~mask]

df.loc[:,"Target"].value_counts()

Target
1    1239
0     890
2     250
5     159
4     108
3      83
Name: count, dtype: int64

## We accomplished removing profanity in general, now time for violence

In [5]:
# Duplicate condition here
violence_duplicates_condition = [{0, 4}, {0, 5}, {1, 4}, {1, 5}]

violence_duplicates_in_general = (
    df.groupby('Text')['Target'].apply(lambda x: set(x) in violence_duplicates_condition)
)

# Filter for groups that have a valid target combination
duplicates2 = violence_duplicates_in_general[violence_duplicates_in_general].index

# Merge with the original DataFrame to retrieve full details
duplicates_full2 = df[df['Text'].isin(duplicates2)]

print(duplicates_full2)

                                                   Text  Target
12    गान्धी भारत anti हिन्दु हुदा नाथुराम गोद्से गन...       1
67    बाहुन हु बाहुन क्षेत्री नाम कलङ्क हरुलाइ चरम य...       1
140   टुड़ीखेल लखेटी लखेटी जुत्ता पिट्नु पर्ने मान्छे...       1
169   लुट्न लाइसन्स पाए नेता लाखाै लाख जनता मारेर कम...       1
183   सरकार अदालत प्रहरी सेना अखितियार सरकार निकाय ह...       0
...                                                 ...     ...
2840  नेपाल प्रहरी लुटेरा , सून तस्करी , सुन्तली धाम...       5
2841  डिआई जि झुटो बोल्दै छँस ज्ञानेन्द्र शाई तिमि ह...       4
2842  डिपार्टमेंट बचाउन गलत बयान जनता सुरक्षा दिन मर...       5
2852  भ्रष्टाचारी , घुस्खोरी , नेपाल प्रहरी भाला मत्...       5
2853  हिरासत यातना हैन प्रहरी भ्रष्टाचार प्रहरी आखा ...       5

[178 rows x 2 columns]


In [6]:
# Create a mask for rows that are in duplicates_full and have Target 0 or 1
mask2 = df.index.isin(duplicates_full2.index) & df['Target'].isin([0, 1])

# Filter out those rows from the original df
df = df[~mask2]

df.loc[:,"Target"].value_counts()

Target
1    1165
0     875
2     250
5     159
4     108
3      83
Name: count, dtype: int64

## now we save this dataset

In [7]:
df.to_csv("./Data/Cleaned_Nepali_dataset_v2.csv", index= False)

## let's run above code to be sure the duplicates were remove

In [8]:
# Duplicate condition here
profanity_duplicates_condition = [{0, 2}, {0, 3}, {1, 2}, {1, 3}]

profanity_duplicates_in_general = (
    df.groupby('Text')['Target'].apply(lambda x: set(x) in profanity_duplicates_condition)
)

# Filter for groups that have a valid target combination
duplicates = profanity_duplicates_in_general[profanity_duplicates_in_general].index

# Merge with the original DataFrame to retrieve full details
duplicates_full = df[df['Text'].isin(duplicates)]

print(duplicates_full)

Empty DataFrame
Columns: [Text, Target]
Index: []


## looks good, so the data is totally deduplicated


In [9]:
!pip install nephased

Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
    PyYAML (>=5.1.*)
            ~~~~~~^[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
from nephased import Nephased

clf = Nephased()
clf.predict("ज़िन्दा जलाईनेछ")

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7ff7e94f8450>>
Traceback (most recent call last):
  File "/home/angel-tamang/myenv/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 
Device set to use cpu


'GENERAL'

In [11]:
clf.predict('''यस्तो काम गर्ने मान्छेलाई त नांगै पार्नुपर्छ, सबका अगाडि''')

'GENERAL'