<a href="https://colab.research.google.com/github/abhinavshrivastva/Assignment/blob/main/Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hindi Text Classification as NFSW or FSW
You are provided with a dataset containing sentences and their corresponding labels. Your task is to train a machine learning model that accurately predicts the label of each sentence based on its content. You also have the option to finetune any of the open source models. Ensure that your solution includes preprocessing, model selection, training, and evaluation steps. Once done, share the complete code and a brief summary of your approach and results. You are expected to share a github repo link. Good luck!

Dataset will only contain two column sentences and labels.
0 -> SFW
1 -> NSFW


For Example,
```
मुझे आंटी की बातों का मर्म ये समझ में आया कि इस सब गुस्से की वजह रात को अंकल ज्यादा देर तक आंटी की चुदाई नहीं कर पाते हैं और आंटी को शांत नहीं कर पाते हैं - 1
सब तुम्हारी वजह से - 0
```


## Uploading the dataset

In [1]:
pip install pandas




In [2]:
import pandas as pd

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv('30k_sample.csv')

# Check the first few rows of the DataFrame
print(df.head())

   Unnamed: 0                                           sentence  label
0       27869     रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग      1
1       22322  मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...      1
2       27125   मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बु...      1
3       38243  मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...      1
4        4637                                 सब तुम्हारी वजह से      0


## Data Cleaning

To remove all empty sentences from the dataset,each sentence was checked for content, and any sentence with no discernible content was removed.

In [3]:
# Remove sentences with all spaces
df = df[df['sentence'].str.strip() != '']

# Check the first few rows of the cleaned DataFrame
print("After removing sentences with all spaces:")
print(df.head())

# To check the number of rows and columns in the cleaned DataFrame
num_rows, num_columns = df.shape
print(f"Number of rows after cleaning step 1: {num_rows}")
print(f"Number of columns: {num_columns}")


After removing sentences with all spaces:
   Unnamed: 0                                           sentence  label
0       27869     रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग      1
1       22322  मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...      1
2       27125   मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बु...      1
3       38243  मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...      1
4        4637                                 सब तुम्हारी वजह से      0
Number of rows after cleaning step 1: 29257
Number of columns: 3


This code defines a Python function, `remove_emojis`, which utilizes a regular expression to remove emojis from text. It compiles a regex pattern that matches a wide range of emojis across different Unicode code point ranges. This function is then applied to the 'sentence' column. Emojis in the sentences are replaced with empty strings, effectively removing them. Finally, the code prints the first few rows of the DataFrame to display the content after the emojis have been removed, providing a clean text dataset with emojis removed.

In [4]:
import re

# Function to remove emojis from a text using a regular expression
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # Emojis in the first group
                               u"\U0001F300-\U0001F5FF"  # Emojis in the second group
                               u"\U0001F680-\U0001F6FF"  # Emojis in the third group
                               u"\U0001F700-\U0001F77F"  # Emojis in the fourth group
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Apply the remove_emojis function to the 'sentence' column
df['sentence'] = df['sentence'].apply(remove_emojis)

# Check the first few rows of the DataFrame after removing emojis
print("After removing emojis:")
print(df.head())


After removing emojis:
   Unnamed: 0                                           sentence  label
0       27869     रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग      1
1       22322  मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...      1
2       27125   मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बु...      1
3       38243  मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...      1
4        4637                                 सब तुम्हारी वजह से      0


## Data Preprocessing

The Indic NLP library is a Python library designed to work with Indian languages. It offers a range of language processing tools and resources for Indian languages, including tokenization, sentence segmentation, and transliteration. Indic NLP facilitates natural language processing tasks in languages such as Hindi, Bengali, Telugu, and many others, making it a valuable resource for researchers and developers working on Indian language-based applications, text analysis, and linguistic studies.

In [None]:
pip install indic-nlp-library


Collecting indic-nlp-library
  Downloading indic_nlp_library-0.92-py3-none-any.whl (40 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sphinx-argparse (from indic-nlp-library)
  Downloading sphinx_argparse-0.4.0-py3-none-any.whl (12 kB)
Collecting sphinx-rtd-theme (from indic-nlp-library)
  Downloading sphinx_rtd_theme-1.3.0-py2.py3-none-any.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting morfessor (from indic-nlp-library)
  Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Collecting sphinxcontrib-jquery<5,>=4 (from sphinx-rtd-theme->indic-nlp-library)
  Downloading sphinxcontrib_jquery-4.1-py2.py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

 I removed Hindi stopwords using the 'stop_words' library. The code iterates through each sentence, tokenizes it into words, and filters out stopwords.

In [5]:
pip install stop-words


Collecting stop-words
  Downloading stop-words-2018.7.23.tar.gz (31 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: stop-words
  Building wheel for stop-words (setup.py) ... [?25l[?25hdone
  Created wheel for stop-words: filename=stop_words-2018.7.23-py3-none-any.whl size=32896 sha256=3cecadec29e0a1ec96e86d42da243b288abcab3dd6eb1f47b9fd727aa487c171
  Stored in directory: /root/.cache/pip/wheels/d0/1a/23/f12552a50cb09bcc1694a5ebb6c2cd5f2a0311de2b8c3d9a89
Successfully built stop-words
Installing collected packages: stop-words
Successfully installed stop-words-2018.7.23


In [6]:
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load your dataset
data = df  # Assuming 'df' is your DataFrame with 'sentence' and 'label' columns

# Create lists to store tokenized sentences and labels
tokenized_sentences = []
labels = data['label']

# Initialize a Tokenizer
tokenizer = Tokenizer()

# Iterate through each sentence in the DataFrame
for sentence in data['sentence']:
    # Tokenize the sentence into words using whitespace as a separator
    words = sentence.split()

    # Join the words to create a tokenized sentence
    tokenized_sentence = ' '.join(words)

    # Append the tokenized sentence to the list
    tokenized_sentences.append(tokenized_sentence)

# Update the tokenizer with the tokenized sentences
tokenizer.fit_on_texts(tokenized_sentences)

# Create a new DataFrame with tokenized sentences and labels
tokenized_data = pd.DataFrame({'sentence': tokenized_sentences, 'label': labels})

# You can access specific columns like this
sentences = tokenized_data['sentence']
labels = tokenized_data['label']

# To check the number of rows and columns in the DataFrame
num_rows, num_columns = tokenized_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

# Display the first 5 rows of the tokenized DataFrame
print(tokenized_data.head())


Number of rows: 29257
Number of columns: 2
                                            sentence  label
0     रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग      1
1  मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...      1
2  मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बुर...      1
3  मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...      1
4                                 सब तुम्हारी वजह से      0


In [7]:
tokenized_sentences

['रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग',
 'मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुकी थी',
 'मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बुरा हाल था',
 'मुझे आंटी की बातों का मर्म ये समझ में आया कि इस सब गुस्से की वजह रात को अंकल ज्यादा देर तक आंटी की चुदाई नहीं कर पाते हैं और आंटी को शांत नहीं कर पाते हैं',
 'सब तुम्हारी वजह से',
 'अगले ही पल मैंने बोला- हां चाची … मैं आपकी रसीली चूचियों को मुंह में लेकर उनके निप्पल काटना चाहता हूं',
 'मेरी बात सुनकर वो चुप हुईं और मंडी के उलटे हाथ वाली गली में जाने लगी',
 'उसने तुरंत कागज पर अपना नंबर लिख कर मेरी तरफ फेंक दिया',
 'जानकारी पसंद आती है तो अपनेयार दोस्तों में',
 'मैं उसे देखता ही रह गया क्योंकि उसने एक झीना सा गाउन अपने जिस्म पर डाला हुआ था; उसमें से उसकी रेड ब्रा और पैंटी साफ नज़र आ रही थी',
 'तापसी भी अब इतनी ज्यादा मूडी हो चुकी थी की वो वंश पर बस टूट पड़ी थी',
 'उसका लम्बा तना हुआ लण्ड किसी बेलन से कम नहीं लग रहा था।वो मेरी पैंटी के ऊपर से ही मेरी चूत को चाटने लगा। उसने धीरे से मेरी पैंटी को मेरी टांगों के बीच से निकाल 

I used this code to remove duplicate entries from my DataFrame based on the 'sentence' column. After executing the code, I dropped the duplicate rows, ensuring that each unique sentence remains. To maintain a clean index, I reset it using the 'reset_index' function with the 'drop' parameter set to 'True' and replaced the original DataFrame with this modified version. Finally, I printed the cleaned DataFrame to verify that duplicates were successfully eliminated.

In [8]:
# Remove duplicates based on the 'sentence' column
tokenized_data = tokenized_data.drop_duplicates(subset='sentence')

# Reset the index of the DataFrame
tokenized_data.reset_index(drop=True, inplace=True)

# Print the cleaned DataFrame
print(tokenized_data)

                                                sentence  label
0         रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग      1
1      मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...      1
2      मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बुर...      1
3      मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...      1
4                                     सब तुम्हारी वजह से      0
...                                                  ...    ...
27249  तुमसे ज्यादा हॉट है और तुमसे ज्यादा हुस्न वाली...      1
27250                             वो जोर से सिसकारने लगी      0
27251  मैंने किशन से कहा कि ये तेल उंगली में लेकर मेर...      1
27252  और लबलबा रही थी… मानो चीख चीख कर लंड माँग रही ...      1
27253           आपको ये भाभी सेक्स कहानी पसंद आई या नहीं      1

[27254 rows x 2 columns]


In [9]:
tokenized_data['sentence']

0           रहने दो तुम रेस्ट करो।दीप्ति - ठीक है डार्लिंग
1        मैं आज एक हिजड़े से चूत चुदवा कर वो खुशी पा चुक...
2        मेरी प्रियंका दीदी के ऊपरी बदन का तो और भी बुर...
3        मुझे आंटी की बातों का मर्म ये समझ में आया कि इ...
4                                       सब तुम्हारी वजह से
                               ...                        
27249    तुमसे ज्यादा हॉट है और तुमसे ज्यादा हुस्न वाली...
27250                               वो जोर से सिसकारने लगी
27251    मैंने किशन से कहा कि ये तेल उंगली में लेकर मेर...
27252    और लबलबा रही थी… मानो चीख चीख कर लंड माँग रही ...
27253             आपको ये भाभी सेक्स कहानी पसंद आई या नहीं
Name: sentence, Length: 27254, dtype: object

FastText is an open-source, state-of-the-art word embedding technique developed by Facebook's AI Research (FAIR) that has been particularly effective in handling languages like Hindi. FastText goes beyond traditional word embeddings by considering subword information. This is especially beneficial for languages with complex morphology and rich word formation, such as Hindi. FastText operates by breaking down words into smaller subword units called "n-grams," which could be character-level or even smaller, like character trigrams or four-grams. By capturing these subword units, FastText can represent words as a sum of their constituent subword embeddings. This enables the model to handle out-of-vocabulary words, a common challenge in morphologically rich languages.

For Hindi, FastText embeddings offer several advantages. They can effectively capture the semantics and context of words, even if those words are rare or unseen in the training data. This makes FastText a powerful tool for natural language processing tasks in Hindi, such as sentiment analysis, machine translation, and document classification. Researchers and developers often use pre-trained FastText embeddings for Hindi, which are available for download, to jumpstart their projects. By leveraging these embeddings, they can significantly enhance the performance of NLP models on Hindi text, making FastText a valuable resource for Hindi language processing and understanding.

In [10]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199771 sha256=1c5c92c870210f43055990a4c8c332bfbbeaf230801e2eaacb33abad3538212b
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [11]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.bin.gz

--2023-10-28 17:30:42--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.24.87, 3.163.24.51, 3.163.24.93, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.24.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4371554972 (4.1G) [application/octet-stream]
Saving to: ‘cc.hi.300.bin.gz’


2023-10-28 17:31:05 (179 MB/s) - ‘cc.hi.300.bin.gz’ saved [4371554972/4371554972]



In [12]:
!gunzip cc.hi.300.bin.gz

In [None]:
# import fasttext

# # Load the fastText model
# model_path = 'cc.hi.300.bin'
# ft = fasttext.load_model(model_path)

# # Define a function to get the fastText embeddings for a sentence
# def get_sentence_embeddings(sentence):
#     # Use the fastText model to get sentence embeddings
#     embedding = ft.get_sentence_vector(sentence)
#     return embedding



# # Apply the embedding function to each sentence in the DataFrame
# tokenized_data['embedding'] = tokenized_data['sentence'].apply(get_sentence_embeddings)

# # Now, df contains sentence embeddings in the 'embedding' column
# print(tokenized_data)




                                                sentence  label  \
0             रहने तुम रेस्ट करो । दीप्ति - ठीक डार्लिंग      1   
1                 मैं आज हिजड़े चूत चुदवा वो खुशी पा चुकी      1   
2                   मेरी प्रियंका दीदी ऊपरी बदन बुरा हाल      1   
3      मुझे आंटी बातों मर्म समझ आया सब गुस्से वजह रात...      1   
4                                        सब तुम्हारी वजह      0   
...                                                  ...    ...   
27052  तुमसे ज्यादा हॉट तुमसे ज्यादा हुस्न वाली । सेक...      1   
27053                                वो जोर सिसकारने लगी      0   
27054      मैंने किशन तेल उंगली लेकर मेरी गांड अन्दर लगा      1   
27055  लबलबा रही थी… चीख चीख लंड माँग रही । मेरी रूपा...      1   
27056                      आपको भाभी सेक्स कहानी पसंद आई      1   

                                               embedding  
0      [-0.005927444, -0.038019452, 0.058730066, 0.01...  
1      [0.0040530004, -0.07248137, 0.014240338, -0.01...  
2      [0.01654801

In [None]:
# tokenized_data["embedding"]

0        [-0.005927444, -0.038019452, 0.058730066, 0.01...
1        [0.0040530004, -0.07248137, 0.014240338, -0.01...
2        [0.016548015, -0.056799635, 0.055843595, 4.532...
3        [-0.02205274, -0.07949906, 0.04177191, 0.01211...
4        [0.005723194, -0.11241563, -0.013619993, 0.010...
                               ...                        
27052    [-0.0062860674, -0.035024516, 0.028184963, 0.0...
27053    [-0.02357255, -0.06010562, 0.063375965, -0.022...
27054    [-0.016162643, -0.028356519, 0.05148187, 0.027...
27055    [0.02696366, -0.033749055, 0.03329916, 0.01336...
27056    [0.022650523, -0.08317428, 0.025835428, 0.0688...
Name: embedding, Length: 27057, dtype: object

# Model Selection

## Logisitic regression

I first employed a basic logistic regression classifier to test the model's classification performance. After splitting the dataset into training and testing sets using `train_test_split`, I used the training data's 'embedding' and 'label' columns. I converted the embeddings to NumPy arrays for compatibility with the classifier and then fitted the logistic regression model to the training data. Following this, I used the trained classifier to predict labels on the test data. The model achieved an accuracy of 85% on the test set, as indicated by the `accuracy_score`. Additionally, I obtained a classification report that provides a detailed breakdown of precision, recall, and F1-score for both classes, further assessing the model's performance.

The results suggest that this initial logistic regression model demonstrates good accuracy, which is promising for the task at hand. Further experimentation and optimization may be beneficial to improve performance, but this baseline accuracy of 71% provides a solid starting point for the classification task.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Assuming you have already split your data into train and test sets as you did earlier
# You need to use the 'sentence' column for text data and the 'label' column for labels.

# Split the data into training and testing sets
train_data, test_data = train_test_split(tokenized_data, test_size=0.2, random_state=42)

# Define your features (X) and labels (y)
X_train = train_data['sentence']
y_train = train_data['label']
X_test = test_data['sentence']
y_test = test_data['label']

# Create a CountVectorizer to convert text data to numerical features
vectorizer = CountVectorizer(max_features=10000)  # You can adjust the number of features as needed
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Create a logistic regression model
logistic_model = LogisticRegression()


# Fit the model on the training data
logistic_model.fit(X_train_vectorized, y_train)

# Predict the labels on the test data
y_pred = logistic_model.predict(X_test_vectorized)

# Calculate accuracy and print a classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy*100}%")
print(report)


Accuracy: 71.326362135388%
              precision    recall  f1-score   support

           0       0.69      0.77      0.73      2743
           1       0.74      0.65      0.69      2708

    accuracy                           0.71      5451
   macro avg       0.72      0.71      0.71      5451
weighted avg       0.72      0.71      0.71      5451



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## FastText - model by Facebook




I used FastText, a model developed by Facebook, to perform text classification. The code loads test data from a file and evaluates the model's predictions against true labels, calculating accuracy, precision, recall, and F1 score using scikit-learn's metrics. The model extracts true labels and makes predictions, removing '__label__' prefixes for comparison. These metrics provide valuable insights into the model's performance in text classification, making it a comprehensive evaluation process. FastText is a powerful tool for such tasks, and this code demonstrates how to apply it effectively.

In [17]:
df = tokenized_data
# First, split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.15, random_state=42)  # Adjust the test_size as needed

# Define the file paths for the train and test files
train_file_path = "train.txt"
test_file_path = "test.txt"

# Write the training data to the train.txt file in the FastText format
with open(train_file_path, 'w', encoding='utf-8') as train_file:
    for index, row in train_df.iterrows():
        sentence = row['sentence']
        label = row['label']
        train_file.write(f"__label__{label} {sentence}\n")

# Write the testing data to the test.txt file in the FastText format
with open(test_file_path, 'w', encoding='utf-8') as test_file:
    for index, row in test_df.iterrows():
        sentence = row['sentence']
        label = row['label']
        test_file.write(f"__label__{label} {sentence}\n")

In [18]:
import fasttext
model = fasttext.train_supervised(input='train.txt')


In [19]:
# Load the test data
test_data = 'test.txt'  # Replace with the path to your test data file

# Initialize variables for tracking evaluation metrics
true_labels = []
predicted_labels = []

# Evaluate the model on the test data
with open(test_data, 'r', encoding='utf-8') as test_file:
    for line in test_file:
        parts = line.strip().split(' ', 1)
        if len(parts) == 2:
            label, sentence = parts
            true_labels.append(label[9:])  # Extract the true label (remove '__label__')
            prediction = model.predict(sentence.strip())  # Get model's prediction
            predicted_labels.append(prediction[0][0][9:])  # Extract the predicted label (remove '__label__')

# Calculate evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Accuracy: 0.9814135485448765
Precision: 0.9816321606520968
Recall: 0.9814135485448765
F1 Score: 0.9814090953522737


## Naive Bayes

In the code provided, I used a technique called "Count Vectorization," often referred to as "tf-idf," to process text data. Imagine we have a collection of sentences, and we want to understand the importance of each word in those sentences. Count Vectorization helps by creating a list of all unique words and counting how many times each word appears in each sentence. This process converts text data into a numerical format that machine learning models can understand. It allows the model to learn patterns and make predictions based on the frequency of words. In simpler terms, it's like turning words into numbers to teach the computer how to understand and classify text data.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Assuming you have already split your data into train and test sets as you did earlier
# You need to use the 'sentence' column for text data and the 'label' column for labels.

# Split the data into training and testing sets
train_data, test_data = train_test_split(tokenized_data, test_size=0.2, random_state=42)

# Define your features (X) and labels (y)
X_train = train_data['sentence']
y_train = train_data['label']
X_test = test_data['sentence']
y_test = test_data['label']

# Create a pipeline with Count Vectorization and Multinomial Naive Bayes
text_clf = Pipeline([
    ('vectorizer', CountVectorizer()),  # You can customize vectorization options here
    ('classifier', MultinomialNB())
])

# Fit the model
text_clf.fit(X_train, y_train)

# Predict the labels
y_pred = text_clf.predict(X_test)

# Calculate accuracy and print a classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy*100}%")
print(report)


Accuracy: 65.76774903687397%
              precision    recall  f1-score   support

           0       0.69      0.59      0.63      2743
           1       0.64      0.73      0.68      2708

    accuracy                           0.66      5451
   macro avg       0.66      0.66      0.66      5451
weighted avg       0.66      0.66      0.66      5451



## Decision tree classification

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report

# Assuming you have already split your data into train and test sets as you did earlier
# You need to use the 'sentence' column for text data and the 'label' column for labels.

# Split the data into training and testing sets
train_data, test_data = train_test_split(tokenized_data, test_size=0.2, random_state=42)

# Define your features (X) and labels (y)
X_train = train_data['sentence']
y_train = train_data['label']
X_test = test_data['sentence']
y_test = test_data['label']

# Create a pipeline with TF-IDF Vectorization and Decision Tree Classifier
text_clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),  # You can customize vectorization options here
    ('classifier', DecisionTreeClassifier())
])

# Fit the model
text_clf.fit(X_train, y_train)

# Predict the labels
y_pred = text_clf.predict(X_test)

# Calculate accuracy and print a classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy*100}%")
print(report)


Accuracy: 63.32782975600807%
              precision    recall  f1-score   support

           0       0.63      0.66      0.64      2743
           1       0.64      0.61      0.62      2708

    accuracy                           0.63      5451
   macro avg       0.63      0.63      0.63      5451
weighted avg       0.63      0.63      0.63      5451



##  CNN

In [22]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Assuming you have already split your data into train and test sets as you did earlier
# You need to use the 'sentence' column for text data and the 'label' column for labels.

# Split the data into training and testing sets
train_data, test_data = train_test_split(tokenized_data, test_size=0.2, random_state=42)

# Define your features (X) and labels (y)
X_train = train_data['sentence']
y_train = train_data['label']
X_test = test_data['sentence']
y_test = test_data['label']

# Tokenize the text data and convert it to sequences
max_sequence_length = 100  # Adjust based on the desired sequence length
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# Pad sequences to ensure consistent input size
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_sequence_length, padding='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_sequence_length, padding='post')

# Create a CNN model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=300, input_length=max_sequence_length))
model.add(Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_padded, y_train, epochs=5, batch_size=64)

# Evaluate the model
y_pred = model.predict(X_test_padded)
y_pred_binary = [1 if pred > 0.5 else 0 for pred in y_pred]

accuracy = accuracy_score(y_test, y_pred_binary)
report = classification_report(y_test, y_pred_binary)

print(f"Accuracy: {accuracy*100}%")
print(report)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 98.45899834892681%
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      2743
           1       0.98      0.99      0.98      2708

    accuracy                           0.98      5451
   macro avg       0.98      0.98      0.98      5451
weighted avg       0.98      0.98      0.98      5451



In [23]:
X_train

7889     फिर थोड़ी देर बिना कोई हरकत किए ऐसे ही में पड़...
1443     इस रिश्ते को सम्पूर्ण करने के लिए आज रात पति प...
19548    मेरे गांव जाने के लिए शहर से 18 किलोमीटर एक छो...
12789    यहां मैं आपको बता दूँ कि मोनाली को गांड में लं...
11480    डैनी _ बेबी उस रोज कि तरह डांस करो ना ताकि मेर...
                               ...                        
21575        मगर हम दोनों खुल्लम खुल्ला चुदाई किया करते थे
5390     चौधरी जी ने अपना मोटा लंड मेरी गांड में डाला औ...
860                वो मुझसे कई बार काम भी बता दिया करती थी
15795    और फिर एक दिन बाथरूम में उसे नंगी नहाते हुए दे...
23654    <span;>वो मेरी आंखों में देख रही थी। <span;>मै...
Name: sentence, Length: 21803, dtype: object

In [24]:
len(X_train_padded[0])

100

##XG BOOST

In [25]:
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Assuming you have already split your data into train and test sets as you did earlier
# You need to use the 'sentence' column for text data and the 'label' column for labels.

# Split the data into training and testing sets
train_data, test_data = train_test_split(tokenized_data, test_size=0.2, random_state=42)

# Define your features (X) and labels (y)
X_train = train_data['sentence']
y_train = train_data['label']
X_test = test_data['sentence']
y_test = test_data['label']

# Convert text data to TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=10000)  # You can adjust the max_features as needed
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Create an XGBoost classifier
xgb_model = xgb.XGBClassifier()

# Train the model
xgb_model.fit(X_train_tfidf, y_train)

# Predict the labels
y_pred = xgb_model.predict(X_test_tfidf)

# Calculate accuracy and print a classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy*100}%")
print(report)


Accuracy: 69.52852687580261%
              precision    recall  f1-score   support

           0       0.68      0.73      0.71      2743
           1       0.71      0.66      0.68      2708

    accuracy                           0.70      5451
   macro avg       0.70      0.70      0.69      5451
weighted avg       0.70      0.70      0.69      5451

