<a href="https://www.kaggle.com/code/aniruddhapa/youtube-toxic-comment-classification?scriptVersionId=161949551" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics

import string
import spacy

np.random.seed(42)

In [2]:
data=pd.read_csv('/kaggle/input/youtube-toxicity-data/youtoxic_english_1000.csv')
data.head()

Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsNationalist,IsSexist,IsHomophobic,IsReligiousHate,IsRadicalism
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False,False,False,False,False,False,False,False,False,False,False,False
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True,True,False,False,False,False,False,False,False,False,False,False
2,Ugg3dWTOxryFfHgCoAEC,04kJtp6pVXI,\nDont you reckon them 'black lives matter' ba...,True,True,False,False,True,False,False,False,False,False,False,False
3,Ugg7Gd006w1MPngCoAEC,04kJtp6pVXI,There are a very large number of people who do...,False,False,False,False,False,False,False,False,False,False,False,False
4,Ugg8FfTbbNF8IngCoAEC,04kJtp6pVXI,"The Arab dude is absolutely right, he should h...",False,False,False,False,False,False,False,False,False,False,False,False


In [3]:
data['IsToxic']=data['IsToxic'].astype('int')

In [4]:
data.head()

Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsNationalist,IsSexist,IsHomophobic,IsReligiousHate,IsRadicalism
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,0,False,False,False,False,False,False,False,False,False,False,False
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,1,True,False,False,False,False,False,False,False,False,False,False
2,Ugg3dWTOxryFfHgCoAEC,04kJtp6pVXI,\nDont you reckon them 'black lives matter' ba...,1,True,False,False,True,False,False,False,False,False,False,False
3,Ugg7Gd006w1MPngCoAEC,04kJtp6pVXI,There are a very large number of people who do...,0,False,False,False,False,False,False,False,False,False,False,False
4,Ugg8FfTbbNF8IngCoAEC,04kJtp6pVXI,"The Arab dude is absolutely right, he should h...",0,False,False,False,False,False,False,False,False,False,False,False


In [5]:
'''This loads the English language model provided by spaCy. The model includes pre-trained components for 
tasks such as part-of-speech tagging, named entity recognition, and more.'''

nlp = spacy.load("en_core_web_sm")

#retriving default stop words for English Language

stop_words = nlp.Defaults.stop_words
print(stop_words)

{'myself', 'now', 'ours', 'full', 'himself', 'was', 'several', 'make', 'than', 'moreover', 'though', 'against', 'what', 'whereby', 'when', 'us', 'his', 'whether', 'over', 'back', 'enough', 'will', 'without', 'twelve', 'every', 'thence', 'twenty', 'n’t', 'except', 'even', 'used', 'been', 'since', 'beforehand', 'many', 'always', 'such', 'can', 'ten', 'show', 'nevertheless', 'would', 'nothing', 'latter', 'behind', 'fifty', 'your', 'still', 'may', 'whither', 'done', 'three', 'often', 'yourselves', 'further', 'me', 'become', 'it', 'therefore', 'our', 'never', 'becoming', 'last', 'alone', 'own', 'through', 'thereafter', 'he', 'off', 'and', 'whose', 'serious', "'ll", 'hence', 'various', 'ever', '’s', 'eleven', '’m', 'until', 'about', 'next', 'but', 'wherever', 'whereafter', 'in', 'forty', 'regarding', 'because', 'of', 'else', 'few', 'only', 'third', 'both', 'hers', 'if', '‘d', "'ve", "'d", 'amount', 'from', 'anyone', 'others', 'herself', 'very', 'one', 'less', 'more', 'where', 'whence', 'at',

In [6]:
punctuations = string.punctuation
print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


# The spacy_tokenizer function performs tokenization and preprocessing of text using the SpaCy library. 

Here's a step-by-step explanation of what it does:

**Tokenization:** It takes a sentence as input and tokenizes it using SpaCy's natural language processing pipeline. Tokenization involves splitting the sentence into individual words or tokens.

**Lemmatization:** For each token, it retrieves its lemma (base form) using SpaCy's lemmatization feature. This helps in reducing words to their base form, which can improve the consistency of the representation.

**Lowercasing:** It converts each token to lowercase using the lower() method. This standardizes the text and ensures that words with the same meaning but different cases are treated identically.

**Stopwords and Punctuation Removal:** It removes stopwords and punctuation tokens from the list of tokens. Stopwords are common words that often don't carry significant meaning in the context of natural language processing tasks, such as "and," "the," "is," etc. Punctuation marks are also removed since they generally don't add semantic value to the text.

**Returning Preprocessed Tokens:** Finally, it returns the preprocessed list of tokens, which contains lemmatized, lowercase tokens with stopwords and punctuation removed.

In [7]:
# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    doc = nlp(sentence)

    #print(doc)
    #print(type(doc))

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in doc ]

    #print(mytokens)

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [8]:
sentence="I am eating Apple"
spacy_tokenizer(sentence)

['eat', 'apple']

> By setting the tokenizer parameter to spacy_tokenizer, the CountVectorizer will use the specified custom tokenizer function to preprocess the text data before converting it into a matrix of token counts. This ensures that the text is tokenized and processed in a consistent and meaningful way before further analysis or modeling.

In [9]:
count_vector = CountVectorizer(tokenizer = spacy_tokenizer)

In [10]:
count_vector.fit_transform(["I am eating apple, I like apple","I am playing cricket"]).toarray() 



array([[2, 0, 1, 1, 0],
       [0, 1, 0, 0, 1]])

The get_feature_names_out() method in scikit-learn's CountVectorizer class returns the list of feature names generated during the vectorization process.

**Purpose:** This method is useful for inspecting the vocabulary (i.e., unique tokens) learned by the CountVectorizer during the training phase. It allows you to understand which words or phrases are being used as features in the vectorized representation of the text data.

In [11]:
count_vector.get_feature_names_out()

array(['apple', 'cricket', 'eat', 'like', 'play'], dtype=object)

The vocabulary_ attribute in scikit-learn's CountVectorizer class contains a dictionary that maps each unique token (word or phrase) to its index in the feature matrix. 

In [12]:
count_vector.vocabulary_

{'eat': 2, 'apple': 0, 'like': 3, 'play': 4, 'cricket': 1}

# Train Test Split

In [13]:
from sklearn.model_selection import train_test_split
X=data['Text']
y=data['IsToxic']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y)

# Creating the Model

In [14]:
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()

CountVectorizer is used to convert the raw text data into a numerical format suitable for machine learning models. The training data is used to learn the vocabulary and transform the text into token counts, while the testing data is transformed using the same vocabulary learned from the training data.

In [15]:
X_train.head()

101    I wonder what the police expect will happen wh...
67     bassem I think walmart is hiring for the holid...
541                                 Anybody got a cigar?
616           *Monkey screamin bout honkies intensifies*
917    5:53 did you see that brick hit that white pig...
Name: Text, dtype: object

In [16]:
#fits the CountVectorizer instance count_vector to the training text data X_train and transforms it into a matrix of token counts.
X_train_vectors=count_vector.fit_transform(X_train)

#This line transforms the testing text data X_test into a document-term matrix using the vocabulary learned from the training data.
X_test_vectors=count_vector.transform(X_test)



In [17]:
type(X_train_vectors)

scipy.sparse._csr.csr_matrix

In [18]:
X_test_vectors.shape

(200, 3153)

In [19]:
classifier.fit(X_train_vectors,y_train)

In [20]:
predicted=classifier.predict(X_test_vectors)
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.705
Logistic Regression Precision: 0.7538461538461538
Logistic Regression Recall: 0.532608695652174


In [21]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
X_train_vetcors= tfidf_vector.fit_transform(X_train)
X_test_vetcors= tfidf_vector.transform(X_test)



In [22]:
classifier = LogisticRegression()
classifier.fit(X_train_vetcors,y_train)
predicted = classifier.predict(X_test_vetcors)
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.695
Logistic Regression Precision: 0.746031746031746
Logistic Regression Recall: 0.5108695652173914
