# **1. Words Vectorizing**

# Outcome
---

1. Understand How to Vectorize a Text
2. Train a simple toxic comment classification model

---
## Outline:

1. Background
2. Simplified Workflows.
3. Importing Data
4. Data Preparations
5. Data Preprocessing
6. Modeling

# **Background**
---

## Problem Description
---

- A video streaming platform **youtube.com** has its problem with its community safety
- User can give a comment to uploaded video
- The comment feature in Youtube could lead into good and bad environment
- Some of the bad form of comment is `Toxicity`

Toxicity ~ *rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion*

Of course this lead to bad experience to content creator / other user

## Business Objective
---

Our objective is to reduce toxicity appearance percentage

## Solution
---


We can classifiy which comment considered as `toxic` (prediction) and drop it from the youtube video comment (policy)

## Data Description
---

- The data is obtained from [Youtube Toxic Comment](https://www.kaggle.com/datasets/reihanenamdari/youtube-toxicity-data).

The dataset that we use is:

**toxic comment dataset** : `toxic_comment.csv`

<center>

|Features|Descriptions|Data Type|
|:--|:--|:--:|
|`CommentId`|The comment ID|`str`|
|`VideoId`|The video ID|`str`|
|`Text`|comment |`str`|
|`IsToxic`|whether the comment text is toxic (`True`) or (`False`)|`boolean`|


# **Worfklow** (Simplified)
---

## <font color='blue'>1. Importing Data</font>

```
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.
```

## <font color='blue'>2.Data Preparation</font>

```
Create several data representations
1. numerical data
2. categorical data (with OHE)
3. numerical + categorical data
```

# **1. Importing Data**
---

What do we do?
1. Load the data.
2. Check the shape & type of data.
3. Handle the duplicates data to maintain data validity.

## Load the data
---

In [1]:
# Load this library
import numpy as np
import pandas as pd

Load the data from given data path

In [2]:
toxic_comment_path = '../data/toxic_comment.csv'

toxic_comment_data = pd.read_csv(toxic_comment_path)

In [3]:
toxic_comment_data.head()

Unnamed: 0,CommentId,VideoId,Text,IsToxic
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True
2,Ugg3dWTOxryFfHgCoAEC,04kJtp6pVXI,\nDont you reckon them 'black lives matter' ba...,True
3,Ugg7Gd006w1MPngCoAEC,04kJtp6pVXI,There are a very large number of people who do...,False
4,Ugg8FfTbbNF8IngCoAEC,04kJtp6pVXI,"The Arab dude is absolutely right, he should h...",False


## Check data shapes & types
---

First we check the data shape

In [4]:
# Check the shape of data
toxic_comment_data.shape

(1000, 4)

There are almost 60k songs in the database with 20 features.

In [5]:
# Check the type of data
toxic_comment_data.dtypes

CommentId    object
VideoId      object
Text         object
IsToxic        bool
dtype: object

## Handling duplicates data
---

What are our current knowledges?
- A song can be on multiple album or genre
- Because of that, two different track_id can represent a single song.

Hence, we need to check for duplicates data from combination of `artists` and `track_name` first

In [6]:
# Check duplicate data
toxic_comment_data.duplicated().sum()

np.int64(0)

Yes, we have some duplicated data. Let's see the duplicated data.

## Create load function
---

Finally, we can create load data function

In [7]:
def load_toxic_data(toxic_path, seed=42):
    """
    Function to load user play count data
        & removing duplicate data on playcount data

    Parameters
    ----------
    toxic_path : str
        The path of toxic comment dataset data (.csv)

    seed : int, default=123
        For reproducibility

    Returns
    -------
    toxic_data : pandas DataFrame
        youtube toxic comment data
    """
    # Load data
    toxic_data = pd.read_csv(toxic_path)
    print('Original data shape                 :', toxic_data.shape)

    # Drop drop duplicate & keep the first item
    toxic_data = (toxic_data
                  .drop_duplicates())

    print('Data shape after dropping duplicate :', toxic_data.shape)

    # drop columns
    toxic_data = toxic_data.drop(['CommentId','VideoId'],axis=1)

    print('Data shape after dropping columns :', toxic_data.shape)
    return toxic_data


In [8]:
toxic_comment_data = load_toxic_data(toxic_path = toxic_comment_path)

Original data shape                 : (1000, 4)
Data shape after dropping duplicate : (1000, 4)
Data shape after dropping columns : (1000, 2)


In [9]:
toxic_comment_data.head()

Unnamed: 0,Text,IsToxic
0,If only people would just take a step back and...,False
1,Law enforcement is not trained to shoot to app...,True
2,\nDont you reckon them 'black lives matter' ba...,True
3,There are a very large number of people who do...,False
4,"The Arab dude is absolutely right, he should h...",False


# **2. Data Preparation**
---

## Data Splitting

Our goal is to develop model that are robust to future prediction, to make sure our model is good, we have to evaluate the model performance

To do this we can use our data to train on some segment of data (Training data) and test it on the rest (Test Data)

<img src="https://upload.wikimedia.org/wikipedia/commons/b/bb/ML_dataset_training_validation_test_sets.png">
<center><a href="https://upload.wikimedia.org/wikipedia/commons/b/bb/ML_dataset_training_validation_test_sets.png">Source </a></center>

In [10]:
# map the target value
toxic_comment_data['IsToxic'] = toxic_comment_data['IsToxic'].map({False : 0 , True : 1})

In [11]:
from sklearn.model_selection import train_test_split
# split data into target and features
target_col = 'IsToxic'
X = toxic_comment_data.drop(target_col,axis=1)
y = toxic_comment_data[target_col]

# split data into Training and Testing
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)




In [12]:
# check shape
print(f'Training Shape : X {X_train.shape}, y : {y_train.shape}')
print(f'Test Shape : X {X_test.shape}, y : {y_test.shape}')

Training Shape : X (750, 1), y : (750,)
Test Shape : X (250, 1), y : (250,)


Before we dive into modelling , **Remember**, that :
1. Our machine learning model cannot directly model text data
2. What we can do is we need to **convert** text data to numerical representation

Or we can call it as vectorizing vector, so what we need to do ?
1. Doing text Preprocessing
2. Vectorizing Text

## Text Preprocessing


---

### Lowercase

Sometimes word can be written in varying way, one of them is how on we use capitalization,

`hello` is not equal to `Hello` or `heLLO` , we can solve this by lowercasing

In [13]:

X_train['Text'] = X_train['Text'].str.lower()


### Removing Digits

we can remove digits since the model directly does not know about the digits, we can use it using `regular expression`

In [14]:
import re

def remove_digits(text):

    text_without_digits = re.sub(r'\d', '', text)
    return text_without_digits

X_train['Text'] = X_train['Text'].apply(remove_digits)
X_train

Unnamed: 0,Text
602,blacks are able to do this because it was whit...
959,: what fucking evidence of by evidence you mea...
953,darren wilson should have had to stand trial a...
163,isn't this the dude who got knocked out and ha...
351,"very moving, very thought out!!!"
...,...
775,amen! tell it like it is sister amen!
571,move or be tazed then arrested.
910,everyone is saying micheal brown deserved it b...
678,the crazy thing is i thought offices never do ...


### Removing Punctuation
For now, our model does not know what punctuation means, so we are going to remove them first, we can use

In [15]:
def remove_punctuation(text):
    # Use regular expression to remove punctuation
    text_without_punctuation = re.sub(r'[^\w\s]', '', text)
    return text_without_punctuation

X_train['Text'] = X_train['Text'].apply(remove_punctuation)
X_train['Text']

602    blacks are able to do this because it was whit...
959     what fucking evidence of by evidence you mean...
953    darren wilson should have had to stand trial a...
163    isnt this the dude who got knocked out and had...
351                         very moving very thought out
                             ...                        
775                  amen tell it like it is sister amen
571                       move or be tazed then arrested
910    everyone is saying micheal brown deserved it b...
678    the crazy thing is i thought offices never do ...
398    dear mr stephen molyneux  minutes is a bit lon...
Name: Text, Length: 750, dtype: object

### Removing Stopwords

Stopwords is a word that appear to be abundant in sentences, in english we can see that stopwords can be :     
1. Pronoun
2. Articles
3. Conjunctions
4. Prepositions

we can see some of stopwords for english language in nltk

next, we will create function to remove stopwords

### Tokenize Text

### Create Processing function

In [16]:
def preprocess_text(data,text_col) :
    """
    Function to preprocess data

    Parameters :
    ----------
    data : pd.DataFrame
         dataframe contain text data
    text_col : str
          column name of text
    """
    # remove digits
    data[text_col] = data[text_col].apply(remove_digits)

    # remove punctuation
    data[text_col] = data[text_col].apply(remove_punctuation)



    return data



In [17]:
X_train_processed = preprocess_text(data= X_train, text_col= 'Text')
X_train_processed

Unnamed: 0,Text
602,blacks are able to do this because it was whit...
959,what fucking evidence of by evidence you mean...
953,darren wilson should have had to stand trial a...
163,isnt this the dude who got knocked out and had...
351,very moving very thought out
...,...
775,amen tell it like it is sister amen
571,move or be tazed then arrested
910,everyone is saying micheal brown deserved it b...
678,the crazy thing is i thought offices never do ...


In [18]:
X_test_processed = preprocess_text(data= X_test, text_col= 'Text')
X_test_processed

Unnamed: 0,Text
7,I would LOVE to see this pussy go to Staten Is...
631,There should be a law where you wont be prosec...
931,I like how the begining it was peaceful and no...
682,Tell them they are breaking the law by blockin...
699,RUN THEM OVER
...,...
413,Fascinating to hear actual facts especially co...
282,Is the cell phone video being shown anymore Al...
164,Bassem Masri is an uneducated idiot Just liste...
505,If you watch the video that Stef mentions at ...


## Vectorizing Text ( From Scratch )
---

Data is the input to *machine learning* model, the data itself its quite varying in terms of type.

Some of them :        
1. Numerical
2. Categorical
3. String
4. etc

**Problem** :    

However, there are requirements in our input, the machine learning model only accept **numerical representation** of the data, so what if we have non-numerical data such as text data ?

**Solution**   :     

For now , we can convert to numerical representation to each word / token  by identifying whether the word / token presence.

### One Hot Encoding

The first approach we can use **One-Hot Encoding** to create a vector of a sentence / text based on word presence

<center>
<img src="../assets/word_vectorizing/one-hot.webp" img>

[Source](https://www.mygreatlearning.com/blog/basics-of-building-an-artificial-intelligence-chatbot/)

What we have to do ?

1. Collect Unique Vocabulary
2. Create mapping text to idx and vice versa
3. Create lookup table to store vector to each sentence
2. Loop all over sentence and check each word

    if word exists fill 1

In [19]:
corpus = [
    'Hello World',
    'Welcomee to Disney World',
    'The flight is on delay'
]

In [20]:
# collect unique words
vocab = []
for sentence in corpus :
    # lowercase
    text = sentence.lower()
    word_split = text.split(' ')
    #append word
    for word in word_split :
        vocab.append(word)

# remove duplicate using set
vocab_ = set(vocab)


In [21]:
vocab_

{'delay',
 'disney',
 'flight',
 'hello',
 'is',
 'on',
 'the',
 'to',
 'welcomee',
 'world'}

In [22]:
# create mapping
word_to_idx = {word:idx for idx,word in enumerate(vocab_)}
idx_to_word = {idx:word for idx,word in enumerate(vocab_)}

In [23]:
# create presence matrix , should have size rows --> number of sentencce and columns --> unique vocab
n_vocab = len(vocab_)
n_sentences = len(corpus)
word_presence_matrix = np.zeros(shape=(n_sentences,n_vocab))
word_presence_matrix

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [24]:
for idx,sentence in enumerate(corpus) :
    text = sentence.lower()
    #append word
    for word in text.split(' ') :
        #get the index of the word
        word_index = word_to_idx[word]
        print(f'Word Index : {word_index}, Sentence Index : {idx}')
        #change the empty matrix
        word_presence_matrix[idx,word_index] = 1

Word Index : 0, Sentence Index : 0
Word Index : 8, Sentence Index : 0
Word Index : 6, Sentence Index : 1
Word Index : 1, Sentence Index : 1
Word Index : 5, Sentence Index : 1
Word Index : 8, Sentence Index : 1
Word Index : 4, Sentence Index : 2
Word Index : 7, Sentence Index : 2
Word Index : 2, Sentence Index : 2
Word Index : 3, Sentence Index : 2
Word Index : 9, Sentence Index : 2


In [25]:
word_presence_matrix

array([[1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0., 1., 0.],
       [0., 0., 1., 1., 1., 0., 0., 1., 0., 1.]])

lets visualize into dataframe

In [26]:
onehot_vector = pd.DataFrame(word_presence_matrix)
onehot_vector.columns = list(word_to_idx.keys())
onehot_vector.index = corpus
onehot_vector

Unnamed: 0,hello,to,is,on,the,disney,welcomee,flight,world,delay
Hello World,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Welcomee to Disney World,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
The flight is on delay,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


Now, our sentence already has feature representation

**OOP Version**

In [27]:
class OneHotVectorizer :

    def generate_vocabulary(self) :
        """
            Function to create vocabulary
        """
        self.vocabulary = set()
        for sentence in self.corpus:
            #perform tokenization , by splitting by space
            word_split = sentence.lower().split(' ')
            #append word
            for word in word_split :
                self.vocabulary.add(word)

    def fit(self,corpus) :
        """
        Function to create vocabulary and saving corpus

        Parameters :
        ----------

        corpus : list / iterables
            contain collection of text

        Returns  :
        ----------

        """
        self.corpus = corpus
        self.generate_vocabulary()
        self.word_idx = {word : idx for idx,word in enumerate(self.vocabulary)}
        self.idx_word  = {idx : word for idx,word in enumerate(self.vocabulary)}

    def get_vocab(self) :
        return self.vocabulary

    def get_mapping(self) :
        return self.word_idx,self.idx_word


    def transform(self,text) :
        """
        Function to convert text into vector by assigning word that are present as 1

        Parameters :
        ----------

        text : list
            list contain texts

        Returns  :
        ----------
        text_features : numpy.array
                   array of features from vocabulary
        """
        print(len(self.vocabulary))
        text_features = np.zeros(shape=(len(text),len(self.vocabulary)))
        for idx_sentence,sentence in enumerate(text) :
            for word in sentence.lower().split(' ') :
                #get the index of the word
                if word not in self.word_idx.keys() :
                    continue
                word_index = self.word_idx[word]

                #assign value

                text_features[idx_sentence,word_index]  = 1
        return text_features

In [28]:
# let's try to validate the result
onehot_vectorizer = OneHotVectorizer()
onehot_vectorizer.fit(corpus)
onehot_textfeatures = onehot_vectorizer.transform(corpus)


10


In [29]:
onehot_df = pd.DataFrame(onehot_textfeatures)
onehot_df.columns = list(onehot_vectorizer.get_mapping()[0].keys())
onehot_df.index = corpus
onehot_df

Unnamed: 0,hello,to,is,on,the,disney,welcomee,flight,world,delay
Hello World,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Welcomee to Disney World,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
The flight is on delay,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


In [30]:
# validate
onehot_df.to_numpy()==onehot_vector.to_numpy()

array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])

### Count Based Vector or Bag of Words

previously our approach is only detecting whether the vocab is presence given a sentence, now we can add the frequency

In [31]:
corpus_bow = [
    'Hello World',
    'Welcomee to Disney World',
    'The flight is on delay',
    'The flight because the runaway is being used for another flight departure'
]

In [32]:
# collect unique words
vocab_bow = []
for sentence in corpus_bow :
    # lowercase
    text = sentence.lower()
    word_split = text.split(' ')
    #append word
    for word in word_split :
        vocab_bow.append(word)

# remove duplicate using set
vocab_bow = set(vocab_bow)


In [33]:
word_to_idx = {word:idx for idx,word in enumerate(vocab_bow)}
idx_to_word = {idx:word for idx,word in enumerate(vocab_bow)}

In [34]:
# create presence matrix , should have size rows --> number of sentencce and columns --> unique vocab
n_vocab = len(vocab_bow)
n_sentences = len(corpus_bow)
bow_matrix = np.zeros(shape=(n_sentences,n_vocab))
bow_matrix

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])

In [35]:
for idx,sentence in enumerate(corpus_bow) :
    text = sentence.lower()
    #append word
    for word in text.split(' ') :
        #get the index of the word
        word_index = word_to_idx[word]
        print(f'Word Index : {word_index}, Sentence Index : {idx}')
        #change the empty matrix
        bow_matrix[idx,word_index] += 1

Word Index : 0, Sentence Index : 0
Word Index : 13, Sentence Index : 0
Word Index : 10, Sentence Index : 1
Word Index : 2, Sentence Index : 1
Word Index : 8, Sentence Index : 1
Word Index : 13, Sentence Index : 1
Word Index : 6, Sentence Index : 2
Word Index : 11, Sentence Index : 2
Word Index : 3, Sentence Index : 2
Word Index : 5, Sentence Index : 2
Word Index : 14, Sentence Index : 2
Word Index : 6, Sentence Index : 3
Word Index : 11, Sentence Index : 3
Word Index : 4, Sentence Index : 3
Word Index : 6, Sentence Index : 3
Word Index : 7, Sentence Index : 3
Word Index : 3, Sentence Index : 3
Word Index : 9, Sentence Index : 3
Word Index : 1, Sentence Index : 3
Word Index : 12, Sentence Index : 3
Word Index : 15, Sentence Index : 3
Word Index : 11, Sentence Index : 3
Word Index : 16, Sentence Index : 3


In [36]:
bow_matrix

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.,
        0.],
       [0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
        0.],
       [0., 1., 0., 1., 1., 0., 2., 1., 0., 1., 0., 2., 1., 0., 0., 1.,
        1.]])

In [37]:
bow_vector = pd.DataFrame(bow_matrix)
bow_vector.columns = list(word_to_idx.keys())
bow_vector.index = corpus_bow
bow_vector

Unnamed: 0,hello,used,to,is,because,on,the,runaway,disney,being,welcomee,flight,for,world,delay,another,departure
Hello World,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Welcomee to Disney World,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
The flight is on delay,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
The flight because the runaway is being used for another flight departure,0.0,1.0,0.0,1.0,1.0,0.0,2.0,1.0,0.0,1.0,0.0,2.0,1.0,0.0,0.0,1.0,1.0


**OOP Version**

In [38]:
class CountVectorizer :

    def generate_vocabulary(self) :
        """
            Function to create vocabulary
        """
        self.vocabulary = set()
        for sentence in self.corpus:
            #perform tokenization , by splitting by space
            word_split = sentence.lower().split(' ')
            #append word
            for word in word_split :
                self.vocabulary.add(word)

    def fit(self,corpus) :
        """
        Function to create vocabulary and saving corpus

        Parameters :
        ----------

        corpus : list / iterables
            contain collection of text

        Returns  :
        ----------

        """
        self.corpus = corpus
        self.generate_vocabulary()
        self.word_idx = {word : idx for idx,word in enumerate(self.vocabulary)}
        self.idx_word  = {idx : word for idx,word in enumerate(self.vocabulary)}

    def get_vocab(self) :
        return self.vocabulary

    def get_mapping(self) :
        return self.word_idx,self.idx_word


    def transform(self,text) :
        """
        Function to convert text into vector by assigning word that are present as 1

        Parameters :
        ----------

        text : list
            list contain texts

        Returns  :
        ----------
        text_features : numpy.array
                   array of features from vocabulary
        """
        print(len(self.vocabulary))
        text_features = np.zeros(shape=(len(text),len(self.vocabulary)))
        for idx_sentence,sentence in enumerate(text) :
            for word in sentence.lower().split(' ') :
                #get the index of the word
                if word not in self.word_idx.keys() :
                    continue
                word_index = self.word_idx[word]

                #assign value

                text_features[idx_sentence,word_index]+= 1
        return text_features

In [39]:
# let's try to validate the result
count_vectorizer = CountVectorizer()
count_vectorizer.fit(corpus_bow)
count_textfeatures = count_vectorizer.transform(corpus_bow)


17


In [40]:
count_pd = pd.DataFrame(count_textfeatures)
count_pd.columns = list(count_vectorizer.get_mapping()[0].keys())
count_pd.index = corpus_bow
count_pd

Unnamed: 0,hello,used,to,is,because,on,the,runaway,disney,being,welcomee,flight,for,world,delay,another,departure
Hello World,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Welcomee to Disney World,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
The flight is on delay,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
The flight because the runaway is being used for another flight departure,0.0,1.0,0.0,1.0,1.0,0.0,2.0,1.0,0.0,1.0,0.0,2.0,1.0,0.0,0.0,1.0,1.0


In [41]:
count_pd.to_numpy() == bow_vector.to_numpy()

array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True]])

We are getting the same result

Previously we learn about count based such as :    
1. One-Hot Encoding
2. Count Vectorizer

those approach are resulting :     
1. One Hot Encoding resulting `0` or `1` values, not heterogen,
2. Count Vectorizing resulting `min 0` values, some words that are fequent could be have high value




Can we have the vector values, that represent the importance of each word, but the value range is not  wide  like `CountVectorizer`

### TF-IDF

Here's come TF-IDF

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a *weighting* function.

Why do we need it? Because not all terms are equally relevant to describe an artist. In short, we measure the term frequency, weighted by its rarity.

$$
\text{IDF}_{\text{term}}
=
\log
\left(
    \cfrac
    {\text{Total Documents}}
    {\text{Documents With Term}}
\right)
$$

Thus:

$$
\text{TF-IDF}_{\text{term}}
=
\text{TF}_{\text{term}}
\cdot
\text{IDF}_{\text{term}}
$$

Th output from `TF-IDF` should be the same as prior vectorization method, resulting matrix with `<number of sentence x number of vocab>` with the values of tf-idf calculation

**First**, we create a  matrix with zeros  for representing features, with size of `<number of sentence x number of vocab>`

In [42]:
text_features = np.zeros(shape=(len(corpus),
                                  len(vocab_)))

In [43]:
text_features

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

**Second**,we calculate the **Term Frequency** a.k.a the number of each word in vocabulary appear divide by number of words in sentence, we can achieve this by looping all over vocab and sentence

In [44]:
from collections import Counter

In [45]:
word_idx = {word : idx for idx,word in enumerate(vocab_)}
idx_word  = {idx : word for idx,word in enumerate(vocab_)}

for idx_sentence,sentence in enumerate(corpus) :
    #use this
    word_count = Counter(sentence.lower().split(' '))
    for word in sentence.lower().split(' ') :
        #get the index of the word
        if word not in word_idx.keys() :
            continue
        word_count_in_sentence = word_count[word]
        word_index = word_idx[word]

        #assign value
        text_features[idx_sentence,word_index] = word_count_in_sentence / len(sentence.split(' '))

we will check non zero values in sentence 1

In [46]:
location_idx = []
for idx in range(len(text_features[0])) :

    if text_features[0,int(idx)]==0 :
        continue
    else :
        location_idx.append(idx)


In [47]:
sentence_1_non_zero = text_features[0,location_idx]
sentence_1_non_zero

array([0.5, 0.5])

for easier comparation , we can use dataframe

In [48]:
sentence_1_df = pd.DataFrame(index=[0])
sentence_1_df['sentence'] = corpus[0]
for idx in location_idx :
    word = idx_word[idx]
    sentence_1_df[word] = text_features[0,int(idx)]


In [49]:
sentence_1_df

Unnamed: 0,sentence,hello,world
0,Hello World,0.5,0.5


we can wrap this operation by creating a function to inspect non zero matrix values

**Next**, we will calculate the (IDF)**Inverse Document Frequency**,IDF is the number of word appearance in all document / corpus / collection of text , divided number of word appear in a sentence / text.

To calculate this we will count number of appearance of each word in document by creating dictionary

Now it's time to calculate number of sentence mention in all text

In [50]:
word_mention = {}
for sentence in corpus :
    #loop semua values
    for word in set(sentence.lower().split(' ')) :
        if word not in word_mention.keys() :
            word_mention[word] =1
        else :
            word_mention[word] +=1


In [51]:
word_mention

{'world': 2,
 'hello': 1,
 'disney': 1,
 'to': 1,
 'welcomee': 1,
 'is': 1,
 'on': 1,
 'the': 1,
 'delay': 1,
 'flight': 1}

Time to combine both tf and idf

In [52]:
total_document = len(corpus)
for idx_sentence,sentence in enumerate(corpus) :
    #use this
    word_count = Counter(sentence.lower().split(' '))
    for word in sentence.split(' ') :
        #get the index of the word
        if word not in word_idx.keys() :
            continue
        number_word_mention  = word_mention[word]

        idf = np.log(total_document / number_word_mention)
        #previously we already counted tf, now multiply by dy
        text_features[idx_sentence,word_index]*=idf

In [53]:
text_features

array([[0.5       , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.5       , 0.        ],
       [0.        , 0.25      , 0.        , 0.        , 0.        ,
        0.25      , 0.25      , 0.        , 0.25      , 0.        ],
       [0.        , 0.        , 0.2       , 0.2       , 0.2       ,
        0.        , 0.        , 0.2       , 0.        , 0.29134516]])

In [54]:
tfidf_vector = pd.DataFrame(text_features)
tfidf_vector.columns = list(word_idx.keys())
tfidf_vector.index = corpus
tfidf_vector

Unnamed: 0,hello,to,is,on,the,disney,welcomee,flight,world,delay
Hello World,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
Welcomee to Disney World,0.0,0.25,0.0,0.0,0.0,0.25,0.25,0.0,0.25,0.0
The flight is on delay,0.0,0.0,0.2,0.2,0.2,0.0,0.0,0.2,0.0,0.291345


Now its time to wrap into a function

In [55]:
class TFIDFVectorizer :
    """
        Class to create TF-IDF vectorizer
    """
    def generate_vocabulary(self) :
        self.vocabulary = set()
        for sentence in self.corpus:
            #perform tokenization , by splitting by space
            word_split = sentence.lower().split(' ')
            #append word
            for word in word_split :
                self.vocabulary.add(word)
    def generate_number_word_mention(self) :
        self.word_mention = {}
        for sentence in self.corpus :
            #loop semua values
            for word in set(sentence.lower().split(' ')) :
                if word not in self.word_mention.keys() :
                    self.word_mention[word] =1
                else :
                    self.word_mention[word] +=1

    def fit(self,corpus) :
        """
        Function to create vocabulary and saving corpus

        Parameters :
        ----------

        corpus : list / iterables
            contain collection of text

        Returns  :
        ----------

        """
        self.corpus = corpus
        self.generate_vocabulary()
        self.generate_number_word_mention()
        self.word_idx = {word : idx for idx,word in enumerate(self.vocabulary)}
        self.idx_word  = {idx : word for idx,word in enumerate(self.vocabulary)}
#         self.sentence_idx = {sentence:idx for idx,sentence in enumerate(self.corpus)}
#         self.idx_sentence = {idx:sentence for idx,sentence in enumerate(self.corpus)}

    def get_vocab(self) :
        return self.vocabulary

    def get_mapping(self) :
        return self.word_idx,self.idx_word


    def transform(self,text) :
        """
        Function to convert text into vector by assigning word that are present as 1

        Parameters :
        ----------

        text : list
            list contain texts

        Returns  :
        ----------
        text_features : numpy.array
                   array of features from vocabulary
        """
        text_features = np.zeros(shape=(len(text),len(self.vocabulary)))
        for idx_sentence,sentence in enumerate(text) :
            word_count = Counter(sentence.lower().split(' '))
            for word in sentence.lower().split(' ') :
                #get the index of the word
                if word not in self.word_idx.keys() :
                    continue
                word_count_in_sentence = word_count[word]
                word_index = word_idx[word]

                #calculate tf
                tf = word_count_in_sentence / len(sentence.lower().split(' '))

                #calculate idf
                number_word_mention  = self.word_mention[word]
                idf = np.log(len(text) / number_word_mention)

                #calculate tfxidf
                text_features[idx_sentence,word_index] = tf*idf
        return text_features



In [56]:
tfidf_vectorizer = TFIDFVectorizer()

In [57]:

tfidf_vectorizer.fit(corpus)
text_features_tfidf = tfidf_vectorizer.transform(corpus)

In [58]:
tfidf_df = pd.DataFrame(text_features_tfidf)
tfidf_df.columns = list(tfidf_vectorizer.get_mapping()[0])
tfidf_df.index = corpus

In [59]:
tfidf_df

Unnamed: 0,hello,to,is,on,the,disney,welcomee,flight,world,delay
Hello World,0.549306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.202733,0.0
Welcomee to Disney World,0.0,0.274653,0.0,0.0,0.0,0.274653,0.274653,0.0,0.101366,0.0
The flight is on delay,0.0,0.0,0.219722,0.219722,0.219722,0.0,0.0,0.219722,0.0,0.219722


### Notes

From 3 approaches mentioned above, still based on word / token frequency and may be not really represent the contextual meaning from its word, also if the word / token not exists we can't really have the numerical representation or vector

### Experiment

In this phase we will experiment to compare several text vectorization technique



## Vectorizing Text
---

### TF-IDF Vectorizer

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

tfidf.fit(X_train_processed['Text'])

In [61]:
X_train_tfidf = tfidf.transform(X_train_processed['Text'])
X_train_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 19086 stored elements and shape (750, 3969)>

In [62]:
X_test_tfidf = tfidf.transform(X_test_processed['Text'])
X_test_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5520 stored elements and shape (250, 3969)>

### Bag of Words / Count Vectorizer

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer()

bow.fit(X_train_processed['Text'])

In [64]:
X_train_bow = bow.transform(X_train_processed['Text'])
X_train_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 19086 stored elements and shape (750, 3969)>

In [65]:
X_test_bow = bow.transform(X_test_processed['Text'])
X_test_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 5520 stored elements and shape (250, 3969)>

# **3. Modelling**
---

Finally We have figured out how to convert text data into a vector / feature , now its time to model our data

**Baseline Model**

In [66]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score

baseline_model = DummyClassifier(strategy="most_frequent")

baseline_model_cv = cross_val_score(estimator=baseline_model,X=X_train_tfidf,y=y_train,cv=5).mean()

In [67]:
baseline_model_cv

np.float64(0.5519999999999999)

### Tf-IDF Vectorizer

**Logistic Regression**

In [68]:
from sklearn.linear_model import LogisticRegression

log_reg_tfidf = LogisticRegression()


cv_tfidf_logreg = cross_val_score(estimator=log_reg_tfidf,X=X_train_tfidf,y=y_train,cv=5).mean()
cv_tfidf_logreg

np.float64(0.6506666666666667)

**Support Vector Classifier**

In [69]:
from sklearn.svm import SVC
svc = SVC()
cv_tfidf_svc = cross_val_score(estimator=svc,X=X_train_tfidf,y=y_train,cv=5).mean()
cv_tfidf_svc

np.float64(0.6266666666666667)

### Count Vectorizer

**Logistic Regression**

In [70]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

log_reg_bow = LogisticRegression()


cv_bow_logreg = cross_val_score(estimator=log_reg_bow,X=X_train_bow,y=y_train,cv=5).mean()
cv_bow_logreg

np.float64(0.6826666666666668)

**Support Vector Classifier**

In [71]:
from sklearn.svm import SVC
svc_bow = SVC()
cv_bow_svc = cross_val_score(estimator=svc_bow,X=X_train_bow,y=y_train,cv=5).mean()
cv_bow_svc

np.float64(0.5666666666666667)

### Model Selection Summary

In [72]:
summary = pd.DataFrame({
    'vectorizer' : ['baseline','bag of words','bag of words','tfidf','tfidf'],
    'model_name' : ['majority class','logistic regresion','svm','logistic regresion','svm'],
    'accuracy-cv5' : [baseline_model_cv,cv_bow_logreg,cv_bow_svc,cv_tfidf_logreg,cv_tfidf_svc]
})
summary

Unnamed: 0,vectorizer,model_name,accuracy-cv5
0,baseline,majority class,0.552
1,bag of words,logistic regresion,0.682667
2,bag of words,svm,0.566667
3,tfidf,logistic regresion,0.650667
4,tfidf,svm,0.626667


Looks like that combination of `Logistic Regression` and `TF-IDF` yield the best accuracy score

### Training Best Model


Since we have selected the best model using cross validation, bow its time to train the model using training data and evaluate it on test data

In [73]:
from sklearn.linear_model import LogisticRegression

best_model = LogisticRegression()


best_model.fit(X_train_tfidf,y_train)


### Evaluate on Test Data

In [74]:
y_pred = best_model.predict(X_test_tfidf)

In [75]:
# evaluate
from sklearn.metrics  import accuracy_score,confusion_matrix


test_acc = accuracy_score(y_test,y_pred)

In [76]:
# summary
best_model_summary = pd.DataFrame(
    {
        'model_name' : ['Logistic Regression'],
        'cv5 accuracy': [cv_tfidf_logreg],
        'test accuracy' : [test_acc],
        'notes' : ['TF-IDF vectorizer']

    }
)
best_model_summary

Unnamed: 0,model_name,cv5 accuracy,test accuracy,notes
0,Logistic Regression,0.650667,0.72,TF-IDF vectorizer


### Model Analysis


let's try to pick some of the toxic comment , from our test data

#### Save for Later

In [77]:
toxic_text = ["your content is suck!"]

In [78]:
# convert as text features using tf idf
text_features = tfidf.transform(toxic_text)

In [79]:
# look at text features
text_features_np = text_features.toarray()

# show as dataframe
text_feature_df = pd.DataFrame(text_features_np)
# column name = vocabulary
text_feature_df.columns = tfidf.get_feature_names_out()

text_feature_df.index = toxic_text
text_feature_df

Unnamed: 0,aaannnyything,abilities,ability,able,about,aboutdemocrats,above,absolutely,absurd,abuse,...,yourselfi,youth,youtube,youve,ypu,yr,yup,zimmerman,zimmermans,zionist
your content is suck!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We will check the non zero values

In [80]:
text_data = text_feature_df.T.reset_index()
text_data

Unnamed: 0,index,your content is suck!
0,aaannnyything,0.0
1,abilities,0.0
2,ability,0.0
3,able,0.0
4,about,0.0
...,...,...
3964,yr,0.0
3965,yup,0.0
3966,zimmerman,0.0
3967,zimmermans,0.0


In [81]:
text_data.columns

Index(['index', 'your content is suck!'], dtype='object')

In [82]:
text_data.loc[text_data['your content is suck!']>0]

Unnamed: 0,index,your content is suck!
1841,is,0.283035
3379,suck,0.845182
3956,your,0.453386


we see that from our text above only suck is available the rest is not known yet in our vocab,

In [83]:
best_model.predict(text_features)

array([1])

Okay, this text detected as toxic, that is good one