### About Project
This project devided into **three notebooks** that explained the usage of TF-IDF using **English.** The process flow of this project start from data collection (corpus) to pre-processing and algorithm fitting, the detailed steps explained below:
1. **Data Collection (self-produce)**
2. **Text Pre-Processing (Case Folding, Punctuation Removal, Tokenizing,Stop-Words Removal, Stemming)**
3. **TF-IDF Algorithm & VSM Implementation**
4. **Boolean Retrieval Algorithm Implementation**

#### The Notebook Divided into three sub-process:
1. text-preprocessing-english.ipynb
2. implementation-tf-idf-and-vsm.ipynb
3. implementation-boolean.ipynb

## Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. **It measures how important a term is within a document relative to a collection of documents** (i.e., relative to a corpus).
- **Term Frequency:** TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
- **Inverse Document Frequency:** IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

### Library Initialization

In [1]:
import numpy as np
import pandas as pd
import regex as re

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Defining Preprocess

In [3]:
def preprocess(text):
  clean_text = []
  stemmer = PorterStemmer()
  stop_words = set(stopwords.words('english'))

  for t in text:
      clean = re.sub(r'[^\w\s]', '', t.lower())
      clean = re.sub(r'\d+', '', clean)
      tokens = word_tokenize(clean)
      stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
      clean_text.append(' '.join(stemmed_tokens))

  return clean_text

### Importing Cleaned Corpus

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv("corpus/clean-corpus-inggris.csv")
dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate


### Importing Un-Preprocessed Corpus

In [5]:
validate = pd.read_csv("corpus/corpus-inggris.csv").head(4)
validate

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate


### Corpus Preparations

In [6]:
corpus = dataset.teks.tolist()
corpus

['call bird habit',
 'brother like bird month father gave black bird',
 'antoni bird lost',
 'greedi characterist hate']

#### Looking to display the unique words

In [7]:
unique_words = set()
for sentence in corpus:
    words = sentence.split()
    unique_words.update(words)
unique_words

{'antoni',
 'bird',
 'black',
 'brother',
 'call',
 'characterist',
 'father',
 'gave',
 'greedi',
 'habit',
 'hate',
 'like',
 'lost',
 'month'}

### Query imputation and preprocessing

In [8]:
clean_query=[]

while True:
    print("Insert a query:")
    query = input()
    list_query =query.split(' ')
    clean_query = preprocess(list_query)
    clean_query= [word for word in clean_query if word != '']
    clean_query= [word for word in clean_query if word in unique_words]
    if clean_query:
        break

print("List of words in the query:\n",list_query)
print("List of the queries:\n",clean_query)

Insert a query:


 antony is my favorite bird, antony have a red colored feather that can change color every month like a chamaleon


List of words in the query:
 ['antony', 'is', 'my', 'favorite', 'bird,', 'antony', 'have', 'a', 'red', 'colored', 'feather', 'that', 'can', 'change', 'color', 'every', 'month', 'like', 'a', 'chamaleon']
List of the queries:
 ['antoni', 'bird', 'antoni', 'month', 'like']


### TF (Term Frequency)

In [9]:
def tf(text):
    word_count_per_document = {}

    for i, sentence in enumerate(text, start=0):
        words = sentence.split()
        for word in words:
            if word in word_count_per_document:
                if i in word_count_per_document[word]:
                    word_count_per_document[word][i] += 1
                else:
                    word_count_per_document[word][i] = 1
            else:
                word_count_per_document[word] = {i: 1}

    df_term_frequency = pd.DataFrame(word_count_per_document)
    df_term_frequency.fillna(0, inplace=True)
    return df_term_frequency.T

#### Searching for the document term frequency

In [10]:
tf_document= tf(corpus).T.sort_index().T
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,2.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


#### Searching for the query term frequency

In [11]:
tf_query= pd.DataFrame(tf(clean_query).sum(axis=1), columns=["query"])
tf_query

Unnamed: 0,query
antoni,2.0
bird,1.0
month,1.0
like,1.0


#### Concated DataFrame of the documents and query

In [12]:
concatenated_tf = pd.concat([tf_query, tf_document], axis=1)
concatenated_tf.fillna(0, inplace=True)
concatenated_tf

Unnamed: 0,query,0,1,2,3
antoni,2.0,0.0,0.0,1.0,0.0
bird,1.0,1.0,2.0,1.0,0.0
month,1.0,0.0,1.0,0.0,0.0
like,1.0,0.0,1.0,0.0,0.0
call,0.0,1.0,0.0,0.0,0.0
habit,0.0,1.0,0.0,0.0,0.0
brother,0.0,0.0,1.0,0.0,0.0
father,0.0,0.0,1.0,0.0,0.0
gave,0.0,0.0,1.0,0.0,0.0
black,0.0,0.0,1.0,0.0,0.0


### IDF (Inverse Document Frequency)

#### Transforming the Frequency into only 1 and 0
p.s. This process done to ease the counting of df

In [13]:
tf_document[tf_document != 0] = 1
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,1.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


#### Searching for df (each terms frequency in all the corpus)

In [14]:
document_frequency= pd.DataFrame(tf_document.sum(axis=1), columns=['document frequency'])
document_frequency.index.names = ["terms"]
document_frequency

Unnamed: 0_level_0,document frequency
terms,Unnamed: 1_level_1
call,1.0
bird,3.0
habit,1.0
brother,1.0
like,1.0
month,1.0
father,1.0
gave,1.0
black,1.0
antoni,1.0


#### Searching for D/df Value

In [15]:
D_per_df=len(corpus)/document_frequency
D_per_df.columns = ['D/df']
D_per_df

Unnamed: 0_level_0,D/df
terms,Unnamed: 1_level_1
call,4.0
bird,1.333333
habit,4.0
brother,4.0
like,4.0
month,4.0
father,4.0
gave,4.0
black,4.0
antoni,4.0


####  Calculating the IDF Value

In [16]:
idf= np.log10(D_per_df)
idf.columns = ['IDF']
idf

Unnamed: 0_level_0,IDF
terms,Unnamed: 1_level_1
call,0.60206
bird,0.124939
habit,0.60206
brother,0.60206
like,0.60206
month,0.60206
father,0.60206
gave,0.60206
black,0.60206
antoni,0.60206


### Weighting of Tf-IDF

In [17]:
concatenated_tf.sort_index()

Unnamed: 0,query,0,1,2,3
antoni,2.0,0.0,0.0,1.0,0.0
bird,1.0,1.0,2.0,1.0,0.0
black,0.0,0.0,1.0,0.0,0.0
brother,0.0,0.0,1.0,0.0,0.0
call,0.0,1.0,0.0,0.0,0.0
characterist,0.0,0.0,0.0,0.0,1.0
father,0.0,0.0,1.0,0.0,0.0
gave,0.0,0.0,1.0,0.0,0.0
greedi,0.0,0.0,0.0,0.0,1.0
habit,0.0,1.0,0.0,0.0,0.0


#### Calculating the TF-IDF weight

In [18]:
weight = pd.DataFrame(index=concatenated_tf.index, columns=concatenated_tf.columns)
for index, row in concatenated_tf.iterrows():
    weight.loc[index] = row * idf.loc[index, 'IDF']

weight.sort_index()

Unnamed: 0,query,0,1,2,3
antoni,1.20412,0.0,0.0,0.60206,0.0
bird,0.124939,0.124939,0.249877,0.124939,0.0
black,0.0,0.0,0.60206,0.0,0.0
brother,0.0,0.0,0.60206,0.0,0.0
call,0.0,0.60206,0.0,0.0,0.0
characterist,0.0,0.0,0.0,0.0,0.60206
father,0.0,0.0,0.60206,0.0,0.0
gave,0.0,0.0,0.60206,0.0,0.0
greedi,0.0,0.0,0.0,0.0,0.60206
habit,0.0,0.60206,0.0,0.0,0.0


In [19]:
weight_without_query =weight.T.drop(['query'])
query_weight=weight_without_query.filter(items=clean_query)
query_weight

Unnamed: 0,antoni,bird,month,like
0,0.0,0.124939,0.0,0.0
1,0.0,0.249877,0.60206,0.60206
2,0.60206,0.124939,0.0,0.0
3,0.0,0.0,0.0,0.0


In [20]:
sum_of_weight= pd.DataFrame(query_weight.T.sum(), columns= ['tf-idf weight'])
sum_of_weight

Unnamed: 0,tf-idf weight
0,0.124939
1,1.453997
2,0.726999
3,0.0


In [21]:
df_combined = pd.concat([sum_of_weight, validate['text']], axis=1)
tfidf_result= df_combined.sort_values(['tf-idf weight'], ascending = False)
tfidf_result

Unnamed: 0,tf-idf weight,text
1,1.453997,My brother likes bird and after a month my father gave him a black bird
2,0.726999,Antony has a bird and he lost it
0,0.124939,"They called him a bird, because of his habit"
3,0.0,Greedy is the most characteristic that I hate


## Support Vector Machines (SVM)
Support Vector Machines (SVM) are a powerful and versatile set of supervised learning algorithms used for classification and regression tasks. SVM aims to find the optimal hyperplane that best separates the data into different classes.

- **Hyperplane**: In SVM, a hyperplane is a decision boundary that separates different classes in the feature space. The optimal hyperplane is the one that maximizes the margin between the classes.

- **Margin**: The margin is the distance between the hyperplane and the nearest data points from each class, known as support vectors. SVM aims to maximize this margin to improve classification accuracy.

- **Support Vectors**: These are the data points that are closest to the hyperplane and influence its position and orientation. They are critical for defining the optimal hyperplane.

SVMs are particularly effective in high-dimensional spaces and are widely used for text classification, image recognition, and bioinformatics. By combining robust text preprocessing techniques with SVM, we can build powerful models for various natural language processing tasks.

### Distance of Query and Documents (Squaring the TF-IDF Weight)

In [22]:
weight_square=weight**2
weight_square

Unnamed: 0,query,0,1,2,3
antoni,1.449905,0.0,0.0,0.362476,0.0
bird,0.01561,0.01561,0.062439,0.01561,0.0
month,0.362476,0.0,0.362476,0.0,0.0
like,0.362476,0.0,0.362476,0.0,0.0
call,0.0,0.362476,0.0,0.0,0.0
habit,0.0,0.362476,0.0,0.0,0.0
brother,0.0,0.0,0.362476,0.0,0.0
father,0.0,0.0,0.362476,0.0,0.0
gave,0.0,0.0,0.362476,0.0,0.0
black,0.0,0.0,0.362476,0.0,0.0


### Square Rooting the Documents and Query Distance

In [23]:
sqrt = pd.DataFrame(weight_square.T.sum(axis=1**2)**0.5, columns=['sqrt'])
sqrt

Unnamed: 0,sqrt
query,1.480023
0,0.860559
1,1.495759
2,0.860559
3,1.042798


### Dot Product Calculation

In [24]:
dot_query_documents= pd.DataFrame(weight['query'].dot(weight))
dot_query_documents=dot_query_documents.drop(['query'])
dot_query_documents.rename(columns = {'query':'query dot documents'},inplace = True)
dot_query_documents

Unnamed: 0,query dot documents
0,0.01561
1,0.756172
2,0.740562
3,0.0


In [25]:
sqrt_query=float(sqrt.loc['query'].values[0])
print(sqrt_query)

1.480022664303569


In [26]:
sqrt_document=sqrt.drop(['query'])
sqrt_document.rename(columns = {'sqrt':'sqrt of documents'}, inplace = True)
sqrt_document

Unnamed: 0,sqrt of documents
0,0.860559
1,1.495759
2,0.860559
3,1.042798


### Cosine Distance Calculation

In [27]:
multiplication_query_documents= sqrt_document*sqrt_query
cosine_distance= pd.DataFrame(dot_query_documents.values/multiplication_query_documents.values, columns=['cosine'])
cosine_distance

Unnamed: 0,cosine
0,0.012256
1,0.341578
2,0.58145
3,0.0


In [28]:
cosine_distance_concated = pd.concat([cosine_distance, validate['text']], axis=1)
cosine_distance_concated.rename(columns = {'cosine':'vsm weight'}, inplace = True)
vsm_result= cosine_distance_concated.sort_values(['vsm weight'], ascending = False)
vsm_result

Unnamed: 0,vsm weight,text
2,0.58145,Antony has a bird and he lost it
1,0.341578,My brother likes bird and after a month my father gave him a black bird
0,0.012256,"They called him a bird, because of his habit"
3,0.0,Greedy is the most characteristic that I hate


## Result Comparison of TF-IDF and VSM

In [29]:
tfidf_result

Unnamed: 0,tf-idf weight,text
1,1.453997,My brother likes bird and after a month my father gave him a black bird
2,0.726999,Antony has a bird and he lost it
0,0.124939,"They called him a bird, because of his habit"
3,0.0,Greedy is the most characteristic that I hate


In [30]:
vsm_result

Unnamed: 0,vsm weight,text
2,0.58145,Antony has a bird and he lost it
1,0.341578,My brother likes bird and after a month my father gave him a black bird
0,0.012256,"They called him a bird, because of his habit"
3,0.0,Greedy is the most characteristic that I hate
