### About Project
This project devided into **three notebooks** that explained the usage of TF-IDF using **English.** The process flow of this project start from data collection (corpus) to pre-processing and algorithm fitting, the detailed steps explained below:
1. **Data Collection (self-produce)**
2. **Text Pre-Processing (Case Folding, Punctuation Removal, Tokenizing,Stop-Words Removal, Stemming)**
3. **TF-IDF Algorithm & VSM Implementation**
4. **Boolean Retrieval Algorithm Implementation**

#### The Notebook Divided into three sub-process:
1. text-preprocessing-english.ipynb
2. implementation-tf-idf-and-vsm.ipynb
3. implementation-boolean.ipynb

## Text Preprocessing
Text preprocessing is a crucial step in natural language processing and information retrieval. It involves transforming raw text into a more suitable format for analysis. The key steps in text preprocessing are:

- **Case Folding**: This step converts all characters in the text to lowercase to ensure uniformity.
- **Punctuation Removal**: This step eliminates punctuation marks from the text. Punctuation often does not contribute to the meaning in most text analysis tasks and can be removed to reduce noise.

- **Tokenizing**: Tokenizing is the process of splitting the text into individual words or tokens. This allows for easier manipulation and analysis of the text. For example, the sentence "Natural language processing is fun!" would be tokenized into ["Natural", "language", "processing", "is", "fun"].

- **Stopwords Removal**: Stopwords are common words that usually do not carry significant meaning and can be removed to focus on the more meaningful words in the text. Examples of stopwords include "a", "the", "and", "in".

- **Stemming**: Stemming reduces words to their root form. For instance, "running", "runner", and "ran" would all be reduced to "run". This helps in reducing the dimensionality of the text data and focusing on the core meaning of the words.

These preprocessing steps are essential for improving the quality and performance of natural language processing models.

### Initiation of the library used

In [1]:
import numpy as np
import pandas as pd

import regex as re
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

### Importing the dataset used on this project
p.s. for the sake of simplicity out of all the documents available, only picked out four that is the shortest and consists of only one sentence per documents

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv('corpus/corpus-inggris.csv').head(4)
dataset

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate


### Final Function for Text Preprocess
This Function will be used throughtout the projects upon cleaning query. The process below (showing the breaked down steps) is the one that is used to clean the corpus before feeding it into the algorithm

In [3]:
def preprocess(text):
  clean_text = []
  stemmer = PorterStemmer()
  stop_words = set(stopwords.words('english'))

  for t in text:
      clean = re.sub(r'[^\w\s]', '', t.lower())
      clean = re.sub(r'\d+', '', clean)
      tokens = word_tokenize(clean)
      stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
      clean_text.append(' '.join(stemmed_tokens))

  return clean_text

### Step-by-step Preprocessing
#### Casefolding and Punctuation Removal

In [4]:
def casefolding_punctuationremoval(text):
    clean_text = []
    for t in text:
        clean = re.sub(r'[^\w\s]', '', t.lower())
        clean = re.sub(r'\d+', '', clean)
        clean_text.append(clean)
    return clean_text

In [5]:
clean_text = casefolding_punctuationremoval(dataset.text)
clean_text

['they called him a bird because of his habit',
 'my brother likes bird and after a month my father gave him a black bird',
 'antony has a bird and he lost it',
 'greedy is the most characteristic that i hate']

#### Tokenizing

In [6]:
def tokenizing(clean_text):
    tokenize = []
    for t in clean_text:
        tokens = word_tokenize(t)
        tokenize.append(tokens)
    return tokenize

In [7]:
tokenize = tokenizing(clean_text)

print('\n'.join(map(str, tokenize)))

['they', 'called', 'him', 'a', 'bird', 'because', 'of', 'his', 'habit']
['my', 'brother', 'likes', 'bird', 'and', 'after', 'a', 'month', 'my', 'father', 'gave', 'him', 'a', 'black', 'bird']
['antony', 'has', 'a', 'bird', 'and', 'he', 'lost', 'it']
['greedy', 'is', 'the', 'most', 'characteristic', 'that', 'i', 'hate']


#### Stopwords Removal

In [8]:
def stopwords_removal(tokenize):
    no_stopwords = []
    stop_words = set(stopwords.words('english'))
    for sentence in tokenize:
        removed_stopwords = [word for word in sentence if word not in stop_words]
        no_stopwords.append(removed_stopwords)
    return no_stopwords

In [9]:
removed_stopwords = stopwords_removal(tokenize)
print('\n'.join(map(str, removed_stopwords)))

['called', 'bird', 'habit']
['brother', 'likes', 'bird', 'month', 'father', 'gave', 'black', 'bird']
['antony', 'bird', 'lost']
['greedy', 'characteristic', 'hate']


#### Stemming

In [10]:
def stemming(removed_stopwords):
    stemmed = []
    stemmer = PorterStemmer()
    for sentence in removed_stopwords:
        stemmed_tokens = [stemmer.stem(token) for token in sentence]
        stemmed.append(stemmed_tokens)
    return stemmed

In [11]:
stemmed = stemming(removed_stopwords)
print('\n'.join(map(str, stemmed)))

['call', 'bird', 'habit']
['brother', 'like', 'bird', 'month', 'father', 'gave', 'black', 'bird']
['antoni', 'bird', 'lost']
['greedi', 'characterist', 'hate']


#### Joining the Result into Sentence

In [12]:
preprocess_result = [' '.join(map(str, sentence)) for sentence in stemmed]
preprocess_result

['call bird habit',
 'brother like bird month father gave black bird',
 'antoni bird lost',
 'greedi characterist hate']

### Transform the clean dataset dan export it into csv

In [13]:
clean_dataset = pd.DataFrame(preprocess_result)
clean_dataset = clean_dataset.rename(columns={0: 'teks'})

In [14]:
clean_dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate


In [15]:
clean_dataset.to_csv('corpus/clean-corpus-inggris.csv', index=False)