### About Project
This project devided into **three notebooks** that explained the usage of TF-IDF using **English.** The process flow of this project start from data collection (corpus) to pre-processing and algorithm fitting, the detailed steps explained below:
1. **Data Collection (self-produce)**
2. **Text Pre-Processing (Case Folding, Punctuation Removal, Tokenizing,Stop-Words Removal, Stemming)**
3. **TF-IDF Algorithm & VSM Implementation**
4. **Boolean Retrieval Algorithm Implementation**

#### The Notebook Divided into three sub-process:
1. text-preprocessing-english.ipynb
2. implementation-tf-idf-and-vsm.ipynb
3. implementation-boolean.ipynb

## Boolean Information Retrieval
Boolean Information Retrieval is a foundational model in information retrieval that uses Boolean logic to match documents against a user’s query. It relies on the principles of Boolean algebra, using logical operators to define the relationships between terms.

- **Boolean Operators**: The primary Boolean operators are AND, OR, and NOT. These operators are used to combine or exclude terms in a query:

> **AND: All terms must be present in the document.**
> 
> For example, the query "cats AND dogs" will retrieve documents containing both "cats" and "dogs".


> **OR: At least one of the terms must be present in the document.**
> 
> For example, the query "cats OR dogs" will retrieve documents containing either "cats" or "dogs" or both.


> **NOT: Excludes documents containing the term.**
> 
> For example, the query "cats NOT dogs" will retrieve documents containing "cats" but not "dogs".

- **Query Formulation**: Users formulate queries using Boolean operators to express their information needs precisely. This method allows for complex and highly specific queries.

- **Document Representation**: Documents are typically represented as sets of terms. The presence or absence of terms in a document is used to determine if the document matches the query.

Boolean Information Retrieval provides a straightforward and powerful way to perform text searches, especially in structured databases and for well-defined queries. Despite its simplicity, it forms the basis for more advanced retrieval models and remains an important tool in the field of information retrieval.

## Library Initialization

In [1]:
import numpy as np
import pandas as pd
import regex as re

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Importing Dataset for English

In [3]:
def preprocess(t):
  clean_text = []
  stemmer = PorterStemmer()
  clean = re.sub(r'[^\w\s]', '', t.lower())
  clean = re.sub(r'\d+', '', clean)
  tokens = word_tokenize(clean)
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  clean_text.append(' '.join(stemmed_tokens))

  return clean_text

### Importing Cleaned Corpus

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv("clean-corpus-inggris.csv")
dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate


### Importing Un-Preprocessed Corpus

In [5]:
validate = pd.read_csv("corpus-inggris.csv").head(4)
validate

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate


### Corpus Preparations

In [6]:
corpus = dataset.teks.tolist()
corpus

['call bird habit',
 'brother like bird month father gave black bird',
 'antoni bird lost',
 'greedi characterist hate']

#### Looking to display the unique words

In [7]:
unique_words = set()
for sentence in corpus:
    words = sentence.split()
    unique_words.update(words)
unique_words

{'antoni',
 'bird',
 'black',
 'brother',
 'call',
 'characterist',
 'father',
 'gave',
 'greedi',
 'habit',
 'hate',
 'like',
 'lost',
 'month'}

### TF (Term Frequency)

In [8]:
def tf(text):
    word_count_per_document = {}

    for i, sentence in enumerate(text, start=0):
        words = sentence.split()
        for word in words:
            if word in word_count_per_document:
                if i in word_count_per_document[word]:
                    word_count_per_document[word][i] += 1
                else:
                    word_count_per_document[word][i] = 1
            else:
                word_count_per_document[word] = {i: 1}

    df_term_frequency = pd.DataFrame(word_count_per_document)
    df_term_frequency.fillna(0, inplace=True)
    return df_term_frequency.T

#### Searching for the document term frequency

In [9]:
tf_document= tf(corpus).T.sort_index().T
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,2.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


#### Transforming the Frequency into only 1 and 0
p.s. This process done to eliminate the frequency so that it only shows "available" or "not available"

In [10]:
tf_document[tf_document != 0] = 1
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,1.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


### Availablity Mapping

In [11]:
list_mapping = []
documents = {}
for i in range(len(unique_words)):
    mapp = tf_document.iloc[i].values.astype(int).tolist()
    documents[tf_document.index[i]] = mapp

In [12]:
documents

{'call': [1, 0, 0, 0],
 'bird': [1, 1, 1, 0],
 'habit': [1, 0, 0, 0],
 'brother': [0, 1, 0, 0],
 'like': [0, 1, 0, 0],
 'month': [0, 1, 0, 0],
 'father': [0, 1, 0, 0],
 'gave': [0, 1, 0, 0],
 'black': [0, 1, 0, 0],
 'antoni': [0, 0, 1, 0],
 'lost': [0, 0, 1, 0],
 'greedi': [0, 0, 0, 1],
 'characterist': [0, 0, 0, 1],
 'hate': [0, 0, 0, 1]}

### Query imputation and preprocessing

In [13]:
query = input("Insert a query: ")
clean_query = preprocess(query)
print("pre-process query: ", clean_query)

Insert a query:  call and bird not greedy


pre-process query:  ['call and bird not greedi']


In [21]:
def boolean_rules():
    tokens = clean_query[0].lower().split()
    result = np.array(5, dtype=int)
    
    i = 0
    while i < len(tokens):
        token = tokens[i]
        if token == "not":
            i += 1
            next_token = tokens[i]
            result = ~np.array(documents[next_token])*result
        elif token == "and":
            i += 1
            next_token = tokens[i]
            result = result & np.array(documents[next_token])
        elif token == "or":
            i += 1
            next_token = tokens[i]
            result = result | np.array(documents[next_token])
        else:
            result = np.array(documents[token])
        i += 1
        
    return abs(result)

In [23]:
search = boolean_rules()

In [48]:
print(f"Most Relevant Document Located at Document {({search.argmax()}.pop())}")
pd.DataFrame(validate.iloc[({search.argmax()}.pop())]).T[["text"]]

Most Relevant Document Located at Document 0


Unnamed: 0,text
0,"They called him a bird, because of his habit"
