### About Project
This project devided into **three notebooks** that explained the usage of TF-IDF using **English.** The process flow of this project start from data collection (corpus) to pre-processing and algorithm fitting, the detailed steps explained below:
1. **Data Collection (self-produce)**
2. **Text Pre-Processing (Case Folding, Punctuation Removal, Tokenizing,Stop-Words Removal, Stemming)**
3. **TF-IDF Algorithm & VSM Implementation**
4. **Boolean Retrieval Algorithm Implementation**

#### The Notebook Divided into three sub-process:
1. text-preprocessing-english.ipynb
2. implementation-tf-idf-and-vsm.ipynb
3. implementation-boolean.ipynb

## Boolean Information Retrieval
Boolean Information Retrieval is a foundational model in information retrieval that uses Boolean logic to match documents against a user’s query. It relies on the principles of Boolean algebra, using logical operators to define the relationships between terms.

- **Boolean Operators**: The primary Boolean operators are AND, OR, and NOT. These operators are used to combine or exclude terms in a query:

> **AND: All terms must be present in the document.**
> 
> For example, the query "cats AND dogs" will retrieve documents containing both "cats" and "dogs".


> **OR: At least one of the terms must be present in the document.**
> 
> For example, the query "cats OR dogs" will retrieve documents containing either "cats" or "dogs" or both.


> **NOT: Excludes documents containing the term.**
> 
> For example, the query "cats NOT dogs" will retrieve documents containing "cats" but not "dogs".

- **Query Formulation**: Users formulate queries using Boolean operators to express their information needs precisely. This method allows for complex and highly specific queries.

- **Document Representation**: Documents are typically represented as sets of terms. The presence or absence of terms in a document is used to determine if the document matches the query.

Boolean Information Retrieval provides a straightforward and powerful way to perform text searches, especially in structured databases and for well-defined queries. Despite its simplicity, it forms the basis for more advanced retrieval models and remains an important tool in the field of information retrieval.

## Library Initialization

In [1]:
import numpy as np
import pandas as pd
import regex as re

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bagussatya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Importing Dataset for English

In [3]:
def preprocess(text):
  clean_text = []
  stemmer = PorterStemmer()

  for t in text:
      clean = re.sub(r'[^\w\s]', '', t.lower())
      clean = re.sub(r'\d+', '', clean)
      tokens = word_tokenize(clean)
      stemmed_tokens = [stemmer.stem(token) for token in tokens]
      clean_text.append(' '.join(stemmed_tokens))

  return clean_text

### Importing Cleaned Corpus

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv("clean-corpus-inggris.csv")
dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate


### Importing Un-Preprocessed Corpus

In [5]:
validate = pd.read_csv("corpus-inggris.csv").head(4)
validate

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate


### Corpus Preparations

In [6]:
corpus = dataset.teks.tolist()
corpus

['call bird habit',
 'brother like bird month father gave black bird',
 'antoni bird lost',
 'greedi characterist hate']

#### Looking to display the unique words

In [7]:
unique_words = set()
for sentence in corpus:
    words = sentence.split()
    unique_words.update(words)
unique_words

{'antoni',
 'bird',
 'black',
 'brother',
 'call',
 'characterist',
 'father',
 'gave',
 'greedi',
 'habit',
 'hate',
 'like',
 'lost',
 'month'}

### TF (Term Frequency)

In [82]:
def tf(text):
    word_count_per_document = {}

    for i, sentence in enumerate(text, start=0):
        words = sentence.split()
        for word in words:
            if word in word_count_per_document:
                if i in word_count_per_document[word]:
                    word_count_per_document[word][i] += 1
                else:
                    word_count_per_document[word][i] = 1
            else:
                word_count_per_document[word] = {i: 1}

    df_term_frequency = pd.DataFrame(word_count_per_document)
    df_term_frequency.fillna(0, inplace=True)
    return df_term_frequency.T

#### Searching for the document term frequency

In [83]:
tf_document= tf(corpus).T.sort_index().T
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,2.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


#### Transforming the Frequency into only 1 and 0
p.s. This process done to eliminate the frequency so that it only shows "available" or "not available"

In [85]:
tf_document[tf_document != 0] = 1
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,1.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


### Availablity Mapping

In [101]:
list_mapping = []
documents = {}
for i in range(len(unique_words)):
    mapp = tf_document.iloc[i].values.astype(int).tolist()
    documents[tf_document.index[i]] = mapp

### Query imputation and preprocessing

In [14]:
query = input("Insert a query: ")
query

Insert a query:  bird and list


'bird and list'

In [None]:
def boolean_rules():
    

Insert a query:


 bird and elephant


List of words in the query:
 ['bird', 'and', 'elephant']
List of the queries:
 ['bird', 'and', 'elephant']


In [77]:
tf_document.loc["call"].values.astype(int) & tf_document.loc["lost"].values.astype(int)

array([0, 0, 0, 0])

In [86]:
~ np.array([1,1,0,1,0]) * np.array([0,0,0,1,1]) & np.array([0,0,1,1,1])

array([0, 0, 0, 0, 1])

In [None]:
result = result.intersection(documents[next_token])

### Scikit-learn TF-IDF

In [38]:
bag_of_words = []
for index, var in enumerate(corpus_eng):
    dict_of_words = var.split()
    bag_of_words.append(dict_of_words)

In [40]:
print(bag_of_words, sep=', ')

[['call', 'bird', 'habit'], ['brother', 'like', 'bird', 'month', 'father', 'gave', 'black', 'bird'], ['antoni', 'bird', 'lost'], ['greedi', 'characterist', 'hate'], ['bird', 'two', 'wing', 'two', 'leg'], ['sustain', 'becom', 'key', 'focu', 'consum', 'electron', 'nokia', 'set', 'apart', 'priorit', 'ecofriendli', 'practic', 'manufactur', 'process', 'align', 'grow', 'demand', 'ethic', 'sourc', 'recycl', 'mobil', 'devic'], ['advent', 'augment', 'realiti', 'applic', 'smartphon', 'like', 'iphon', 'transform', 'versatil', 'tool', 'blur', 'line', 'digit', 'physic', 'realm', 'offer', 'user', 'immers', 'experi', 'previous', 'unimagin'], ['smartphon', 'manufactur', 'strive', 'market', 'domin', 'iphon', 'distinguish', 'seamless', 'ecosystem', 'integr', 'hardwar', 'softwar', 'servic', 'creat', 'cohes', 'user', 'experi', 'unparallel', 'competitor'], ['era', 'privaci', 'concern', 'loom', 'larg', 'iphon', 'stand', 'robust', 'secur', 'featur', 'provid', 'user', 'peac', 'mind', 'amidst', 'grow', 'threat

In [41]:
bag_of_words[0]

['call', 'bird', 'habit']

In [48]:
token_bank = {}
for index, var in enumerate(bag_of_words):
    for parameter, temp in enumerate(var):
        if temp not in token_bank:
            token_bank[temp] = 1
        else:
            token_bank[temp]+=1

In [50]:
print(token_bank, sep=', ')

{'call': 1, 'bird': 5, 'habit': 1, 'brother': 1, 'like': 2, 'month': 1, 'father': 1, 'gave': 1, 'black': 1, 'antoni': 1, 'lost': 1, 'greedi': 1, 'characterist': 1, 'hate': 1, 'two': 2, 'wing': 1, 'leg': 1, 'sustain': 2, 'becom': 1, 'key': 1, 'focu': 1, 'consum': 1, 'electron': 1, 'nokia': 2, 'set': 1, 'apart': 1, 'priorit': 1, 'ecofriendli': 1, 'practic': 1, 'manufactur': 3, 'process': 2, 'align': 1, 'grow': 2, 'demand': 1, 'ethic': 1, 'sourc': 1, 'recycl': 1, 'mobil': 2, 'devic': 1, 'advent': 1, 'augment': 1, 'realiti': 1, 'applic': 1, 'smartphon': 3, 'iphon': 3, 'transform': 1, 'versatil': 1, 'tool': 1, 'blur': 1, 'line': 1, 'digit': 1, 'physic': 1, 'realm': 1, 'offer': 1, 'user': 3, 'immers': 1, 'experi': 3, 'previous': 1, 'unimagin': 1, 'strive': 1, 'market': 1, 'domin': 1, 'distinguish': 1, 'seamless': 1, 'ecosystem': 1, 'integr': 1, 'hardwar': 1, 'softwar': 1, 'servic': 1, 'creat': 2, 'cohes': 1, 'unparallel': 1, 'competitor': 1, 'era': 1, 'privaci': 1, 'concern': 1, 'loom': 1, '

In [51]:
print(f"count of unique words: {len(token_bank)}")

count of unique words: 181


In [53]:
import re
import nltk
from nltk.stem import PorterStemmer

In [55]:
bop = {
    'not':3,
    'and':2,
    'or':1,
    '(':0,
    ')':0
}

BO = list(bop.keys())
BO

['not', 'and', 'or', '(', ')']

In [91]:
masuk = ""
masukan = input(f"masukan kata:{masuk}")

masukan kata: Fundamental, Or Memory


In [92]:
list_of_masukan = masukan.split()
list_of_masukan

['Fundamental,', 'Or', 'Memory']

In [93]:
st = PorterStemmer()

In [94]:
clean_input = []
for index, teks in enumerate(list_of_masukan):
    clean = re.sub(r'[^\w\s]','',teks.lower())
    stemword = st.stem(clean)
    clean_input.append(stemword)
print(clean_input, sep=', ')

['fundament', 'or', 'memori']


In [95]:
infix = clean_input
postfix = []
stack = []
scanned_operator = ""

if len(infix) > 0:
    II = 0
    while II < len(infix):
        if infix[II] not in BO:
            postfix.append(infix[II])
            II+=1
        else:
            scanned_operator = infix[II]
            if len(stack) == 0:
                stack.append(scanned_operator)
                II+=1
            elif scanned_operator == "(":
                stack.append(scanned_operator)
                II+=1
            elif scanned_operator ==")":
                postfix.append(stack.pop())
                while len(stack) != 0:
                    if stack[-1] != "(":
                        postfix.append(stack.pop())
                    else:
                        stack.pop()
                II+=1
            II+=1
elif BOP[stack[-1]] >= BOP[scanned_operator]:
    postfix.append(stack.pop())
    stack.append(scanned_operator)
    II+=1
else:
    stack.append(scanned_operator)
    II+=1
while len(stack)>0:
    postfix.append(stack.pop())

In [96]:
postfix

['fundament', 'or']

In [110]:
mappedQuery = {}
BO = ['and', 'or', 'not']
for index, var in enumerate(bag_of_words):
    templist = []
    for item in postfix:
        if item in BO:
            templist.append(item)
        else:
            if item in var:
                templist.append(True)
            else:
                templist.append(False)
    mappedQuery[index] = templist

In [111]:
print(mappedQuery, sep=', ')

{0: [False, 'or'], 1: [False, 'or'], 2: [False, 'or'], 3: [False, 'or'], 4: [False, 'or'], 5: [False, 'or'], 6: [False, 'or'], 7: [False, 'or'], 8: [False, 'or'], 9: [False, 'or'], 10: [True, 'or'], 11: [False, 'or'], 12: [False, 'or'], 13: [False, 'or'], 14: [False, 'or']}


In [115]:
evaluatedQuery = {}
BO = ['and', 'or', 'not']

for var in (mappedQuery):
    print(var)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
