## Term Frequency-Inverse Document Frequency (TF-IDF)
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. **It measures how important a term is within a document relative to a collection of documents** (i.e., relative to a corpus).
- **Term Frequency:** TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
- **Inverse Document Frequency:** IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

### About Project
This project devided into **three notebooks** that explained the usage of TF-IDF using **Bahasa Indonesia & English.** The process flow of this project start from data collection (corpus) to pre-processing and algorithm fitting, the detailed steps explained below:
1. **Data Collection (self-produce)**
2. **Text Pre-Processing (Case Folding, Punctuation Removal, Tokenizing, Applying Stop Words, Stemming)**
3. **Fitting the TF-IDF Algorithm**
4. **Testing for Input and Output**

#### The Notebook Divided into three sub-process:
1. text-preprocessing-english.ipynb
2. text-preprocessing-indonesia.ipynb
3. implementation.ipynb

### Listing library used in this project

In [1]:
# General text-processing using NLTK
!pip install nltk



In [2]:
# Puncuation Removal using REGEX
!pip install regex



In [3]:
# TF-IDF Algorithm using sklearn
!pip install scikit-learn



In [4]:
# Numpy for numerical manipulation
!pip install numpy



In [5]:
# Dataframe manipulation using Pandas
!pip install pandas



### Initiation of the library used

In [6]:
import numpy as np
import pandas as pd

import regex as re
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer

### Importing the dataset used on this project

In [7]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv('corpus-inggris.csv')
dataset

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate
4,ENG5,Bird has two wings and two legs,bird
5,ENG6,"As sustainability becomes a key focus in consumer electronics, Nokia sets itself apart by prioritizing eco-friendly practices in its manufacturing processes, aligning with the growing demand for ethically sourced and recyclable mobile devices.",nokia
6,ENG7,"With the advent of augmented reality applications, smartphones like the iPhone are transforming into versatile tools that blur the lines between digital and physical realms, offering users immersive experiences previously unimaginable.",smartphone
7,ENG8,"As smartphone manufacturers strive for market dominance, the iPhone distinguishes itself with its seamless ecosystem, where integration between hardware, software, and services creates a cohesive user experience unparalleled by its competitors.",smartphone
8,ENG9,"In an era where privacy concerns loom large, the iPhone stands out for its robust security features, providing users with peace of mind amidst growing threats to personal data on smartphones.",smartphone
9,ENG10,"Nokia, once synonymous with mobile innovation, is undergoing a resurgence in the telecommunications industry, leveraging its heritage to reintroduce iconic designs infused with modern technological advancements.",nokia


### Function Support

In [8]:
def listToString(s):
    str1 = " "
    return (str1.join(s))

### Text Pre-processing

In [9]:
text = dataset.text.to_list()

In [10]:
pd.DataFrame(text)

Unnamed: 0,0
0,"They called him a bird, because of his habit"
1,My brother likes bird and after a month my father gave him a black bird
2,Antony has a bird and he lost it
3,Greedy is the most characteristic that I hate
4,Bird has two wings and two legs
5,"As sustainability becomes a key focus in consumer electronics, Nokia sets itself apart by prioritizing eco-friendly practices in its manufacturing processes, aligning with the growing demand for ethically sourced and recyclable mobile devices."
6,"With the advent of augmented reality applications, smartphones like the iPhone are transforming into versatile tools that blur the lines between digital and physical realms, offering users immersive experiences previously unimaginable."
7,"As smartphone manufacturers strive for market dominance, the iPhone distinguishes itself with its seamless ecosystem, where integration between hardware, software, and services creates a cohesive user experience unparalleled by its competitors."
8,"In an era where privacy concerns loom large, the iPhone stands out for its robust security features, providing users with peace of mind amidst growing threats to personal data on smartphones."
9,"Nokia, once synonymous with mobile innovation, is undergoing a resurgence in the telecommunications industry, leveraging its heritage to reintroduce iconic designs infused with modern technological advancements."


#### Data Cleaning
This process has a purpose to **remove the punctuation** and **normalize the word lettering to lowercase.**

In [11]:
clean_text = []
for index, teks in enumerate(text):
    clean = re.sub(r'[^\w\s]','',teks.lower())
    clean_text.append(clean)
print(clean_text, sep=', ')

['they called him a bird because of his habit', 'my brother likes bird and after a month my father gave him a black bird', 'antony has a bird and he lost it', 'greedy is the most characteristic that i hate', 'bird has two wings and two legs', 'as sustainability becomes a key focus in consumer electronics nokia sets itself apart by prioritizing ecofriendly practices in its manufacturing processes aligning with the growing demand for ethically sourced and recyclable mobile devices', 'with the advent of augmented reality applications smartphones like the iphone are transforming into versatile tools that blur the lines between digital and physical realms offering users immersive experiences previously unimaginable', 'as smartphone manufacturers strive for market dominance the iphone distinguishes itself with its seamless ecosystem where integration between hardware software and services creates a cohesive user experience unparalleled by its competitors', 'in an era where privacy concerns l

#### Sentence Tokenizing
This process has a purpose to split each corpus data into its different sentence form.

In [12]:
split_sentences = []
for index, teks in enumerate(clean_text):
    sentences = sent_tokenize(teks)
    split_sentences.append(sentences)
print(split_sentences, sep=', ')

[['they called him a bird because of his habit'], ['my brother likes bird and after a month my father gave him a black bird'], ['antony has a bird and he lost it'], ['greedy is the most characteristic that i hate'], ['bird has two wings and two legs'], ['as sustainability becomes a key focus in consumer electronics nokia sets itself apart by prioritizing ecofriendly practices in its manufacturing processes aligning with the growing demand for ethically sourced and recyclable mobile devices'], ['with the advent of augmented reality applications smartphones like the iphone are transforming into versatile tools that blur the lines between digital and physical realms offering users immersive experiences previously unimaginable'], ['as smartphone manufacturers strive for market dominance the iphone distinguishes itself with its seamless ecosystem where integration between hardware software and services creates a cohesive user experience unparalleled by its competitors'], ['in an era where p

#### Word Tokenizing
This process has a purpose to split each sentences data into its different words form.

In [13]:
split_words = []
for index, sentence in enumerate(split_sentences):
    words = word_tokenize(listToString(split_sentences[index]))
    split_words.append(words)
print(split_words, sep=', ')

[['they', 'called', 'him', 'a', 'bird', 'because', 'of', 'his', 'habit'], ['my', 'brother', 'likes', 'bird', 'and', 'after', 'a', 'month', 'my', 'father', 'gave', 'him', 'a', 'black', 'bird'], ['antony', 'has', 'a', 'bird', 'and', 'he', 'lost', 'it'], ['greedy', 'is', 'the', 'most', 'characteristic', 'that', 'i', 'hate'], ['bird', 'has', 'two', 'wings', 'and', 'two', 'legs'], ['as', 'sustainability', 'becomes', 'a', 'key', 'focus', 'in', 'consumer', 'electronics', 'nokia', 'sets', 'itself', 'apart', 'by', 'prioritizing', 'ecofriendly', 'practices', 'in', 'its', 'manufacturing', 'processes', 'aligning', 'with', 'the', 'growing', 'demand', 'for', 'ethically', 'sourced', 'and', 'recyclable', 'mobile', 'devices'], ['with', 'the', 'advent', 'of', 'augmented', 'reality', 'applications', 'smartphones', 'like', 'the', 'iphone', 'are', 'transforming', 'into', 'versatile', 'tools', 'that', 'blur', 'the', 'lines', 'between', 'digital', 'and', 'physical', 'realms', 'offering', 'users', 'immersive'

#### Implementing Stop Words in English
Below is the stopwords used for this project that available using nltk.corpus library

In [14]:
english_stop = stopwords.words('english')
print(len(english_stop), "stopwords in english:\n", english_stop)

179 stopwords in english:
 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'ow

The process of implementing the stop words into the dataset 

In [15]:
split_stop = []
for index, words in enumerate(split_words):
    wordsFiltered = [w for w in words if w not in english_stop]
    split_stop.append(wordsFiltered)
print(split_stop, sep=', ')

[['called', 'bird', 'habit'], ['brother', 'likes', 'bird', 'month', 'father', 'gave', 'black', 'bird'], ['antony', 'bird', 'lost'], ['greedy', 'characteristic', 'hate'], ['bird', 'two', 'wings', 'two', 'legs'], ['sustainability', 'becomes', 'key', 'focus', 'consumer', 'electronics', 'nokia', 'sets', 'apart', 'prioritizing', 'ecofriendly', 'practices', 'manufacturing', 'processes', 'aligning', 'growing', 'demand', 'ethically', 'sourced', 'recyclable', 'mobile', 'devices'], ['advent', 'augmented', 'reality', 'applications', 'smartphones', 'like', 'iphone', 'transforming', 'versatile', 'tools', 'blur', 'lines', 'digital', 'physical', 'realms', 'offering', 'users', 'immersive', 'experiences', 'previously', 'unimaginable'], ['smartphone', 'manufacturers', 'strive', 'market', 'dominance', 'iphone', 'distinguishes', 'seamless', 'ecosystem', 'integration', 'hardware', 'software', 'services', 'creates', 'cohesive', 'user', 'experience', 'unparalleled', 'competitors'], ['era', 'privacy', 'concer

#### Stemming

In [16]:
st = PorterStemmer()

In [17]:
dic_stem = {"nomor":[], "parameter":[], "kata":[]}
list_index = []
list_param = []
list_stem = []
for index, stop in enumerate(split_stop): 
    for parameter, kata in enumerate(stop):
        stemword = st.stem(kata)
        dic_stem["nomor"].append(index)
        dic_stem["parameter"].append(parameter)
        dic_stem["kata"].append(stemword)

In [18]:
df_stem = pd.DataFrame(dic_stem)
df_stem.head(10)

Unnamed: 0,nomor,parameter,kata
0,0,0,call
1,0,1,bird
2,0,2,habit
3,1,0,brother
4,1,1,like
5,1,2,bird
6,1,3,month
7,1,4,father
8,1,5,gave
9,1,6,black


In [19]:
stembeneran = []
for index in range(0,15) :
    var = ' '.join(df_stem[df_stem.nomor == index].kata)
    stembeneran.append(var)

##### Comparison between before (split_stop variable) and after (stembeneran variable) the stemming

In [20]:
print(split_stop, sep=', ')

[['called', 'bird', 'habit'], ['brother', 'likes', 'bird', 'month', 'father', 'gave', 'black', 'bird'], ['antony', 'bird', 'lost'], ['greedy', 'characteristic', 'hate'], ['bird', 'two', 'wings', 'two', 'legs'], ['sustainability', 'becomes', 'key', 'focus', 'consumer', 'electronics', 'nokia', 'sets', 'apart', 'prioritizing', 'ecofriendly', 'practices', 'manufacturing', 'processes', 'aligning', 'growing', 'demand', 'ethically', 'sourced', 'recyclable', 'mobile', 'devices'], ['advent', 'augmented', 'reality', 'applications', 'smartphones', 'like', 'iphone', 'transforming', 'versatile', 'tools', 'blur', 'lines', 'digital', 'physical', 'realms', 'offering', 'users', 'immersive', 'experiences', 'previously', 'unimaginable'], ['smartphone', 'manufacturers', 'strive', 'market', 'dominance', 'iphone', 'distinguishes', 'seamless', 'ecosystem', 'integration', 'hardware', 'software', 'services', 'creates', 'cohesive', 'user', 'experience', 'unparalleled', 'competitors'], ['era', 'privacy', 'concer

In [21]:
print(stembeneran, sep=', ')

['call bird habit', 'brother like bird month father gave black bird', 'antoni bird lost', 'greedi characterist hate', 'bird two wing two leg', 'sustain becom key focu consum electron nokia set apart priorit ecofriendli practic manufactur process align grow demand ethic sourc recycl mobil devic', 'advent augment realiti applic smartphon like iphon transform versatil tool blur line digit physic realm offer user immers experi previous unimagin', 'smartphon manufactur strive market domin iphon distinguish seamless ecosystem integr hardwar softwar servic creat cohes user experi unparallel competitor', 'era privaci concern loom larg iphon stand robust secur featur provid user peac mind amidst grow threat person data smartphon', 'nokia synonym mobil innov undergo resurg telecommun industri leverag heritag reintroduc icon design infus modern technolog advanc', 'cellular biolog research uncov intric mechan govern cell signal pathway shed light fundament process crucial understand human health d

### Transform the clean dataset dan export it into csv

In [22]:
clean_dataset = pd.DataFrame(stembeneran)
clean_dataset = clean_dataset.rename(columns={0: 'teks'})

In [23]:
clean_dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate
4,bird two wing two leg
5,sustain becom key focu consum electron nokia set apart priorit ecofriendli practic manufactur process align grow demand ethic sourc recycl mobil devic
6,advent augment realiti applic smartphon like iphon transform versatil tool blur line digit physic realm offer user immers experi previous unimagin
7,smartphon manufactur strive market domin iphon distinguish seamless ecosystem integr hardwar softwar servic creat cohes user experi unparallel competitor
8,era privaci concern loom larg iphon stand robust secur featur provid user peac mind amidst grow threat person data smartphon
9,nokia synonym mobil innov undergo resurg telecommun industri leverag heritag reintroduc icon design infus modern technolog advanc


In [24]:
clean_dataset.to_csv('clean-corpus-inggris.csv')