# Preprocessing Data Text

Preprocessing data teks adalah tahapan penting dalam analisis teks dan pemrosesan bahasa alami (Natural Language Processing/NLP). Tujuannya adalah untuk membersihkan, mentransformasi, dan mengorganisasi data teks sehingga menjadi lebih sesuai untuk analisis atau penggunaan di berbagai model pembelajaran mesin. Berikut ini adalah beberapa tahapan umum dalam preprocessing data teks:

1. **Pembersihan Teks (Text Cleaning):**
   - Menghapus Karakter Khusus: Menghilangkan karakter khusus seperti tanda baca, simbol, atau karakter yang tidak relevan.

2. **Tokenisasi:**
  - Memecah teks menjadi kata-kata atau token-token yang lebih kecil. Tokenisasi biasanya melibatkan pemisahan berdasarkan spasi, tetapi dapat juga melibatkan pemisahan berdasarkan tanda baca.

3. **Menghilangkan Kata Stop (Stopword Removal):**
   - Menghapus kata-kata umum yang tidak memiliki nilai informasi tinggi, seperti "dan", "atau", "di", "sebuah", dll.

4. **Stemming atau Lemmatization:**
  - Mengubah kata-kata menjadi bentuk dasarnya. Stemming adalah pendekatan lebih kasar yang memotong akhiran kata, sedangkan lemmatization lebih canggih dan mengembalikan kata ke bentuk kata kerja atau kata benda dasarnya.

5. **Vektorisasi:**
   - Mengonversi setiap kata atau token ke dalam representasi numerik yang dapat digunakan oleh model pembelajaran mesin. Salah satu pendekatan umum adalah TF-IDF (Term Frequency-Inverse Document Frequency) dan Word Embeddings (seperti Word2Vec atau GloVe).

6. **Pengelompokan (Text Categorization):**
   - Mengkategorikan atau mengelompokkan teks ke dalam kelas atau kategori tertentu. Ini sering digunakan dalam tugas klasifikasi teks.

7. **Pemisahan Data:**
   - Memisahkan dataset menjadi set pelatihan, validasi, dan pengujian, jika diperlukan.

8. **Pengujian dan Evaluasi:**
   - Menguji dan mengevaluasi hasil preprocessing untuk memastikan data teks telah siap digunakan dalam analisis atau pemodelan.

Preprocessing data teks adalah langkah kunci dalam pemrosesan bahasa alami dan dapat memiliki dampak besar pada performa model dan analisis yang dibuat. Itu sebabnya penting untuk memahami berbagai tahapan preprocessing dan menerapkan mereka dengan cermat sesuai dengan tujuan analisis Anda.

## Install & Import Library

In [2]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


Ini adalah bagian awal dari kode yang mengimpor semua pustaka, modul, dan dependencies yang akan digunakan dalam analisis teks, seperti NLTK, Scikit-Learn, dan Pandas.


In [3]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

import warnings
import pandas as pd
import numpy as np
import nltk
import re
import csv

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load Dataset

Ini adalah langkah untuk membaca data dari file CSV menggunakan Pandas. Data tersebut dimuat ke dalam DataFrame dengan nama `df`. DataFrame ini digunakan sebagai basis untuk analisis teks yang akan dilakukan.

In [4]:
df = pd.read_csv('/content/drive/MyDrive/Kuliah /Semester 7/Pencarian Dan Penambangan Web/ppw/Dataset/crawling_pta.csv')
df

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...
...,...,...,...,...,...
495,SISTEM INFORMASI REKOMENDASI CUSTOMER BATIK ME...,ANDIKA PRASETYA PUTRA,"BAIN KHUSNUL KHOTIMAH,S.T.,M.Kom","RIKA YUNITARINI,S.T.,M.T",Ukm Batik Tresna Art Bangkalan memiliki Sistem...
496,PEMODELAN SISTEM DINAMIK UNTUK MENINGKATKAN PR...,Rasminawati,"Sigit Susanto Putro, S.Kom., M.Kom","Achmad Jauhari, S.T., M.Kom",Keuntungan pada sebuah industri merupakan sal...
497,CUSTOMER PROFILING MENGGUNAKAN ALGORITMA K-MEA...,Eka Fitriadi,"Bain Khusnun Khotimah, S.T., M.Kom.","Rika Yunitarini, S.T., M.T.",Customer Profiling adalah profil pelanggan. Pr...
498,Analytical Hierarchy Process (AHP) Untuk Evalu...,Sulaiman Wakhid,"Dr. Arif Muntasa, S.Si., M.T.","Moh. Kautsar Sophan, S.Kom., M.MT.",IT Master Plan adalah sebuah perencanaan strat...


## 1. Cleaning Data

### Menghapus Data Null

Kode ini memeriksa dan mengatasi data yang hilang (NaN) dalam DataFrame `df`. Data yang hilang dihapus dari DataFrame menggunakan `df.dropna()`.


In [6]:
df.isnull().sum()

Judul                  0
Penulis                0
Dosen Pembimbing I     0
Dosen Pembimbing II    1
Abstrak                5
dtype: int64

In [7]:
df = df.dropna()
df.isnull().sum()

Judul                  0
Penulis                0
Dosen Pembimbing I     0
Dosen Pembimbing II    0
Abstrak                0
dtype: int64

### Menghapus Karakter Tertentu

Fungsi `cleaning` digunakan untuk membersihkan teks dalam kolom 'Abstrak'. Ini menghapus karakter-karakter yang tidak relevan seperti tanda baca dan mengubah teks menjadi daftar kata-kata.

In [8]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Abstrak'].apply(cleaning)
df['Cleaning']

0      Sistem  informasi  akademik  SIAKAD merupakan ...
1      Berjalannya koneksi jaringan komputer dengan l...
2      Web server adalah sebuah perangkat lunak serve...
3      Penjadwalan  kuliah  di  Perguruan  Tinggi  me...
4      Seiring perkembangan teknologi yang ada diduni...
                             ...                        
495    Ukm Batik Tresna Art Bangkalan memiliki Sistem...
496    Keuntungan  pada sebuah industri merupakan sal...
497    Customer Profiling adalah profil pelanggan Pro...
498    IT Master Plan adalah sebuah perencanaan strat...
499    Dewasa ini perkembangan teknologi semakin maju...
Name: Cleaning, Length: 494, dtype: object

Fungsi `cek_specialCharacter` digunakan untuk mendeteksi karakter khusus dalam teks yang telah dibersihkan. Jika karakter khusus ditemukan, teks tersebut dicetak.

In [9]:
def cek_specialCharacter(dokumen):
  karakter = ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '_', '+', '=', '{', '}', '[', ']', '|', '\\', ':', ';', '"', "'", '<', '>', ',', '.', '?', '/', '`', '~']
  for i in dokumen:
    if i in karakter :
      print(dokumen)
df['Cleaning'].apply(cek_specialCharacter)

0      None
1      None
2      None
3      None
4      None
       ... 
495    None
496    None
497    None
498    None
499    None
Name: Cleaning, Length: 494, dtype: object

## 2. Tokenizing

Fungsi `tokenizer` digunakan untuk melakukan tokenisasi dan juga proses case folding pada teks yang telah dibersihkan. Ini mengubah teks menjadi token-token kata.

In [10]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0      [sistem, informasi, akademik, siakad, merupaka...
1      [berjalannya, koneksi, jaringan, komputer, den...
2      [web, server, adalah, sebuah, perangkat, lunak...
3      [penjadwalan, kuliah, di, perguruan, tinggi, m...
4      [seiring, perkembangan, teknologi, yang, ada, ...
                             ...                        
495    [ukm, batik, tresna, art, bangkalan, memiliki,...
496    [keuntungan, pada, sebuah, industri, merupakan...
497    [customer, profiling, adalah, profil, pelangga...
498    [it, master, plan, adalah, sebuah, perencanaan...
499    [dewasa, ini, perkembangan, teknologi, semakin...
Name: Tokenizing, Length: 494, dtype: object

Menghitung jumlah kata dalam tiap abstrak

In [13]:
def count_word(dokumens):
  return len(dokumens)

df['Count Word'] = df['Tokenizing'].apply(count_word)
df

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Cleaning,Tokenizing,Count Word
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,Sistem informasi akademik SIAKAD merupakan ...,"[sistem, informasi, akademik, siakad, merupaka...",150
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,Berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, den...",204
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,Web server adalah sebuah perangkat lunak serve...,"[web, server, adalah, sebuah, perangkat, lunak...",182
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,Penjadwalan kuliah di Perguruan Tinggi me...,"[penjadwalan, kuliah, di, perguruan, tinggi, m...",134
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,Seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, yang, ada, ...",137
...,...,...,...,...,...,...,...,...
495,SISTEM INFORMASI REKOMENDASI CUSTOMER BATIK ME...,ANDIKA PRASETYA PUTRA,"BAIN KHUSNUL KHOTIMAH,S.T.,M.Kom","RIKA YUNITARINI,S.T.,M.T",Ukm Batik Tresna Art Bangkalan memiliki Sistem...,Ukm Batik Tresna Art Bangkalan memiliki Sistem...,"[ukm, batik, tresna, art, bangkalan, memiliki,...",122
496,PEMODELAN SISTEM DINAMIK UNTUK MENINGKATKAN PR...,Rasminawati,"Sigit Susanto Putro, S.Kom., M.Kom","Achmad Jauhari, S.T., M.Kom",Keuntungan pada sebuah industri merupakan sal...,Keuntungan pada sebuah industri merupakan sal...,"[keuntungan, pada, sebuah, industri, merupakan...",143
497,CUSTOMER PROFILING MENGGUNAKAN ALGORITMA K-MEA...,Eka Fitriadi,"Bain Khusnun Khotimah, S.T., M.Kom.","Rika Yunitarini, S.T., M.T.",Customer Profiling adalah profil pelanggan. Pr...,Customer Profiling adalah profil pelanggan Pro...,"[customer, profiling, adalah, profil, pelangga...",175
498,Analytical Hierarchy Process (AHP) Untuk Evalu...,Sulaiman Wakhid,"Dr. Arif Muntasa, S.Si., M.T.","Moh. Kautsar Sophan, S.Kom., M.MT.",IT Master Plan adalah sebuah perencanaan strat...,IT Master Plan adalah sebuah perencanaan strat...,"[it, master, plan, adalah, sebuah, perencanaan...",180


## 3. Stopword

Stopword adalah kata-kata umum yang sering tidak memiliki nilai dalam analisis teks. Fungsi `stopwordText` digunakan untuk menghapus stopword dari token-token kata yang telah dihasilkan.

Token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus digabungkan kembali menjadi teks utuh dan disimpan dalam kolom 'Full Text'.

In [None]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0      sistem informasi akademik siakad sistem inform...
1      berjalannya koneksi jaringan komputer lancar g...
2      web server perangkat lunak server berfungsi me...
3      penjadwalan kuliah perguruan kompleks permasal...
4      seiring perkembangan teknologi didunia muncul ...
                             ...                        
495    ukm batik tresna art bangkalan memiliki sistem...
496    keuntungan industri salah kemajuan industri te...
497    customer profiling profil pelanggan profiling ...
498    it master plan perencanaan strategis terkait p...
499    dewasa perkembangan teknologi maju salah tekno...
Name: Full Text, Length: 494, dtype: object

## 4. Stemming



```
def stemmingText(dokumens):
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()

  return [stemmer.stem(i) for i in dokumens]

df['Stemming'] = df['Stopword Removal'].apply(stemmingText)
df['Stemming']
```



# VSM (Vector Space Model)

## 1. One Hot Encoding

### Fungsi One Hot Encoder Using Pandas

Fungsi `pandasOneHotEncoder` digunakan untuk melakukan one-hot encoding pada token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah DataFrame yang mewakili keberadaan atau ketiadaan setiap kata dalam setiap dokumen.

In [None]:
def pandasOneHotEncoder(dokumens):
  encoder  = pd.get_dummies(dokumens.apply(pd.Series).stack()).sum(level=0)
  df = pd.concat([dokumens, encoder], axis=1)

  return df

oneHotEncoder = pandasOneHotEncoder(df['Stopword Removal'])
oneHotEncoder

Unnamed: 0,Stopword Removal,a,aam,ability,abjad,absensi,absolut,absolute,abstract,abstractpada,...,yogyakarta,yunitarini,z,zachman,zaman,zat,zf,zona,zucara,zungu
0,"[sistem, informasi, akademik, siakad, sistem, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[berjalannya, koneksi, jaringan, komputer, lan...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[web, server, perangkat, lunak, server, berfun...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[penjadwalan, kuliah, perguruan, kompleks, per...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[seiring, perkembangan, teknologi, didunia, mu...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,"[ukm, batik, tresna, art, bangkalan, memiliki,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
496,"[keuntungan, industri, salah, kemajuan, indust...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
497,"[customer, profiling, profil, pelanggan, profi...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
498,"[it, master, plan, perencanaan, strategis, ter...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Save into CSV

In [None]:
oneHotEncoder.to_csv('OneHotEncoder.csv', index=False)

## 2. TF IDF

### Fungsi TF IDF

Fungsi `tfidf` digunakan untuk melakukan TF-IDF vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema TF-IDF.

In [None]:
def tfidf(dokumen):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'])
final_tfidf

Unnamed: 0,Dokumen,aam,ability,abjad,absensi,absolut,absolute,abstract,abstractpada,abstrak,...,yl,yogyakarta,yunitarini,zachman,zaman,zat,zf,zona,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,berjalannya koneksi jaringan komputer lancar g...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,web server perangkat lunak server berfungsi me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,penjadwalan kuliah perguruan kompleks permasal...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,seiring perkembangan teknologi didunia muncul ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,perkembangan internet pesat didukungnya media ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
490,kemajuan teknologi sebanding kebutuhan pelayan...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
491,media pembelajaran interaktif salah satunya pe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
492,perkembangan teknologi menyebabkan mudahnya me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save into CSV

In [None]:
final_tfidf.to_csv('TF IDF.csv', index=False)

## 3. Term Frequensi

### Fungsi Term Frequensi

Fungsi `term_freq` digunakan untuk melakukan Term Frequency (TF) vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema Term Frequency.

In [None]:
def term_freq(dokumens):
  # Buat objek CountVectorizer
  vectorizer = CountVectorizer()
  tf_matrix = vectorizer.fit_transform(dokumens).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tf = pd.DataFrame(tf_matrix, columns=terms)
  final_tf.insert(0, 'Dokumen', dokumens)

  return (vectorizer, final_tf, tf_matrix, terms)

tf_vectorizer, final_tf, tf_matrix, tf_terms = term_freq(df['Full Text'])
final_tf

Unnamed: 0,Dokumen,aam,ability,abjad,absensi,absolut,absolute,abstract,abstractpada,abstrak,...,yl,yogyakarta,yunitarini,zachman,zaman,zat,zf,zona,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,berjalannya koneksi jaringan komputer lancar g...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,web server perangkat lunak server berfungsi me...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,penjadwalan kuliah perguruan kompleks permasal...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,seiring perkembangan teknologi didunia muncul ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,perkembangan internet pesat didukungnya media ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
490,kemajuan teknologi sebanding kebutuhan pelayan...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
491,media pembelajaran interaktif salah satunya pe...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
492,perkembangan teknologi menyebabkan mudahnya me...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Save into CSV

In [None]:
final_tf.to_csv('Term Frequensi.csv', index=False)


## 4. Logarithm Freqency

Fungsi `logarithm_freq` digunakan untuk melakukan transformasi frekuensi logaritmik pada data Term Frequency. Ini membantu dalam mengurangi dampak dominasi kata-kata yang sangat umum dalam analisis teks.

### Fungsi Logarithm Frequensi

In [None]:
def logarithm_freq(dokumens):
  return np.log10(dokumens + 1)

df_logarithm_freq = pd.DataFrame(tf_matrix, columns=tf_terms).apply(logarithm_freq)
df_logarithm_freq.insert(0, 'Dokumen', df['Full Text'])
df_logarithm_freq

Unnamed: 0,Dokumen,aam,ability,abjad,absensi,absolut,absolute,abstract,abstractpada,abstrak,...,yl,yogyakarta,yunitarini,zachman,zaman,zat,zf,zona,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,berjalannya koneksi jaringan komputer lancar g...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,web server perangkat lunak server berfungsi me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,penjadwalan kuliah perguruan kompleks permasal...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,seiring perkembangan teknologi didunia muncul ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,perkembangan internet pesat didukungnya media ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
490,kemajuan teknologi sebanding kebutuhan pelayan...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
491,media pembelajaran interaktif salah satunya pe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
492,perkembangan teknologi menyebabkan mudahnya me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save into CSV

In [None]:
df_logarithm_freq.to_csv('Logarithm Frequensi.csv', index=False)