Thu, 12 Oct 2023
# Text Mining
---


Main Challenges of Text Mining: 
- **Unstructured** data
- **High Dimensionality** (many features)
- **Sparsity** (many zeros)
- **Ambiguity** (many words have multiple meanings)
- **Synonymy** (many words have similar meanings)
- **Polysemy** (many words have multiple meanings)
- **Collocation** (many words have different meanings when combined with other words)
- **Domain-specific** (many words are specific to a domain)
- **Dynamic** (language changes over time)


In [57]:
#Library

!pip3 install sastrawi
!pip3 install nltk
!pip3 install nlp-id


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


**Keywords**

- Document: 1 or more paragraphs
- Corpus: Collection of documents

## Case of Text Mining 

- Sentiment Analysis
- Topic Modeling
- Text Classification


## Text Preprocessing
---

### Part 1: Tokenization

Tokenization: Splitting a string into a list of words
- Converting to lowercase / Case folding
- Contractions / Normalization (e.g., "don't" to "do not")
- Removing punctuation (?, , , !, etc.)
- Removing numbers or converting numbers to words
- Remove white spaces
- Removing stop words


In [58]:
#Import Library 
import pandas as pd
import numpy as np

from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

In [59]:
#Load Dataset
sms = pd.read_csv('/Users/Dwika/My Projects/DATASETS/sms_spam_collection.csv')

In [60]:
sms

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [61]:
news = pd.read_csv('/Users/Dwika/My Projects/DATASETS/Data_berita.csv', encoding='latin-1')

In [62]:
#Limit data to Judul & Category columns
news = news[['judul', 'kategori']]
news

Unnamed: 0,judul,kategori
0,Baekhyun EXO Berikan Contoh Baik Pencegahan Vi...,non_clickbait
1,Lee Seung Gi Akhirnya Beri Kabar Setelah Dicar...,clickbait
2,"UPDATE: Jaejoong JYJ Ngaku Kena Corona, Idol K...",clickbait
3,"Virus Corona Masuk ke Korea, Kenapa Banyak Ora...",clickbait
4,"16 Seleb yang Lagunya Bertema Virus Corona, Ad...",non_clickbait
...,...,...
2024,Kemenkes Permudah Lansia dan Pelayan Publik Me...,non_clickbait
2025,Satgas: Jakarta dan Jabar Penyumbang Kasus Cov...,clickbait
2026,Awal Mula Varian Baru Virus Corona Masuk ke In...,non_clickbait
2027,Jokowi Sebut Pengangguran di Indonesia Hampir ...,non_clickbait


In [63]:
#Data Preprocessing
#------------------

import string
import regex as re

#Case Folding
def to_lower(text):
    return str(text).lower()

#Remove Punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('','',string.punctuation)) #Search for punctuation in text, then translate it to ''


def remove_punctuation2(text):
    return ''.join([char for char in text if char not in string.punctuation]) #Search for char in text, if char not in string.punctuation, then join it

#Remove Number
def remove_number(text):
    return ''.join([char for char in text if not char.isdigit()]) #Search for char in text, if char not in digit, then join it

#Remove Whitespace leading & trailing
def remove_whitespace_LT(text):
    return text.strip()

#Remove Whitespace multiple inside text
def remove_whitespace_multiple(text):
    return re.sub('\s+',' ',text) #\s+ means search for whitespace and replace with ' '

#Remove Single Char
def remove_singl_char(text):
    return re.sub(r"\b[a-zA-Z]\b", "", text) #\b means search for single char and replace with ''



### Removing Stopwords

In [64]:
#From Master 
#-----------

stopword_factory = StopWordRemoverFactory()
stopword_1 = stopword_factory.get_stop_words()


In [65]:
import requests

#Get Stopword from URL
stopword_url = 'https://raw.githubusercontent.com/masdevid/ID-Stopwords/master/id.stopwords.02.01.2016.txt'

def get_stopword(stopword_url):
    stopwords = requests.get(stopword_url)
    return stopwords.text

stopwords_2 = get_stopword(stopword_url)
stopwords_2 = [stopwords_2.split('\n')]

In [66]:
#Combine Stopword
stopwords_all = stopword_1 + stopwords_2


#Remove stopword function
def remove_stopword(text):
    text = [word for word in text.split() if word not in stopwords_all]
    return " ".join(text)

In [None]:
#Lemma

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

#Create stemmer
stem_factory = StemmerFactory().create_stemmer()

def lemma(text):
    return " ".join([stem_factory.stem(word) for word in text.split()])


In [77]:
#Compile all preprocessing
def prepro(text):
    pre = to_lower(text)
    pre = remove_punctuation(pre)
    pre = remove_number(pre)
    pre = remove_whitespace_LT(pre)
    pre = remove_whitespace_multiple(pre)
    pre = remove_singl_char(pre)
    pre = remove_stopword(pre)
    pre = lemma(pre)
    return pre

In [78]:
news['judul_preprocessed'] = news['judul'].apply(prepro)
news

Unnamed: 0,judul,kategori,judul_preprocessed
0,Baekhyun EXO Berikan Contoh Baik Pencegahan Vi...,non_clickbait,baekhyun exo ikan contoh baik cegah virus coro...
1,Lee Seung Gi Akhirnya Beri Kabar Setelah Dicar...,clickbait,lee seung gi akhir beri kabar cari banyak oran...
2,"UPDATE: Jaejoong JYJ Ngaku Kena Corona, Idol K...",clickbait,update jaejoong jyj ngaku kena corona idol kpo...
3,"Virus Corona Masuk ke Korea, Kenapa Banyak Ora...",clickbait,virus corona masuk korea banyak orang cari lee...
4,"16 Seleb yang Lagunya Bertema Virus Corona, Ad...",non_clickbait,seleb lagu tema virus corona bimbo rhoma irama
...,...,...,...
2024,Kemenkes Permudah Lansia dan Pelayan Publik Me...,non_clickbait,kemenkes mudah lansia layan publik dapat vaksin
2025,Satgas: Jakarta dan Jabar Penyumbang Kasus Cov...,clickbait,satgas jakarta jabar sumbang kasus covid banya...
2026,Awal Mula Varian Baru Virus Corona Masuk ke In...,non_clickbait,awal mula varian baru virus corona masuk indon...
2027,Jokowi Sebut Pengangguran di Indonesia Hampir ...,non_clickbait,jokowi sebut anggur indonesia hampir juta akib...


## Part 2 : Lemmatization & Stemming


- Lemmatization: rever the words back to dictionary form
    
    caring - cares - cared - caringly - carefully --> care - care - care - caringly - carefully

- Stemming: heuristic process that removes the ends of words to root form of a word

    caring - cares - cared - caringly - carefully --> care - care - care - care - care




In [73]:
#Lemma

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

#Create stemmer
stem_factory = StemmerFactory().create_stemmer()

def lemma(text):
    return " ".join([stem_factory.stem(word) for word in text.split()])


In [79]:
#Test
display(news['judul'][99])
lemma(news['judul_preprocessed'][99])

'Jokowi Tetapkan Pembatasan Sosial Berskala Besar Hadapi Corona Covid-19, Apa Maksudnya?'

'jokowi tetap batas sosial skala besar hadap corona covid apa maksud'

## Part 3 : Bag of Words

List down all words in a document and count the frequency of each word
- Tokenization : Splitting a string into a list of words
- Vocabulary building: Collecting a list of all words
- Vectorization: Counting the frequency of each word in the vocabulary

Each unique words are compiled to columns

TF (Term Frequency) : Number of times a word appears in a document

IDF (Inverse Document Frequency) : Inverse of the number of documents in which the word appears

**TF-IDF (Term Frequency - Inverse Document Frequency) : TF * IDF**

Rescale features by how informative we expect them to be
-  Give high weight to any term appear often in particular document, not in many documents
-  tfidf(word, doc) = tf(word) logn((N+1)/(Nw+1)) + 1, with
    -  tf(word, doc): term freq of certain word of document
    -  Nw: number of doc where the words appear
    -  N: number of doc in training set


**Bag of Multiple Words: n-Grams**

- bad
- not bad --> not_bad 

if n-gram = (1,1) then
- bad
- not

if n-gram = (1,2) then
- bad not
- not bad





In [81]:
#TF IDF

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


In [82]:
#Split data

X = news['judul_preprocessed']
y = np.where(news['kategori'] == 'clickbait', 1, 0)

Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=2023)


In [92]:
#TF IDF Converter

tfidf = TfidfVectorizer(ngram_range=(1,2))
Xtrain_tfidf = tfidf.fit_transform(Xtrain)
Xtrain_tfidf.shape

(1623, 12113)

In [96]:
Xtrain_tfidf

<1623x12113 sparse matrix of type '<class 'numpy.float64'>'
	with 26461 stored elements in Compressed Sparse Row format>

In [93]:
Xtrain_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [100]:
from scipy.sparse import csr_matrix
df_tfidf = pd.DataFrame(Xtrain_tfidf.toarray(), columns=tfidf.get_feature_names_out())
df_tfidf

Unnamed: 0,aa,aa gym,aa umbara,aali,aali malah,ab,ab bagaimana,abai,abai corona,abai pasien,...,zubir tak,zubir umum,zuckerberg,zuckerberg balik,zulkifli,zulkifli sebut,zumba,zumba medan,zumi,zumi zola
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1618,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1620,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [118]:
#TF IDF Converter

tfidf = TfidfVectorizer(ngram_range=(1,2))

#Pipeline
model_pipe = Pipeline([
    ('vect', tfidf),
    ('clf', DecisionTreeClassifier())
])

model_pipe

In [127]:
#Apply model

model_pipe.fit(Xtrain, ytrain)

In [128]:
pred = model_pipe.predict(Xtest)
pred

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,

In [129]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [130]:
print(confusion_matrix(ytest, pred))

[[205  81]
 [ 63  57]]


In [131]:
print(classification_report(ytest, pred))

              precision    recall  f1-score   support

           0       0.76      0.72      0.74       286
           1       0.41      0.47      0.44       120

    accuracy                           0.65       406
   macro avg       0.59      0.60      0.59       406
weighted avg       0.66      0.65      0.65       406



In [132]:
print('Accuracy : ', accuracy_score(ytest, pred))

Accuracy :  0.645320197044335


# Testing 

In [177]:
news_headline = [
'Ketika Jokowi Minta Masyarakat Tak Takut Corona, Ini Kata Dokter Tirta' ,
'Viral Video Pembukaan Mal di India Diserbu hingga Dijarah Pengunjung, Warga Ambil Makanan Tak Bayar'

,'Siaran Pers: Kemenparekraf Perkuat Pasar Asia Pasifik Melalui PATA Travel Mart di India'
,'GEGER: Kemenparekraf VIRAL Perkuat KACAU Pasar Asia HEBOH Pasifik Melalui PATA Travel Mart di India'

,'Heboh Xiaomi dan Vivo Dituduh Tebar Propaganda China'
,'Cara Bikin Judul Clickbait Lebih Menarik, tapi Tetap Aman'

,'Elite Politik Yakin PM Ardern Mundur karena Ancaman dan Pelecehan'
,'Clickbait banget ini Elite Politik Yakin PM Ardern Mundur karena Ancaman dan Pelecehan',

'Soal Dukun Cabul di Aceh Berjuluk Pesulap Hijau, Pesulap Merah Keberatan: Kacau Banget']

In [179]:
clickbait_status = model_pipe.predict(news_headline)

In [180]:
pd.DataFrame({'Judul': news_headline, 'Clickbait?': clickbait_status})

Unnamed: 0,Judul,Clickbait?
0,Ketika Jokowi Minta Masyarakat Tak Takut Coron...,1
1,Viral Video Pembukaan Mal di India Diserbu hin...,1
2,Siaran Pers: Kemenparekraf Perkuat Pasar Asia ...,0
3,GEGER: Kemenparekraf VIRAL Perkuat KACAU Pasar...,1
4,Heboh Xiaomi dan Vivo Dituduh Tebar Propaganda...,0
5,"Cara Bikin Judul Clickbait Lebih Menarik, tapi...",1
6,Elite Politik Yakin PM Ardern Mundur karena An...,0
7,Clickbait banget ini Elite Politik Yakin PM Ar...,1
8,Soal Dukun Cabul di Aceh Berjuluk Pesulap Hija...,1
