#### Daftar Isi:
- [1. Akusisi Data](#1-akusisi-data)
- [2. Pembentukan Kamus Slang](#2-pembentukan-kamus-slang)
- [3. Anotasi Data](#3-anotasi-data)
- [4. Prapemrosesan](#4-prapemrosesan)
- [5. Pemodelan SVM](#5-pemodelan-svm)
- [6. Pengukuran Performa](#6-pengukuran-performa)
    - [6.1 Pengukuran Performa Tahap 1](#61-pengukuran-performa-tahap-1)
    - [6.2 Pengukuran Performa Tahap 2](#62-pengukuran-performa-tahap-2)
- [7. Pengujian](#7-pengujian)

___Internal link tidak berfungsi di laman github.___

***

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%load_ext watermark
%watermark -a "F. Waskito" -n -t -u -v

Author: F. Waskito

Last updated: Mon Jun 05 2023 21:26:49

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 8.12.0



#### __1. Akusisi Data__

In [None]:
from collection.scrape import TweetScraper

scraper = TweetScraper(
    "depresi OR bipolar",
    "id",
    "2022-05-10",
    "2022-05-11",
)

scraper.scrape()

Scraping: 100%|██████████| 476/476 [00:39<00:00, 12.17it/s]


In [None]:
path = "data/tweet/scrape/depresi_or_bipolar_tweets_id_220510_with_irrelevant.csv"
scraper.tweets_table.to_csv(path, index=False)

In [None]:
print(f"Number of tweets before removal: {scraper.n_tweets}")

irrelevant_tweets_table = scraper.remove_irrelevant()
print(f"Number of tweets after removal: {scraper.n_tweets}")

Number of tweets before removal: 476


Removing irrelevant: 100%|██████████| 476/476 [07:44<00:00,  1.03it/s]

Number of tweets after removal: 428





In [None]:
path = "data/tweet/scrape/depresi_or_bipolar_tweets_id_220510_irrelevant.csv"
irrelevant_tweets_table.to_csv(path, index=False)

In [None]:
path = "data/scrape/depresi_or_bipolar_tweets_id_220510.csv"
scraper.tweets_table.to_csv(path, index=False)

#### __2. Pembentukan Kamus Slang__

Beralih ke:
- [1. Akusisi Data](#1-akusisi-data)
- [3. Anotasi Data](#3-anotasi-data)

In [1]:
import pandas

path = "data/tweet/scrape/depresi_or_bipolar_tweets_id_01-10.csv"
tweets_table = pandas.read_csv(path)
tweets_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3409 entries, 0 to 3408
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Tweet_ID  3409 non-null   int64 
 1   Datetime  3409 non-null   object
 2   Username  3409 non-null   object
 3   Text      3409 non-null   object
dtypes: int64(1), object(3)
memory usage: 106.7+ KB


In [2]:
from collection.slang.template import KamusSlangTemplate

template = KamusSlangTemplate(tweets_table['Text'])
template.create()

100%|██████████| 3409/3409 [4:22:04<00:00,  4.61s/it]  


__Catatan__: Karena dalam proses pencarian slang dibutuhkan parapemrosesan teks sampai di tahap *stemming* (menggunakan Sastrawi), ketidakefisienan waktu jadi konsekuensinya.

In [7]:
template.template.info()
template.template.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5357 entries, 0 to 5356
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Slang       5357 non-null   object 
 1   Makna       0 non-null      float64
 2   No_Konteks  5357 non-null   int64  
 3   Konteks     5357 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 167.5+ KB


Unnamed: 0,Slang,Makna,No_Konteks,Konteks
0,ilux,,0,Padahal ilux baru KB/TK ya mana ngerti begitua...
1,kb,,0,Padahal ilux baru KB/TK ya mana ngerti begitua...
2,tk,,0,Padahal ilux baru KB/TK ya mana ngerti begitua...
3,ya,,0,Padahal ilux baru KB/TK ya mana ngerti begitua...
4,ngerti,,0,Padahal ilux baru KB/TK ya mana ngerti begitua...


In [None]:
path = "data/dictionary/kamus_slang.csv"
template.template.to_csv(path, index=False)

#### __3. Anotasi Twit__

Beralih ke:
- [2. Pembentukan Kamus Slang](#2-pembentukan-kamus-slang)
- [4. Prapemrosesan](#4-prapemrosesan)

In [14]:
import pandas

path = "data/tweet/scrape/depresi_or_bipolar_tweets_id_01-10.csv"
tweets_table = pandas.read_csv(path)
tweets_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3409 entries, 0 to 3408
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Tweet_ID  3409 non-null   float64
 1   Datetime  3409 non-null   object 
 2   Username  3409 non-null   object 
 3   Text      3409 non-null   object 
dtypes: float64(1), object(3)
memory usage: 106.7+ KB


In [2]:
from collection.annotation import BlobLabeler

anotator = BlobLabeler(tweets_table["Text"])
anotator.generate()

labeling: 100%|██████████| 3409/3409 [31:51<00:00,  1.78it/s]  


__Catatan__: Lama waktu proses pelabelan lebih dipengaruhi oleh dua faktor, koneksi internet dan versi Python. Pelabelan ini hampir 2 kali lebih cepat dari proses pelabelan sebelumnya. Menggunakan *dependecies* yang persis sama, proses sebelumnya memakan waktu di atas 55 menit ketika dilakukan di jam aktif dan melalui Python 3.10.x.

In [5]:
from collection import analysis

analysis.get_distribution(anotator.labels)

Distribution:
	('positive', 864)
	('neutral', 1253)
	('negative', 1292)


In [10]:
tweets_table["Sentiment"] = anotator.labels

tweets_table.info()
tweets_table.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3409 entries, 0 to 3408
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Tweet_ID   3409 non-null   float64
 1   Datetime   3409 non-null   object 
 2   Username   3409 non-null   object 
 3   Text       3409 non-null   object 
 4   Sentiment  3409 non-null   object 
dtypes: float64(1), object(4)
memory usage: 133.3+ KB


Unnamed: 0,Tweet_ID,Datetime,Username,Text,Sentiment
0,1.520554e+18,2022-05-01 00:00:12+00:00,yfnasa,Padahal ilux baru KB/TK ya mana ngerti begitua...,positive
1,1.520561e+18,2022-05-01 00:30:46+00:00,SoleilLumina,Et dah gw jadi ngefollow akun quotes depresi (...,neutral
2,1.520563e+18,2022-05-01 00:36:48+00:00,petitegeeky,"Rossy setahun di laut betah, cuma pas pulang a...",neutral
3,1.520565e+18,2022-05-01 00:43:40+00:00,raniapj,Sebenarnya aku jarang jbjb. Aku toh lagi stres...,neutral
4,1.520565e+18,2022-05-01 00:43:50+00:00,Jawaban,"Apakah kamu sedang banyak masalah, sampai-samp...",neutral


In [None]:
path = "data/tweet/depresi_or_bipolar_tweets_id_01-10.csv"
tweets_table.to_csv(path, index=False)

#### __4. Prapemrosesan__

Beralih ke:
- [3. Anotasi Data](#3-anotasi-data)
- [5. Pemodelan SVM](#5-pemodelan-svm)

In [21]:
from IPython.core.interactiveshell import InteractiveShell
import sklearnex

InteractiveShell.ast_node_interactivity = "all"
sklearnex.patch_sklearn()

%load_ext watermark
%watermark -a "F. Waskito" -n -t -u -v

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: F. Waskito

Last updated: Thu Jun 08 2023 07:16:59

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 8.12.0



Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
import pandas

path = "data/tweet/depresi_or_bipolar_tweets_id_01-10.csv"
tweets_table = pandas.read_csv(path)
tweets_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3409 entries, 0 to 3408
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Tweet_ID   3409 non-null   float64
 1   Datetime   3409 non-null   object 
 2   Username   3409 non-null   object 
 3   Text       3409 non-null   object 
 4   Sentiment  3409 non-null   object 
dtypes: float64(1), object(4)
memory usage: 133.3+ KB


In [3]:
from collection import analysis

texts = tweets_table.loc[:, "Text"].copy().to_list()
labels = tweets_table.loc[:, "Sentiment"].copy().to_list()

analysis.get_shape(texts)
analysis.get_distribution(labels)

Shape: (3409,)
Distribution:
	('positive', 864)
	('neutral', 1253)
	('negative', 1292)


##### 4.1 Praoperasi Numerik

In [4]:
import time
from tqdm import tqdm
from preprocess.preprocessing import TextPreprocessor

preprocessor = TextPreprocessor()
for i, text in enumerate(tqdm(texts)):
    text = preprocessor.clean(text)
    text = preprocessor.standardize(text)
    tokens = preprocessor.tokenize(text)
    tokens = preprocessor.filter(tokens)
    texts[i] = preprocessor.stem(tokens)
    time.sleep(0.001)

100%|██████████| 3409/3409 [10:36<00:00,  5.36it/s]


__Note__: Manipulasi data dengan ukuran sama (3409 cuitan), melalui Python versi 3.10.x atau 3.11.x waktu *runtime* praoperasi numerik (dengan dan atau *negation handling*) hampir selalu di di kisaran 12m 30s s.d. 13m.

##### 4.2 Ekstraksi Fitur

4.2.1 Ekstraksi Fitur dengan Bag of Words (BOW)

In [5]:
from preprocess.feature.extraction import TextVectorizer

In [13]:
extractor = TextVectorizer(texts)
extractor.transform(target="bow", min_df=1)
vector_texts = extractor.vectors

analysis.get_shape(vector_texts)

Shape: (3409, 4949)


In [15]:
extractor.vocabs[1005]

'depresi'

In [14]:
vector_texts[4][1000:1010]
vector_texts[1151][1000:1010]

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 0], dtype=int64)

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0], dtype=int64)

In [16]:
# dengan normalisasi
extractor.transform(target="bow", min_df=1, norm=True)
extractor.vectors[4][1000:1010]
extractor.vectors[1151][1000:1010]

array([0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0. , 0. , 0. ])

array([0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.  , 0.  , 0.  ])

Vektor BOW representasi teks dokumen (cuitan) yang akan digunakan adalah vetor BOW yeng telah dilakukan dua proses sekuender lainnya yaitu:
- Normaliasi dengan penskalaan "Min-Max"
- Reduksi n (banyak) fitur menaikkan nilai DF (frekuensi kemunculan *term* t dalam dokuemen d) = 2. Dengan kata lain, setiap fitur (*term*/elemen di setiap vektor) yang hanya memiliki nilai pada 1 (satu) buah vektor, maka fitur tersebut akan dihapus.

In [22]:
# dengan normaliasi dan min. DF adalah 2
extractor = TextVectorizer(texts)
extractor.transform(target="bow", min_df=2, norm=True)
vector_texts = extractor.vectors

analysis.get_shape(vector_texts)
vector_texts[4][1000:1010]
vector_texts[1151][1000:1010]

Shape: (3409, 2205)


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [24]:
# perpindahan indeks fitur setelah menaikkan DF = 2
extractor.vocabs[1005]
extractor.vocabs.index("depresi")

'kognitif'

450

In [25]:
vector_texts[4][445:455]
vector_texts[1151][445:455]

array([0. , 0. , 0. , 0. , 0. , 0.5, 0. , 0. , 0. , 0. ])

array([0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.  , 0.  , 0.  ])

4.2.2 Ektraksi Fitur dengaN Term Frequency-Inverse Document Frequency (TF-IDF)

In [6]:
extractor = TextVectorizer(texts)
extractor.transform(target="tfidf", min_df=1)
vector_texts = extractor.vectors

analysis.get_shape(vector_texts)

Shape: (3409, 4949)


In [21]:
vector_texts[4][1000:1010]
vector_texts[1151][1000:1010]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       2.24489866, 0.        , 0.        , 0.        , 0.        ])

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       1.12244933, 0.        , 0.        , 0.        , 0.        ])

In [22]:
# dengan normalisasi
extractor.transform(target="tfidf", min_df=1, norm=True)
extractor.vectors[4][1000:1010]
extractor.vectors[1151][1000:1010]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.11688664, 0.        , 0.        , 0.        , 0.        ])

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.03385865, 0.        , 0.        , 0.        , 0.        ])

Vektor TF-IDF representasi teks dokumen (cuitan) yang akan digunakan adalah vetor TF-IDF yeng telah dilakukan dua proses sekuender lainnya yaitu:
- Normaliasi dengan penskalaan "L2" (Euclidean/Akar Kuadrat)
- Reduksi n fitur juga sama sebagaimana dilakukan pada vektor BOW.

In [38]:
extractor = TextVectorizer(texts)
extractor.transform(target="tfidf", min_df=2, norm=True)
vector_texts = extractor.vectors

analysis.get_shape(vector_texts)

Shape: (3409, 2205)


##### 4.3 Transformasi Label

In [23]:
from preprocess.encoding import LabelEncoder

In [39]:
encoder = LabelEncoder(labels)
encoder.transform(target="integer")
encoded_labels = encoder.encoded_labels

analysis.get_distribution(encoded_labels)

Distribution:
	(2, 864)
	(1, 1253)
	(0, 1292)


##### 4.4 Seprasi Data Set

In [25]:
from sklearn.model_selection import train_test_split

In [40]:
X_train, X_test, y_train, y_test = train_test_split(
    vector_texts,
    encoded_labels,
    test_size = 0.3,
    random_state = 42
)

print("> Train set:")
analysis.get_shape(X_train)
analysis.get_distribution(y_train)
print("\n> Test set:")
analysis.get_shape(X_test)
analysis.get_distribution(y_test)

> Train set:
Shape: (2386, 2205)
Distribution:
	(1, 866)
	(2, 619)
	(0, 901)

> Test set:
Shape: (1023, 2205)
Distribution:
	(0, 391)
	(1, 387)
	(2, 245)


#### __5. Pemodelan SVM__

Beralih ke:
- [4. Prapemrosesan](#4-prapemrosesan)
- [6. Pengukuran Performa](#6-pengukuran-performa)

In [27]:
from sklearn.svm import SVC

In [257]:
c = 100.0
gamma = 1.0
degree = 3

In [41]:
linear_svm = SVC(kernel="linear")

In [42]:
rbf_svm = SVC(kernel="rbf")

In [43]:
poly_svm = SVC(kernel="poly")

#### __6. Pengukuran Performa__

Beralih ke:
- [5. Pemodelan SVM](#5-pemodelan-svm)
- [6.2 Pengukuran Performa SVM + TF-IDF](#62-pengukuran-performa-svm--tf-idf)

In [18]:
from validation.cross import ImbalancedCV

In [31]:
n_fold = 5
scoring = ["accuracy", "precision", "recall", "f1",]
random_state = 42

In [32]:
linear_perform = ImbalancedCV(
    model = linear_svm,
    n_fold = n_fold,
    scoring = scoring,
    scoring_avg = "macro",
    random_state = random_state,
)

linear_perform.validate(X_train, y_train)

CV: 100%|██████████| 5/5 [00:17<00:00,  3.44s/it]


In [33]:
rbf_perform = ImbalancedCV(
    model = rbf_svm,
    n_fold = n_fold,
    scoring = scoring,
    scoring_avg = "macro",
    random_state = random_state,
)

rbf_perform.validate(X_train, y_train)

CV: 100%|██████████| 5/5 [00:17<00:00,  3.57s/it]


In [34]:
poly_perform = ImbalancedCV(
    model = poly_svm,
    n_fold = n_fold,
    scoring = scoring,
    scoring_avg = "macro",
    random_state = random_state,
)

poly_perform.validate(X_train, y_train)

CV: 100%|██████████| 5/5 [00:19<00:00,  3.88s/it]


***

##### 6.1 Pengukuran Performa SVM + BOW

Beralih ke:
- [6. Pengukuran Performa](#6-pengukuran-performa)
- [6.2 Pengukuran Performa SVM + TF-IDF](#62-pengukuran-performa-svm--tf-idf)

6.1.1 Performa SVM-Linier

In [35]:
linear_perform.get_score() # default

{'mean_accuracy': 0.672,
 'mean_precision': 0.662,
 'mean_recall': 0.657,
 'mean_f1': 0.654}

In [131]:
linear_perform.get_score() # C= 1.0

{'mean_accuracy': 0.692,
 'mean_precision': 0.687,
 'mean_recall': 0.681,
 'mean_f1': 0.678}

6.1.3 Performa SVM-RBF

In [36]:
rbf_perform.get_score() # default

{'mean_accuracy': 0.637,
 'mean_precision': 0.633,
 'mean_recall': 0.606,
 'mean_f1': 0.602}

In [132]:
rbf_perform.get_score() # C= 1.0; gamma= 0.01

{'mean_accuracy': 0.523,
 'mean_precision': 0.596,
 'mean_recall': 0.508,
 'mean_f1': 0.468}

6.1.2 Performa SVM-Polinomial

In [37]:
poly_perform.get_score() # default

{'mean_accuracy': 0.387,
 'mean_precision': 0.761,
 'mean_recall': 0.361,
 'mean_f1': 0.234}

In [133]:
poly_perform.get_score() # C= 1.0; gamma= 0.01; degree= 3

{'mean_accuracy': 0.368,
 'mean_precision': 0.256,
 'mean_recall': 0.335,
 'mean_f1': 0.182}

##### 6.2 Pengukuran Performa SVM + TF-IDF

Beralih ke:
- [5. Pemodelan SVM](#5-pemodelan-svm)

6.2.1 Performa SVM-Linier

In [48]:
linear_perform.get_score() # default

{'mean_accuracy': 0.696,
 'mean_precision': 0.685,
 'mean_recall': 0.682,
 'mean_f1': 0.683}

In [491]:
linear_perform.get_score() # C= 1.0

{'mean_accuracy': 0.712,
 'mean_precision': 0.701,
 'mean_recall': 0.698,
 'mean_f1': 0.698}

6.2.2 Performa SVM-RBF

In [49]:
rbf_perform.get_score() # default

{'mean_accuracy': 0.7,
 'mean_precision': 0.707,
 'mean_recall': 0.672,
 'mean_f1': 0.674}

In [492]:
rbf_perform.get_score() # C= 1.0; gamma= 0.01

{'mean_accuracy': 0.456,
 'mean_precision': 0.646,
 'mean_recall': 0.496,
 'mean_f1': 0.419}

6.2.3 Performa SVM-Polinomial

In [50]:
poly_perform.get_score() # default

{'mean_accuracy': 0.585,
 'mean_precision': 0.647,
 'mean_recall': 0.55,
 'mean_f1': 0.541}

In [493]:
poly_perform.get_score() # C= 1.0; gamma= 0.01; degree= 3

{'mean_accuracy': 0.427,
 'mean_precision': 0.753,
 'mean_recall': 0.39,
 'mean_f1': 0.291}

Beralih ke:
- [Pengukuran Performa SVM + TF-IDF](#62-pengukuran-performa-svm--tf-idf)

***

#### __7. Pengujian__

Beralih ke:
- [6.2 Pengukuran Performa Tahap 2](#62-pengukuran-performa-tahap-2)
- [Daftar Isi](#daftar-isi)

In [140]:
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE

In [None]:
smoter = SMOTE()
balanced_X_train, balanced_y_train = smoter.fit_resample(
    X_train,
    y_train
)

analysis.get_shape(X_train)
analysis.get_distribution(y_train)
analysis.get_shape(balanced_X_train)
analysis.get_distribution(balanced_y_train)

In [None]:
# print(f"Confusion Matrix:\n {confusion_matrix(y_test, )}\n")
# print(f"Report:\n {classification_report(y_test, )}")
