# **Capstone Project: Analisis Sentimen & Topik r/CryptoMarkets dengan IBM Granite**

**Nama:** [Gilhan Shaleh Hamidif]
**Program:** Hacktiv8 x IBM Skillsbuild - Student Developer Initiative

Notebook ini berisi seluruh alur kerja untuk proyek capstone, mulai dari memuat data, membersihkan data, menganalisis menggunakan API IBM Granite, hingga menyimpan hasil akhir.

In [28]:
# =================================================================
# LANGKAH 1: SETUP ENVIRONMENT
# =================================================================
# Menginstall library yang mungkin belum ada (jarang diperlukan di Colab)
# !pip install requests pandas

# Mengimpor semua library yang kita butuhkan di awal
import pandas as pd
import requests
import time

# Mengimpor 'userdata' untuk mengambil API Key dari Colab Secrets dengan aman
from google.colab import userdata

print("✅ Library berhasil diimpor!")

✅ Library berhasil diimpor!


## **Tahap 1: Memuat dan Membersihkan Data**

Di tahap ini, kita akan memuat dataset mentah dari file CSV, melakukan eksplorasi singkat (EDA) untuk memahami datanya, dan membersihkannya agar siap diolah oleh AI.

In [51]:
# =================================================================
# LANGKAH 2: MEMUAT DATASET
# =================================================================
# ambil file csv di sini https://github.com/gilhan94/capstone-crypto-analysis/blob/main/R_cryptomarkets.csv
# lalu upload ke foler "Files"
file_path = 'R_cryptomarkets.csv'

try:
    df = pd.read_csv(file_path)
    print(f"✅ Dataset '{file_path}' berhasil dimuat.")
    print(f"Jumlah baris: {len(df)}, Jumlah kolom: {len(df.columns)}")
except FileNotFoundError:
    print(f"❌ ERROR: File '{file_path}' tidak ditemukan. Pastikan sudah di-upload ke Colab.")

✅ Dataset 'R_cryptomarkets.csv' berhasil dimuat.
Jumlah baris: 37584, Jumlah kolom: 27


In [30]:
# =================================================================
# LANGKAH 3: EKSPLORASI DATA SINGKAT (EDA)
# =================================================================
print("--- Informasi DataFrame ---")
df.info()

print("\n\n--- 5 Baris Data Pertama ---")
display(df.head())

print("\n\n--- Cek Data Kosong (Missing Values) ---")
print(df.isnull().sum())

--- Informasi DataFrame ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37584 entries, 0 to 37583
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   submission             37584 non-null  object 
 1   subreddit              37584 non-null  object 
 2   author                 37584 non-null  object 
 3   created                37584 non-null  int64  
 4   retrieved              37584 non-null  int64  
 5   edited                 37584 non-null  int64  
 6   pinned                 37584 non-null  int64  
 7   archived               37584 non-null  int64  
 8   locked                 37584 non-null  int64  
 9   removed                37584 non-null  int64  
 10  deleted                37584 non-null  int64  
 11  is_self                37584 non-null  int64  
 12  is_video               37584 non-null  int64  
 13  is_original_content    37584 non-null  int64  
 14  title                  375

Unnamed: 0,submission,subreddit,author,created,retrieved,edited,pinned,archived,locked,removed,...,score,gilded,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink,full_text,text_len
0,rt6qtq,cryptomarkets,Aggravating_One8942,1640995371,1641139707,0,0,0,0,1,...,1,0,0,0,0,,default,https://redd.it/rt6qtq,Amp Price Predictions: Where Is the AMP Crypto...,71
1,rt6snq,cryptomarkets,[deleted],1640995522,1641139707,0,0,0,0,1,...,1,0,0,0,0,[removed],default,https://redd.it/rt6snq,Hasbulla Inu $HASB\n[removed],28
2,rt6w7m,cryptomarkets,markbrown0795,1640995838,1641139707,0,0,0,0,1,...,1,0,0,0,0,[removed],default,https://redd.it/rt6w7m,What can I buy with dogecoin?\n[removed],39
3,rt6wtb,cryptomarkets,economicsdesign,1640995888,1641139707,0,0,0,0,0,...,0,0,0,18,0,We provide over 100+ FREE crypto articles on ...,self,https://redd.it/rt6wtb,Economics Of Public Goods\n We provide over 10...,2219
4,rt78ah,cryptomarkets,Aggravating_One8942,1640996907,1641139707,0,0,0,0,1,...,1,0,0,0,0,,default,https://redd.it/rt78ah,"If You See the Pattern, This is the Smart Way ...",63




--- Cek Data Kosong (Missing Values) ---
submission                   0
subreddit                    0
author                       0
created                      0
retrieved                    0
edited                       0
pinned                       0
archived                     0
locked                       0
removed                      0
deleted                      0
is_self                      0
is_video                     0
is_original_content          0
title                       11
link_flair_text          16002
upvote_ratio                 0
score                        0
gilded                       0
total_awards_received        0
num_comments                 0
num_crossposts               0
selftext                 20609
thumbnail                    0
shortlink                    0
full_text                    0
text_len                     0
dtype: int64


In [31]:
# =================================================================
# LANGKAH 4: PREPROCESSING & CLEANING
# =================================================================
print("Memulai proses pembersihan data...")

# 1. Hapus baris yang judulnya kosong (karena jumlahnya sedikit)
df.dropna(subset=['title'], inplace=True)
print("- Baris dengan judul kosong telah dihapus.")

# 2. Gabungkan 'title' dan 'selftext' menjadi 'full_text'
# .fillna('') digunakan untuk mengubah nilai NaN (kosong) di selftext menjadi teks kosong
df['full_text'] = df['title'] + ' ' + df['selftext'].fillna('')
print("- Kolom 'full_text' berhasil dibuat.")

# 3. Lakukan pembersihan teks dasar
df['full_text'] = df['full_text'].str.lower() # Ubah jadi huruf kecil
df['full_text'] = df['full_text'].str.replace(r'http\S+|www.\S+', '', case=False, regex=True) # Hapus URL
print("- Teks telah diubah ke lowercase dan URL telah dihapus.")

# 4. Tampilkan sampel hasil pembersihan
print("\n--- Contoh Hasil Pembersihan ---")
display(df[['title', 'selftext', 'full_text']].head())

Memulai proses pembersihan data...
- Baris dengan judul kosong telah dihapus.
- Kolom 'full_text' berhasil dibuat.
- Teks telah diubah ke lowercase dan URL telah dihapus.

--- Contoh Hasil Pembersihan ---


Unnamed: 0,title,selftext,full_text
0,Amp Price Predictions: Where Is the AMP Crypto...,,amp price predictions: where is the amp crypto...
1,Hasbulla Inu $HASB,[removed],hasbulla inu $hasb [removed]
2,What can I buy with dogecoin?,[removed],what can i buy with dogecoin? [removed]
3,Economics Of Public Goods,We provide over 100+ FREE crypto articles on ...,economics of public goods we provide over 100...
4,"If You See the Pattern, This is the Smart Way ...",,"if you see the pattern, this is the smart way ..."


## **Tahap 2: Analisis Menggunakan IBM Granite**

Sekarang data sudah bersih. Kita akan siapkan fungsi untuk memanggil API IBM Granite, mendefinisikan prompt, dan menjalankan analisis.

In [41]:
# =================================================================
# LANGKAH 5 (VERSI BARU): INSTALASI & SETUP LANGCHAIN
# =================================================================
!pip install replicate langchain_community -q

import os
from google.colab import userdata
from langchain_community.llms import Replicate

# Ambil token dari secret
try:
    api_token = userdata.get('api_token') # Pastikan nama secret-nya ini
    os.environ["REPLICATE_API_TOKEN"] = api_token
    print("✅ REPLICATE_API_TOKEN berhasil disiapkan.")
except Exception as e:
    print("❌ Gagal mengambil API Key. Pastikan nama secret adalah 'REPLICATE_API_TOKEN'")

# Definisikan LLM yang akan kita gunakan
llm_granite = Replicate(
    model="ibm-granite/granite-3.3-8b-instruct",
    model_kwargs={"temperature": 0.1, "max_new_tokens": 50}
)
print("✅ Model LLM IBM Granite siap digunakan via LangChain.")

✅ REPLICATE_API_TOKEN berhasil disiapkan.
✅ Model LLM IBM Granite siap digunakan via LangChain.


## **Tahap 3: Eksekusi Analisis dan Penyimpanan Hasil**

**PERINGATAN:** Menjalankan analisis pada seluruh dataset akan memakan waktu **SANGAT LAMA** (bisa berjam-jam). Pastikan untuk mencoba pada sampel kecil terlebih dahulu.

In [40]:
# =================================================================
# LANGKAH 6 (VERSI BARU DENGAN PROMPT YANG DIPERBAIKI)
# =================================================================

# --- PROMPT BARU YANG LEBIH TEGAS ---
prompt_sentiment = """Analisis teks berikut dan berikan HANYA SATU dari kategori ini: Bullish, Bearish, atau Netral. JANGAN berikan penjelasan.
Teks: "[TEKS_POST]"
Kategori:"""

prompt_topic = """Klasifikasikan teks berikut ke dalam HANYA SATU dari kategori ini: Analisis Teknikal, Analisis Fundamental, Berita Altcoin, Regulasi, atau Diskusi Umum. JANGAN berikan penjelasan.
Teks: "[TEKS_POST]"
Kategori:"""
# ------------------------------------

# Ambil sampel data untuk testing (sisa kode di bawah ini sama persis)
df_sample = df.head(5).copy()
df_sample['sentiment'] = ''
df_sample['topic'] = ''

print("Memulai analisis pada 5 data sampel dengan LangChain (Prompt Diperbaiki)...")
for index, row in df_sample.iterrows():
    text = row['full_text']

    print(f"Menganalisis baris ke-{index}... ", end='')

    # Gabungkan prompt dengan teks
    full_prompt_sentiment = prompt_sentiment.replace("[TEKS_POST]", text)
    full_prompt_topic = prompt_topic.replace("[TEKS_POST]", text)

    # Panggil model menggunakan .invoke()
    sentiment_result = llm_granite.invoke(full_prompt_sentiment)
    df_sample.loc[index, 'sentiment'] = sentiment_result.strip().lower()

    topic_result = llm_granite.invoke(full_prompt_topic)
    df_sample.loc[index, 'topic'] = topic_result.strip().lower()

    print(f"Hasil: Sentimen='{sentiment_result.strip()}', Topik='{topic_result.strip()}'")

print("\n--- ✅ Analisis Sampel Selesai ---")
display(df_sample[['full_text', 'sentiment', 'topic']])

Memulai analisis pada 5 data sampel dengan LangChain (Prompt Diperbaiki)...
Menganalisis baris ke-0... Hasil: Sentimen='Bullish', Topik='Analisis Teknikal'
Menganalisis baris ke-1... Hasil: Sentimen='Bullish', Topik='Diskusi Umum'
Menganalisis baris ke-2... Hasil: Sentimen='Bearish', Topik='Diskusi Umum'
Menganalisis baris ke-3... Hasil: Sentimen='Bearish', Topik='Diskusi Umum'
Menganalisis baris ke-4... Hasil: Sentimen='Bullish', Topik='Analisis Teknikal'

--- ✅ Analisis Sampel Selesai ---


Unnamed: 0,full_text,sentiment,topic
0,amp price predictions: where is the amp crypto...,bullish,analisis teknikal
1,hasbulla inu $hasb [removed],bullish,diskusi umum
2,what can i buy with dogecoin? [removed],bearish,diskusi umum
3,economics of public goods we provide over 100...,bearish,diskusi umum
4,"if you see the pattern, this is the smart way ...",bullish,analisis teknikal


In [46]:
# =================================================================
# LANGKAH 7 (ANALISIS SAMPEL ACAK)
# =================================================================

# --- ATUR JUMLAH SAMPEL DI SINI ---
# Kamu bisa ubah angka ini sesuai kebutuhan. 1000 data butuh sekitar 30-40 menit.
jumlah_sampel = 10
# ----------------------------------

# Buat DataFrame baru yang berisi sampel acak dari data utama
# random_state=42 memastikan sampel yang diambil akan selalu sama setiap kali kode dijalankan
df_sampel = df.sample(n=jumlah_sampel, random_state=42).copy()

print(f"MEMULAI ANALISIS PADA {jumlah_sampel} SAMPEL ACAK. Estimasi waktu: {round(jumlah_sampel * 1.5 / 60)} menit.")

# Pastikan kolom untuk hasil sudah ada di df_sampel
df_sampel['sentiment'] = ''
df_sampel['topic'] = ''

# Loop utama yang berjalan HANYA pada data sampel
for index, row in df_sampel.iterrows():
    # Print progress setiap 20 baris
    # (menggunakan df_sampel.index.get_loc(index) untuk mendapatkan urutan baris)
    current_pos = df_sampel.index.get_loc(index) + 1
    if current_pos % 20 == 0:
        print(f"Sedang memproses baris ke-{current_pos} dari {len(df_sampel)}...")

    text = row['full_text']

    # Gabungkan prompt dengan teks
    full_prompt_sentiment = prompt_sentiment.replace("[TEKS_POST]", text)
    full_prompt_topic = prompt_topic.replace("[TEKS_POST]", text)

    try:
        # Panggil model menggunakan .invoke()
        sentiment_result = llm_granite.invoke(full_prompt_sentiment)
        df_sampel.loc[index, 'sentiment'] = sentiment_result.strip().lower()

        topic_result = llm_granite.invoke(full_prompt_topic)
        df_sampel.loc[index, 'topic'] = topic_result.strip().lower()

    except Exception as e:
        print(f"Error di baris {index}: {e}")
        df_sampel.loc[index, 'sentiment'] = 'ERROR'
        df_sampel.loc[index, 'topic'] = 'ERROR'

    time.sleep(1)

print("✅ ANALISIS SAMPEL SELESAI!")
display(df_sampel.head())

# Jangan lupa simpan hasilnya ke file CSV terpisah
df_sampel.to_csv('hasil_analisis_sampel.csv', index=False)
print("Hasil analisis sampel telah disimpan ke 'hasil_analisis_sampel.csv'")

MEMULAI ANALISIS PADA 10 SAMPEL ACAK. Estimasi waktu: 0 menit.
✅ ANALISIS SAMPEL SELESAI!


Unnamed: 0,submission,subreddit,author,created,retrieved,edited,pinned,archived,locked,removed,...,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink,full_text,text_len,sentiment,topic
879,rxsh4d,cryptomarkets,Year-Ecstatic4,1641512497,1641556023,0,0,0,0,0,...,0,4,0,,https://a.thumbs.redditmedia.com/mqOkrJSkHuP0M...,https://redd.it/rxsh4d,putting crowns on the throne in 2022,37,bullish,diskusi umum
29217,wxhil0,cryptomarkets,nayan742,1661442050,1661502559,0,0,0,0,0,...,0,8,0,Walking together instead of walking alone; \...,self,https://redd.it/wxhil0,y5 crypto cex has now listed luna classic (lun...,2137,bullish,diskusi umum
11546,tam7y8,cryptomarkets,13tom13,1646873043,1646918813,0,0,0,0,0,...,0,7,0,I'm moving to crypto.com from coinbase so what...,self,https://redd.it/tam7y8,some other koinly.io questions i'm moving to c...,779,bearish\n\n(note: the categorization is based ...,diskusi umum
21519,v7qivh,cryptomarkets,stanley9528,1654697347,1654751437,0,0,0,0,0,...,0,7,0,,https://b.thumbs.redditmedia.com/1PDzc_C8vOhrl...,https://redd.it/v7qivh,what is a cross-chain bridge and how are bridg...,57,bearish,diskusi umum
20353,v0aszi,cryptomarkets,Openingcrypto,1653829588,1653870703,0,0,0,0,1,...,0,0,0,,default,https://redd.it/v0aszi,crypto news economy 3.0 create million jobs #c...,76,bullish,berita altcoin


Hasil analisis sampel telah disimpan ke 'hasil_analisis_sampel.csv'


In [50]:
# =================================================================
# LANGKAH 8 (ANALISIS SELURUH DATA)
# =================================================================
import math

# --- Menghitung Estimasi Waktu ---
jumlah_total_data = len(df)
detik_per_data = 1.5 # Rata-rata waktu per request (termasuk time.sleep)
total_menit = (jumlah_total_data * detik_per_data) / 60
total_jam = total_menit / 60

print(f"MEMULAI ANALISIS PADA {jumlah_total_data} DATASET.")
print(f"Estimasi waktu selesai: {math.ceil(total_menit)} menit atau sekitar {total_jam:.2f} jam.")
print("=================================================================")

# Pastikan kolom untuk hasil sudah ada di df utama
df['sentiment'] = ''
df['topic'] = ''

# Loop utama yang berjalan pada seluruh data di df
for index, row in df.iterrows():
    # Print progress setiap 100 baris
    if index > 0 and index % 100 == 0:
        print(f"--> Sedang memproses baris ke-{index} dari {jumlah_total_data}...")

    text = row['full_text']

    # Gabungkan prompt dengan teks
    full_prompt_sentiment = prompt_sentiment.replace("[TEKS_POST]", text)
    full_prompt_topic = prompt_topic.replace("[TEKS_POST]", text)

    try:
        # Panggil model menggunakan .invoke()
        sentiment_result = llm_granite.invoke(full_prompt_sentiment)
        df.loc[index, 'sentiment'] = sentiment_result.strip().lower()

        topic_result = llm_granite.invoke(full_prompt_topic)
        df.loc[index, 'topic'] = topic_result.strip().lower()

    except Exception as e:
        print(f"Error di baris {index}: {e}")
        df.loc[index, 'sentiment'] = 'ERROR'
        df.loc[index, 'topic'] = 'ERROR'

    time.sleep(1)

print("\n✅✅✅ SELURUH ANALISIS SELESAI! ✅✅✅")

# Simpan hasil final ke file CSV baru
df.to_csv('hasil_analisis_FINAL.csv', index=False)
print("Hasil analisis final telah disimpan ke 'hasil_analisis_FINAL.csv'")

display(df.head())

MEMULAI ANALISIS PADA 37573 DATASET.
Estimasi waktu selesai: 940 menit atau sekitar 15.66 jam.


KeyboardInterrupt: 

In [47]:
# ===================================================
# SEL KHUSUS UNTUK TES KONEKSI API
# ===================================================
import os
from google.colab import userdata
from langchain_community.llms import Replicate

print("Memulai tes koneksi sederhana...")

try:
    # 1. Coba ambil token dari secret
    token = userdata.get('api_token')
    os.environ["REPLICATE_API_TOKEN"] = api_token
    print("✅ Token ditemukan dan diset ke environment.")

    # 2. Coba inisialisasi model
    llm_test = Replicate(
        model="ibm-granite/granite-3.3-8b-instruct",
        model_kwargs={"temperature": 0.1, "max_new_tokens": 50}
    )
    print("✅ Model berhasil diinisialisasi.")

    # 3. Coba panggil model dengan pertanyaan simpel
    print("\nMencoba memanggil model...")
    output = llm_test.invoke("Hello, who are you?")
    print("✅✅✅ BERHASIL! Jawaban dari model:")
    print(output)

except Exception as e:
    print("\n❌❌❌ GAGAL!")
    print(e)

Memulai tes koneksi sederhana...
✅ Token ditemukan dan diset ke environment.
✅ Model berhasil diinisialisasi.

Mencoba memanggil model...
✅✅✅ BERHASIL! Jawaban dari model:
Hello! I am Granite, an AI assistant developed by IBM, designed to provide information and support. How can I assist you today?
