<a href="https://colab.research.google.com/github/fangkiigopramana/Fangki_CP_DATA_SCI_2025/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🛒 **Materials**

## **Data Cleaning Code**

In [69]:
import pandas as pd
import re
import json
import pprint

# --- KONFIGURASI ---
# Ganti dengan path lokasi file CSV
FILE_INPUT = '/content/drive/MyDrive/dataset/debate.csv'
FILE_OUTPUT = '/content/drive/MyDrive/dataset/debate_cleaned.json'

# Speaker yang ingin dianalisis.
SPEAKERS_TO_KEEP = ['Trump', 'Clinton']


def clean_debate_text(text):
    """
    Cleaning the transcript text.
    """
    text = str(text)

    # 1. Hapus teks non-verbal dalam tanda kurung, seperti (APPLAUSE)
    text = re.sub(r'\s*\(.*?\)\s*', ' ', text)

    # 2. Ubah ke lowercase
    text = text.lower()

    # 3. Hapus tanda baca
    text = re.sub(r'[^\w\s]', '', text)

    # 4. Hapus spasi berlebih
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# --- PROSES UTAMA ---

# 1. Load data dari CSV
try:
    # Read file dengan encoding 'latin-1'
    df = pd.read_csv(FILE_INPUT, encoding='latin-1')
except FileNotFoundError:
    print(f"Error: File '{FILE_INPUT}' tidak ditemukan. Pastikan file ada di folder yang sama.")
    exit()

# 2. Filter baris berdasarkan Speaker yang telah ditentukan
df_filtered = df[df['Speaker'].isin(SPEAKERS_TO_KEEP)].copy()


# 3. Menerapkan function clean_debate_text
df_filtered['text_cleaned'] = df_filtered['Text'].apply(clean_debate_text)


# 4. Hapus baris yang teksnya menjadi kosong
df_cleaned = df_filtered[df_filtered['text_cleaned'] != ''].copy()

# 5. Ubah ke Format List of Dictionaries dan Simpan sebagai JSON

# a. Pilih hanya kolom yang relevan
df_final = df_cleaned[['Speaker', 'text_cleaned','Date']]

# b. Ubah nama kolom agar sesuai dengan output
df_final = df_final.rename(columns={
    'Speaker': 'speaker',
    'text_cleaned': 'text',
    'Date': 'date'
})

# c. Konversi DataFrame ke format list of dictionaries
output_list = df_final.to_dict('records')

# d. Simpan list tersebut ke dalam file JSON
with open(FILE_OUTPUT, 'w', encoding='utf-8') as f:
    json.dump(output_list, f, indent=2, ensure_ascii=False)

print(f"\nData bersih dalam format JSON berhasil disimpan di file: '{FILE_OUTPUT}'")


Data bersih dalam format JSON berhasil disimpan di file: '/content/drive/MyDrive/dataset/debate_cleaned.json'


## **Dataset (Menyimpan Text Transkrip ke sebuah variable)**

In [75]:
import json


grouped_data_json = []
try:
    with open(FILE_OUTPUT, 'r', encoding='utf-8') as f:
        datas = json.load(f)
        df = pd.DataFrame(datas)

    grouped_by_date = df.groupby('date')

    for date, group in grouped_by_date:
        group_list_of_dicts = group.to_dict(orient='records')

        grouped_data_json.append({
            'date': date,
            'entries': group_list_of_dicts
        })
    datas = grouped_data_json
    print("Data berhasil dikelompokkan dan disimpan dalam variabel 'datas'.")
    # print(len(datas[0]['entries']))

except FileNotFoundError:
    print(f"Error: File '{FILE_OUTPUT}' tidak ditemukan. Pastikan path file benar.")
except json.JSONDecodeError:
    print(f"Error: Gagal mendekode JSON dari file '{FILE_OUTPUT}'. Pastikan format JSON valid.")
except Exception as e:
    print(f"Terjadi error lain: {e}")

Data berhasil dikelompokkan dan disimpan dalam variabel 'datas'.


# 🛠 **Data Processing**

## **Install Package**

In [None]:
!pip install langchain_community
!pip install replicate



## **Set API Token and Granite Model**

In [91]:
from langchain_community.llms import Replicate
import os
from google.colab import userdata

# Set the API token
api_token = userdata.get('api_token')
os.environ["REPLICATE_API_TOKEN"] = api_token

# Model setup
model = "ibm-granite/granite-3.3-8b-instruct"
output = Replicate(
  model=model,
  replicate_api_token=api_token,
)

## **Configuration of process**

In [None]:
# CONFIGURE PROCESS
parameters = {
 "top_k": 0,
 "top_p": 0.9,
 "max_tokens": 2,
 "min_tokens": 0,
 "random_seed": 90,
 "repetition_penalty": 1.0,
 "stopping_criteria": None,
 "stopping_sequence": None
}

# **Prompting**

## Analytical result

### Counting Percentage the dialogue each speaker

Process

In [74]:
import pandas as pd

# --- Fungsi untuk menghitung dialog ---
def calculate_dialogue_counts(conversation_entries):
    """
    Menghitung jumlah dialog untuk setiap speaker dalam daftar entri percakapan.

    Args:
        conversation_entries (list): List of dictionaries, each representing a speaking turn.
                                     Expected format: [{'speaker': 'Name', 'text': '...'}]

    Returns:
        dict: A dictionary where keys are speaker names and values are their dialogue counts.
              Returns an empty dictionary if input is invalid or empty.
    """
    dialogue_counts = {}
    last_speaker = None

    if not isinstance(conversation_entries, list) or not conversation_entries:
        print("Warning: Invalid or empty conversation_entries provided. Returning empty counts.")
        return {}

    for entry in conversation_entries:
        speaker_name = entry.get('speaker')

        if speaker_name is None:
            continue

        if speaker_name != last_speaker:
            dialogue_counts[speaker_name] = dialogue_counts.get(speaker_name, 0) + 1
            last_speaker = speaker_name

    return dialogue_counts

counts_0 = calculate_dialogue_counts(datas[0]['entries'])
counts_1 = calculate_dialogue_counts(datas[1]['entries'])
counts_2 = calculate_dialogue_counts(datas[2]['entries'])

clinton_count = counts_0.get('Clinton', 0) + counts_1.get('Clinton', 0) + counts_2.get('Clinton', 0)
trump_count = counts_0.get('Trump', 0) + counts_1.get('Trump', 0) + counts_2.get('Trump', 0)

Show the output

In [107]:
prompt = f"""
Calculate the percentage contribution of each speaker based on their dialogue counts.
Given the following dialogue counts:
Clinton: {clinton_count} dialogues
Trump: {trump_count} dialogues

the output just number value.

Output Format:
* Clinton: [percentage]
* Trump: [percentage]
"""

# response = output.invoke(prompt)

print("Granite Model Response:\n")
print(response)

Granite Model Response:

* Clinton: 50%
* Trump: 50%


### Summarize the topic

Process

In [103]:
# Prompting
data1 = "\n".join([
    f"{data['speaker']}: {data['text']}"
    for i, data in enumerate(datas[0]['entries'])
])

data2 = "\n".join([
    f"{data['speaker']}: {data['text']}"
    for i, data in enumerate(datas[1]['entries'])

])

data3 = "\n".join([
    f"{data['speaker']}: {data['text']}"
    for i, data in enumerate(datas[2]['entries'])
])

prompt1 = f"""
Provide a concise, two-sentence summary of the key arguments and main points raised by each speaker in the following conversation transcript.
Focus on their distinct contributions and perspectives.

---
**Conversation Transcript:**
{data1}
---

**Output Format:**
- **[Name of Speaker 1]:** Summarize their key arguments in 2 concise sentences.
- **[Name of Speaker 2]:** Summarize their key arguments in 2 concise sentences.
"""

prompt2 = f"""
Provide a concise, two-sentence summary of the key arguments and main points raised by each speaker in the following conversation transcript.
Focus on their distinct contributions and perspectives.

---
**Conversation Transcript:**
{data2}
---

**Output Format:**
- **[Name of Speaker 1]:** Summarize their key arguments in 2 concise sentences.
- **[Name of Speaker 2]:** Summarize their key arguments in 2 concise sentences.
"""

prompt3 = f"""
Provide a concise, two-sentence summary of the key arguments and main points raised by each speaker in the following conversation transcript.
Focus on their distinct contributions and perspectives.

---
**Conversation Transcript:**
{data3}
---

**Output Format:**
- **[Name of Speaker 1]:** Summarize their key arguments in 2 concise sentences.
- **[Name of Speaker 2]:** Summarize their key arguments in 2 concise sentences.
"""

parameters = {
 "top_k": 0,
 "top_p": 1.0,
 "max_tokens": 2,
 "min_tokens": 0,
 "random_seed": None,
 "repetition_penalty": 1.0,
 "stopping_criteria": None,
 "stopping_sequence": None
}

Granite Model Response:

=> Summary of first debate
**Clinton:** Clinton emphasized the importance of the Supreme Court upholding citizens' rights, including women's rights, LGBT rights, and the affordability of healthcare. She advocated for a Supreme Court that stands against Citizens United, supports Roe v. Wade, and opposes reversing marriage equality. She also highlighted the need for reasonable gun control measures while respecting the Second Amendment.

**Trump:** Trump underscored the necessity of appointing conservative justices who will uphold the Second Amendment and interpret the Constitution as the Founding Fathers intended. He pledged to protect the Second Amendment from being "a very small replica" under his opponent and criticized the Heller decision, which he claimed Clinton disagreed with due to its impact on reasonable regulations. Trump also emphasized the importance of strong borders and securing them, suggesting that he would build a wall and appoint pro-life judge

Output

In [109]:

# Invoke the model
# response_summarize1 = output.invoke(prompt1, parameters=parameters)
# response_summarize2 = output.invoke(prompt2, parameters=parameters)
# response_summarize3 = output.invoke(prompt3, parameters=parameters)

# Print the response
print("Granite Model Response:\n")
print('A. Summary of first debate')
print(response_summarize1)
print(' ')
print('B. Summary of second debate')
print(response_summarize2)
print(' ')
print('C. Summary of third debate')
print(response_summarize3)

Granite Model Response:

A. Summary of first debate
**Clinton:** Clinton emphasized the importance of the Supreme Court upholding citizens' rights, including women's rights, LGBT rights, and the affordability of healthcare. She advocated for a Supreme Court that stands against Citizens United, supports Roe v. Wade, and opposes reversing marriage equality. She also highlighted the need for reasonable gun control measures while respecting the Second Amendment.

**Trump:** Trump underscored the necessity of appointing conservative justices who will uphold the Second Amendment and interpret the Constitution as the Founding Fathers intended. He pledged to protect the Second Amendment from being "a very small replica" under his opponent and criticized the Heller decision, which he claimed Clinton disagreed with due to its impact on reasonable regulations. Trump also emphasized the importance of strong borders and securing them, suggesting that he would build a wall and appoint pro-life judge