<a href="https://colab.research.google.com/github/alice410451027/LLM-questiongen-news/blob/main/LLM_questiongen_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. News Vectorization (Embedding)
In this step, each news article's content (the content field) is converted into a vector — essentially, a list of numbers. These vectors allow computers to analyze relationships and similarities between articles in a mathematical way. For example, if two articles discuss similar topics, their vector representations will be closer together in space.


> Use cases:


1.News recommendation systems

2.Topic clustering

3.Similar article retrieval

4.Article classification

5.Response generation

#2. Customized Question Generation (Add Question)
This step uses Gemini (or another large language model) to automatically generate a relevant question for each news article based on its content. The goal is to create questions that help users understand the news through a Q&A approach.

>Use cases:

*   Building a news Q&A system

* Creating reading comprehension datasets

* Enabling semantic search by combining with vector data

#3. Possible Project Intentions (Inferred Goals):

* Create a vector database of news content → Let users input a question and retrieve the most semantically relevant articles.

* Enhance retrieval-augmented generation (RAG) systems → Combine generated questions with vector search to support queries like “Which article discusses the cold weather?”

* Enable AI to answer questions about news content → For reading comprehension, interactive news education, or conversational applications.

#4. Project Workflow
* Load a .jsonl file containing 408 news articles

* Vectorize the content field of each article (using Gemini or any embedding API)

* Generate a relevant question for each article's content (using Gemini)

* Add a new question field to the original dataset

* Save the enriched data as a new .jsonl or .csv file

#1.Install packages

In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

from IPython.display import clear_output
!pip install -q google-generativeai langchain langchain-google-genai langchain_huggingface sentence-transformers faiss-cpu pypdf==4.1.0 -U
!pip install -q langchain-community -U
!pip install --upgrade --quiet "unstructured[pdf]" "unstructured[txt]"
!apt-get install -y poppler-utils
!apt-get install -y tesseract-ocr
!pip install pytesseract
import warnings

warnings.filterwarnings("ignore", message="Convert_system_message_to_human will be deprecated!", category=UserWarning)

clear_output()
print("packages are installed")

packages are installed


#3.Import news file

In [4]:
import pandas as pd
import openai
import json
import google.generativeai as genai
from tqdm import tqdm
import time

In [5]:
file_path = "/content/all_ettoday_news.jsonl"

data = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)
df.head()

Unnamed: 0,publish_date,publish_time,title,keywords,content,source
0,2021-09-17,11:09:00,路易莎「插座經濟學」創19億咖啡王國　逆勢展店穩坐觀光股后,"路易莎咖啡,星巴克,黃銘賢,開箱雲愛美食",路易莎「插座經濟學」創19億咖啡王國逆勢展店穩坐觀光股后 攝影：屠惠剛、謝婷婷 九月十七日，...,https://finance.ettoday.net/news/2081674
1,2021-08-06,07:30:00,第一志業並非金融業！謝長融立志用「變形蟲管理」　5年內將新光銀變「一線銀行」,"新光銀,謝長融,元大銀行,台新銀行",第一志業並非金融業！謝長融立志用「變形蟲管理」5年內將新光銀變「一線銀行」 「其實研究所畢業...,https://finance.ettoday.net/news/2048694
2,2021-06-13,09:26:00,逆轉女王！法務女將賣能量衣　公司電話織進去「不怕你打」,"能量襪,能量內衣,京美,呂麗美",逆轉女王！法務女將賣能量衣公司電話織進去「不怕你打」 「個性使然，我看不到環境的困難，只聚焦...,https://finance.ettoday.net/news/1995195
3,2021-06-09,09:47:00,「評議」公親難當！接任中心董座9個月　林志潔：不讓消費者二度受傷是堅持,"金融消費評議中心,林志潔,金管會,黃天牧",「評議」公親難當！接任中心董座9個月林志潔：不讓消費者二度受傷是堅持 陽明交通大學科法所的特...,https://finance.ettoday.net/news/2001540
4,2021-05-31,21:35:00,獨／味全轉虧為盈「四大心法拚認同！」　董座陳宏裕打造台灣潮味道,,獨／味全轉虧為盈「四大心法拚認同！」董座陳宏裕打造台灣潮味道 當你身為市場中常年的第一名，卻...,https://finance.ettoday.net/news/1995726


#Test :Gemini (Success)
1. batch: 15
2. delay_time: 30 sec
3. temperature: 0.7

In [6]:
import json
import pandas as pd
import time
import google.generativeai as genai
import glob

# Set up the Gemini API
genai.configure(api_key="AIzaSyCFB0y4xnBh0_SY8Jnv8wRudMNcu_5ekBE")
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

file_path = "/content/all_ettoday_news.jsonl"
data = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)

print(f"Total: {len(df)} articles loaded")
print(df.columns)

batch_size = 15
delay_time = 30
total_batches = (len(df) + batch_size - 1) // batch_size

all_questions = []  # List to hold all the questions
all_records = []  # List to hold all the records (for JSONL)

for batch_index in range(total_batches):
    start = batch_index * batch_size
    end = min(start + batch_size, len(df))
    print(f"▶️ Processing batch {batch_index + 1}: articles {start} to {end - 1}")

    batch_df = df.iloc[start:end].copy()
    questions = []

    for i, row in batch_df.iterrows():
        title = row['title']
        content = row['content']

        prompt = f"""
                You are a professional question generator. Based on the following news "title and content", generate ONE question that is highly relevant, constructive, and specific.
                The question must meet the following criteria:
                1. It can only be answered based on the news content. Do not infer anything beyond the article.
                2. It should not be a yes/no, vague, or overly broad question.
                3. It should be constructive — encouraging the reader to understand the article's key points, impacts, causes, context, or consequences.
                Please generate the question **in Traditional Chinese**, without any extra explanation or commentary.
                Title: {title}
                Content: {content}
                Question:
                """
        try:
            response = model.generate_content(prompt, generation_config={"temperature": 0.7})
            question = response.text.strip()
        except Exception as e:
            print(f"Error at index {i}: {e}")
            question = ""
        questions.append(question)
        time.sleep(1.2)

    batch_df['question'] = questions

    # Append to the final lists for later merging
    all_questions.extend(questions)
    all_records.extend(batch_df.to_dict(orient='records'))

    print(f"✅ Batch {batch_index + 1} processed.")

    if batch_index < total_batches - 1:
        print(f"⏳ Waiting {delay_time} seconds to avoid API overload...")
        time.sleep(delay_time)

# After all batches are processed, merge and save the results
# Merge CSV
merged_df = pd.DataFrame(all_records)
merged_csv_path = "/content/news_with_questions_Gemini(0.7).csv"
merged_df.to_csv(merged_csv_path, index=False, encoding='utf-8-sig')
print(f"✅ All CSV batches merged and saved as: {merged_csv_path}")

# Merge JSONL
merged_jsonl_path = "/content/news_with_questions_Gemini(0.9).jsonl"
with open(merged_jsonl_path, 'w', encoding='utf-8') as outfile:
    for record in all_records:
        outfile.write(json.dumps(record, ensure_ascii=False) + "\n")
print(f"✅ All JSONL batches merged and saved as: {merged_jsonl_path}")

print("🎉 All question generation completed!")


Total: 408 articles loaded
Index(['publish_date', 'publish_time', 'title', 'keywords', 'content',
       'source'],
      dtype='object')
▶️ Processing batch 1: articles 0 to 14
✅ Batch 1 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 2: articles 15 to 29
✅ Batch 2 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 3: articles 30 to 44
✅ Batch 3 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 4: articles 45 to 59
✅ Batch 4 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 5: articles 60 to 74
✅ Batch 5 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 6: articles 75 to 89
✅ Batch 6 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 7: articles 90 to 104
✅ Batch 7 processed.
⏳ Waiting 30 seconds to avoid API overload...
▶️ Processing batch 8: articles 105 to 119
✅ Batch 8 processed.
⏳ Waiting 30 seconds to avoid API overload

#Test :Groq (Success)

In [7]:
pip install groq

Collecting groq
  Downloading groq-0.22.0-py3-none-any.whl.metadata (15 kB)
Downloading groq-0.22.0-py3-none-any.whl (126 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.7/126.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.22.0


In [11]:
# ✅ Install SDK (only needed once)
!pip install -q groq

# ✅ Import Groq SDK
import groq
import getpass
import pandas as pd
import json

# Initialize Groq API client
api_key = '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'
client = groq.Groq(api_key=api_key)

# Set the model name
model = "meta-llama/llama-4-scout-17b-16e-instruct"

# System prompt
system_prompt = '''
You are a senior news editor with strong background knowledge in finance and politics. You are also an excellent creator of news perspectives, capable of summarizing complex concepts in an easy-to-understand way.

Your goal is to explain the following news content concisely and accurately to help others understand the news event. Please provide a summary for the article below and include the Source (with a link and citation) in your answer:
'''

# Load all article data (simulate reading all in memory, no batches)
# Here we assume there is a list of news data, either from CSVs, databases, or any source
# Let's say `news_articles` is a list of dicts like:
# [{'title': ..., 'content': ..., 'publish_date': ..., 'keywords': ..., 'source': ...}, ...]

# For demo purposes, you may load from a single file if needed
# news_articles = pd.read_csv("your_data.csv").to_dict(orient='records')
# Simulated:
news_articles = [
    {
        'title': '中東媒體看大選：台灣盼更高國際地位與政治透明',
        'content': '中央社記者施婉清開羅13日專電中東的半島電視台派員在台灣連線，現場直擊台灣總統大選投開票結果...',
        'publish_date': '2024-01-14 10:52:00',
        'keywords': '2024總統,半島電視台',
        'source': 'https://www.ettoday.net/news/20240114/2663948.htm'
    }
]

# Create a list to collect processed results
results = []

# Process each article
for idx, article in enumerate(news_articles):
    print(f"Processing article {idx + 1}/{len(news_articles)}...")

    # Construct user prompt
    user_prompt = f'''Title: {article['title']}

Content: {article['content']}

News publish date: {article['publish_date']}

Keywords: {article['keywords']}

Source: {article['source']}
'''

    # Send request to Groq API
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.7,
        max_tokens=500
    )

    answer = response.choices[0].message.content.strip()

    # Append result
    results.append({
        "title": article["title"],
        "summary": answer,
        "source": article["source"],
        "publish_date": article["publish_date"],
        "keywords": article["keywords"]
    })

print("✅ All articles processed. Merging results...")

# Save to CSV
csv_path = "/content/news_with_questions_Groq(0.7).csv"
pd.DataFrame(results).to_csv(csv_path, index=False, encoding='utf-8-sig')
print(f"✅ All results saved as CSV: {csv_path}")

# Save to JSONL
jsonl_path = "/content/news_with_questions_Groq(0.7).jsonl"
with open(jsonl_path, 'w', encoding='utf-8') as f:
    for item in results:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")
print(f"✅ All results saved as JSONL: {jsonl_path}")


Processing article 1/1...
✅ All articles processed. Merging results...
✅ All results saved as CSV: /content/news_with_questions_Groq(0.7).csv
✅ All results saved as JSONL: /content/news_with_questions_Groq(0.7).jsonl
