<a href="https://colab.research.google.com/github/Walla17x/Amazon-word-/blob/main/amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Project: What's Up, Docs? - Advanced LLM Summarization Pipeline with GPU + Batching

# Step 1: Install Required Libraries
# Run in Google Colab
!pip install transformers datasets rouge-score

# Step 2: Upload Data
# Use the left panel in Colab to upload `train.csv` and `test_features.csv`

# Step 3: Load Data
import pandas as pd
train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test_features.csv')

# Step 4: Load Summarization Model on GPU
from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

# Step 5: Define Summarization Function
def summarize_texts(texts, max_tokens=1024):
    # Truncate all texts to max token length
    clipped_texts = [t[:max_tokens] for t in texts]
    summaries = summarizer(clipped_texts, max_length=130, min_length=30, do_sample=False)
    return [s['summary_text'] for s in summaries]

# Step 6: Batch Process Summaries
submission = []
batch_size = 8

for i in range(0, len(test_df), batch_size):
    batch = test_df.iloc[i:i+batch_size]
    paper_ids = batch['paper_id'].tolist()
    texts = batch['text'].tolist()
    try:
        summaries = summarize_texts(texts)
        for pid, summary in zip(paper_ids, summaries):
            submission.append({'paper_id': pid, 'summary': summary})
    except Exception as e:
        print(f"Batch {i} failed: {e}")

    if i % 100 == 0:
        print(f"Processed {i}/{len(test_df)} papers...")

# Step 7: Save Submission File
submission_df = pd.DataFrame(submission)
submission_df.to_csv("submission.csv", index=False)

# Step 8: Download the Submission File (Colab only)
from google.colab import files
files.download("submission.csv")


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Processed 0/345 papers...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Processed 200/345 papers...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>