In [1]:
%pip install torch numpy transformers datasets huggingface

Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl.metadata (2.9 kB)
Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Installing collected packages: huggingface
Successfully installed huggingface-0.0.1
Note: you may need to restart the kernel to use updated packages.


### Importing the Required libraries

Hugginface key is required 

[Huggingface](https://huggingface.co/)

use the link to signup and create a api key free of cost


In [2]:
from huggingface_hub import login
import torch
from transformers import BartForConditionalGeneration, BartTokenizer
from datasets import load_dataset
huggingfacekey = "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"
login(token=huggingfacekey)

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\ajukh\.cache\huggingface\token
Login successful


## Random Index Generation and Dataset Loading Example

### Random Index Generation:
    - Uses np.random.randint() to generate random integers.
    - Generates n = 10 random integers between 0 and 30000.
### CNN/DailyMail Dataset:
    - Loaded using load_dataset("cnn_dailymail", "3.0.0").
    - Contains articles and summaries used for NLP tasks such as summarization.


In [3]:
import numpy as np
n = 10
randindex = np.random.randint(30000,size=10)
print(randindex)
dataset = load_dataset("cnn_dailymail", "3.0.0")
print(dataset)

[11553 22988  2665 17719 15816 16634 21188 27230 16303  4220]


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 287113/287113 [00:04<00:00, 65356.66 examples/s]
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 65539.22 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 66540.86 examples/s]


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})


##### Selecting Random Articles , BART Model

Random indices are generated, and the corresponding articles from the dataset are selected.


The BART (Bidirectional and Auto-Regressive Transformer) model is loaded using Hugging Face.
This model is used for text summarization.

##### Hugging Face Pretrained Models:

The "facebook/bart-large-cnn" model is used for summarizing news articles.

In [4]:
articles = dataset['train'][randindex]['article']
# Load the BART model and tokenizer
model_name = "facebook/bart-large-cnn"

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


**Purpose:** This code defines a function summarize_text() that takes a text input and generates a summary using a pre-trained transformer model like BART.

**Inputs:** The function accepts a text string and processes it with the tokenizer to convert it into a format suitable for the model. The input text is truncated to a maximum length of 1024 tokens to fit within the model's constraints.

**Model Summarization:** It uses the model’s generate() method to create a summary with specific parameters like max_length, min_length, and beam search (num_beams=4) for optimal summary quality. Early stopping is applied to halt the generation process once a suitable summary is reached.

**Output:** The function decodes the generated token IDs into a readable text summary and returns it, excluding any special tokens.

In [20]:
# Function to summarize text
def summarize_text(text):
    inputs = tokenizer(text, max_length=1024, return_tensors="pt", truncation=True,padding=True)
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

**Purpose:** This loop processes a list of articles, applies the summarize_text() function to each article, and prints both the original article and its summary.

**Iteration over Articles:** The code iterates through the articles list using enumerate() to keep track of both the index (i) and the article content.

**Summarization Process:** For each article, the function summarize_text(article) is called, generating a summary based on the pre-trained model (such as BART).

**Output:** For each article, the code prints:
The original article (Original Article {i+1}).
The generated summary (Summary {i+1}).
A separator line ("-" * 80) for readability between articles.

In [21]:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)"

def scraping(url):
    # Define browser-like headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }

    try:
        # Sending request to get the data from the webpage
        scraped_data = requests.get(url, headers=headers)
        scraped_data.raise_for_status()  # Check if request was successful (status code 200)

        # Parse the content with BeautifulSoup
        soup = BeautifulSoup(scraped_data.content, 'html.parser')

        # Find all paragraphs and extract text
        paragraphs = soup.find_all('p')
        article_context = ""

        for p in paragraphs:
            article_context += p.get_text(strip=True) + " "  # Strip whitespace and concatenate paragraphs

        return article_context.strip()  # Return the article content, ensuring no leading/trailing whitespace

    except requests.exceptions.RequestException as e:
        # Handle any exception raised by requests
        print(f"An error occurred: {e}")
        return None

In [23]:
# for i , article in enumerate(articles):
#     summary = summarize_text(article)
#     print(f"Original Article {i+1}:\n{article}\n")
#     print(f"Summary {i+1}:\n{summary}\n")
#     print("-" * 80)
#     break
article = scraping(url)
summary = summarize_text(article)
def WriteFile(final_summary):
    with open('./summary.txt',mode='a', encoding='utf-8') as file:
        file.write(f"\n \n \n The LLM (BART) summarized text  is as follows : \n {final_summary}")
WriteFile(summary)