<a href="https://colab.research.google.com/github/fahimku2020/fahimku2020/blob/main/Beast_bart_extractive_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install wikipedia
!pip install transformers
!pip install sentencepiece



Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=7d94f1ef3878f1874eafe9729f777ee1de4d70274934254480355aa7a5856b1f
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [6]:
import requests
import wikipediaapi
import nltk
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BartForConditionalGeneration, BartTokenizer, BertTokenizer, BertModel
import torch
import torch.nn.functional as F

# Download necessary NLTK resources
nltk.download('punkt_tab')

class BartSummarizer:
    def __init__(self, num_clusters=5):
        # Initialize Wikipedia API
        self.wiki_wiki = wikipediaapi.Wikipedia(language='en',
            extract_format=wikipediaapi.ExtractFormat.WIKI,
            user_agent='MySummarizationBot/1.0 (https://mywebsite.com; myemail@example.com)' # Replace with your own details
        )

        # Initialize BERT embedding model
        self.bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.bert_model = BertModel.from_pretrained('bert-base-uncased')

        # Initialize BART summarization model
        self.bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
        self.bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

        # Clustering parameters
        self.num_clusters = num_clusters

    def fetch_wikipedia_document(self, topic):
        """Fetch Wikipedia document for a given topic."""
        page = self.wiki_wiki.page(topic)

        if not page.exists():
            raise ValueError(f"No Wikipedia page found for topic: {topic}")

        return page.text

    def split_into_sentences(self, text):
        """Split text into sentences."""
        return nltk.sent_tokenize(text)

    def get_bert_embeddings(self, sentences):
        """Generate BERT embeddings for sentences."""
        embeddings = []

        for sentence in sentences:
            # Tokenize and encode the sentence
            inputs = self.bert_tokenizer(sentence, return_tensors='pt',
                                         max_length=512, truncation=True, padding=True)

            # Get BERT embeddings
            with torch.no_grad():
                outputs = self.bert_model(**inputs)
                # Use the [CLS] token embedding (first token)
                embedding = outputs.last_hidden_state[:, 0, :].numpy()
                embeddings.append(embedding.flatten())

        return np.array(embeddings)

    def cluster_sentences(self, embeddings):
        """Cluster sentences using KMeans."""
        kmeans = KMeans(n_clusters=self.num_clusters, random_state=42)
        return kmeans.fit_predict(embeddings)

    def generate_cluster_summaries(self, sentences, cluster_labels):
        """Generate summaries for each cluster using BART."""
        cluster_summaries = {}

        for cluster in range(self.num_clusters):
            # Get sentences in this cluster
            cluster_sentences = [
                sentences[i] for i in range(len(sentences)) if cluster_labels[i] == cluster
            ]

            # Combine cluster sentences
            cluster_text = " ".join(cluster_sentences)

            # Generate summary
            inputs = self.bart_tokenizer(
                cluster_text,
                max_length=1024,
                return_tensors='pt',
                truncation=True
            )

            summary_ids = self.bart_model.generate(
                inputs['input_ids'],
                num_beams=4,
                max_length=150,
                early_stopping=True
            )

            summary = self.bart_tokenizer.decode(
                summary_ids[0],
                skip_special_tokens=True
            )

            cluster_summaries[cluster] = {
                'original_sentences': cluster_sentences,
                'summary': summary
            }

        return cluster_summaries

    def summarize(self, topic):
        """Main method to summarize a Wikipedia topic."""
        # Fetch document
        document = self.fetch_wikipedia_document(topic)

        # Split into sentences
        sentences = self.split_into_sentences(document)

        # Get sentence embeddings
        embeddings = self.get_bert_embeddings(sentences)

        # Cluster sentences
        cluster_labels = self.cluster_sentences(embeddings)

        # Generate summaries for each cluster
        cluster_summaries = self.generate_cluster_summaries(sentences, cluster_labels)

        return cluster_summaries

# Example usage
if __name__ == "__main__":
    summarizer = BartSummarizer(num_clusters=5)
    topic = "Artificial Intelligence"

    try:
        results = summarizer.summarize(topic)

        print(f"Summaries for topic: {topic}")
        print("-" * 50)

        for cluster, data in results.items():
            print(f"\nCluster {cluster}:")
            print("Original Sentences:")
            for sentence in data['original_sentences']:
                print(f"- {sentence}")
            print("\nCluster Summary:")
            print(data['summary'])
            print("-" * 50)

    except Exception as e:
        print(f"An error occurred: {e}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Summaries for topic: Artificial Intelligence
--------------------------------------------------

Cluster 0:
Original Sentences:
- Some high-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); interacting via human speech (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT, and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go).
- However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore."
- The various subfields of AI research are centered around particular goals and the use of particular tools.
- The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural langu

Build Bart summarizer.fetch document from wikipedia, split into sentences,use bert- clustering for sentences.print top 5 bigram keywords of each cluster,generate its bart summary

In [None]:
import requests
import re
import numpy as np
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from transformers import BartForConditionalGeneration, BartTokenizer
import nltk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def fetch_wikipedia_content(topic):
    """Fetch Wikipedia content for a given topic."""
    try:
        # Construct Wikipedia API URL
        base_url = "https://en.wikipedia.org/w/api.php"
        params = {
            "action": "query",
            "format": "json",
            "titles": topic,
            "prop": "extracts",
            "exintro": True,
            "explaintext": True
        }

        response = requests.get(base_url, params=params)
        data = response.json()

        # Extract page content
        page = next(iter(data['query']['pages'].values()))
        return page.get('extract', '')
    except Exception as e:
        print(f"Error fetching Wikipedia content: {e}")
        return ""

def preprocess_text(text):
    """Preprocess text by removing special characters and converting to lowercase."""
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    return text

def sentence_clustering(sentences, n_clusters=5):
    """Cluster sentences using TF-IDF and KMeans."""
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Perform KMeans clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(tfidf_matrix)

    return kmeans.labels_

def extract_top_bigrams(sentences, n=5):
    """Extract top bigrams for each cluster."""
    stop_words = set(stopwords.words('english'))

    def get_bigrams(sentence):
        words = [word for word in sentence.split() if word not in stop_words]
        return list(ngrams(words, 5))

    bigram_counts = {}
    for sentence in sentences:
        sentence_bigrams = get_bigrams(preprocess_text(sentence))
        for bigram in sentence_bigrams:
            bigram_counts[bigram] = bigram_counts.get(bigram, 0) + 1

    return sorted(bigram_counts.items(), key=lambda x: x[1], reverse=True)[:1]

def generate_bart_summary(text, max_length=450, min_length=250):
    """Generate summary using BART model."""
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

    inputs = tokenizer(text, max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(
        inputs['input_ids'],
        num_beams=4,
        max_length=max_length,
        min_length=min_length
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def main(topic):
    # Fetch Wikipedia content
    text = fetch_wikipedia_content(topic)

    # Split into sentences
    sentences = sent_tokenize(text)

    # Perform sentence clustering
    cluster_labels = sentence_clustering(sentences)

    # Group sentences by cluster
    clustered_sentences = {}
    for label, sentence in zip(cluster_labels, sentences):
        if label not in clustered_sentences:
            clustered_sentences[label] = []
        clustered_sentences[label].append(sentence)

    # Print cluster information and top bigrams
    print(f"Analysis for topic: {topic}\n")
    for cluster, cluster_sents in clustered_sentences.items():
        print(f"Cluster {cluster}:")
        print("Sentences:")
        for sent in cluster_sents[:5]:  # Print first two sentences of each cluster
            print(f"- {sent}")

        print("\nTop Bigrams:")
        top_bigrams = extract_top_bigrams(cluster_sents)
        for bigram, count in top_bigrams:
            print(f"  {' '.join(bigram)}: {count}")
        print("\n")

    # Generate BART summary
    summary = generate_bart_summary(text)
    print("BART Summary:")
    print(summary)

# Example usage
if __name__ == "__main__":
    main("Artificial intelligence")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Analysis for topic: Artificial intelligence

Cluster 2:
Sentences:
- Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
- It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.
- Such machines may be called AIs.
- Some high-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); interacting via human speech (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT, and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go).
- General intelligence—the ability to complete any task performed by a human on an at least equal level—is among the fiel

In [3]:
!pip install wikipedia-api # Install wikipedia-api
!pip install transformers
!pip install sentencepiece

Collecting wikipedia-api
  Downloading wikipedia_api-0.7.1.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.7.1-py3-none-any.whl size=14346 sha256=b8ffb843aad8fd05064f3190028d660eaae80e56748d2411fb6a586e91c4c9d3
  Stored in directory: /root/.cache/pip/wheels/4c/96/18/b9201cc3e8b47b02b510460210cfd832ccf10c0c4dd0522962
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.7.1
