# Text summarization

## Loading data (scraping from Wikipedia)

In [None]:
import requests

def get_wikipedia_articles(topic):
    """
    Fetches text from up to 5 Wikipedia articles related to the given topic.

    Parameters:
    topic (str): The main topic to search articles for.

    Returns:
    dict: A dictionary where keys are article titles and values are article content.
    """

    # Wikipedia API URL for searching articles
    search_url = "https://en.wikipedia.org/w/api.php"

    # Custom headers with User-Agent
    headers = {
        "User-Agent": "MyWikipediaScraper/1.0 (contact@example.com)"  # Replace with your actual email
    }

    # Parameters for searching articles related to the topic
    search_params = {
        "action": "query",
        "list": "search",
        "srsearch": topic,
        "format": "json",
        "srlimit": 5  # Limit to 5 articles
    }

    # Make the search request
    search_response = requests.get(search_url, headers=headers, params=search_params)
    search_data = search_response.json()

    # Dictionary to store titles and content of articles
    articles = {}

    # Loop over search results and fetch content for each article
    for result in search_data["query"]["search"]:
        title = result["title"]

        # Parameters for fetching the page content
        content_params = {
            "action": "query",
            "prop": "extracts",
            "explaintext": True,
            "titles": title,
            "format": "json"
        }

        # Make the request for article content
        content_response = requests.get(search_url, headers=headers, params=content_params)
        content_data = content_response.json()

        # Extract page content
        page = next(iter(content_data["query"]["pages"].values()))
        if "extract" in page:
            articles[title] = page["extract"]  # Store the title and content

    return articles

# Example usage
topic = "Artificial Intelligence"
articles = get_wikipedia_articles(topic)

# Print article titles and a preview of their content
for title, content in articles.items():
    print(f"Title: {title}\nContent Preview: {content[:1000]}...\n")

Title: Artificial intelligence
Content Preview: Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.
High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of c

## Data cleaning

In [None]:
import re

def remove_references(text):
    """
    Removes reference tags (e.g., [1], [citation needed]) from Wikipedia text.

    Parameters:
    text (str): The input text containing references.

    Returns:
    str: Cleaned text without references.
    """
    # Remove patterns like [1], [12], [citation needed]
    return re.sub(r'\[.*?\]', '', text)


def remove_special_characters(text):
    """
    Removes special characters like newline and tab characters from text.

    Parameters:
    text (str): The input text to clean.

    Returns:
    str: Cleaned text without special characters.
    """
    # Replace newlines and tabs with spaces
    text = text.replace('\n', ' ').replace('\t', ' ')

    # Remove other special characters, if any (you can add more as needed)
    text = re.sub(r'[^A-Za-z0-9.,;:!?\'" ]+', '', text)

    return text


def normalize_whitespace(text):
    """
    Normalizes whitespace by replacing multiple spaces with a single space.

    Parameters:
    text (str): The input text with extra spaces.

    Returns:
    str: Text with normalized whitespace.
    """
    # Replace multiple spaces with a single space
    return re.sub(r'\s+', ' ', text).strip()


def clean_text(text):
    """
    Cleans Wikipedia text by removing references, special characters, and normalizing whitespace.

    Parameters:
    text (str): The raw Wikipedia article text.

    Returns:
    str: Fully cleaned text.
    """
    text = remove_references(text)
    text = remove_special_characters(text)
    text = normalize_whitespace(text)
    return text

# Example Usage
raw_text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by animals including humans.[1] Leading AI textbooks define the field as the study of "intelligent agents"[citation needed]: any system that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
"""

# Clean the text using our utility function
cleaned_text = clean_text(raw_text)
print("Cleaned Text:", cleaned_text)

Cleaned Text: Artificial intelligence AI is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by animals including humans. Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.


In [None]:
import nltk
import sklearn


nltk.__version__, sklearn.__version__

('3.9.1', '1.6.1')

In [None]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np

# nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
first_article = list(articles.keys())[0]
cleaned_text = clean_text(articles[first_article])

sentences = sent_tokenize(cleaned_text)
sentences[:2]

['Artificial intelligence AI refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problemsolving, perception, and decisionmaking.',
 'It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.']

In [None]:
len(sentences)

575

## Let's create embeddings (vectors)

TF-IDF assigns a score to each word based on its importance. By converting each sentence to a TF-IDF vector, we can find the sentences with the highest scores, which are more likely to represent the main points.

In [None]:
# Initialize TfidfVectorizer with English stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Compute TF-IDF scores for each sentence
tfidf_matrix = vectorizer.fit_transform(sentences)

# Display TF-IDF matrix shape to understand the output
print("TF-IDF Matrix shape:", tfidf_matrix.shape)

TF-IDF Matrix shape: (575, 2987)


In [None]:
np.array(tfidf_matrix.sum(axis=1)).flatten()[:5]

array([3.98990294, 4.50122519, 1.69971014, 5.71521362, 3.50986723])

### Summarization using TF-IDF scores

Here we rank sentences by their importance in the article.

To summarize, we want to select the top sentences that represent the text. We can use the sum of TF-IDF scores for each sentence as a measure of importance.

In [None]:
# Sum the TF-IDF scores for each sentence (row)
sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()

# Get the indices of sentences sorted by score (descending order)
top_sentence_indices = sentence_scores.argsort()[-3:][::-1]

# Display the top-ranked sentence indices
print("Top sentence indices:", top_sentence_indices)

Top sentence indices: [560 153 398]


In [None]:
import textwrap

## Let's build a Summary

In [None]:
# Extract the top sentences for the summary
summary_sentences = [sentences[i] for i in top_sentence_indices]

# Join the sentences to form the summary text
summary = ' '.join(summary_sentences)

wrapped_summary = textwrap.fill(summary, width=90)
print("Summary:")
print(wrapped_summary)

Summary:
See also Artificial intelligence and elections Use and impact of AI on political elections
Artificial intelligence content detection Software to detect AIgenerated content Behavior
selection algorithm Algorithm that selects actions for intelligent agents Business process
automation Automation of business processes Casebased reasoning Process of solving new
problems based on the solutions of similar past problems Computational intelligence
Ability of a computer to learn a specific task from data or experimental observation
Digital immortality Hypothetical concept of storing a personality in digital form Emergent
algorithm Algorithm exhibiting emergent behavior Female gendering of AI technologies
Gender biases in digital technologyPages displaying short descriptions of redirect targets
Glossary of artificial intelligence List of definitions of terms and concepts commonly
used in the study of artificial intelligence Intelligence amplification Use of information
technology to augm

## Problem with TF-IDF approach - Non grammatical summary

The issue with extraction-based summarization producing non-grammatical summaries often stems from the approach of selecting top sentences based solely on TF-IDF scores. High TF-IDF scores don’t necessarily ensure that the sentences will flow naturally when put together, as they were designed to be read in context.

---

## Some production grade tooling - TextRank

TextRank is a graph-based algorithm that ranks sentences based on similarity scores. TextRank can produce more readable and meaningful summaries by choosing sentences that best represent the text collectively.

<br>

Pros of Using TextRank for Extractive Summarization
- Maintains Readability: TextRank considers sentence similarity, which leads to better sentence cohesion.

- Concise yet Coherent Summaries: TextRank captures the central idea while avoiding redundant sentences.

In [None]:
# Sumy

!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: breadability, docopt
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=brea

In [None]:
import sumy

sumy.__version__

'0.11.0'

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

def extractive_summary_textrank(text, sentence_count=3):
    """
    Generates an extractive summary of the text using the TextRank algorithm from Sumy.

    Parameters:
    text (str): The input text to summarize.
    sentence_count (int): The number of sentences to include in the summary.

    Returns:
    str: The generated summary.
    """
    # Create a parser for the input text
    parser = PlaintextParser.from_string(text, Tokenizer("english"))

    # Initialize the TextRank summarizer
    summarizer = TextRankSummarizer()

    # Generate summary with the specified number of sentences
    summary = summarizer(parser.document, sentence_count)

    # Join the summarized sentences into a single string
    summary_text = " ".join([str(sentence) for sentence in summary])

    return summary_text

In [None]:
# Get the summary with 3 sentences
summary = extractive_summary_textrank(cleaned_text, sentence_count=3)

wrapped_summary = textwrap.fill(summary, width=90)
print("Summary:")
print(wrapped_summary)

Summary:
The emergence of advanced generative AI in the midst of the AI boom and its ability to
create and modify content exposed several unintended consequences and harms in the present
and raised concerns about the risks of AI and its longterm effects in the future,
prompting discussions about regulatory policies to ensure the safety and benefits of the
technology. An AI framework such as the Care and Act Framework containing the SUM
valuesdeveloped by the Alan Turing Institute tests projects in four main areas: Respect
the dignity of individual people Connect with other people sincerely, openly, and
inclusively Care for the wellbeing of everyone Protect social values, justice, and the
public interest Other developments in ethical frameworks include those decided upon during
the Asilomar Conference, the Montreal Declaration for Responsible AI, and the IEEE's
Ethics of Autonomous Systems initiative, among others; however, these principles are not
without criticism, especially regards 

### Theory time !

#### What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) is a technique to evaluate the importance of a word in a document relative to a collection of documents (corpus).

1. Term Frequency (TF): Measures how frequently a word appears in a document. The more frequent the word, the higher its TF score for that document.

2. Inverse Document Frequency (IDF): Measures how unique a word is across all documents in the corpus. Common words like "the" and "is" are less informative, so they get a lower IDF score, while unique terms get higher scores.


TF-IDF combines these two metrics by calculating:

  `TF-IDF = TF × IDF`


Words with high TF-IDF scores in a document are considered more important for that document, making TF-IDF useful for identifying key words and phrases within texts.

---


#### TF-IDF vs. TextRank in Extractive Summarization

In extractive summarization, TF-IDF can help by identifying the most informative words within a document. A common approach is to:

- Compute the TF-IDF scores of words,

- Identify sentences that contain the highest scoring words,

- Extract these top sentences to form a summary.

However, this approach often produces disjointed summaries that may not be grammatically coherent, as it doesn't consider sentence-to-sentence relationships. This is where TextRank provides an advantage.

---

#### TextRank: `A Graph-Based Summarization Algorithm`

TextRank doesn’t directly use TF-IDF. Instead, it builds a graph of sentence relationships based on semantic similarity. Here’s how it works:

1. Convert Text into Sentences: Split the text into individual sentences.

2. Build Sentence Graph: Treat each sentence as a node in a graph. Then, connect sentences (nodes) by edges if they have a high similarity score. Similarity can be calculated using various methods, like cosine similarity between sentence embeddings (e.g., with Word2Vec or BERT embeddings).

3. Rank Sentences: TextRank ranks sentences based on how many similar sentences link to them and how important those linking sentences are. The result is a measure of centrality, where highly-ranked sentences are central to the text’s meaning.

4. Extract Top Sentences: Finally, extract the top-ranked sentences as the summary.

By capturing sentence relationships, TextRank produces summaries that are often more coherent and grammatically correct than a TF-IDF approach.


#### Why We Used TextRank Instead of TF-IDF for Summarization

In our case, Sumy’s TextRank summarizer provided better coherence and readability by selecting sentences based on their centrality in the document rather than just their term importance, as TF-IDF would.

---

TextRank is better suited for extractive summarization because:

It maintains grammatical coherence by considering sentence-to-sentence similarity.
It selects sentences that are representative of the entire document’s content.