In [36]:
#installing the requried libraries

%pip install nltk beautifulsoup4 requests

Note: you may need to restart the kernel to use updated packages.


### Importing the Required libraries

In [37]:
import nltk
from bs4 import BeautifulSoup
import requests
from string import punctuation
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from heapq import nlargest


1. **Overview**: 
   This code snippet provides a function `get_word_weights(context)` that calculates and returns the normalized frequency of each unique word in a given text input. It filters out common stop words and punctuation to focus on meaningful terms.

2. **Dependencies**: 
   The code requires the Natural Language Toolkit (NLTK) library. The following NLTK resources are downloaded to support tokenization and stop word filtering:
   - `punkt`: For tokenizing text into words.
   - `wordnet` and `omw-1.4`: For potential lemmatization and synonym functionalities (not used directly in this function).
   - `stopwords`: To remove common English words that do not contribute to the text's meaning.

3. **Function Usage**: 
   - To use the function, import the necessary libraries and run the NLTK download commands before invoking `get_word_weights()`.
   - Call the function with a string argument containing the text you want to analyze, e.g., `weights = get_word_weights("Your text here")`.

4. **Output**: 
   The function returns a dictionary where each key is a unique word from the input text, and the corresponding value is its normalized frequency (between 0 and 1), representing its relative importance in the context.

In [38]:
nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('punkt_tab')
nltk.download('stopwords')
def get_word_weights(context):
    word_tokens = word_tokenize(context)            #.,$ []
    punctuations = punctuation + '\n'
    # ",.()[]!@#$%^&*"
    stop_words = stopwords.words('english')
    # is , iam ,this ,they etc..

    word_freqeuencies = {}
    for word in word_tokens:
        if word.lower() not in stop_words and word.lower() not in punctuations:
                if word not in word_freqeuencies.keys():
                    word_freqeuencies[word] = 1
                else:
                    word_freqeuencies[word] += 1
    max_frequency = max(word_freqeuencies.values())
    for word in word_freqeuencies.keys():
        word_freqeuencies[word] = word_freqeuencies[word]/max_frequency
        
    return word_freqeuencies

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ajukh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ajukh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Sentence Weight Calculation Function

1. **Purpose**: The `get_sentence_weights` function calculates the weight of each sentence in a given text based on the frequency of words, helping to identify the most significant sentences.

2. **Parameters**:
   - `context`: A string containing the text for which sentence weights are to be computed.
   - `word_frequency`: A dictionary where each key represents a word and its value indicates the word's frequency within the text.

3. **Process**: The function tokenizes the input text into sentences and iterates through each sentence. It calculates the weight of each sentence by summing the frequencies of the words it contains, excluding stop words.

4. **Output**: The function returns a dictionary where each key is a sentence from the input text, and the corresponding value is the cumulative weight based on the word frequencies.

5. **Usage**: This function can be used in text summarization tasks to determine which sentences carry the most weight and should be included in a summary.


In [39]:
def get_sentence_weights(context,word_frequency):
    sentence = sent_tokenize(context)
    sentence_weight = dict()
    for line in sentence:
        sentence_wordcount = len(word_tokenize(line))
        sentence_wordcount_without_stop_word = 0
        for word in word_tokenize(line.lower()):
            if word in word_frequency.keys():
                sentence_wordcount_without_stop_word += 1
                if line in sentence_weight:
                    sentence_weight[line] += word_frequency[word]
                else:
                    sentence_weight[line] = word_frequency[word]
                    
    return sentence_weight

# Sentence Summary Generation Function

1. **Purpose**: The `get_sentence_summary` function generates a summary of text based on the weights of its sentences, allowing for concise representation of the original content.

2. **Parameters**:
   - `sentence_weight`: A dictionary where each key is a sentence and the value represents its calculated weight.
   - `summary_len`: A float (default value 0.5) indicating the proportion of sentences to include in the summary relative to the total number of sentences.

3. **Process**: The function calculates the number of sentences to include in the summary by multiplying the total number of sentences by `summary_len`. It then selects the top-weighted sentences using the `nlargest` function.

4. **Output**: The function returns a single string that contains the selected sentences, joined together to form a coherent summary.

5. **Usage**: This function is useful in text summarization tasks, enabling the extraction of key information from larger texts by selecting the most important sentences based on their weights.


In [40]:
def get_sentence_summary(sentence_weight,summary_len = 0.5):
    select_len = int(len(sentence_weight) * summary_len)
    summary = nlargest(select_len,sentence_weight,key = sentence_weight.get)
    #a array of senteces with highest weight 
    final_summary = " ".join(summary)
    return final_summary

# Web Scraping Function

1. **Purpose**: The `scraping` function retrieves text content from a specified webpage URL by making an HTTP request and parsing the HTML content.

2. **Headers**: The function utilizes browser-like headers to mimic a real browser request, which helps prevent potential blocking by the server due to automated scraping.

3. **Data Retrieval**: The function sends a GET request to the specified URL. If the request is successful (status code 200), it proceeds to parse the HTML content.

4. **Content Extraction**: Using BeautifulSoup, the function searches for all paragraph (`<p>`) tags in the HTML and concatenates their text content into a single string, stripping any leading or trailing whitespace.

5. **Error Handling**: The function includes error handling to catch and display exceptions that may occur during the request, ensuring robustness in case of network issues or invalid URLs.


In [41]:

def scraping(url):
    # Define browser-like headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }

    try:
        # Sending request to get the data from the webpage
        scraped_data = requests.get(url, headers=headers)
        scraped_data.raise_for_status()  # Check if request was successful (status code 200)

        # Parse the content with BeautifulSoup
        soup = BeautifulSoup(scraped_data.content, 'html.parser')

        # Find all paragraphs and extract text
        # p stands for paragraph element in html
        paragraphs = soup.find_all('p')
        article_context = ""

        for p in paragraphs:
            article_context += p.get_text(strip=True) + " "  # Strip whitespace and concatenate paragraphs

        return article_context.strip()  # Return the article content, ensuring no leading/trailing whitespace

    except requests.exceptions.RequestException as e:
        # Handle any exception raised by requests
        print(f"An error occurred: {e}")
        return None

# File Loading Function

1. **Purpose**: The `load_files` function is designed to read and load text files from a specified directory, storing their contents in a list.

2. **Directory Input**: The function accepts a single argument, `directory`, which is the path to the directory containing the text files to be loaded.

3. **File Reading**: It iterates through all files in the given directory. For each file, it opens the file in read mode and appends its contents to a list.

4. **Data Storage**: The contents of each file are collected in the `files_data` list, which holds the text from all files in the specified directory.

5. **Return Value**: The function returns the `files_data` list, providing the caller with the text content of all files in the specified directory for further processing.


In [42]:

import os
def load_files(directory):
    files_data = []

    for fname in os.listdir(directory):
        with open(f"{directory}/{fname}",'r') as f:
            files_data.append(f.read())
    
    return files_data


# File Writing Function

1. **Purpose**: The `WriteFile` function is designed to write the summarized text and the original context into a file named `summary.txt`.

2. **Parameters**: The function accepts two parameters: 
   - `context`: The original text that has been summarized.
   - `final_summary`: The summarized version of the original text.

3. **File Handling**: The function opens (or creates) the `summary.txt` file in append mode (`'a'`), ensuring that new summaries are added without overwriting existing content.

4. **Content Structure**: The function writes a structured format to the file, including:
   - A header indicating the summarized text.
   - The summarized text itself.
   - Separators to visually distinguish between different summaries.
   - A header for the original text followed by the original context.

5. **Encoding**: The file is written using UTF-8 encoding to support a wide range of characters and ensure proper storage of the text content.


In [43]:
filesdata = load_files('./sample_mail_data')

def WriteFile(context,final_summary):
    with open('../summary.txt',mode='a', encoding='utf-8') as file:
        file.write("\n--------------\n")
        file.write("--------------\n")
        file.write("--------------\n")
        file.write("--------------\n")
        file.write(f"The original text is as follows: \n {context}")
        file.write("\n--------------\n")
        file.write("--------------\n")
        file.write("--------------\n")
        file.write("--------------\n")
        file.write(f"The summarized text is as follows : \n {final_summary}")

        

# Main Function for Summarization

1. **Purpose**: The `main` function orchestrates the text summarization process by invoking various helper functions to compute word and sentence weights, generate a summary, and write the result to a file.

2. **Steps in Summarization**:
   - **Word Weights**: It calls `get_word_weights(context)` to calculate the frequency of words, excluding stop words and punctuation, which forms the basis for sentence weighting.
   - **Sentence Weights**: It uses `get_sentence_weights(context, word_weights)` to evaluate each sentence based on word frequencies.
   - **Summary Generation**: It calls `get_sentence_summary(sentence_weight=sentence_weights, summary_len=0.3)` to generate a summary that covers 30% of the original content.

3. **Writing Results**: After generating the summary, it calls `WriteFile(context, final_summary)` to store both the summarized and original texts in an output file.

4. **Input**: The `context` parameter contains the original text that is to be summarized.

5. **Output**: The function does not return any values, but it writes the summarized and original texts to a file called `summary.txt`.


In [44]:
def main(context):
    word_weights = get_word_weights(context)
    sentence_weights = get_sentence_weights(context,word_weights)
    final_summary = get_sentence_summary(sentence_weight=sentence_weights, summary_len=0.5)
    WriteFile(context,final_summary)


**Sample URLS for summarization**


In [45]:
Umbriel_moon = 'https://en.wikipedia.org/wiki/Umbriel'
Fine_tuning = "https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)"
llama3  = 'https://en.wikipedia.org/wiki/Llama_(language_model)'
hachi = f"https://en.wikipedia.org/wiki/Hachi:_A_Dog%27s_Tale"
opopenheimer = "https://en.wikipedia.org/wiki/J._Robert_Oppenheimer"
Beethoven = "https://en.wikipedia.org/wiki/Ludwig_van_Beethoven"
context  = scraping(Fine_tuning)
main(context)

# Wikipedia Scraping and Summarization Script

This script is designed to scrape Wikipedia pages, generate summaries, and save both the summary and original content in a file. The code flow can be described in the following steps:

### 1. **Web Scraping with `scraping` Function**
   - The `scraping` function sends a GET request to a Wikipedia URL.
   - It uses the `BeautifulSoup` library to parse the HTML and extract text from all paragraph (`<p>`) elements.
   - This extracted text (referred to as `context`) will be summarized in the subsequent steps.

### 2. **Summarization Process (`main` Function)**
   - The extracted `context` is passed to the `main(context)` function for summarization.
   - Steps involved in summarization:
     - **Word Frequencies**: The function `get_word_weights(context)` is used to calculate the importance of each word based on its frequency.
     - **Sentence Weights**: Sentences are then ranked based on the importance of the words they contain using the `get_sentence_weights()` function.
     - **Summary Creation**: A summary is generated that condenses the text to a given percentage (e.g., 30%) using `get_sentence_summary()`.
   
### 3. **Writing Output to File (`WriteFile` Function)**
   - The `WriteFile` function appends the summarized text and original content to a file named `summary.txt`.
   - The file contains:
     - A section for the summarized text.
     - A section for the original content, separated by multiple dashes for clarity.

### 4. **Usage Example**
   - You can scrape and summarize content from different Wikipedia pages by passing their URLs to the `scraping` function.
   - For example, to summarize the Beethoven Wikipedia page:
     ```python
     context = scraping(Beethoven)
     main(context)
     ```

### 5. **Output File**
   - The final output is saved in `summary.txt` with the following structure:
     ```
     The summarized text is as follows: 
     [Summary]
     --------------
     --------------
     --------------
     --------------
     The original text is as follows: 
     [Original Article]
     ```

