📘 Jupyter Notebook: Text Extraction and Analysis
=====================================
Author: Satyam Mohapatra
-------------------------------------
Date: 2024-12-21
-------------------------------------
Objective:
To extract article text from URLs provided in Input.xlsx, clean the text using customized stopwords, and perform sentiment and readability analysis.

📂 Table of Contents:
   1. Introduction
   2. Setup and Imports
   3. Data Extraction
   4. Text Cleaning and Stopwords
   5. Sentiment and Readability Analysis
   6. Saving Results
   7. Conclusion

1. 📖 Introduction <a id="introduction"></a>

    This notebook automates the process of web article extraction and analysis. Key tasks include:
    - Extracting text from URLs.
    - Cleaning and preprocessing the text by removing stopwords (currencies, dates, generic words, etc.).
    - Performing sentiment analysis (positive/negative scores, polarity, subjectivity).
    - Computing readability metrics (fog index, sentence length, word count).
    - Exporting results to Output Data Structure.xlsx.

2. ⚙️ Setup and Imports <a id="setup-and-imports"></a>

In [1]:
# Install necessary packages:
%pip install beautifulsoup4
%pip install selenium 
%pip install nltk 
%pip install openpyxl 
%pip install pandas 
%pip install syllapy 
%pip install requests
%pip install --upgrade setuptools

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.Collecting selenium
  Obtaining dependency information for selenium from https://files.pythonhosted.org/packages/a6/1e/5f1a5dd2a28528c4b3ec6e076b58e4c035810c805328f9936123283ca14e/selenium-4.27.1-py3-none-any.whl.metadata
  Downloading selenium-4.27.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Obtaining dependency information for trio~=0.17 from https://files.pythonhosted.org/packages/3c/83/ec3196c360afffbc5b342ead48d1eb7393dd74fa70bca75d33905a86f211/trio-0.27.0-py3-none-any.whl.metadata
  Downloading trio-0.27.0-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Obtaining dependency information for trio-websocket~=0.9 from https://files.pythonhosted.org/packages/48/be/a9ae5f50cad5b6f85bd2574c2c923730098530096e170c1ce7452394d7aa/trio_websocket-0.11.1-py3-none-any.whl.metadata
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting o



Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Collecting syllapy
  Obtaining dependency information for syllapy from https://files.pythonhosted.org/packages/0e/cc/ffc9bddc146f14e8792a9b05b2bd1bc5f23f3b752a06e96b244780ce55b9/syllapy-0.7.2-py3-none-any.whl.metadata
  Downloading syllapy-0.7.2-py3-none-any.whl.metadata (854 bytes)
Downloading syllapy-0.7.2-py3-none-any.whl (24 kB)
Installing collected packages: syllapy
Successfully installed syllapy-0.7.2
Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Collecting setuptools
  Obtaining dependency information for setuptools from https://files.pythonhosted.org/packages/55/21/47d163f615df1d30c094f6c8bbb353619274edccf0327b185cc2493c2c33/setuptools-75.6.0-py3-none-any.whl.metadata
  Downloading setuptools-75.6.0-py3-none-any.whl.metadata (6.7 kB)
Downloading setuptools-75.6.0-py3-none-any.whl (1.2 MB)
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   - -------------------------------------- 0.0/1.2 MB 330.3 kB/s eta 0:00:04
   -- ------------------------------------- 0.1/1.2 MB 657.6 kB/s eta 0:00:02
   ---- ----------------------------------- 0.1/1.2 MB 853.3 kB/s eta 0:00:02
   --------- ------------------------------ 0.3/1.2 MB 1.3 MB/s eta 0:00:01
   ----------- ---------------------------- 0.4/1.2 MB 1.6 MB/s eta 0:00:01
   ---------------- ----------------------- 0.5/1.2 MB 1.6 MB/s eta 0:00:01
   ------------------ --------------------- 0.6/1



In [1]:
#Import required libraries:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import os
print(os.getcwd())
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import nltk
import syllapy

C:\Users\smt93\Test Assignment\Data Extraction and NLP Blackcoffer


In [5]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\smt93\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\smt93\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

3. 🌐 Data Extraction <a id="data-extraction"></a>

    Goal: Extract text content from URLs provided in Input.xlsx.

In [9]:
#Code:
# Load URLs from Excel
df = pd.read_excel(r"C:\Users\smt93\Test Assignment\Input.xlsx")
df.to_csv(r"C:\Users\smt93\Test Assignment\Input.csv", index=False)

# Directory to store extracted articles
os.makedirs('extracted_articles', exist_ok=True)

# Extract article text from URL
def extract_article(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        article_text = ' '.join([p.text for p in soup.find_all('p')])
        return article_text.strip()
    except Exception as e:
        print(f"Failed to extract {url}: {e}")
        return ''
    
# Extract and save articles
for index, row in df.iterrows():
    article = extract_article(row['URL'])
    with open(f"extracted_articles/{row['URL_ID']}.txt", 'w', encoding='utf-8') as f:
        f.write(article)


4. ✂️ Text Cleaning and Stopwords <a id="text-cleaning-and-stopwords"></a>

    Goal: Remove irrelevant words (stopwords) from extracted text using custom stopword lists.
    
    Stopwords Include:
    - Auditor names
    - Currencies
    - Dates and numbers
    - Generic terms
    - Geographic locations
    - Common names

In [10]:
#Code: 
stopwords_auditor = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_Auditor.txt").read().split())

with open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_Currencies.txt", 'rb') as file:
    byte_content = file.read()

# Decode with error handling
decoded_content = byte_content.decode('utf-8', errors='replace')
stopwords_currencies = set(decoded_content.split())

stopwords_dates_numbers = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_DatesandNumbers.txt").read().split())
stopwords_generic = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_Generic.txt").read().split())
stopwords_genericlong = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_GenericLong.txt").read().split())
stopwords_geographic = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_Geographic.txt").read().split())
stopwords_names = set(open(r"C:\Users\smt93\Test Assignment\StopWords\StopWords_Names.txt").read().split())

custom_stopwords = (
    stopwords_auditor.union(
        stopwords_currencies,
        stopwords_dates_numbers,
        stopwords_generic,
        stopwords_genericlong,
        stopwords_geographic,
        stopwords_names
    )
)

5. 📊 Sentiment and Readability Analysis <a id="sentiment-and-readability-analysis"></a>

   Goal: Perform sentiment and readability analysis to compute the following variables:

   - POSITIVE SCORE – Total count of positive words.
   - NEGATIVE SCORE – Total count of negative words.
   - POLARITY SCORE – Measures overall positivity or negativity of the text.
   - SUBJECTIVITY SCORE – Indicates how subjective or objective the text is.
   - AVG SENTENCE LENGTH – Average number of words per sentence.
   - PERCENTAGE OF COMPLEX WORDS – Proportion of words with more than two syllables.
   - FOG INDEX – Readability score indicating text complexity.
   - AVG NUMBER OF WORDS PER SENTENCE – Average word count across sentences.
   - COMPLEX WORD COUNT – Total count of words with more than two syllables.
   - WORD COUNT – Total number of words (excluding stopwords).
   - SYLLABLE PER WORD – Average syllable count per word.
   - PERSONAL PRONOUNS – Count of personal pronouns like I, we, my, ours, us.
   - AVG WORD LENGTH – Average character length of words.


In [21]:
#Code:
positive_words = set(open(r"C:\Users\smt93\Test Assignment\MasterDictionary\positive-words.txt").read().split())

with open(r"C:\Users\smt93\Test Assignment\MasterDictionary\negative-words.txt", 'rb') as file:
    content = file.read()

# Decode with error handling
decoded_content1 = content.decode('utf-8', errors='replace')
negative_words = set(decoded_content1.split())

pos_dict = set(positive_words) 
neg_dict = set(negative_words)

def analyze_sentiment(text):
    words = word_tokenize(text.lower())
    sentences = sent_tokenize(text)
    words = [word for word in words if word.isalpha() and word not in custom_stopwords]
    
    # 1. Sentiment Analysis
    positive_score = sum(1 for word in words if word in pos_dict)
    negative_score = sum(1 for word in words if word in neg_dict)
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(words) + 0.000001)
    
    # 2. Readability and Complexity Analysis
    if len(sentences) > 0:
        avg_sentence_length = len(words) / len(sentences)
    else:
        avg_sentence_length = 0
    
    complex_words = [word for word in words if syllapy.count(word) > 2]
    percentage_complex = len(complex_words) / len(words) if len(words) > 0 else 0
    fog_index = 0.4 * (avg_sentence_length + percentage_complex)
    
    # 3. Additional Metrics
    avg_number_of_words_per_sentence = len(words) / len(sentences) if len(sentences) > 0 else 0
    complex_word_count = len(complex_words)
    word_count = len(words)
    syllable_per_word = sum(syllapy.count(word) for word in words) / word_count if word_count > 0 else 0
    personal_pronouns = len(re.findall(r'\b(I|we|my|ours|us)\b', text, re.I))
    avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0
    
    # Return results as a dictionary
    return {
        'POSITIVE SCORE': positive_score,
        'NEGATIVE SCORE': negative_score,
        'POLARITY SCORE': polarity_score,
        'SUBJECTIVITY SCORE': subjectivity_score,
        'AVG SENTENCE LENGTH': avg_sentence_length,
        'PERCENTAGE OF COMPLEX WORDS': percentage_complex,
        'FOG INDEX': fog_index,
        'AVG NUMBER OF WORDS PER SENTENCE': avg_number_of_words_per_sentence,
        'COMPLEX WORD COUNT': complex_word_count,
        'WORD COUNT': word_count,
        'SYLLABLE PER WORD': syllable_per_word,
        'PERSONAL PRONOUNS': personal_pronouns,
        'AVG WORD LENGTH': avg_word_length
    }

6. 💾 Saving Results <a id="saving-results"></a>

    Goal: Save analysis results to an Excel file.

In [22]:
results = []

for file in os.listdir('extracted_articles'):
    # Only process .txt files
    if file.endswith('.txt'):
        try:
            with open(f"extracted_articles/{file}", 'r', encoding='utf-8') as f:
                text = f.read()
                
                if len(text.strip()) == 0:  # Skip empty files
                    print(f"Skipping empty file: {file}")
                    continue

                # Analyze the sentiment and readability
                analysis = analyze_sentiment(text)
                analysis['URL_ID'] = file.split('.')[0]
                results.append(analysis)
        except Exception as e:
            print(f"Error processing file {file}: {e}")

# Create a DataFrame from the results and save it to an Excel file
output_df = pd.DataFrame(results)

# Ensure the DataFrame is not empty before saving
if not output_df.empty:
    output_df.to_excel('Output Data Structure.xlsx', index=False)
    print("Results saved to 'Output Data Structure.xlsx'")
else:
    print("No valid data to save.")

Skipping empty file: Netclan20241053.txt
Skipping empty file: Netclan20241054.txt
Skipping empty file: Netclan20241055.txt
Skipping empty file: Netclan20241056.txt
Skipping empty file: Netclan20241057.txt
Skipping empty file: Netclan20241058.txt
Results saved to 'Output Data Structure.xlsx'


7. ✅ Conclusion <a id="conclusion"></a>

   This notebook automates the extraction and analysis of web articles, providing comprehensive insights through sentiment and readability metrics.