Skip to content

codeMaestro78/Text-Analysis-and-Sentiment-Analysis-Tool

Repository files navigation

📊 Text Analysis and Sentiment Analysis Tool

A powerful Python-based tool for comprehensive text analysis, sentiment analysis, and readability assessment of web articles. Extract, process, and analyze content with ease.

✨ Features

  • 🌐 Web Scraping

    • Extracts article content from any URL
    • Handles both static and JavaScript-heavy websites
    • Built-in fallback mechanisms for robust content extraction
  • 📊 Sentiment Analysis

    • Positive/Negative sentiment scoring
    • Polarity and subjectivity metrics
    • Emotion detection
  • 📝 Text Complexity Analysis

    • Flesch Reading Ease score
    • Average sentence and word length
    • Percentage of complex words
    • Fog Index calculation
    • Syllable analysis
    • Personal pronoun counting
  • 💾 Output

    • Clean CSV export
    • Structured data format
    • Easy integration with data analysis tools

🚀 Quick Start

Prerequisites

  • Python 3.8 or higher
  • Chrome WebDriver (for Selenium)
  • UV (Ultra-fast Python package installer and resolver)

Installation with UV (Recommended)

  1. Install UV if you haven't already:

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone and set up the project:

    git clone https://github.com/codeMaestro78/Text-Analysis-and-Sentiment-Analysis-Tool.git
    cd blackcoffer_assignment/assignment
  3. Install dependencies with UV:

    uv pip install -r requirements.txt
  4. Download required NLTK data:

    python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

Running the Application

To run the analysis using UV:

uv run main.py

Or for verbose output:

uv run main.py --verbose

📁 Project Structure

assignment/
├── Input.csv                  # Input file with URLs to analyze
├── Output_Data_Structure.csv  # Template for output data
├── main.py                    # Main application script
├── requirements.txt           # Python dependencies
├── README.md                  # This documentation
├── MasterDictionary/          # Sentiment analysis word lists
│   ├── positive-words.txt     # Positive sentiment words
│   └── negative-words.txt     # Negative sentiment words
└── StopWords/                 # Text processing stop words
    └── *.txt                  # Various stop word lists

🔍 Output Metrics

The analysis generates comprehensive metrics for each URL, including:

Metric Description
POSITIVE SCORE Score indicating positive sentiment
NEGATIVE SCORE Score indicating negative sentiment
POLARITY SCORE Overall sentiment polarity (-1 to 1)
SUBJECTIVITY SCORE How subjective the text is (0 to 1)
AVG SENTENCE LENGTH Average number of words per sentence
PERCENTAGE OF COMPLEX WORDS Percentage of complex words in the text
FOG INDEX Readability metric
AVG NUMBER OF WORDS PER SENTENCE Self-explanatory
COMPLEX WORD COUNT Number of complex words
WORD COUNT Total number of words
SYLLABLE PER WORD Average syllables per word

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🛠️ Usage

Input Format

Create an Input.csv file in the project root with the following format:

URL_ID,URL
id1,https://example.com/article1
id2,https://example.com/article2

Running the Analysis

  1. Basic Usage:

    uv run main.py
  2. With Custom Input File:

    uv run main.py --input custom_input.csv
  3. Verbose Mode (for debugging):

    uv run main.py --verbose

What Happens Next?

The script will:

  1. Read URLs from the input file
  2. Extract and clean article content
  3. Perform comprehensive text and sentiment analysis
  4. Generate output files in the project directory
  5. Display progress and summary statistics

Output

The analysis generates the following metrics for each URL:

  • POSITIVE SCORE: Score indicating positive sentiment
  • NEGATIVE SCORE: Score indicating negative sentiment
  • POLARITY SCORE: Overall sentiment polarity (-1 to 1)
  • SUBJECTIVITY SCORE: How subjective the text is (0 to 1)
  • AVG SENTENCE LENGTH: Average number of words per sentence
  • PERCENTAGE OF COMPLEX WORDS: Percentage of complex words in the text
  • FOG INDEX: Readability metric
  • AVG NUMBER OF WORDS PER SENTENCE: Self-explanatory
  • COMPLEX WORD COUNT: Number of complex words
  • WORD COUNT: Total number of words
  • SYLLABLE PER WORD: Average syllables per word
  • PERSONAL PRONOUNS: Count of personal pronouns
  • AVG WORD LENGTH: Average length of words in characters

Configuration

You can modify the following in the code if needed:

  • use_selenium in WebScraper class to toggle between Selenium and requests
  • Timeout values for web requests
  • Output file names and locations

Troubleshooting

  • SSL Certificate Errors: If you encounter SSL errors, try running with verify=False in the requests (not recommended for production)
  • WebDriver Issues: Ensure Chrome and ChromeDriver versions are compatible
  • Missing Dependencies: Make sure all required packages are installed

Acknowledgments

  • NLTK for natural language processing
  • TextStat for readability calculations
  • BeautifulSoup and Selenium for web scraping

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages