A powerful Python-based tool for comprehensive text analysis, sentiment analysis, and readability assessment of web articles. Extract, process, and analyze content with ease.
-
🌐 Web Scraping
- Extracts article content from any URL
- Handles both static and JavaScript-heavy websites
- Built-in fallback mechanisms for robust content extraction
-
📊 Sentiment Analysis
- Positive/Negative sentiment scoring
- Polarity and subjectivity metrics
- Emotion detection
-
📝 Text Complexity Analysis
- Flesch Reading Ease score
- Average sentence and word length
- Percentage of complex words
- Fog Index calculation
- Syllable analysis
- Personal pronoun counting
-
💾 Output
- Clean CSV export
- Structured data format
- Easy integration with data analysis tools
- Python 3.8 or higher
- Chrome WebDriver (for Selenium)
- UV (Ultra-fast Python package installer and resolver)
-
Install UV if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh -
Clone and set up the project:
git clone https://github.com/codeMaestro78/Text-Analysis-and-Sentiment-Analysis-Tool.git cd blackcoffer_assignment/assignment -
Install dependencies with UV:
uv pip install -r requirements.txt
-
Download required NLTK data:
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
To run the analysis using UV:
uv run main.pyOr for verbose output:
uv run main.py --verboseassignment/
├── Input.csv # Input file with URLs to analyze
├── Output_Data_Structure.csv # Template for output data
├── main.py # Main application script
├── requirements.txt # Python dependencies
├── README.md # This documentation
├── MasterDictionary/ # Sentiment analysis word lists
│ ├── positive-words.txt # Positive sentiment words
│ └── negative-words.txt # Negative sentiment words
└── StopWords/ # Text processing stop words
└── *.txt # Various stop word lists
The analysis generates comprehensive metrics for each URL, including:
| Metric | Description |
|---|---|
| POSITIVE SCORE | Score indicating positive sentiment |
| NEGATIVE SCORE | Score indicating negative sentiment |
| POLARITY SCORE | Overall sentiment polarity (-1 to 1) |
| SUBJECTIVITY SCORE | How subjective the text is (0 to 1) |
| AVG SENTENCE LENGTH | Average number of words per sentence |
| PERCENTAGE OF COMPLEX WORDS | Percentage of complex words in the text |
| FOG INDEX | Readability metric |
| AVG NUMBER OF WORDS PER SENTENCE | Self-explanatory |
| COMPLEX WORD COUNT | Number of complex words |
| WORD COUNT | Total number of words |
| SYLLABLE PER WORD | Average syllables per word |
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Create an Input.csv file in the project root with the following format:
URL_ID,URL
id1,https://example.com/article1
id2,https://example.com/article2
-
Basic Usage:
uv run main.py
-
With Custom Input File:
uv run main.py --input custom_input.csv
-
Verbose Mode (for debugging):
uv run main.py --verbose
The script will:
- Read URLs from the input file
- Extract and clean article content
- Perform comprehensive text and sentiment analysis
- Generate output files in the project directory
- Display progress and summary statistics
The analysis generates the following metrics for each URL:
- POSITIVE SCORE: Score indicating positive sentiment
- NEGATIVE SCORE: Score indicating negative sentiment
- POLARITY SCORE: Overall sentiment polarity (-1 to 1)
- SUBJECTIVITY SCORE: How subjective the text is (0 to 1)
- AVG SENTENCE LENGTH: Average number of words per sentence
- PERCENTAGE OF COMPLEX WORDS: Percentage of complex words in the text
- FOG INDEX: Readability metric
- AVG NUMBER OF WORDS PER SENTENCE: Self-explanatory
- COMPLEX WORD COUNT: Number of complex words
- WORD COUNT: Total number of words
- SYLLABLE PER WORD: Average syllables per word
- PERSONAL PRONOUNS: Count of personal pronouns
- AVG WORD LENGTH: Average length of words in characters
You can modify the following in the code if needed:
use_seleniuminWebScraperclass to toggle between Selenium and requests- Timeout values for web requests
- Output file names and locations
- SSL Certificate Errors: If you encounter SSL errors, try running with
verify=Falsein the requests (not recommended for production) - WebDriver Issues: Ensure Chrome and ChromeDriver versions are compatible
- Missing Dependencies: Make sure all required packages are installed
- NLTK for natural language processing
- TextStat for readability calculations
- BeautifulSoup and Selenium for web scraping