File Statistics Analyzer 📊

A powerful Python command-line tool that analyzes text files and provides comprehensive statistics including word frequency, character count, vocabulary richness, and more. Export results to JSON for further analysis.

Features

✅ Line Count - Total number of lines in the file
✅ Word Count - Total number of words (punctuation-free)
✅ Character Count - Total alphabetic characters
✅ Most Common Words - Identify top N most frequently used words
✅ Average Word Length - Calculate mean word length
✅ Detailed Statistics - Longest/shortest words, unique word count, vocabulary richness
✅ JSON Export - Export all statistics to structured JSON format
✅ Clean Output - Formatted, easy-to-read terminal output

Installation

Prerequisites

Python 3.7 or higher
No external dependencies required (uses only Python standard library)

Setup

Clone the repository:

git clone https://github.com/yourusername/file-stats-analyzer.git
cd file-stats-analyzer

(Optional) Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

You're ready to go! No pip install needed.

Usage

Basic Usage

Analyze a text file:

python file_stats.py sample.txt

Output:

📊 File Statistics for 'sample.txt'
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Lines:           245
Words:           1,823
Characters:      9,456
Most common:     "the" (45 times)
Avg word length: 4.2 characters

Show Top N Words

Display the top 5 most common words:

python file_stats.py sample.txt --top 5

Output:

📊 File Statistics for 'sample.txt'
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Lines:           245
Words:           1,823
Characters:      9,456

Top 5 Most Common Words:
  1. "the" - 45 times
  2. "and" - 32 times
  3. "to" - 28 times
  4. "a" - 25 times
  5. "in" - 22 times

Avg word length: 4.2 characters

Detailed Statistics

Get in-depth analysis:

python file_stats.py sample.txt --detailed

Output:

📊 File Statistics for 'sample.txt'
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Lines:           245
Words:           1,823
Characters:      9,456
Most common:     "the" (45 times)
Avg word length: 4.2 characters

📋 Detailed Statistics:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Longest word:    "understanding" (13 chars)
Shortest word:   "a" (1 char)
Unique words:    523
Vocabulary:      28.7% (523 unique / 1,823 total)

Export to JSON

Save results to a JSON file:

python file_stats.py sample.txt --export results.json

JSON Output (results.json):

{
  "file": "sample.txt",
  "statistics": {
    "lines": 245,
    "words": 1823,
    "characters": 9456,
    "average_word_length": 4.2
  },
  "most_common_word": {
    "word": "the",
    "count": 45
  }
}

Combine All Features

Get the complete analysis:

python file_stats.py sample.txt --top 10 --detailed --export complete.json

Command-Line Options

Option	Type	Description
`filename`	Required	Path to the text file to analyze
`--top N`	Optional	Show top N most common words (default: 1)
`--detailed`	Flag	Show detailed statistics (longest/shortest word, unique count, etc.)
`--export FILE`	Optional	Export results to JSON file

Examples

Example 1: Quick Analysis

python file_stats.py mybook.txt

Example 2: Top 20 Words with Export

python file_stats.py research_paper.txt --top 20 --export paper_analysis.json

Example 3: Complete Analysis

python file_stats.py article.txt --top 15 --detailed --export article_stats.json

Example 4: Help

python file_stats.py --help

JSON Export Format

Basic Export

{
  "file": "sample.txt",
  "statistics": {
    "lines": 245,
    "words": 1823,
    "characters": 9456,
    "average_word_length": 4.2
  },
  "most_common_word": {
    "word": "the",
    "count": 45
  }
}

With Top Words (--top 5)

{
  "file": "sample.txt",
  "statistics": {
    "lines": 245,
    "words": 1823,
    "characters": 9456,
    "average_word_length": 4.2
  },
  "top_words": [
    {"word": "the", "count": 45},
    {"word": "and", "count": 32},
    {"word": "to", "count": 28},
    {"word": "a", "count": 25},
    {"word": "in", "count": 22}
  ]
}

With Detailed Stats (--detailed)

{
  "file": "sample.txt",
  "statistics": {
    "lines": 245,
    "words": 1823,
    "characters": 9456,
    "average_word_length": 4.2
  },
  "most_common_word": {
    "word": "the",
    "count": 45
  },
  "detailed": {
    "longest_word": "understanding",
    "shortest_word": "a",
    "unique_words": 523,
    "vocabulary_richness": 0.287
  }
}

Project Structure

file-stats-analyzer/
├── file_stats.py          # Main application
├── sample.txt             # Sample test file
├── tests/                 # Test files and scripts
│   ├── test_basic.sh      # Basic functionality tests
│   ├── test_edge_cases.sh # Edge case tests
│   ├── empty.txt          # Empty file test
│   ├── single_word.txt    # Single word test
│   └── large.txt          # Large file test
├── README.md              # This file
├── .gitignore            # Git ignore rules
└── LICENSE               # MIT License

How It Works

Algorithm

The tool performs a single-pass analysis for optimal performance:

Read File Line by Line - Memory efficient, handles large files
Simultaneous Counting - Lines, words, and characters in one pass
Word Normalization - Lowercase conversion and punctuation removal
Frequency Analysis - Uses Python's Counter for efficient word counting
Statistical Calculation - Computes averages and finds extremes
Export - Structures data and writes to JSON

Time Complexity

Reading & Counting: O(n) where n = number of characters
Finding Top Words: O(k log k) where k = unique words
Overall: O(n + k log k)

Space Complexity

Memory Usage: O(w) where w = number of words
Streaming: Only one line in memory at a time during file reading

Technical Details

Word Definition

A "word" is any sequence of alphabetic characters
Punctuation is automatically removed
Case-insensitive (all words converted to lowercase)
Empty strings after cleaning are ignored

Character Count

Only alphabetic characters (a-z, A-Z) are counted
Numbers, punctuation, and whitespace are excluded

Vocabulary Richness

Calculated as: unique_words / total_words

0.0 - 0.3: Low vocabulary (repetitive)
0.3 - 0.6: Moderate vocabulary
0.6 - 1.0: High vocabulary (diverse)

Use Cases

📚 Writers - Analyze writing style and vocabulary usage
🎓 Students - Check essay complexity and word diversity
📊 Researchers - Process and analyze text documents
💼 Data Analysts - Quick text file statistics
🔍 SEO Specialists - Keyword frequency analysis
📝 Content Creators - Content quality metrics

Error Handling

The tool gracefully handles common errors:

# File not found
python file_stats.py nonexistent.txt
❌ Error: File doesn't exist: nonexistent.txt

# Invalid arguments
python file_stats.py sample.txt --top abc
usage: file_stats.py [-h] [--top TOP] [--detailed] [--export FILE] filename
file_stats.py: error: argument --top: invalid int value: 'abc'

Performance

Tested on various file sizes:

File Size	Lines	Words	Processing Time
1 KB	20	150	< 0.01s
100 KB	2,000	15,000	< 0.1s
1 MB	20,000	150,000	~0.5s
10 MB	200,000	1,500,000	~5s
100 MB	2,000,000	15,000,000	~50s

Tested on: Intel i5 8th Gen, 8GB RAM, SSD

Future Enhancements

Planned features for future releases:

Development

Running Tests

# Run all tests
cd tests
bash test_basic.sh
bash test_edge_cases.sh

# Or individually
python file_stats.py tests/empty.txt
python file_stats.py tests/single_word.txt
python file_stats.py tests/large.txt --top 10 --detailed

Code Style

Follows PEP 8 guidelines
Type hints where applicable
Docstrings for all methods
Single responsibility principle

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Your Name

GitHub: @ahmed amin
Email: ahmedamin1700@gmail.com

Acknowledgments

Built as part of a progressive Python learning path
Inspired by Unix wc (word count) command
Uses Python's collections.Counter for efficient frequency counting
Thanks to the Python community for excellent documentation

Changelog

Version 1.1.0 (Current)

✨ Added --top N flag for multiple most common words
✨ Added --detailed flag for extended statistics
✨ Added --export flag for JSON output
🐛 Fixed edge case handling for empty files
📚 Improved documentation

Version 1.0.0

🎉 Initial release
✅ Basic file statistics
✅ Word frequency analysis
✅ Command-line interface

⭐ If you found this useful, please star the repository!

📧 Questions or suggestions? Open an issue or reach out directly.

🚀 Happy analyzing!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tests		tests
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
file_stats.py		file_stats.py
sample.txt		sample.txt

ahmedamin1700/python-file-stats-analyzer

Folders and files

Latest commit

History

Repository files navigation

File Statistics Analyzer 📊

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Show Top N Words

Detailed Statistics

Export to JSON

Combine All Features

Command-Line Options

Examples

Example 1: Quick Analysis

Example 2: Top 20 Words with Export

Example 3: Complete Analysis

Example 4: Help

JSON Export Format

Basic Export

With Top Words (--top 5)

With Detailed Stats (--detailed)

Project Structure

How It Works

Algorithm

Time Complexity

Space Complexity

Technical Details

Word Definition

Character Count

Vocabulary Richness

Use Cases

Error Handling

Performance

Future Enhancements

Development

Running Tests

Code Style

Contributing

License

Author

Acknowledgments

Changelog

Version 1.1.0 (Current)

Version 1.0.0

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages