Skip to content

A collection of python scripts to analyze web server traffic logs to provide analytics around traffic and security observations.

Notifications You must be signed in to change notification settings

focusedhunts/WebLogAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Web Log Analyzer v2.3

Transform raw server logs into AI-powered business insights, security intelligence, and SEO analysis.

Web Log Analyzer is a comprehensive Python tool that reads website server logs to generate easy-to-understand reports. It features AI-powered narratives, content classification, and sitemap analysis to provide deep insights into visitor behavior, security threats, and SEO health. It's perfect for small business owners who want professional-grade analytics without the complexity or cost of enterprise solutions.


πŸš€ TL;DR - Quick Setup

⚑ Essential 5-Minute Setup:

  1. Copy configuration file:

    cp config_template.py config.py # Create your own config from the template
  2. Edit these MUST-CHANGE settings in config.py:

    # Your website's developer/admin IPs (REQUIRED - replace with your IP)
    DEVELOPER_IPS = ['YOUR.IP.ADDRESS.HERE']  # Find your IP at whatismyip.com
    
    # Your website domain (Recommended for sitemap & SEO analysis)
    SITEMAP_DOMAIN = "yourdomain.com"
    
    # Enrichment API Keys (Optional but recommended for rich insights)
    IPINFO_TOKEN = "your_token_here"      # Free at ipinfo.io
    ABUSEIPDB_KEY = "your_key_here"       # Free at abuseipdb.com
    
    # AI Narrative Generation (Optional - requires API key)
    AI_PROVIDER = "gemini"  # or "claude"
    GOOGLE_AI_API_KEY = "your_gemini_api_key"
    # ANTHROPIC_API_KEY = "your_claude_api_key"
  3. Put your log files in the input/ directory

  4. Run analysis:

    # Generate standard unified reports (recommended)
    python weblog_analyzer.py
    
    # For more options, like forcing a refresh of all data:
    python weblog_analyzer.py --force
    python weblog_analyzer.py --help
  5. View your reports in output/ directory:

    • output/website_analytics_report_business.md - For business insights
    • output/website_analytics_report_security.md - For security analysis
    • output/index.html - A dashboard linking to all reports

🎯 Critical Settings to Change:

  • DEVELOPER_IPS: Replace 'YOUR.IP.ADDRESS.HERE' with your actual IP address to exclude your own visits from analytics.
  • SITEMAP_DOMAIN: Add your domain for sitemap and SEO analysis.
  • API Keys: Add free API keys for AI summaries, geolocation, and threat intelligence.
  • IP Enrichment: The tool will automatically download a local IP database (.mmdb) on first run to input/IPinfo/ to reduce API calls.

That's it! Everything else has sensible defaults.


🎯 Who Is This For?

πŸ‘” Business Owners & Marketers

  • See how many real people (not bots) visit your site
  • Discover which pages are most popular with actual customers
  • Understand where your visitors come from geographically
  • Track growth trends over time
  • Get insights typically only available with expensive analytics tools

πŸ›‘οΈ Website Owners & Developers

  • Identify security threats and attack patterns before they become problems
  • Analyze bot traffic and distinguish between helpful and harmful bots
  • Get specific recommendations for improving site security
  • Monitor site health and technical performance
  • Understand visitor device preferences (mobile vs desktop)

πŸ“ˆ Marketing Teams

  • See which content drives the most engagement
  • Understand visitor behavior patterns
  • Identify your most loyal visitors (return customers)
  • Track referral sources and marketing campaign effectiveness

✨ What Makes This Special

πŸ” Deep Intelligence with Free APIs

Unlike basic log analyzers, this tool enriches your data with:

  • Geographic insights - See exactly where your visitors are located
  • Threat intelligence - Know which IPs are potentially dangerous
  • Network analysis - Understand if visitors are on residential, business, or hosting networks
  • Device breakdown - Mobile vs desktop usage patterns

πŸ“Š Business-Focused Reporting

Two separate reports tailored for different audiences:

  • Business Report: Easy-to-read insights for owners and marketers
  • Security Report: Technical details for developers and IT teams

πŸ€– Smart Bot Detection

Sophisticated bot classification that separates:

  • Beneficial bots (Google, Bing search crawlers)
  • Neutral bots (SEO tools, monitoring services)
  • Suspicious bots (potential scrapers or attackers)
  • Malicious tools (known attack frameworks)

πŸ“ˆ Historical Trending

  • Compare this month to previous months automatically
  • Identify growth patterns and seasonal trends
  • Get warnings when unusual activity occurs
  • Build a picture of your site's growth over time

πŸš€ Key Features

Feature What It Does Business Value
Visitor Separation Filters out bots to show real human traffic Know your actual customer count
Geographic Insights Shows where visitors come from Target marketing by location
Device Analysis Mobile vs desktop breakdown Optimize for your visitors' preferences
Popular Content Which pages get the most real visitors Focus on content that works
Security Monitoring Detects attacks and threats automatically Protect your business reputation
Return Visitor Tracking Identifies loyal customers Understand customer loyalty
Historical Trends Compares current to past performance Track business growth
API Quota Management Uses free tiers efficiently Get enterprise insights at no cost
Page-Level Trends Tracks performance of individual pages over time Optimize high-performing content

πŸ’» Quick Start

Prerequisites

  • Python 3.8 or higher
  • Access to your website's server logs
  • 10 minutes of setup time

Installation

  1. Download and setup

    git clone https://github.com/focused-hunts/weblog-analyzer.git
    cd weblog-analyzer
    pip install -r requirements.txt
  2. Configure the tool (Critical Step)

    cp config_template.py config.py
    # Edit config.py with your settings (see TL;DR section above)
  3. Add your log files

    • Place your server log files in the input/ directory
    • Supports both .log and .log.gz (compressed) files
    • Works with Apache and Nginx Combined Log Format by default
  4. Run your first analysis

    # Default unified report generation
    python weblog_analyzer.py
    
    # Explicitly run with switches
    # (Note: 'unified' is the default command and can be omitted)
    python weblog_analyzer.py unified --cache-only
    python weblog_analyzer.py unified --force

Your First Reports

After running, check the output/ directory for:

  • website_analytics_report_business.md - Your business intelligence report
  • website_analytics_report_security.md - Security analysis and threats
  • index.html - Quick overview dashboard
  • website_analytics_report.json - Raw data for debugging or other tools

πŸ”‘ API Keys Setup (Optional but Recommended)

While the tool works without API keys, adding them provides much richer insights:

🌍 IPinfo (Geographic Data)

  • What it adds: Visitor locations, ISP information, network details
  • Free tier: 50,000 lookups/month
  • Get key: ipinfo.io
  • Configuration: IPINFO_TOKEN in config.py
  • Business value: Understand your market geography, optimize for local customers

πŸ›‘οΈ AbuseIPDB (Threat Intelligence)

  • What it adds: Security threat scores, IP reputation data
  • Free tier: 1,000 lookups/day
  • Get key: abuseipdb.com
  • Business value: Identify high-risk visitors, protect against fraud
  • Configuration: ABUSEIPDB_KEY in config.py

πŸ” GreyNoise (Scanner Detection)

  • What it adds: Identifies automated scanners vs real visitors
  • Free tier: Community access
  • Get key: greynoise.io
  • Business value: Better bot detection, cleaner visitor metrics

πŸ€– AI Narratives (Gemini or Claude)

  • What it adds: Generates executive summaries and insights in plain English.
  • Get key: Google AI Studio (for Gemini) or Anthropic (for Claude).
  • Business value: Turns complex data into easy-to-understand reports.
  • Configuration: Set AI_PROVIDER to "gemini" or "claude" in config.py and add the corresponding API key (GOOGLE_AI_API_KEY or ANTHROPIC_API_KEY).

πŸ“ IPGeolocation (Validation)

  • What it adds: Validates location data for accuracy
  • Free tier: 1,000 lookups/month
  • Get key: ipgeolocation.io
  • Configuration: IPGEOLOCATION_API_KEY in config.py
  • Business value: More accurate geographic insights

πŸ’‘ Tip: Start with just IPinfo and AbuseIPDB - they provide 90% of the value!


πŸ“‹ Understanding Your Reports

πŸ“ˆ Business Report Highlights

Visitor Overview

🏠 Real Visitors This Month: 1,247 people
πŸ”„ Return Customers: 23% (287 visitors came back)
🌍 Top Countries: United States (45%), Canada (12%), UK (8%)
πŸ“± Mobile Users: 67% of your visitors prefer mobile

Content Performance

  • Which pages get the most real visitor attention
  • How long people spend on different sections
  • Which content drives return visits

Growth Trends

  • Month-over-month visitor growth
  • Seasonal patterns in your traffic
  • Comparison to your historical performance

πŸ›‘οΈ Security Report Highlights

Threat Level Assessment

🟒 Security Status: NORMAL
🚨 Attack Attempts: 12 blocked (0.3% of traffic)
πŸ€– Bot Traffic: 52% (mostly search engines)
⚠️  High-Risk IPs: 3 identified and flagged

Attack Analysis

  • Types of attacks attempted (SQL injection, etc.)
  • Geographic sources of threats
  • Recommendations for additional protection

πŸ› οΈ Configuration Deep Dive

Essential Settings

See config_template.py for a full list of options. The most critical ones to change are DEVELOPER_IPS and WEBSITE_DOMAIN. See config_template.py for a full list of options. The most critical ones to change are DEVELOPER_IPS and SITEMAP_DOMAIN.

AI & SEO Configuration

To enable AI-powered narratives and SEO analysis, set the following in config.py:

# AI provider for generating report narratives ('gemini' or 'claude')
AI_PROVIDER = "gemini" 

# Corresponding API Keys for the selected provider
GOOGLE_AI_API_KEY = "your_gemini_key_here"
ANTHROPIC_API_KEY = "your_claude_key_here"

Report & Cache Settings

# What reports to generate
GENERATE_BUSINESS_REPORT = True          # Always recommended
GENERATE_SECURITY_REPORT = True          # Recommended for all sites
GENERATE_HTML = True                     # Nice overview page

# Historical analysis (how many months back to compare)
UNIFIED_HISTORICAL_MONTHS = 6            # 6 months gives good context for trends

# How long to cache API responses (saves money!)
CACHE_TTL_DAYS = 90                      # 90 days is recommended

Example: Content Creator

"Analytics revealed our 'how-to' posts had 3x higher return visitor rates than news posts. We shifted our content strategy and built a more loyal audience."

Example: Security Discovery

"We found IP addresses from a known hosting provider making 500+ requests per day to admin pages. We blocked the range and reduced server load by 15%."


πŸ”§ Advanced Features

πŸ“± Device & Browser Analysis

  • Desktop vs Mobile vs Tablet breakdown
  • Browser version tracking
  • Operating system distribution
  • User agent anomaly detection (fake browsers, bots)
  • Clearer Reporting: Accounts for traffic from unknown device types.

🌐 Network Intelligence

  • ISP and hosting provider analysis
  • VPN/Proxy detection
  • Residential vs business network identification
  • Risk scoring for different network types

πŸ€– Sophisticated Bot Classification

  • Beneficial: Google, Bing, Facebook crawlers
  • Neutral: SEO tools, monitoring services
  • Research: Academic institutions, security research
  • Suspicious: Scrapers, unknown automation
  • Malicious: Known attack tools and frameworks

πŸ“ˆ Historical Context

  • Automatically compares current month to previous months
  • Identifies growth trends and seasonal patterns
  • Alerts for unusual activity spikes
  • Builds long-term performance picture

πŸ“„ Page-Level Trend Analysis

  • Tracks monthly views for your most important pages.
  • Identifies which content is gaining or losing popularity.
  • Helps you focus content strategy on what works.

πŸ’Ύ How Log Processing Works

1. Log Discovery

Automatically finds all log files in your input directory:

  • access.log, ssl_log, access-Nov-2024.log.gz
  • Groups files by month based on timestamps or filenames
  • Handles compressed (.gz) files automatically

2. Smart Parsing

  • Extracts visitor IPs, timestamps, pages visited, devices used
  • Detects attack patterns automatically
  • Classifies traffic as human visitors vs bots
  • Identifies suspicious activity and security threats

3. Data Enrichment

  • Looks up IP locations and threat scores using APIs
  • Validates data across multiple sources for accuracy
  • Caches results to minimize API usage and costs
  • Respects free-tier limits automatically

4. Business Intelligence

  • Separates real customers from bots and crawlers
  • Tracks visitor loyalty and return rates
  • Analyzes content performance and popular pages
  • Generates trend comparisons with historical data

πŸ“ Project Structure

weblog-analyzer/
β”œβ”€β”€ weblog_analyzer.py          # Main application
β”œβ”€β”€ config_template.py          # Copy this to config.py
β”œβ”€β”€ config.py                   # Your settings (don't commit to git!)
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ log_parser.py          # Reads and parses log files
β”‚   β”œβ”€β”€ analyzer.py            # Core data analysis
β”‚   β”œβ”€β”€ enrichment.py          # Adds geographic and threat data
β”‚   β”œβ”€β”€ reporter.py            # Creates business and security reports
β”‚   β”œβ”€β”€ trend_manager.py       # Handles historical comparisons
β”‚   └── log_registry.py        # Tracks processed files
β”‚   β”œβ”€β”€ content_classifier.py  # Classifies content (business vs. technical)
β”‚   β”œβ”€β”€ sitemap_analyzer.py    # Analyzes sitemap coverage
β”‚   β”œβ”€β”€ seo_analyzer.py        # Provides SEO health insights
β”‚   β”œβ”€β”€ ai_narrator.py         # Generates AI-powered summaries
β”‚   └── logger_setup.py        # Configures logging
β”œβ”€β”€ input/                      # πŸ“ Put your log files here
β”œβ”€β”€ output/                     # πŸ“ Generated reports appear here
β”œβ”€β”€ cache/                      # πŸ“ Caches API data and AI narratives
└── README.md                   # This file

🚨 Security & Privacy

For Your Business

  • Sensitive data filtering: Your admin/developer IP addresses are automatically excluded from visitor analytics
  • Threat identification: Potential attackers are flagged but their attempts are already blocked by your web server
  • Privacy compliance: The tool analyzes visitor patterns but doesn't store personally identifiable information

For the Tool

  • API key protection: Never commit your config.py file to version control
  • Local processing: All analysis happens on your server - no data sent to third parties except for IP lookup APIs
  • Cache encryption: Sensitive lookup data is cached locally to minimize external API calls

Recommendations Based on Your Reports

Your security report will provide specific recommendations like:

  • "Consider enabling Cloudflare for additional protection"
  • "IP range X.X.X.X/24 shows attack patterns - consider blocking"
  • "Your mobile visitors are growing - ensure mobile security features are enabled"

⚑ Performance & Costs

Typical Performance

  • Small sites (1,000-5,000 visitors/month): 2-5 minutes
  • Medium sites (10,000-50,000 visitors/month): 10-20 minutes
  • Larger sites (100,000+ visitors/month): 30-60 minutes

API Cost Management

  • Smart caching: 70-90% of repeat IP lookups use cached data (free)
  • Quota monitoring: Automatic warnings before hitting free-tier limits
  • Selective validation: Only validates high-value IPs to conserve quota
  • Progressive enhancement: Works without API keys, gets better with them

Typical Monthly Costs

  • No API keys: $0 (basic analytics)
  • Free tiers only: $0 (rich analytics for most small businesses)
  • Growing business: $5-15/month (if you exceed free tiers)

🎯 Use Cases by Business Type

πŸ›’ E-commerce Sites

  • Track real customer visits vs bot traffic
  • Identify your most loyal customers (return visitors)
  • Understand geographic distribution for shipping planning
  • Monitor for payment page attacks or card testing

πŸ“ Content Sites & Blogs

  • See which articles drive the most engagement
  • Understand your audience's device preferences
  • Track return readers vs one-time visitors
  • Identify content that builds audience loyalty

🏒 Local Service Businesses

  • Geographic analysis for local market understanding
  • Mobile usage patterns (important for local search)
  • Contact page and service page performance tracking
  • Local competition insights through referrer analysis

πŸ’Ό B2B Companies

  • Business vs residential visitor identification
  • Content performance for different decision-makers
  • International market analysis
  • Lead quality assessment through visit patterns

🎨 Creative Professionals

  • Portfolio page performance analysis
  • Client geographic distribution
  • Mobile vs desktop portfolio viewing preferences
  • Contact form and inquiry pattern analysis

πŸ“ž Troubleshooting

"No log files found"

Solution:

  • Ensure log files are in the input/ directory
  • Check that files have .log or .log.gz extensions
  • Verify log format is Apache Combined Log Format

"No real visitors in reports"

Solution:

  • Check that DEVELOPER_IPS includes your actual IP address
  • Verify your site actually has human visitors (not just bots)
  • Look at the raw data in the JSON export to understand the traffic

"API quota exceeded"

Solution:

  • This is normal! The tool will use cached data for repeat IPs
  • Wait 24 hours for daily quotas to reset
  • Consider upgrading to paid tiers if you need real-time data
  • The reports will still be valuable with partial enrichment

Reports show mostly bot traffic

This is normal!

  • 50-80% bot traffic is typical for most websites
  • The reports separate bot and human traffic clearly
  • Focus on the "human visitors" metrics for business insights

🀝 Contributing & Support

This tool is built for small business owners. Contributions and feature requests are welcome!

  1. Check existing issues on GitHub before opening a new one.
  2. For bugs, please provide a log sample, your configuration (with keys removed), and the error message.
  3. We are considering features like WordPress integration, email scheduling, and more. Feel free to contribute or make a request!

πŸ“„ License & Credits

License: MIT License - free for personal and commercial use

Built With:

  • IPinfo for geographic data
  • AbuseIPDB for threat intelligence
  • GreyNoise for scanner identification
  • Shodan InternetDB for infrastructure analysis
  • BGPView for network information

πŸš€ What's Next?

After running your first analysis:

  1. Review both reports - Business for growth insights, Security for peace of mind
  2. Set up API keys if you haven't already - the insights get much richer
  3. Run monthly to build historical context and trend analysis
  4. Act on insights - optimize for mobile, improve popular content, address security issues
  5. Track improvements - use month-over-month comparisons to measure success

Built for small business owners who want enterprise-grade insights without the enterprise complexity or cost.

Ready to understand your website's true performance? Download, configure, and run your first analysis in under 10 minutes.

About

A collection of python scripts to analyze web server traffic logs to provide analytics around traffic and security observations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages