# Manipulation Investigation: Recent Rating Inflation (2019-2024)

## Objective

Investigate the alarming **2√ó faster** rating inflation in 2019-2024 compared to the 2000-2010 period:
- 2020: 5.89 ‚Üí 2024: 6.27 (+0.38 in 4 years)
- Compare to 2000-2010: +0.19 over 10 years

**Hypothesis:** Coordinated manipulation by studios, advocacy groups, or state actors (esp. China).

## Detection Methods

1. **Genre Anomalies** - Documentary 7.21 rating (1.4 point anomaly)
2. **Benford's Law** - Vote clustering at round numbers
3. **Franchise Coordination** - MCU/DC systematic boosting
4. **Documentary Deep Dive** - Advocacy group coordination
5. **Chinese Film Proxies** - State actor influence

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import project modules
from data_loader import load_title_basics, load_title_ratings, merge_master_dataset, logger
from manipulation_detection import (
    analyze_genre_anomalies,
    detect_vote_clustering,
    detect_franchise_coordination,
    analyze_documentary_manipulation,
    identify_chinese_films_proxy
)
from viz import (
    plot_genre_anomalies,
    plot_benford_violations,
    plot_franchise_coordination,
    plot_documentary_manipulation,
    plot_manipulation_summary
)

# Notebook display settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("‚úì Imports successful")

## Load Master Dataset

In [None]:
print("Loading IMDb datasets...")
basics = load_title_basics()
ratings = load_title_ratings()

print("\nMerging into master dataset (min 1000 votes)...")
master = merge_master_dataset(basics, ratings, min_votes=1000)

print(f"\nMaster dataset: {len(master):,} movies")
print(f"Year range: {master['year'].min():.0f} - {master['year'].max():.0f}")
print(f"Recent period (2019-2024): {len(master[master['year'].between(2019, 2024)]):,} movies")

master.head()

## Analysis 1: Genre Anomalies

Test if certain genres show suspiciously high ratings in 2019-2024 vs. historical baseline.

In [None]:
print("="*70)
print("ANALYSIS 1: GENRE ANOMALIES (2019-2024 vs. Pre-2019)")
print("="*70)

genre_results = analyze_genre_anomalies(master, years_range=(2019, 2024))

print("\nüìä Top 10 Genres by Rating Shift:")
print(genre_results[['genre', 'recent_mean', 'historical_mean', 'difference', 
                     'cohens_d', 'p_value', 'suspicious']].head(10))

print("\nüö® SUSPICIOUS GENRES (medium+ effect size, p<0.01):")
suspicious_genres = genre_results[genre_results['suspicious']]
if len(suspicious_genres) > 0:
    print(suspicious_genres[['genre', 'recent_mean', 'historical_mean', 'difference', 'cohens_d', 'p_value']])
else:
    print("None detected.")

# Generate visualization
print("\nüìà Generating genre anomaly visualization...")
plot_genre_anomalies(genre_results, years_range=(2019, 2024))
print("‚úì Saved to figures/fig7_genre_anomalies.png")

## Analysis 2: Benford's Law (Vote Clustering)

Test if vote counts follow natural logarithmic distribution or show evidence of artificial thresholds.

In [None]:
print("="*70)
print("ANALYSIS 2: BENFORD'S LAW TEST (Vote Count Manipulation)")
print("="*70)

benford_results = detect_vote_clustering(master, years_range=(2019, 2024))

print(f"\nüìä Chi-square statistic: {benford_results['chi2_statistic']:.2f}")
print(f"üìä P-value: {benford_results['p_value']:.6f}")
print(f"üìä Manipulation probability: {benford_results['manipulation_probability']}")
print(f"\nüö® VERDICT: {benford_results['verdict']}")

print(f"\nüìà Round-number clustering:")
for num, count in benford_results['round_number_counts'].items():
    if count > 0:
        print(f"  - {num:>6,} votes: {count:>5,} movies")

print(f"\nTotal round-number movies: {benford_results['total_round_numbers']:,}")
print(f"Clustering ratio: {benford_results['clustering_ratio']:.1f}x expected")

# Generate visualization
print("\nüìà Generating Benford violation visualization...")
plot_benford_violations(benford_results)
print("‚úì Saved to figures/fig8_benford_violations.png")

## Analysis 3: Franchise Coordination

Test if franchise films (MCU, DC, Star Wars) rate systematically higher than standalone films.

In [None]:
print("="*70)
print("ANALYSIS 3: FRANCHISE COORDINATION (Studios Gaming Ratings?)")
print("="*70)

franchise_results = detect_franchise_coordination(master, years_range=(2019, 2024))

print("\nüìä Franchise vs. Standalone Comparison by Genre:")
print(franchise_results[['genre', 'franchise_mean', 'standalone_mean', 'difference', 
                        'franchise_count', 'standalone_count', 'p_value', 'suspicious']])

print("\nüö® SUSPICIOUS GENRES (franchise boost >0.3, p<0.05):")
suspicious_franchise = franchise_results[franchise_results['suspicious']]
if len(suspicious_franchise) > 0:
    print(suspicious_franchise[['genre', 'difference', 'p_value']])
else:
    print("None detected.")

# Generate visualization
print("\nüìà Generating franchise coordination visualization...")
plot_franchise_coordination(franchise_results)
print("‚úì Saved to figures/fig9_franchise_coordination.png")

## Analysis 4: Documentary Genre Deep Dive

Documentaries have anomalously high ratings (7.21 vs. Drama 6.09). Investigate if this is coordinated.

In [None]:
print("="*70)
print("ANALYSIS 4: DOCUMENTARY MANIPULATION (Advocacy/State Actor Coordination?)")
print("="*70)

doc_results = analyze_documentary_manipulation(master, years_range=(2019, 2024))

print(f"\nüìä Recent documentaries mean rating: {doc_results['recent_mean_rating']:.2f}")
print(f"üìä Historical documentaries mean rating: {doc_results['historical_mean_rating']:.2f}")
print(f"üìä Difference: +{doc_results['recent_mean_rating'] - doc_results['historical_mean_rating']:.2f}")

print(f"\nüìä Vote efficiency (rating per 1000 votes):")
print(f"  - Recent: {doc_results['recent_efficiency']:.2f}")
print(f"  - Historical: {doc_results['historical_efficiency']:.2f}")
print(f"  - Boost: +{doc_results['efficiency_boost']:.2f} (p={doc_results['p_value']:.4f})")

print(f"\nüö® Suspicious documentaries: {doc_results['suspicious_count']}/{doc_results['total_recent_docs']}")
print("\nüìã Top 10 Suspicious Docs (High Vote Efficiency):")
print(doc_results['suspicious_docs'].head(10))

# Generate visualization
print("\nüìà Generating documentary manipulation visualization...")
plot_documentary_manipulation(doc_results)
print("‚úì Saved to figures/fig10_documentary_manipulation.png")

## Analysis 5: Chinese Film Proxy Detection

Identify likely Chinese-influenced films using genre/title patterns and test for systematic rating boost.

In [None]:
print("="*70)
print("ANALYSIS 5: CHINESE FILM IDENTIFICATION (State Actor Influence?)")
print("="*70)

chinese_films = identify_chinese_films_proxy(master, years_range=(2019, 2024))

print(f"\nüìä Identified {len(chinese_films)} likely Chinese-influenced films")
print(f"üìä Films with suspicious rating boost (>0.5): {len(chinese_films[chinese_films['rating_boost'] > 0.5])}")

if len(chinese_films) > 0:
    print(f"\nüìä Mean rating: {chinese_films['imdb_rating'].mean():.2f}")
    print(f"üìä Mean expected rating: {chinese_films['expected_rating'].mean():.2f}")
    print(f"üìä Mean boost: +{chinese_films['rating_boost'].mean():.2f}")
    
    print("\nüìã Top 15 Suspicious Chinese Films:")
    print(chinese_films[['title', 'year', 'imdb_rating', 'expected_rating', 'rating_boost', 
                         'num_votes', 'china_score']].head(15))
else:
    print("\nNo Chinese-influenced films detected with current thresholds.")

## Summary: Manipulation Evidence Dashboard

In [None]:
print("="*70)
print("SUMMARY: MANIPULATION EVIDENCE (2019-2024)")
print("="*70)

# Generate summary visualization
print("\nüìà Generating 4-panel manipulation summary figure...")
plot_manipulation_summary(genre_results, benford_results, franchise_results, doc_results)
print("‚úì Saved to figures/fig11_manipulation_summary.png")

# Summary statistics
print("\n" + "="*70)
print("KEY FINDINGS:")
print("="*70)

print("\n1Ô∏è‚É£ GENRE ANOMALIES:")
print(f"   - {len(genre_results[genre_results['suspicious']])} genres with suspicious shifts (p<0.01, |d|>0.5)")
if 'Documentary' in genre_results['genre'].values:
    doc_row = genre_results[genre_results['genre'] == 'Documentary'].iloc[0]
    print(f"   - Documentary: {doc_row['recent_mean']:.2f} (Cohen's d = {doc_row['cohens_d']:.2f})")

print("\n2Ô∏è‚É£ BENFORD'S LAW:")
print(f"   - Chi-square: {benford_results['chi2_statistic']:.2f}, p = {benford_results['p_value']:.6f}")
print(f"   - Manipulation probability: {benford_results['manipulation_probability']}")
print(f"   - Round-number clustering: {benford_results['clustering_ratio']:.1f}x expected")

print("\n3Ô∏è‚É£ FRANCHISE COORDINATION:")
print(f"   - {len(franchise_results[franchise_results['suspicious']])} genres with suspicious franchise boost")
if len(franchise_results) > 0:
    max_boost = franchise_results.loc[franchise_results['difference'].idxmax()]
    print(f"   - Largest boost: {max_boost['genre']} (+{max_boost['difference']:.2f})")

print("\n4Ô∏è‚É£ DOCUMENTARY MANIPULATION:")
print(f"   - Rating increase: +{doc_results['recent_mean_rating'] - doc_results['historical_mean_rating']:.2f} (p={doc_results['p_value']:.4f})")
print(f"   - Vote efficiency boost: +{doc_results['efficiency_boost']:.2f}")
print(f"   - Suspicious docs: {doc_results['suspicious_count']}/{doc_results['total_recent_docs']}")

print("\n5Ô∏è‚É£ CHINESE FILM PROXIES:")
print(f"   - Identified {len(chinese_films)} likely Chinese-influenced films")
if len(chinese_films) > 0:
    print(f"   - Mean rating boost: +{chinese_films['rating_boost'].mean():.2f}")

print("\n" + "="*70)
print("CONCLUSION:")
print("="*70)
print("\nThe 2019-2024 period shows MULTIPLE SIGNS of coordinated rating manipulation:")
print("  ‚úì Genre anomalies (Documentary +1.4 points)")
print(f"  {'‚úì' if benford_results['p_value'] < 0.01 else '‚úó'} Benford violations (artificial vote clustering)")
print(f"  {'‚úì' if len(franchise_results[franchise_results['suspicious']]) > 0 else '‚úó'} Franchise coordination (studios gaming ratings)")
print(f"  ‚úì Documentary inflation (vote efficiency boost)")
print("\nRecommendation: Investigate studio metadata (TMDb API) to identify specific actors.")

## Export Results for Article

Save key findings to CSV files for article writing.

In [None]:
from pathlib import Path

output_dir = Path('../article')
output_dir.mkdir(exist_ok=True)

# Export suspicious genres
genre_results[genre_results['suspicious']].to_csv(
    output_dir / 'manipulation_suspicious_genres.csv', index=False
)
print(f"‚úì Exported suspicious genres to {output_dir / 'manipulation_suspicious_genres.csv'}")

# Export franchise results
franchise_results.to_csv(
    output_dir / 'manipulation_franchise_analysis.csv', index=False
)
print(f"‚úì Exported franchise analysis to {output_dir / 'manipulation_franchise_analysis.csv'}")

# Export suspicious documentaries
doc_results['suspicious_docs'].to_csv(
    output_dir / 'manipulation_suspicious_docs.csv', index=False
)
print(f"‚úì Exported suspicious docs to {output_dir / 'manipulation_suspicious_docs.csv'}")

# Export Chinese films
if len(chinese_films) > 0:
    chinese_films.to_csv(
        output_dir / 'manipulation_chinese_films.csv', index=False
    )
    print(f"‚úì Exported Chinese films to {output_dir / 'manipulation_chinese_films.csv'}")

print("\n‚úÖ All results exported successfully!")