In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 02 – Exploratory Analysis of Curated Movie Ratings\n",
    "\n",
    "This notebook demonstrates exploratory insights from the curated dataset:\n",
    "\n",
    "- IMDb vs Rotten Tomatoes critic score relationship\n",
    "- Rating gap (RT − IMDb) across genres\n",
    "- Budget vs critic score\n",
    "\n",
    "These visualizations support the CS598 project's goal of showing how the curated dataset enables meaningful analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "plt.style.use('seaborn-v0_8')\n",
    "\n",
    "df = pd.read_csv('../data/curated/movies_with_scores.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IMDb vs Rotten Tomatoes (normalized ratings)\n",
    "\n",
    "This shows whether critics and audiences generally agree across platforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(7,6))\n",
    "plt.scatter(df['imdb_rating_norm'], df['rt_score_norm'], alpha=0.7)\n",
    "plt.xlabel('IMDb Rating (0–100)')\n",
    "plt.ylabel('Rotten Tomatoes Critic Score (0–100)')\n",
    "plt.title('IMDb vs Rotten Tomatoes Ratings')\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rating Gap by Genre\n",
    "\n",
    "RT score – IMDb score.\n",
    "Positive values: critics rate higher than audiences.\n",
    "Negative: audiences like it more than critics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['primary_genre'] = df['genres'].fillna('').str.split('|').str[0]\n",
    "df['rating_gap'] = df['rt_score_norm'] - df['imdb_rating_norm']\n",
    "\n",
    "genre_gap = df.groupby('primary_genre')['rating_gap'].mean().sort_values()\n",
    "\n",
    "plt.figure(figsize=(10,6))\n",
    "genre_gap.plot(kind='barh')\n",
    "plt.xlabel('Average (RT - IMDb) Rating Difference')\n",
    "plt.title('Critic vs Audience Rating Gap by Genre')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Budget vs Critic Score\n",
    "\n",
    "Tests whether big-budget films get higher critic ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(7,6))\n",
    "plt.scatter(df['budget'], df['rt_score_norm'], alpha=0.7)\n",
    "plt.xscale('log')\n",
    "plt.xlabel('Budget (log USD)')\n",
    "plt.ylabel('Rotten Tomatoes Score (0–100)')\n",
    "plt.title('Budget vs Critic Score')\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "- IMDb and Rotten Tomatoes ratings show modest correlation.\n",
    "- Certain genres show larger critic–audience disagreements.\n",
    "- Budget does not strongly predict critic score in this pilot dataset.\n",
    "\n",
    "These analyses demonstrate the value and utility of the curated dataset for downstream research."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
