# Frequent GDP Analysis (2020–2025) — Interactive Notebook

This notebook performs an end-to-end interactive analysis of the Kaggle dataset "GDP per country (2020–2025)" (https://www.kaggle.com/datasets/codebynadiia/gdp-per-country-20202025/data).

Goals:
- Load the dataset automatically from `/kaggle/input` (or accept a downloaded CSV) and coerce it to a consistent long format (columns: `country`, `year`, `gdp`).
- Explore the data with interactive Plotly visualizations (histograms, pair matrix, heatmaps, subplots, 3D, boxplots, choropleth where possible).
- Compute growth metrics (YoY, CAGR) and create a year-by-year animation saved as `/kaggle/working/gdp_animation.mp4` (and an interactive HTML).

Notes:
- All visualizations use Plotly (no matplotlib). Light/pastel palettes are used for pleasant interactive visuals.
- Outputs (clean CSV, HTMLs, frames, final MP4) are saved under `/kaggle/working/`.

Estimated runtime for full export (frames + mp4): ~1–5 minutes depending on environment and whether `kaleido` and `imageio` are available to export images and stitch video.

## How to run

- Run cells top-to-bottom in a Kaggle/Colab/Jupyter environment.
- If running outside Kaggle, ensure you have the dataset CSV available and placed somewhere accessible; the notebook auto-searches `/kaggle/input` first.
- If `kaleido` or `imageio-ffmpeg` are not installed and you want to export PNG frames and MP4, uncomment the pip install lines in the "Install checks" cell.
- Sampling: heavy visuals sample up to 2,500 countries/rows for performance (stable reproducible sampling with `RANDOM_STATE = 42`).

In [None]:
# Environment & Imports
import sys
import os
import math
import glob
import json
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Export/animation libs (optional)
try:
    import kaleido  # used by Plotly to write images
    _HAS_KALEIDO = True
except Exception:
    _HAS_KALEIDO = False

try:
    import imageio
    _HAS_IMAGEIO = True
except Exception:
    _HAS_IMAGEIO = False

RANDOM_STATE = 42

# Print versions
versions = {
    'python': sys.version.split()[0],
    'pandas': getattr(pd, '__version__', 'unknown'),
    'numpy': getattr(np, '__version__', 'unknown'),
    'plotly': getattr(px, '__version__', getattr(go, '__version__', 'unknown')),
    'scikit-learn': None,
    'kaleido': 'installed' if _HAS_KALEIDO else 'missing',
    'imageio': 'installed' if _HAS_IMAGEIO else 'missing'
}
import sklearn
versions['scikit-learn'] = getattr(sklearn, '__version__', 'unknown')
print('Library versions:')
for k, v in versions.items():
    print(f"- {k}: {v}")

### Install checks (optional)

If you need image export (PNG frames) and MP4 creation, ensure `kaleido` and `imageio-ffmpeg` are installed.
Uncomment and run the commands below in an environment where installing packages is allowed.

In [None]:
# Example installs (do NOT run in restricted environments unless you want to install packages):
# !pip install kaleido imageio imageio-ffmpeg
# After installing, restart the kernel to ensure Plotly finds kaleido.
print('kaleido available:', _HAS_KALEIDO)
print('imageio available:', _HAS_IMAGEIO)

## Load & Inspect Data

This cell attempts to auto-detect CSV files under `/kaggle/input` and load the most likely main CSV. If no CSV is found, it will print a clear message so you can place the CSV manually or provide a path.

In [None]:
# Auto-detect CSV under /kaggle/input (recursive search)
def find_csvs(base_path='/kaggle/input'):
    csv_paths = []
    if not os.path.exists(base_path):
        return csv_paths
    for root, dirs, files in os.walk(base_path):
        for f in files:
            if f.lower().endswith('.csv'):
                csv_paths.append(os.path.join(root, f))
    return csv_paths

csv_files = find_csvs('/kaggle/input')
print('Found CSVs (sample up to 10):')
for p in csv_files[:10]:
    print('-', p)

df = None
if csv_files:
    # Heuristic: pick the largest CSV (most rows) as the main table
    sizes = [(p, os.path.getsize(p)) for p in csv_files]
    sizes.sort(key=lambda x: x[1], reverse=True)
    candidate = sizes[0][0]
    print('\nLoading candidate CSV:', candidate)
    try:
        df = pd.read_csv(candidate)
    except Exception as e:
        print('CSV load failed:', e)
        # fallback: try pandas read_table
        try:
            df = pd.read_table(candidate)
        except Exception as e2:
            print('Fallback read_table failed:', e2)

if df is None:
    print('\nNo CSV found under /kaggle/input or load failed.\nPlease download the Kaggle dataset and place the CSV under /kaggle/input/<dataset>/ or set a path manually and re-run this cell.')
else:
    print('\ndf.shape =>', df.shape)
    display(df.head())
    print('\nColumns:')
    print(list(df.columns))

## Normalize Data Format (wide ↔ long)

This cell detects whether the data is in wide format (year columns like `2020`, `2021` or `GDP_2020`), or long format (rows for `country`, `year`, `gdp`). It standardizes to `df_long` with three columns: `country`, `year` (int), `gdp` (float).

Missing and malformed rows are handled, and a diagnostic markdown summary is printed.

In [None]:
# Normalize to long format with columns: country, year, gdp
def normalize_gdp_dataframe(df_raw):
    df_work = df_raw.copy()
    # Standardize column names to strings
    df_work.columns = [str(c).strip() for c in df_work.columns]
    cols = df_work.columns.tolist()

    # Heuristics to find country column name
    country_col_candidates = [c for c in cols if c.lower() in ('country', 'country_name', 'countryname', 'country/region', 'countries')]
    if country_col_candidates:
        country_col = country_col_candidates[0]
    else:
        # fallback: pick first non-year string column
        non_year_cols = [c for c in cols if not (c.strip().lstrip('GDP_').replace('_','').isdigit() or c.strip().isdigit())]
        country_col = non_year_cols[0] if non_year_cols else cols[0]

    # Detect year-like columns (e.g. '2020', '2021', 'GDP_2020', '2020_gdp')
    year_like = []
    for c in cols:
        token = c.strip()
        # direct numeric year
        if token.isdigit() and 1900 <= int(token) <= 2100:
            year_like.append(c)
            continue
        # patterns like GDP_2020 or gdp2020
        import re
        m = re.search(r"(19|20)\d{2}", token)
        if m:
            year_like.append(c)
    
    # Decide wide vs long
    if year_like and len(year_like) >= 2:
        detected_format = 'wide'
        # melt keeping country_col
        id_vars = [country_col]
        df_melt = df_work.melt(id_vars=id_vars, value_vars=year_like, var_name='year_raw', value_name='gdp_raw')
        # Extract year integer
        def extract_year(x):
            import re
            s = str(x)
            m = re.search(r"(19|20)\d{2}", s)
            return int(m.group(0)) if m else None
        df_melt['year'] = df_melt['year_raw'].apply(extract_year)
        df_melt.rename(columns={country_col: 'country'}, inplace=True)
        df_melt['gdp'] = pd.to_numeric(df_melt['gdp_raw'], errors='coerce')
        df_long = df_melt[['country','year','gdp']].copy()
    else:
        # Try to detect long format
        possible_year_cols = [c for c in cols if c.lower() in ('year','yr')]
        possible_gdp_cols = [c for c in cols if c.lower() in ('gdp','gdp_value','gdp_usd','gdp_usd_millions')]
        if possible_year_cols and possible_gdp_cols:
            detected_format = 'long'
            df2 = df_work.rename(columns={possible_year_cols[0]: 'year', possible_gdp_cols[0]: 'gdp'})
            df2 = df2.rename(columns={country_col: 'country'})
            df2['year'] = pd.to_numeric(df2['year'], errors='coerce')
            df2['gdp'] = pd.to_numeric(df2['gdp'], errors='coerce')
            df_long = df2[['country','year','gdp']].copy()
        else:
            # Try common wide layout where years are column names directly like '2020','2021'
            year_cols = [c for c in cols if c.strip().isdigit() and 1900 <= int(c.strip()) <= 2100]
            if year_cols:
                detected_format = 'wide'
                df_melt = df_work.melt(id_vars=[country_col], value_vars=year_cols, var_name='year', value_name='gdp')
                df_melt.rename(columns={country_col: 'country'}, inplace=True)
                df_melt['year'] = pd.to_numeric(df_melt['year'], errors='coerce')
                df_melt['gdp'] = pd.to_numeric(df_melt['gdp'], errors='coerce')
                df_long = df_melt[['country','year','gdp']].copy()
            else:
                # Last fallback: try to find probable year and gdp columns heuristically
                detected_format = 'unknown'
                # attempt to find any column with years inside values
                # If nothing else, return an empty normalized dataframe
                df_long = pd.DataFrame(columns=['country','year','gdp'])

    # Clean country names
    if not df_long.empty:
        df_long['country'] = df_long['country'].astype(str).str.strip()
        df_long = df_long.dropna(subset=['country'])
        df_long['year'] = pd.to_numeric(df_long['year'], errors='coerce').astype('Int64')
        df_long['gdp'] = pd.to_numeric(df_long['gdp'], errors='coerce')
        df_long = df_long.dropna(subset=['year'])
        df_long['year'] = df_long['year'].astype(int)
    
    return df_long, detected_format

df_long = None
detected_format = None
if df is not None:
    df_long, detected_format = normalize_gdp_dataframe(df)
    print('Detected format:', detected_format)
    print('df_long.shape =>', df_long.shape)
    display(df_long.head())
    # Show available years
    print('Years present (sample):', sorted(df_long['year'].unique())[:20])

### Post-normalization checks & enforcement of 2020..2025

Ensure `year` contains the expected range 2020–2025. If the dataset has a wider range or year parsing issues, we coerce and warn.

In [None]:
REQUIRED_YEARS = list(range(2020, 2026))
if df_long is None or df_long.empty:
    raise RuntimeError('No normalized data available. Please check the dataset file and re-run the previous cells.')

present_years = sorted(df_long['year'].unique())
print('Present years in data:', present_years)
missing_required = [y for y in REQUIRED_YEARS if y not in present_years]
if missing_required:
    print('\nWarning: expected years 2020..2025 not all present. Missing:', missing_required)
    # If years outside this range exist, we keep them but later we will restrict animation to 2020..2025 if any of these are present.
else:
    print('\nAll required years (2020..2025) are present.')

## Handle Missing Values & Interpolation Strategy

Strategy implemented:
- For each country, compute the proportion of missing years (within the YEAR RANGE present). If >30% missing, drop country from animations/trend analysis (we still keep it in the cleaned CSV but mark it).
- For remaining countries, linearly interpolate small gaps across years and forward/backfill only at the edges if necessary.
- Document dropped countries and reasons.

In [None]:
# Build a complete grid of country x year for required years (2020..2025) to evaluate missingness
countries = sorted(df_long['country'].unique())
years_all = sorted(df_long['year'].unique())

# We'll create df_grid which contains all country-year combos for the years we have in the dataset
year_min, year_max = min(years_all), max(years_all)
grid_years = list(range(year_min, year_max+1))
grid = pd.MultiIndex.from_product([countries, grid_years], names=['country','year']).to_frame(index=False)
df_full = pd.merge(grid, df_long, on=['country','year'], how='left')

# Compute missingness per country for the YEARS RANGE we will analyze (restrict to 2020..2025 intersection)
analysis_years = [y for y in REQUIRED_YEARS if y >= year_min and y <= year_max]
print('Analysis years used:', analysis_years)

def country_missing_stats(df_full, analysis_years):
    stats = []
    for c in df_full['country'].unique():
        sub = df_full[(df_full['country']==c) & (df_full['year'].isin(analysis_years))]
        total = len(analysis_years)
        missing = sub['gdp'].isna().sum()
        stats.append({'country': c, 'total_years': total, 'missing_years': int(missing), 'missing_pct': missing/total if total>0 else 1.0})
    return pd.DataFrame(stats)

missing_stats = country_missing_stats(df_full, analysis_years)
dropped_countries = missing_stats[missing_stats['missing_pct'] > 0.3]['country'].tolist()
print('Countries dropped from trend/animation due to >30% missing years (count={}):'.format(len(dropped_countries)))
print(dropped_countries[:20])

# Interpolate small gaps for remaining countries
keep_countries = [c for c in countries if c not in dropped_countries]
df_interp = df_full[df_full['country'].isin(keep_countries)].copy()
df_interp = df_interp.sort_values(['country','year'])

# Linear interpolation per country across the full year range we have (not restricted to 2020..2025 to preserve continuity)
def interp_group(g):
    g = g.set_index('year').sort_index()
    # Interpolate only if at least two numeric values exist
    if g['gdp'].count() >= 2:
        g['gdp'] = g['gdp'].interpolate(method='linear', limit_direction='both')
    else:
        # insufficient data; leave as is
        pass
    g = g.reset_index()
    return g

df_interp = df_interp.groupby('country', group_keys=False).apply(interp_group).reset_index(drop=True)

# Mark dropped countries in the full df_long as well for CSV output
df_long_clean = df_long.copy()
df_long_clean['dropped_for_animation'] = df_long_clean['country'].isin(dropped_countries)

print('After interpolation, sample:')
display(df_interp.head())

## Summary Stats & Quick EDA

Compute key summary statistics and show top-10 countries by average GDP (2020–2025). We'll save the top-10 bar chart as an interactive HTML.

In [None]:
# Compute stats for analysis years (intersection)
df_analysis = df_long_clean[df_long_clean['year'].isin(analysis_years)].copy()
grouped = df_analysis.groupby('country', as_index=False)['gdp'].mean()
grouped = grouped.rename(columns={'gdp':'gdp_avg'})
top10 = grouped.sort_values('gdp_avg', ascending=False).head(10)
print('Top 10 countries by average GDP ({} years):'.format(len(analysis_years)))
display(top10)

# Interactive pastel bar chart
palette = px.colors.sequential.Plasma[-6:]  # use a soft palette fallback
fig_top10 = px.bar(top10, x='country', y='gdp_avg', title='Top 10 countries by average GDP ({}-{})'.format(min(analysis_years), max(analysis_years)),
                  color='gdp_avg', color_continuous_scale=px.colors.sequential.Peach, text='gdp_avg')
fig_top10.update_layout(coloraxis_showscale=False, template='simple_white')

OUT_DIR = '/kaggle/working'
os.makedirs(os.path.join(OUT_DIR, 'frames'), exist_ok=True)
top10_html = os.path.join(OUT_DIR, 'fig_top10_avg.html')
fig_top10.write_html(top10_html)
print('Saved interactive top10 HTML to', top10_html)
fig_top10

Interpretation:

- The bar chart above shows the top 10 countries by average GDP across the analysis years. Large economies appear at the top; this chart helps identify the dominant economies in absolute terms. Note: using absolute GDP favors large-population / large-economy countries; per-capita normalization is recommended for fairness.

## Distribution & Yearly Histograms

We create histograms for each analysis year (2020..2025 intersection). For consistency we use the same bin edges across years and pastel colors. If the dataset is extremely large, we sample to 2500 rows for plotting performance.

In [None]:
MAX_SAMPLE = 2500
df_for_hist = df_analysis.copy()
if len(df_for_hist) > MAX_SAMPLE:
    df_for_hist = df_for_hist.sample(n=MAX_SAMPLE, random_state=RANDOM_STATE)

gdp_vals = df_for_hist['gdp'].dropna()
bins = np.histogram_bin_edges(gdp_vals, bins='auto')

hist_html_paths = []
for y in analysis_years:
    sub = df_for_hist[df_for_hist['year']==y]
    fig = px.histogram(sub, x='gdp', nbins=len(bins)-1, title=f'GDP distribution - {y}',
                       labels={'gdp':'GDP'}, template='simple_white')
    fig.update_traces(marker_color='rgb(179,205,227)')
    html_path = os.path.join(OUT_DIR, f'fig_hist_{y}.html')
    fig.write_html(html_path)
    hist_html_paths.append(html_path)
    display(fig)
    print('Saved', html_path)

Interpretation:

- These histograms show GDP distribution for each year. Expect right-skew: a few very large GDPs (USA, China) and many smaller economies. Look for changes in spread and tail behavior year-to-year; sudden shifts usually indicate data issues or re-basing.

## Pair Plot / Scatter Matrix

We pivot countries into columns for the year-wise GDP values and build a scatter matrix for pairwise relationships across years. We sample up to 2500 countries/rows to keep interactivity responsive.

In [None]:
# Pivot: countries as rows, years as columns
pivot = df_long.pivot_table(index='country', columns='year', values='gdp')
pivot_cols = [c for c in pivot.columns if c in analysis_years]
pivot_sub = pivot[pivot_cols].dropna(axis=0, how='any')  # only countries with full data for analysis years
print('Pivot shape (countries x years):', pivot_sub.shape)

pivot_for_matrix = pivot_sub.reset_index()
if len(pivot_for_matrix) > MAX_SAMPLE:
    pivot_for_matrix = pivot_for_matrix.sample(n=MAX_SAMPLE, random_state=RANDOM_STATE)

if len(pivot_for_matrix.columns) >= 3:
    fig = px.scatter_matrix(pivot_for_matrix, dimensions=pivot_cols, color_discrete_sequence=px.colors.qualitative.Pastel, title='Scatter matrix across years')
    fig.update_layout(width=900, height=900)
    scatter_matrix_html = os.path.join(OUT_DIR, 'fig_scatter_matrix.html')
    fig.write_html(scatter_matrix_html)
    print('Saved scatter matrix to', scatter_matrix_html)
    
    fig
else:
    print('Not enough year-columns for scatter matrix. Need at least 3 columns.')

Interpretation:

- The scatter matrix helps detect strong linear relationships between GDPs across years. High correlation across adjacent years is expected for GDP; look for weak correlations that might indicate outliers or data inconsistency.

## Correlation Heatmap (years correlation)

Compute a correlation matrix among years (countries are observations). For performance, limit to top-25 countries by average GDP.

In [None]:
top25_countries = grouped.sort_values('gdp_avg', ascending=False).head(25)['country'].tolist()
pivot_top25 = pivot.loc[top25_countries, pivot_cols]
corr = pivot_top25.transpose().corr()
fig = px.imshow(corr, text_auto='.2f', aspect='auto', color_continuous_scale='RdBu', origin='lower',
                title='Correlation across years (top 25 countries)')
heatmap_html = os.path.join(OUT_DIR, 'fig_corr_heatmap.html')
fig.write_html(heatmap_html)
print('Saved heatmap to', heatmap_html)
fig

Interpretation:

- Heatmap shows very high correlations across adjacent years for GDP; this is typical for annual GDP values. Low-correlation cells indicate countries that changed more dramatically.

## Subplots Example (global average, distribution boxplot, top-5 lines)
Create a multi-panel interactive Plotly figure combining three perspectives.

In [None]:
# Global average across years
global_avg = df_analysis.groupby('year')['gdp'].mean().reset_index()

# Boxplot data preparation
df_box = df_analysis.copy()

# Top-5 countries lines
top5 = grouped.sort_values('gdp_avg', ascending=False).head(5)['country'].tolist()
df_top5 = df_long[df_long['country'].isin(top5)].sort_values(['country','year'])

fig = make_subplots(rows=3, cols=1, shared_xaxes=True, vertical_spacing=0.08,
                    subplot_titles=('Global average GDP (years)', 'GDP distribution per year (boxplot)', 'Top-5 countries - GDP over time'))

fig.add_trace(go.Scatter(x=global_avg['year'], y=global_avg['gdp'], mode='lines+markers', name='Global Avg', line=dict(color='rgb(179,205,227)')) , row=1, col=1)

for y in sorted(df_box['year'].unique()):
    # for boxplot we'll create a box trace grouped by year using px.box would normally be simpler
    pass

fig.add_trace(go.Box(x=df_box['year'], y=df_box['gdp'], name='Distribution', marker_color='rgb(204,235,197)', boxmean='sd'), row=2, col=1)

for c in top5:
    tmp = df_top5[df_top5['country']==c]
    fig.add_trace(go.Scatter(x=tmp['year'], y=tmp['gdp'], mode='lines+markers', name=c), row=3, col=1)

fig.update_layout(height=1000, showlegend=True, template='simple_white', title_text='Subplots: global avg, distribution and top-5 trends')
subplots_html = os.path.join(OUT_DIR, 'fig_subplots.html')
fig.write_html(subplots_html)
print('Saved subplots to', subplots_html)
fig

Interpretation:

- The top panel shows the global average GDP over years; the boxplot summarizes per-year distribution and dispersion; the bottom panel shows trend lines for the top-5 economies, highlighting their relative stability or growth.

## Boxplot / Violin per year

Interactive boxplots or violins help identify distribution shape and outliers per year.

In [None]:
fig_box = px.box(df_analysis, x='year', y='gdp', points='outliers', title='GDP distribution per year (boxplot)',
                 color_discrete_sequence=['rgb(209,229,240)'])
box_html = os.path.join(OUT_DIR, 'fig_box.html')
fig_box.write_html(box_html)
print('Saved boxplot to', box_html)
fig_box

Interpretation:

- Boxplots show central tendency and spread of GDP across countries for each year. Watch for increasing dispersion or new outliers which may signal data issues or real economic divergence.

## Barplot & Dotplot (selected years)

Compare top countries in 2020 vs 2025 (if 2025 exists). We'll create interactive bar and dot plots and save them.

In [None]:
def top_n_by_year(df_long, year, n=20):
    sub = df_long[df_long['year']==year].copy()
    sub = sub.dropna(subset=['gdp'])
    return sub.sort_values('gdp', ascending=False).head(n)

plots_saved = []
for y in [min(analysis_years), max(analysis_years)]:
    top20 = top_n_by_year(df_long, y, n=20)
    fig_bar = px.bar(top20, x='country', y='gdp', title=f'Top 20 countries by GDP - {y}',
                     color='gdp', color_continuous_scale=px.colors.sequential.Pastel)
    path = os.path.join(OUT_DIR, f'fig_top20_{y}.html')
    fig_bar.write_html(path)
    plots_saved.append(path)
    display(fig_bar)
    # dotplot
    fig_dot = px.scatter(top20, x='gdp', y='country', size='gdp', title=f'Dotplot - Top 20 GDP - {y}',
                        color_discrete_sequence=['rgb(197,218,235)'])
    path2 = os.path.join(OUT_DIR, f'fig_dot_top20_{y}.html')
    fig_dot.write_html(path2)
    plots_saved.append(path2)
    display(fig_dot)
print('Saved plots:', plots_saved[:6])

Interpretation:

- These visuals make it simple to compare the ranking and relative size of the largest economies in a single year and observe rank changes across years.

## Pie Chart for top-10 country share (selected year)

Create a pie chart for a selected year and group the rest as `Other`.

In [None]:
def pie_top10(df_long, year):
    sub = df_long[df_long['year']==year].dropna(subset=['gdp'])
    top10 = sub.sort_values('gdp', ascending=False).head(10).copy()
    others = sub[~sub['country'].isin(top10['country'])]['gdp'].sum()
    top10 = top10.append({'country':'Other','gdp':others}, ignore_index=True)
    fig = px.pie(top10, values='gdp', names='country', title=f'Top-10 share of global GDP - {year}',
                 color_discrete_sequence=px.colors.sequential.Pastel)
    return fig

year_choice = min(analysis_years)
fig_pie = pie_top10(df_long, year_choice)
pie_html = os.path.join(OUT_DIR, f'fig_pie_top10_{year_choice}.html')
fig_pie.write_html(pie_html)
print('Saved pie to', pie_html)
fig_pie

Interpretation:

- Pie chart emphasizes share concentration; the top 10 countries commonly account for a large portion of global GDP. Use this with caution because pie charts can hide distribution nuances; combine with bar charts for clarity.

## 3D Plot (if applicable)

If additional numeric columns are present (e.g., GDP per capita or growth rate), we can create a 3D scatter. Otherwise we'll construct a temporal 3D view using `year` as z-axis for a selected subset of countries.

In [None]:
# Attempt to create a 3D plot using year as z for top 10 countries
sample_countries = grouped.sort_values('gdp_avg', ascending=False).head(10)['country'].tolist()
df_3d = df_long[df_long['country'].isin(sample_countries)].dropna(subset=['gdp'])
fig_3d = px.scatter_3d(df_3d, x='year', y='country', z='gdp', color='country',
                     title='3D temporal view (year, country, gdp)',
                     color_discrete_sequence=px.colors.qualitative.Pastel)
fig_3d.update_traces(marker=dict(size=6))
three_d_html = os.path.join(OUT_DIR, 'fig_3d.html')
fig_3d.write_html(three_d_html)
print('Saved 3D plot to', three_d_html)
fig_3d

Interpretation:

- The 3D scatter uses year as the x-axis, country (categorical) on the y-axis and GDP on the z-axis to show how top-country GDPs evolve over time. Use rotation to explore temporal depth.

## Trend Analysis & Growth Rates

Compute year-over-year (YoY) growth rates and CAGR (2020→2025) when both endpoints are available. Plot YoY lines for top-10 countries.

In [None]:
df_analysis_sorted = df_long.sort_values(['country','year']).copy()
df_analysis_sorted['gdp_lag1'] = df_analysis_sorted.groupby('country')['gdp'].shift(1)
df_analysis_sorted['yoy_pct'] = (df_analysis_sorted['gdp'] / df_analysis_sorted['gdp_lag1'] - 1) * 100

# Compute CAGR for countries that have both 2020 and 2025 values
pivot_for_cagr = df_long[df_long['year'].isin([2020, 2025])].pivot(index='country', columns='year', values='gdp')
if 2020 in pivot_for_cagr.columns and 2025 in pivot_for_cagr.columns:
    pivot_for_cagr = pivot_for_cagr.dropna(subset=[2020, 2025])
    pivot_for_cagr['abs_change'] = pivot_for_cagr[2025] - pivot_for_cagr[2020]
    pivot_for_cagr['CAGR_2020_2025_pct'] = ((pivot_for_cagr[2025] / pivot_for_cagr[2020]) ** (1/5) - 1) * 100
    cagr_summary = pivot_for_cagr[['abs_change','CAGR_2020_2025_pct']].sort_values('CAGR_2020_2025_pct', ascending=False)
    display(cagr_summary.head(10))
else:
    print('Not enough data to compute 2020-2025 CAGR for countries (missing either 2020 or 2025).')

# Plot YoY for top 10 average GDP countries
top10_countries = top10['country'].tolist()
df_top10_yoy = df_analysis_sorted[df_analysis_sorted['country'].isin(top10_countries)]
fig_yoy = px.line(df_top10_yoy, x='year', y='yoy_pct', color='country', title='YoY growth percentage - Top 10 countries',
                  color_discrete_sequence=px.colors.qualitative.Pastel)
yoy_html = os.path.join(OUT_DIR, 'fig_yoy_top10.html')
fig_yoy.write_html(yoy_html)
print('Saved YoY plot to', yoy_html)
fig_yoy

Interpretation:

- YoY lines reveal year-to-year volatility. Watch for extreme spikes or negative values indicating contraction. CAGR gives a compact view of multi-year growth.

## Country Comparison Panel (subplot function)
Function `plot_country_comparison` creates an interactive subplot showing each selected country's GDP across years with a consistent y-axis for easier comparisons.

In [None]:
def plot_country_comparison(countries_list):
    n = len(countries_list)
    cols = 2
    rows = math.ceil(n/cols)
    fig = make_subplots(rows=rows, cols=cols, subplot_titles=countries_list, shared_yaxes=True)
    r = 1
    c = 1
    for i, country in enumerate(countries_list):
        tmp = df_long[df_long['country']==country].sort_values('year')
        trace = go.Scatter(x=tmp['year'], y=tmp['gdp'], mode='lines+markers', name=country)
        fig.add_trace(trace, row=r, col=c)
        c += 1
        if c > cols:
            r += 1
            c = 1
    fig.update_layout(height=300*rows, title='Country comparison panel')
    return fig

sample_countries = top10['country'].tolist()[:6]
fig_cmp = plot_country_comparison(sample_countries)
cmp_html = os.path.join(OUT_DIR, 'fig_country_comparison.html')
fig_cmp.write_html(cmp_html)
print('Saved comparison panel to', cmp_html)
fig_cmp

Interpretation:

- Each subplot shows one country's GDP over time. Shared y-axis makes relative comparisons easier; look for diverging or converging trends.

## Animation → Create `.mp4`

Approach used:
- For robustness, we implement the frame-by-frame export approach (Option 2). For each year in the analysis years we render a Plotly figure (top N bar chart + global average inset) and save a PNG frame using `fig.write_image(...)` (requires `kaleido`).
- Frames are stitched into `/kaggle/working/gdp_animation.mp4` using `imageio`. If `kaleido` or `imageio` are missing, we write an interactive HTML animation and provide clear fallback instructions.
  
If `kaleido` is missing, the notebook will not attempt per-frame PNG exports and will instead save an interactive `fig.write_html(...)` animation.

In [None]:
ANIM_YEARS = sorted(analysis_years)
FRAMES_DIR = os.path.join(OUT_DIR, 'frames')
os.makedirs(FRAMES_DIR, exist_ok=True)

def make_frame_for_year(year, top_n=20):
    # Create a combined figure showing top-N bar and a small inset global average marker
    topn = top_n_by_year(df_long, year, n=top_n)
    fig = make_subplots(rows=1, cols=2, column_widths=[0.7, 0.3], specs=[[{"type":"xy"}, {"type":"xy"}]])
    bar = go.Bar(x=topn['country'], y=topn['gdp'], marker_color='rgb(179,205,227)', name=f'Top {top_n}')
    fig.add_trace(bar, row=1, col=1)
    # Inset: global average point
    ga = global_avg[global_avg['year']==year]['gdp'].values
    ga_val = float(ga[0]) if len(ga)>0 else None
    if ga_val is not None:
        fig.add_trace(go.Scatter(x=[0], y=[ga_val], mode='markers+text', text=[f'Global avg: {ga_val:,.0f}'], textposition='top center', marker=dict(size=12, color='rgb(204,235,197)')), row=1, col=2)
    fig.update_layout(title_text=f'GDP snapshot - {year}', showlegend=False, template='simple_white')
    fig.update_xaxes(tickangle=45, row=1, col=1)
    return fig

frame_paths = []
if _HAS_KALEIDO and _HAS_IMAGEIO:
    print('kaleido and imageio present; exporting frames to PNG and stitching to MP4')
    for y in ANIM_YEARS:
        fig = make_frame_for_year(y, top_n=20)
        frame_path = os.path.join(FRAMES_DIR, f'frame_{y}.png')
        try:
            fig.write_image(frame_path, engine='kaleido', width=1280, height=720)
            print('Wrote frame', frame_path)
            frame_paths.append(frame_path)
        except Exception as e:
            print('Error writing frame with kaleido:', e)
    # Stitch frames to mp4
    mp4_path = os.path.join(OUT_DIR, 'gdp_animation.mp4')
    try:
        with imageio.get_writer(mp4_path, fps=1) as writer:
            for fp in frame_paths:
                img = imageio.imread(fp)
                writer.append_data(img)
        print('Saved animation to', mp4_path)
    except Exception as e:
        print('Failed to write MP4 via imageio:', e)
else:
    print('kaleido or imageio missing. Creating interactive frame-based HTML animation fallback.')
    # Create a Plotly animation (interactive) using frames and save HTML
    # Build a dataframe for animation (top 20 per year) and animate by year as a bar chart race
    
    anim_df_list = []
    for y in ANIM_YEARS:
        tmp = top_n_by_year(df_long, y, n=20).copy()
        tmp['year'] = y
        anim_df_list.append(tmp)
    anim_df = pd.concat(anim_df_list, ignore_index=True)
    fig_anim = px.bar(anim_df, x='country', y='gdp', color='country', animation_frame='year', range_y=[0, anim_df['gdp'].max()*1.05],
                      title='GDP Top-20 - animation (interactive)')
    anim_html = os.path.join(OUT_DIR, 'gdp_animation_interactive.html')
    fig_anim.write_html(anim_html)
    print('Saved interactive animation to', anim_html)
    fig_anim

If frames were created and stitched to MP4 above, the video will be saved under `/kaggle/working/gdp_animation.mp4`. If not, an interactive HTML animation has been created as a fallback.

In [None]:
# Display mp4 if exists, otherwise show interactive animation HTML path
from IPython.display import HTML, display
mp4_path = os.path.join(OUT_DIR, 'gdp_animation.mp4')
interactive_path = os.path.join(OUT_DIR, 'gdp_animation_interactive.html')
if os.path.exists(mp4_path):
    print('Displaying MP4:')
    display(HTML(f"<video width=800 controls><source src='{mp4_path}' type='video/mp4'></video>"))
elif os.path.exists(interactive_path):
    print('Interactive animation HTML available at:', interactive_path)
    display(HTML(f"<a href='{interactive_path}' target='_blank'>Open interactive animation</a>"))
else:
    print('No animation output found. Check previous cell logs for issues (kaleido/imageio availability).')

## Save outputs & reproducibility

- Save cleaned long-format CSV to `/kaggle/working/df_long.csv`.
- Already saved interactive HTMLs for key figures under `/kaggle/working/`.
- If frames were generated, they are under `/kaggle/working/frames/` and the final MP4 is `/kaggle/working/gdp_animation.mp4`.

In [None]:
clean_csv_path = os.path.join(OUT_DIR, 'df_long.csv')
df_long_clean.to_csv(clean_csv_path, index=False)
print('Saved cleaned long-format CSV to', clean_csv_path)

# Summarize saved files
saved_files = [os.path.join(OUT_DIR, p) for p in os.listdir(OUT_DIR) if p.endswith('.html') or p.endswith('.csv') or p.endswith('.mp4')]
print('Saved output files:')
for f in saved_files:
    print('-', f)

## Final Results & Business Insights

- **Top economies dominate absolute GDP:** The top 10 countries account for a large share of global GDP; use per-capita or PPP normalization to compare productivity.
- **High year-to-year correlation:** GDP values are highly correlated year-to-year; large changes usually indicate real shocks or data/re-basing.
- **Missing data matters:** Several countries are dropped from animations due to >30% missing years—this should be flagged if running automated reports.
- **Follow-ups:** Merge population for GDP per capita, integrate inflation / exchange-rate adjustments, and add more historical years for robust forecasting.
- **Forecasting caveat:** 6 annual points (2020–2025) are short for strong forecasting — consider longer time series for temporal models like Prophet or ARIMA.

## Reproducibility / README

To reproduce fully:
1. Ensure the dataset CSV is available under `/kaggle/input/<dataset>/` or update the path in the loading cell.
2. (Optional) Install `kaleido` and `imageio-ffmpeg` if you want per-frame PNG export and MP4 output:
   - `!pip install kaleido imageio imageio-ffmpeg`
3. Run notebook top-to-bottom. Exporting frames and writing MP4 may take a few minutes depending on CPU.

Extension ideas:
- Add population data to compute GDP per capita and re-run the visualizations.
- Add regional grouping and produce side-by-side choropleths.
- Implement clustering by GDP growth patterns and annotate clusters in visuals.

### Notebook generation completed.

If you want, I can now:
- Write this notebook JSON to a file under your workspace (please confirm filepath), or
- Add a Streamlit app version that exposes these visualizations as an interactive dashboard.