# **Assignment 3: Using Plotly's Heatmap with ipywidgets Interactivity<br>to Analyze Changes in PolitiFact Claim Sources Over Time**

### Exploring PolitiFact fact checks of 21,152 statements from 2008 to 2022 using Plotly, an open-source graphing library
#### _Frankie Pike, The University of Michigan, SIADS 521, Data Source: [Kaggle](https://www.kaggle.com/datasets/rmisra/politifact-fact-check-dataset?resource=download)_
---

## Visualization Technique
An **interactive heatmap** is a graphical representation of data where individual or aggregate values are represented by colors. In this technique, each cell in the grid corresponds to a data point or aggregate, and its color intensity reflects the magnitude of the value. The interactive component allows users to filter by an additional data value to dynamically adjust the data being displayed, making it easier to explore and analyze patterns within the dataset. I implemented this technique by preparing the data and putting it into a Pandas DataFrame, using the Plotly Graph Objects library's go.Heatmap function to create a static heatmap, then integrating it with ipywidgets to make the chart interactive using the interactive function.

Interactive heatmaps are great to use when dealing with datasets that have multiple categorical variables or dimensions (e.g., different sources, verdicts, years). They can help users drill down into specific subsets of data and easily see trends on multiple levels. There are limitations, though, so they are not a good idea if you have too many unique values in a categorical variable or if the data ranges are highly variable or skewed because it would be difficult to quickly perceive overall trends. Additionally, purely qualitative data is not a good fit.

I also considered using a bar chart, which is straightforward and good for categorical data, or a bubble chart, which would add an extra dimension of information (size) to a scatter plot, but I went with an interactive heatmap because it allows clear visualization of patterns in aggregated data, which is useful for summarizing a large dataset such as this one. Additionally, a bar chart did not allow me to express all the variables I wanted to, and a bubble chart became cluttered and hard to interpret with so many data points.

## Visualization Library
In this demonstration, we're going to be using the **Plotly** and **ipywidgets** libraries, both of which are open source. Plotly was created by Plotly Inc., a company that develops interactive data visualization tools. ipywidgets was created by the IPython development team, which is part of Project Jupyter. You can install both with simple pip commands (below).

In [None]:
!pip install plotly
!pip install ipywidgets

Once installed, load the libraries. From Plotly, we will need Express and Graph Objects. From ipywidgets, we will need Widgets and Interactive.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

import ipywidgets as widgets
from ipywidgets import interactive

Plotly is primarily a declarative library. You define what you want to plot, and Plotly handles the rendering and interactions. ipywidgets, on the other hand, is a procedural library. It allows users to create interactive widgets (e.g., sliders, dropdowns) and link them to Python code. This procedural approach lets you define the interaction logic and update visualizations dynamically. Both libraries are easy to use in Jupyter. In fact, ipywidgets was specifically made for Jupyter. 

When choosing which visualization libraries to use, I also considered Seaborn for its ease of creating heatmaps and Bokeh for its advanced interactivity. But ultimately, I chose Plotly with ipywidgets for its seamless integration of interactive features directly within Jupyter notebooks, offering a more user-friendly and dynamic experience for exploring data. It is complicated to make Seaborn's static visualizations interactive, even integrated with other libraries, and I found Bokeh to be overly complex and difficult to implement in Jupyter.

## Demonstration 
I already downloaded the data from Kaggle and imported it to the workspace. It is saved as politifact_factcheck_data.json. The first thing we need to do is import the other necessary libraries and modules.

In [None]:
import pandas as pd
import numpy as np
import os

Let's read in the data to a Pandas DataFrame. While it is a JSON file, there are some anomalies in the formatting, so we'll read it as a string, fix the anomalies, then convert the fixed JSON to a DataFrame. After we do that, we'll preview it.

In [None]:
filename = 'politifact_factcheck_data.json'

# Read the entire JSON file as a string
with open(filename, 'r') as f:
    json_str = f.read()

# Replace consecutive JSON objects with comma-separated versions within square brackets
fixed_json_str = '[' + json_str.replace('}\n{', '},{') + ']'

# Load the JSON data into a DataFrame
df = pd.read_json(fixed_json_str)
    
#View DataFrame
df.head()

According to Kaggle's documentation of the dataset, each record (i.e. row) consists of 8 attributes (i.e. columns):

+ **verdict:** The verdict of fact check in one of 6 categories: true, mostly-true, half-true, mostly-false, false, and pants-fire
+ **statement_originator:** the person who made the statement being fact checked
+ **statement:** statement being fact checked
+ **statement_date:** the date when statement being fact checked was made
+ **statement_source:** the source where the statement was made. It is one of 13 categories: speech, television, news, blog, social_media, advertisement, campaign, meeting, radio, email, testimony, statement, and other
+ **factchecker:** name of the person who fact checked the claim
+ **factcheck_date:** date when the fact checked article was published
+ **factcheck_analysis_link:** link to the fact checked analysis article

Based on this understanding, let's clean the data. We want the statement_date column to hold DateTime objects and the statement_source column to be categorical. We're also not going to need information such as the factchecker, the factcheck_date, or the factcheck_analysis_link, so we'll remove those columns because they're irrelevant and removing them will reduce the memory footprint and speed up computations.

In [None]:
# Convert to datetime data type
df = df.drop(columns=['factchecker', 'factcheck_date', 'factcheck_analysis_link'])
df['statement_date'] = pd.to_datetime(df['statement_date'])
df = df[df['statement_date'].dt.year >= 2008]
df['statement_source'] = pd.Categorical(df['statement_source'])
df.head()

Before we get to the core question (and more complex implementation), let's do some brief exploratory data analysis to investigate how the amount of statements from various sources has changed over time. We can aggregate the statement counts by date and source. While these statements don't encompass every potentially false claim, the fact checking experts at PolitiFact "select the most newsworthy and significant ones," according to their [website](https://www.politifact.com/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/). This selection process provides a meaningful representation of where the most newsworthy claims are being made.

With our aggregated data, we can use Plotly Express to make a simple line chart to visualize statement counts over time from the various sources.

In [None]:
# Calculate the count of statements for each date and statement source
df_count = df.groupby(['statement_date', 'statement_source']).size().reset_index(name='count')

# Create the line plot using Plotly Express
fig = px.line(
    df_count,
    x='statement_date',
    y='count',
    color='statement_source',
    title='Statement Counts by Source Over Time',
    labels={
        'statement_date': 'Date',
        'count': 'Count',
        'statement_source': 'Statement Source'
    },
    template='plotly_white'
)


# Adjust layout to make the plot less wide
fig.update_layout(
    autosize=True,
    margin=dict(l=40, r=40, t=40, b=40),  # Adjust margins as needed
    width=1000,  # Adjust the width to make it less wide
    height=400,  # Adjust the height if needed
    legend_title_text='Statement Source'
)

# Show the plot
fig.show()

We can see that in the early to mid 2010s, most statements were from speeches or the news, but since 2019, social media statements have dominated. There was a specific spike in November 2020, during the U.S. presidential election. Without having to specify, Plotly conveniently shows more details about each data point on hover. 

That's great, but what if we want a bird's eye look? It's a little hard to see changes with so many lines. We can aggregate the data by year instead and replot it for a bigger picture view.

In [None]:
df_yearly = df.groupby([pd.Grouper(key='statement_date', freq='Y'), 'statement_source']).size().reset_index(name='count')

# Create the line plot using Plotly Express
fig = px.line(
    df_yearly,
    x='statement_date',
    y='count',
    color='statement_source',
    title='Statement Counts by Source Over Time (Yearly)',
    labels={
        'statement_date': 'Year',
        'count': 'Count',
        'statement_source': 'Statement Source'
    },
    template='plotly_white' 
)

# Adjust layout to make the plot less wide
fig.update_layout(
    autosize=True,
    margin=dict(l=40, r=40, t=40, b=40),  # Adjust margins as needed
    width=1000,  # Adjust the width to make it less wide
    height=400,  # Adjust the height if needed
    legend_title_text='Statement Source'
)

# Show the plot
fig.show()

With this plot, the gradual rise in significance of social media statements as speech and the news statements declined is more evident. This chart illustrates four stages: 1) News dominated from 2010 to 2015, 2) Speeches briefly became the most common source during the 2016 Trump vs. Clinton election, 3) News regained prominence through 2017 and 2018, and 4) Social media has been the leading source since early 2019.

Finally, let's get to the core of the demonstration—making the interactive heatmap. My goal when creating this chart was to answer the question: From 2008 to 2022, how did the number of newsworthy statements from each source change by year and, in the aggregate, how accurate were they?

In [None]:
#Set verdict order from least true ('Liar, Liar, Pants on Fire') to most true ('True.')
verdict_order = ['pants-fire', 'false', 'mostly-false', 'half-true', 'mostly-true', 'true']

# Adjust DataFrame for heatmap by aggregating by year, verdict, and source
df['statement_date_year'] = df['statement_date'].dt.to_period('Y').astype(str)
chartdf = df.groupby(['statement_date_year', 'verdict', 'statement_source']).size().reset_index(name='count')

# Create a function to plot the heatmap
def plot_heatmap(selected_source):
    filtered_df = chartdf[chartdf['statement_source'] == selected_source]
    heatmap_data = filtered_df.pivot_table(
        index='verdict',
        columns='statement_date_year',
        values='count',
        fill_value=0
    )
    # Set the order for the y-axis
    heatmap_data = heatmap_data.reindex(verdict_order, axis=0)
    
    # Create heatmap using Plotly
    fig = go.Figure(data=go.Heatmap(
        z=heatmap_data.values,
        x=heatmap_data.columns,
        y=heatmap_data.index,
        colorscale='dense',
        text=heatmap_data.values,
        texttemplate="%{text}",
        textfont={"size":12}
    ))

    # Update layout to adjust width and height
    fig.update_layout(
        title=f'Heatmap for PolitiFact Claims from Source: {selected_source}',
        xaxis_nticks=len(heatmap_data.columns),
        yaxis_nticks=len(heatmap_data.index),
        xaxis_title='Year',
        yaxis_title='Verdict',
        font=dict(family="Helvetica, DejaVu Sans", size=12),
        title_font=dict(size=20, family='Helvetica, DejaVu Sans', color='black'),
        width=1200,  # Adjust the width as needed
        height=600  # Adjust the height as needed
    )

    fig.show()

# Create the dropdown widget
statement_sources = chartdf['statement_source'].unique()
source_dropdown = widgets.Dropdown(
    options=statement_sources,
    value=statement_sources[0],
    description='Source:'
)

# Link the dropdown to the plotting function
interactive_plot = interactive(plot_heatmap, selected_source=source_dropdown)
output = interactive_plot.children[-1]
output.layout.height = '600px'
interactive_plot