### Harris vs Trump Debate Dataset
- **Description:** Transcript of a political debate between Kamala Harris and Donald Trump.
- **Data Source:** [harris_vs_trump_debate_with_extras.parquet](https://www.dropbox.com/scl/fi/tp551hfzo5xp20urs7b8x/harris_vs_trump_debate_with_extras.parquet?rlkey=4gep2vn60vv3wx5q11iq6hc3j&dl=1)
  - **Potential columns for visualization:**
    - **X & Y Coordinates:** `tsne__x`, `tsne__y`, `pca__x`, `pca__y`
    - **Point Size:** `certainty`
    - **Color:** `speaker_color`
    - **Label:** `text`

## Get data

### Data parameters

In [1]:
ext = '.parquet'
src = 'https://www.dropbox.com/scl/fi/tp551hfzo5xp20urs7b8x/harris_vs_trump_debate_with_extras.parquet?rlkey=4gep2vn60vv3wx5q11iq6hc3j&dl=1'
target_filename = 'harris_vs_trump_debate_with_extras.parquet'

### Install and import

In [2]:
import os
if not os.getenv('IN_COSMO_DEV_ENV'):
    %pip install -q cosmograph tabled cosmodata

import tabled
import cosmodata

from functools import partial 
from cosmograph import cosmo

### Load data

In [3]:
if ext:
    getter = partial(tabled.get_table, ext=ext)
else:
    getter = tabled.get_table
# acquire_data takes care of caching locally too, so next time access will be faster
# (If you want a fresh copy, you can delete the local cache file manually.)
data = cosmodata.acquire_data(src, target_filename, getter=getter)

Fetching data from https://www.dropbox.com/scl/fi/tp551hfzo5xp20urs7b8x/harris_vs_trump_debate_with_extras.parquet?rlkey=4gep2vn60vv3wx5q11iq6hc3j&dl=1...
Data cached at: /Users/thorwhalen/.local/share/cosmodata/datasets/harris_vs_trump_debate_with_extras.parquet.pkl


## Peep at the data

In [4]:
mode = 'short'  #Literal['short', 'sample', 'stats'] = 'short',
exclude_cols = []
cosmodata.print_dataframe_info(data, exclude_cols, mode=mode)

DataFrame shape: (1141, 20)
First row
------------------------------------------------------------
id                                                                     14
speaker                                                     KAMALA HARRIS
text                               So, I was raised as a middle-class kid
topic                                                             economy
token                   ['so', 'be', 'raise', 'a', 'middle-class', 'kid']
polarity                                                              0.0
subjectivity                                                          0.0
certainty                                                             1.0
pca__x                                                          -0.143514
pca__y                                                           0.135135
random_projection__x                                            -1.130084
random_projection__y                                              1.32939
tsne__x      

## Visualize data

### PCA Visualization of Speakers

This visualization shows the distribution of speakers in the PCA space. The X and Y coordinates are derived from the PCA-reduced dimensions. The points are colored based on the speaker's assigned color. The size of the points can represent subjectivity or certainty, giving additional insight into the data dynamics.

In [5]:
cosmo(
    data,
    point_x_by="pca__x",
    point_y_by="pca__y",
    point_color_by="speaker_color",
    point_size_by="subjectivity",
    point_id_by="id",
    point_label_by="speaker",
    point_color_palette=["red", "blue", "green", "orange"],
    point_size_range=[2, 10],
    point_size_scale=3,
)

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…

### t-SNE Visualization of Topics

This graph visualizes how topics cluster in the t-SNE reduced space, providing insight into how similar topics are represented. Points represent individual instances of text linked to topics, with color indicating different topics and size reflecting their certainty in sentiment analysis.

In [6]:
cosmo(
    data,
    point_x_by="tsne__x",
    point_y_by="tsne__y",
    point_color_by="topic_id",
    point_size_by="certainty",
    point_id_by="id",
    point_label_by="topic",
    point_color_palette=["#FF9999", "#99FF99", "#9999FF"],
    point_size_range=[1, 8],
    point_size_scale=1,
)

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…

### Random Projection Visualization

This visualization represents the distribution of data points in the Random Projection space, allowing us to see how well the data preserves distances between points. Each point is color-coded by the topic it relates to, and their size can indicate the polarity of the sentiment.

In [7]:
cosmo(
    data,
    point_x_by="random_projection__x",
    point_y_by="random_projection__y",
    point_color_by="topic_id",
    point_size_by="polarity",
    point_id_by="id",
    point_label_by="speaker",
    point_color_palette=["purple", "aqua", "yellow"],
    point_size_range=[1, 10],
    point_size_scale=2,
)

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…