# Visualization

## Loading simple Palmer's Penguin Data 

In [1]:
import polars as pl

pengs = pl.read_csv("../data/penguins.csv")
pengs.sample(3)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
str,str,f64,f64,i64,i64,str
"""Gentoo""","""Biscoe""",49.9,16.1,213,5400,"""MALE"""
"""Gentoo""","""Biscoe""",49.8,15.9,229,5950,"""MALE"""
"""Chinstrap""","""Dream""",48.1,16.4,199,3325,"""FEMALE"""


### Behold.  Crazy simple plotting.  
Using holoviz as backend.  
No explicit call to holoviz is required, but the library needs to exist in our .venv in order to be called.

Please note how simple and descriptive the plotting syntax is.

In [2]:
# note: requires pyarrow be installed (not in notebook, but in venv)
pengs.plot.scatter(x='bill_length_mm', y='bill_depth_mm', by='species')

### Slightly more complicated plotting
`holoviz` is also the backend, but here it uses `scipy` to calculate the estimated probability distribution implied by the data.
No user interaction is required, but if scipy were not loaded into our .venv holoviz wouldn't be able to call it and this would throw an error.

(KDE: kernel density estimate)

In [3]:
# note: requires scipy be installed (not in notebook, but in venv)
pengs["flipper_length_mm"].plot.kde()

In [4]:
irisodes = pl.read_csv("../data/iris.csv")
irisodes.sample(3)
irisodes.plot.scatter(x='sepal.length', y='sepal.width', by='variety')

## Plotting Temporal Data

### Cleaning Up some stocks data
- convert time strings to DateTimes
- convert some pretty printed numbers to actual numbers
- add a column with company name
  - re-casting that column as a categorical
    - Note: almost certainly a simpler way to do that, but optimization is for later

**PagerDuty**

In [5]:
pduty = (
    
pl.scan_csv("../data/stocks/stock_pagerduty.csv") 
.with_columns([
    pl.col("Date").str.strptime(pl.Datetime, "%m/%d/%Y"),        # convert string to datetime
    pl.col("Volume").str.replace_all(",", "").cast(pl.UInt32),   # convert "123,456" to uInt
        pl.lit("PagerDuty").alias("Company").cast(pl.Categorical),
    ]) 
.collect()
    
)
print(pduty.sample(3))
print("Full dataframe: ", pduty.shape[0], " rows by ", pduty.shape[1], " cols")
pduty.plot.line(x='Date', y='Open', label="PagerDuty Stock Opening Prices")

shape: (3, 7)
┌─────────────────────┬───────┬───────┬───────┬───────┬─────────┬───────────┐
│ Date                ┆ Open  ┆ High  ┆ Low   ┆ Close ┆ Volume  ┆ Company   │
│ ---                 ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---     ┆ ---       │
│ datetime[μs]        ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ u32     ┆ cat       │
╞═════════════════════╪═══════╪═══════╪═══════╪═══════╪═════════╪═══════════╡
│ 2023-04-06 00:00:00 ┆ 31.58 ┆ 32.42 ┆ 30.97 ┆ 32.2  ┆ 931563  ┆ PagerDuty │
│ 2023-03-31 00:00:00 ┆ 33.1  ┆ 35.33 ┆ 32.97 ┆ 34.98 ┆ 3114053 ┆ PagerDuty │
│ 2024-01-04 00:00:00 ┆ 21.02 ┆ 21.67 ┆ 20.81 ┆ 21.5  ┆ 1893486 ┆ PagerDuty │
└─────────────────────┴───────┴───────┴───────┴───────┴─────────┴───────────┘
Full dataframe:  252  rows by  7  cols


**PetCo** ("WOOF")

In [6]:
petco = (
    
pl.scan_csv("../data/stocks/stock_petco.csv") 
.with_columns([
    pl.col("Date").str.strptime(pl.Datetime, "%m/%d/%Y"),        # convert string to datetime
    pl.col("Volume").str.replace_all(",", "").cast(pl.UInt32),   # convert "123,456" to uInt
    pl.lit("PetCo").alias("Company").cast(pl.Categorical),
    ]) 
.collect()
    
)
print(petco.sample(3))
print("Full dataframe: ", petco.shape[0], " rows by ", pduty.shape[1], " cols")
petco.plot.line(x='Date', y='Open', label="PetCo ('woof') Stock Opening Prices")

shape: (3, 7)
┌─────────────────────┬──────┬───────┬───────┬───────┬─────────┬─────────┐
│ Date                ┆ Open ┆ High  ┆ Low   ┆ Close ┆ Volume  ┆ Company │
│ ---                 ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---     ┆ ---     │
│ datetime[μs]        ┆ f64  ┆ f64   ┆ f64   ┆ f64   ┆ u32     ┆ cat     │
╞═════════════════════╪══════╪═══════╪═══════╪═══════╪═════════╪═════════╡
│ 2023-07-11 00:00:00 ┆ 9.31 ┆ 9.595 ┆ 9.31  ┆ 9.47  ┆ 1173954 ┆ PetCo   │
│ 2023-03-28 00:00:00 ┆ 7.95 ┆ 8.475 ┆ 7.945 ┆ 8.47  ┆ 4466267 ┆ PetCo   │
│ 2023-04-17 00:00:00 ┆ 9.6  ┆ 9.88  ┆ 9.52  ┆ 9.63  ┆ 3514961 ┆ PetCo   │
└─────────────────────┴──────┴───────┴───────┴───────┴─────────┴─────────┘
Full dataframe:  252  rows by  7  cols


### Concatenate our Data, and plot by the Company Names we added

In [7]:
pdpc = pl.concat([pduty,petco])
pdpc.plot.line(x="Date", y="Open", by="Company", label="PagerDuty & PetCo -  Raw Stock Value")

  dataset.data.groupby(group_by, sort=False)]


### Let's do some basic analysis
It would be fun to compare stocks on 'their own scale', as it were.
Let's just pull out a total average and normalize our data by it.

Logic note:
(If we were looking for subtler effects there would be better ways of doing this that don't succumb to edge effects or temporal outliers, but this is a great, simple scale & shift operation to get a general sense of data.)

Machine note:
(This is not using Polar's performance optimizations.  As these are separated transforms; vs making using the 'lazy' api to allow internal optimization.  The repeated storage of our means also adds inefficiencies.  Both to storage and, unnecessary, fetch.  But, to make a lesson of it: don't over-optimize when exploring.  Just explore.  There's certainly syntactic learning ahead though.)

In [8]:

pdpc_means = (
pdpc.group_by("Company").agg([
    pl.col('Open').mean().alias('mean_open'),
    pl.col('High').mean().alias('mean_high'),
    pl.col('Low').mean().alias('mean_low'),
    pl.col('Close').mean().alias('mean_close'),
])
)

pdpc_ext = pdpc.join(pdpc_means, on="Company")


pdpc_ext = (
pdpc_ext.with_columns([
    (pl.col('Open') / pl.col('mean_open')).alias('normd_open'),
    (pl.col('High') / pl.col('mean_high')).alias('normd_high'),
    (pl.col('Low') / pl.col('mean_low')).alias('normd_low'),
    (pl.col('Close') / pl.col('mean_close')).alias('normd_close')
])
)

pdpc_ext.sample(3)

Date,Open,High,Low,Close,Volume,Company,mean_open,mean_high,mean_low,mean_close,normd_open,normd_high,normd_low,normd_close
datetime[μs],f64,f64,f64,f64,u32,cat,f64,f64,f64,f64,f64,f64,f64,f64
2023-08-30 00:00:00,5.36,5.41,5.23,5.25,4744295,"""PetCo""",7.215837,7.376,7.045042,7.207857,0.742811,0.73346,0.742366,0.728372
2023-10-04 00:00:00,3.75,3.82,3.68,3.77,6784946,"""PetCo""",7.215837,7.376,7.045042,7.207857,0.51969,0.517896,0.522353,0.52304
2023-05-30 00:00:00,8.07,8.195,7.82,7.88,2727189,"""PetCo""",7.215837,7.376,7.045042,7.207857,1.118373,1.111036,1.11,1.093251


In [9]:
pdpc_ext.plot.line(x="Date", y="normd_open", by="Company", label="PagerDuty & PetCo -  Normalized Stock Value")

  dataset.data.groupby(group_by, sort=False)]


## Interactives with `Panel`

In [10]:
# import panel as pn

## Other

In [11]:
import hvplot.polars
pengs.hvplot()