## Pandas with some seaborn

#### Topics covered:


- the dataframe: basic properties and manipulations
- IO
- intermediate dataframe manipulation
- visualization with Seaborn
- my opinion of Pandas has actually improved significantly since last year :o

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pet_data = pd.DataFrame(
    {
        "animal": "cat dog cat fish dog cat cat".split(),
        "size": list("SSMMMLL"),
        "weight": [8, 10, 11, 1, 20, 12, 12],
        "adult": [False] * 5 + [True] * 2,
    }
)

In [None]:
pet_data

In [None]:
pet_data.head()

In [None]:
pet_data.describe()

#### accessing data: loc vs. iloc

In [None]:
pet_sub_data = pet_data.iloc[2:4, :2]
pet_sub_data

note index does not reset - can be confusing, and lead to further problems!
For example, using df.loc[] with a number will assume you are referring to the actual index values rather than 0,1,2...

In [None]:
pet_sub_data_reset = pet_data.iloc[2:4, :2].reset_index(drop=True)
pet_sub_data_reset

In [None]:
pet_sub_data.iloc[1]

In [None]:
pet_data.loc[2:4, ["animal", "size"]]

#why is the slice suddenly inclusive jfc pandas

In [None]:
pet_data.loc[pet_data["weight"] > 10, ["animal", "weight", "adult"]]

### IO with Pandas

In [None]:
tennis_data = pd.read_csv("tennis_example.csv")

In [None]:
tennis_data

In [None]:
tennis_data.dtypes

In [None]:
tennis_data["Date"] = pd.to_datetime(tennis_data["Date"], format="%d/%m/%y")

In [None]:
tennis_data

- if your data is large, please do not use CSVs. They are slow and large and don't save your data types. There are lots of file formats you can use instead, the most common seems to be parquet.
- DataFrame.to_latex may be of use - for example, I have to generate a fair number of confusion tables for my work. I can generate them in Numpy, add column and index labels when instantiating as a df, then export with to_latex. Extremely extremely situationally useful.
- read_csv can read directly from a Google sheet :o

### intermediate data manipulation

In [None]:
survey_data = pd.read_csv("https://docs.google.com/spreadsheets/d/1j9SmPqO514jTJ1IECrQikXRZ89dEEhQ96SzZoZi7njI/export?format=csv")
survey_data = survey_data.drop(["Email Address", "What is your name?"], axis=1)
survey_data["Timestamp"] = pd.to_datetime(survey_data["Timestamp"])

In [None]:
survey_data

In [None]:
cols = {}
for col in survey_data.columns:
    cols[col] = col.split(" ")[-1]
survey_data = survey_data.rename(columns=cols)

In [None]:
survey_data

In [None]:
survey_data["early"] = True
survey_data.loc[survey_data["Timestamp"] > pd.to_datetime("8/26/23"), "early"] = False

In [None]:
survey_data

### Group by: the split-apply-combine paradigm

In [None]:
survey_transformed = survey_data.groupby("early")[survey_data.columns[1:-2]].agg(["mean", "min"])
survey_transformed #              split                                     combine    apply

In [None]:
survey_transformed.T.plot(kind="bar")
plt.show()

also, this transformed data is a MultiIndex now! Pandas's way of representing higher dimensional data.

In [None]:
survey_transformed

In [None]:
survey_transformed.stack(level=-1)

In [None]:
survey_transformed.stack().stack()

In [None]:
survey_transformed

In [None]:
islice = pd.IndexSlice
survey_transformed.loc[:, (islice["[PyCharm]":"[Numba]"], "mean")]

an alternative for higher-dimensional data: xarray

x.sum('time') would be nice in numpy!!

WARNING: created by geophysicists




an alternative for speed: polars

(though Pandas 3.0 will be MUCH faster thanks to PyArrow)

note: I wrote this a year ago and it is still not out so uh

also - Pandas is still MUCH faster than Numpy for non-numeric data. Like, ~10x, depending on what you're doing.

### a few seaborn things


Seaborn is built around a few different meta-plot-types, each with their own subtypes:
![seaborn_plot_types](https://seaborn.pydata.org/_images/function_overview_8_0.png)

In [None]:
sns.lmplot(data=survey_data, x="[PyCharm]", y="[argparse]", hue="early")
plt.show()

In [None]:
penguins = sns.load_dataset("penguins")
penguins

In [None]:
sns.pairplot(penguins, hue="species")
plt.show()

Seaborn interacts nicely with matplotlib - you can put seaborn plots into existing matplotlib figures, etc:

In [None]:
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]), layout="constrained")
sns.scatterplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species", ax=axs[0])
sns.histplot(data=penguins, x="species", hue="species", shrink=.8, alpha=.8, legend=False, ax=axs[1])
plt.show()

Also for many modifications you want, you don't have to get into the weeds of matplotlib!

In [None]:
peng_plot = sns.relplot(
    data=penguins,
    x="bill_length_mm", y="bill_depth_mm", hue="body_mass_g")
peng_plot.set_axis_labels("Bill length (mm)", "Bill depth (mm)", labelpad=10)
peng_plot.legend.set_title("Body mass (g)")
peng_plot.figure.set_size_inches(6.5, 4.5)
peng_plot.ax.margins(.15)
peng_plot.despine(trim=True)
plt.show()

many more examples can be found at https://seaborn.pydata.org/examples/index.html