Do we remember our python?
--------------------------

Remembering python and good programming practices

* docstrings
* variable names


Do we know how github works?
----------------------------------------------

Git-based workflow:
* Make your own copy in git: git fork
* Get the code to work on: git clone
* Work, work (not too long!)
* Save your changes: git commit
* Put your changes at github: git push origin main

Git-based access control:
* SSH keys (preferred!)
* Tokens

(See also: https://docs.github.com/en/authentication/connecting-to-github-with-ssh; https://code.visualstudio.com/docs/editor/versioncontrol)


Looking at Data with numpy
----------------------------------------

Do *not* use this for your project (yet!)

In [1]:
import numpy
import seaborn

In [2]:
# Let's load the iris data

iris = seaborn.load_dataset('iris')

iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


What are some types of data?
----------------------------

| Type | Example Values | Example Application | 
|----|-------------|--------------|
| Quantitative Continuous |  |  |
| Quantiative Discrete |  |  |
| Qualitative Nominal |  |  |
| Qualitative Ordinal |  |   |

What are the types of the columns in the iris dataset?

What about these forms of data?
* Text
* Dates/times
* Images

In [3]:
# Let's get some summary statistics for this data

iris['sepal_length'].agg(['min', 'mean', 'median', 'max', 'var'])

min       4.300000
mean      5.843333
median    5.800000
max       7.900000
var       0.685694
Name: sepal_length, dtype: float64

In [None]:
# What about features / variables that are not numeric?

iris['species'].unique()

Summary statistics can mislead
------------------------------

The example below comes from this great seaborn documentation: https://seaborn.pydata.org/examples/anscombes_quartet.html

In [None]:
seaborn.set_theme(style="ticks")

# Load the example dataset for Anscombe's quartet
df = seaborn.load_dataset("anscombe")

df

In [None]:
df.groupby('dataset').agg(['min', 'mean', 'median', 'max', 'var'])

In [None]:
# Show the results of a linear regression within each dataset
seaborn.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="colorblind", height=4,
           scatter_kws={"s": 50, "alpha": 1})

What... just happened?

Let's explore data visualization
----------------------

In [None]:
iris = seaborn.load_dataset('iris')
 
seaborn.pairplot(data=iris, x_vars=["petal_width"], y_vars=["petal_length"], height=5)

In [None]:
help(seaborn.pairplot)

What are some types of visualization?
----------------------------------------------------

Hint: What happens if you go to https://seaborn.pydata.org/examples/index.html or type help(seaborn)?

In [None]:
# Let's explore seaborn

Lying with visualizations
-------------------------

Hint: https://uxdesign.cc/a-beginners-guide-to-identifying-misleading-data-visualizations-d82a93211ac6

Being a good visualization creator
----------------------------------

* https://uxdesign.cc/how-to-design-data-visualizations-that-are-actually-valuable-e8b752835b9a
* https://www.tableau.com/about/blog/examining-data-viz-rules-dont-use-red-green-together
* https://seaborn.pydata.org/tutorial/color_palettes.html