Skip to content

This repository contains illustrations to explain concepts in data (science).

Notifications You must be signed in to change notification settings

cosimameyer/illustrations

Repository files navigation

Illustrations

License: CC BY 4.0

This folder contains illustrations that I generated to explain concepts in #stats, #rstats, and/or #python.

I'm very happy if you find these resources useful. I created the illustrations to make (more or less) complex topics more understandable and you're more than welcome to use them by CC-BY license. Please attribute it by citing "Illustration by @cosima_meyer".

This work is licensed under a Creative Commons Attribution 4.0 International License.

General art

What I enjoy doing: Creativity, code, and puzzles

ALTImage showing two people holding two puzzle pieces to the sky (on one piece it says ”Creativity“, on the other ”Code“)) with a subtitle below the two persons saying ”What I enjoy doing ♥️

R and Python 💛💙

ALTImage showing a blue R with a pirate's hat and eye patch with a snake (Python) around R's leg

Hello Mastodon

ALTA blue mastodon/elephant holding up a sign with "hello" in hand-writing written on it

R

ALTImage showing a blue R with a pirate's hat and eye patch

Writing functions in R

CheatSheet

ALTImage showing how a general function in R looks like (a function has arguments, a function statement, and usually a return function). Good practices when writing functions are: Use meaningful names for your functions. It’s good to use verbs for functions. Make your function short and simple - each function should do one thing at a time Use an explicit return statement Writing assertions, warnings and stops is helpful

Debugging in R

CheatSheet

ALTImage showing a mole as a comparison for the debugging process (a mole digs in using debug(), stops when there is a browser(), and leaves the tunnel when calling undebug(). It also shows how the flow package works and that you get a visual overview of the "flow" of your package.

Writing a package in R

CheatSheet

ALTA summary reiterating the basic structure in package development (DESCRIPTION, NAMESPACE, R/, man/, and tests/) as well as helpful packages (devtools, use this, roxygen2, testthat, xpectr, cover, goodpractice, inteRgrate).

Shiny

UI and Server

ALTAn image showing a pseudo UI ui <- fluidPage( titltePanel("Your Title"), sidebarLayout(sidebarPanel(... Some content...), mainPanel(...place-your-plot...))

ALTAn image showing a pseudo server server <- function(input, output{output$first_plot <- renderPlot({...create-your-plot....})}

Visualization of reactivity (based on the excellent description by Garett Grolemund)

ALTGIF showing a pigeon carrier flying to the server to update a visualization when it is relevant

CheatSheet

ALTA visual summary of ShinyApps

left side: User interface (body) that defines the outer appearance of the app An image showing a pseudo UI ui <- fluidPage( titltePanel("Your Title"), sidebarLayout(sidebarPanel(... Some content...), mainPanel(...place-your-plot...))

right side: server (brain) where all the calculation happens An image showing a pseudo server server <- function(input, output{output$first_plot <- renderPlot({...create-your-plot....})}

Git(Hub)

Workflow

ALTImage showing a git workflow from the working directory to the remote repo. Working directory → Staging area → local repo → remote repo and also common git commands (git add code.R, git commit -m "Update", git push, git pull, git checkout, git merge)

Branches

ALTGIF showing how a feature branch evolves from a main branch and is then guided back (merged) into the main branch

GitHub and RStudio

ALTVisualization showing a typical workflow when using GitHub in RStudio with a new project: 1) Create a new repository on GitHub, 2) Open . Rproj in RStudio, 3) Connect with GitHub - and now it’s time to pull, commit and push :)

GitHub and VS Code

ALTVisualization showing a typical workflow when using GitHub in VS Code with a new project: 1) Create a new repository on GitHub, 2) Clone repository in your VS Code, 3) Connect with GitHub - and now it’s time to pull, commit and push :)

CheatSheet

ALTVisual summary of how to GitHub in and with RStudio left side: Image showing a git workflow from the working directory to the remote repo. Working directory → Staging area → local repo → remote repo and also common git commands (git add code.R, git commit -m "Update", git push, git pull, git checkout, git merge) right side: Visualization showing a typical workflow when using GitHub in RStudio with a new project: 1) Create a new repository on GitHub, 2) Open .Rproj in RStudio, 3) Connect with GitHub - and now it's time to pull, commit and push :)

NLP

Terms and concepts

ALTImage showing a visual overview of terms and concepts explaining a corpus, tokens, tokenization, DFM, stemming, and lemmatization. The verbalized version is in the text below: Corpus: When you have your text data ready, you have your corpus. It’s a collection of documents. Tokens: Define each word in a text (but it could also be a sentence, paragraph, or character). Tokenization: When you hear the word tokenization, it means that you are splitting up the sentences into single words (tokens) and turning them into a bag of words. You can take this quite literally - a bag of words does not really take the order of the words into account. There are ways to account for the order using n-grams (so for instance a bigram would leave the sentence "Rory lives in a world of books" as "Rory lives", "lives in", "in a", "a world", "world of", "of books") but it’s limited. Document-feature matrix (DFM): To generate the DFM you first split the text into its single terms (tokens), then count how frequently each token occurs in each document. Stemming: With stemming, you are getting the stem of the word. Lemmatization: With lemmatization, it’s slightly different. Instead of "stud" (which is the stem of the study terms), you end up with a meaningful stem - "study"

BERT

ALTImage showing two different workflows (Bag of words and BERT). The main difference is that with BERT you build upon a pre-trained model and tokenizer while with BOW you often have to train a model from scratch.

ALTImage showing three important components to know when training a BERT model. First, with BERT, you identify the order of the input. You give the model information about different embedding layers (the tokens (BERT uses special tokens ([CLS] and [SEP]) to make sense of the sentence), the positional embedding (where each token is placed in the sentence), and the segment embedding (which gives you more info about the sentences to which the tokens belong). And then there is the training: The first half of the training involves masking the words (Mask ML). During the training period, you mask one word at a time and the model learns, which word usually follows. During the second half, you train the model to predict the next sentence. This way, the model learns which sentences usually follow each other.

These visualizations are also available in blue:

Explainable AI/ML

ALT The visualization of six different model agnostic approaches to explain machine learning models post-hoc such as
  • Feature importance: Feature importance is based on the idea of permutation where you shuffle the values of a feature. If this change increases the model error, the feature is perceived to be important.
  • Shapley value: Shapley values are based on a game theoretical approach that calculates the average of all marginal contributions to all possible outcomes.
  • LIME: LIME plots tell you locally around a data point what the most important feature is. While they may look similar to SHAP, they are only an approximation (calculated on a small set of features and do not provide a guarantee of accuracy and consistency.
  • ICE: ICE plots show the individual conditional expectation where all other features are kept the same and the effects for one feature are calculated.
  • Partial dependence: Partial dependency plots visualize the average output of the model for each target feature value for the entire dataset.
  • Breakdown plot: Breakdown plots show the contribution of every variable to the final prediction.

ALT The visualization shows the logic of integrated gradients. You start with your baseline which does not have any effect on the model classification and continue stepwise using linear interpolation to get to the original input. On the way, you calculate the model's prediction, compare it to the baseline, and derive the integrated gradients for each input feature by summing up the results of these calculations.

Amazing Women

The following illustrations are part of a larger project in which I aim to make women more visible in the world of programming, statistics, and STEM in general.

About

This repository contains illustrations to explain concepts in data (science).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published