## Jupyter notebooks

Documentation: https://docs.jupyter.org/en/latest/

Jupyter notebooks may easily be the most deployed type of web application in Python!

<img src="https://jupyter.org/assets/homepage/main-logo.svg" alt="Jupyter Logo" width="200"/>

Jupyter notebooks support two main types of cells:

- **Code Cells**: Used to write and execute code (e.g., Python, R). The output appears directly below the cell.
- **Markdown Cells**: Used for formatted text, documentation, equations (LaTeX), images, and links.
- **Raw Cells**: Contain plain text that is not executed or rendered as Markdown. Useful for notes, instructions, or content to be processed by external tools.

You can switch between cell types using the toolbar or keyboard shortcuts.

#### Jupyter Server and Kernel

- **Jupyter Server**: The Jupyter server is the backend application that manages your notebooks, files, and computational resources. It provides the web interface, handles requests from your browser, and communicates with kernels to execute code. When you start Jupyter Notebook or JupyterLab, you are launching a Jupyter server.

- **Kernel**: A kernel is a separate process that runs and executes your code. Each notebook is connected to a kernel, which can be for different programming languages (e.g., Python, R, Julia). The kernel receives code from the notebook interface, executes it, and returns the output (results, plots, errors) back to the notebook.

**How they work together:**  
When you run a code cell in a notebook, the Jupyter server sends the code to the kernel. The kernel executes the code and sends the output back to the server, which then displays it in your browser.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#### Understanding the available datasets

#### 1. Penguins datset

- The penguins dataset contains measurements for three species of penguins (Adelie, Chinstrap, Gentoo) observed on different islands in the Palmer Archipelago, Antarctica.
- It includes features such as bill length and depth, flipper length, body mass, sex, and species.
- This dataset is commonly used for data visualization and machine learning exercises as an alternative to the classic iris dataset.
- It provides a real-world example for exploring classification, visualization, and data cleaning techniques.

In [None]:
penguins_df = pd.read_csv('data/penguins.csv')

In [None]:
penguins_df.head()

#### 2. Car Crashes Dataset

- The car crashes dataset contains data on traffic accidents, including the number of crashes, injuries, and fatalities by state or region.
- It typically includes features such as total crashes, alcohol-involved crashes, speeding-related crashes, and population statistics.
- This dataset is widely used for data visualization, exploratory data analysis, and statistical modeling to understand factors contributing to road accidents.
- It provides a practical example for learning about correlation, regression, and geospatial analysis in Python.

In [None]:
# Load and explore the penguins dataset
car_crashes = pd.read_csv('data/car_crashes.csv')

car_crashes.head(10)

#### 3. Chlorophyll Concentration Analysis

The `data/chla_subset.csv` dataset contains chlorophyll-a (chla) predictions for various water bodies, such as lakes and reservoirs. Each row represents a measurement event, including the following columns:

- `gnis_name`: Name of the water body (e.g., "Pepacton Reservoir", "Lake Montauk").
- `comid`: Unique identifier for the water body.
- `centroid_longitude` and `centroid_latitude`: Geographic coordinates of the water body's centroid.
- `date_acquired`: Date when the measurement or prediction was made.
- `predictions`: Predicted chlorophyll-a concentration (likely in µg/L).

This dataset is useful for analyzing spatial and temporal patterns of chlorophyll-a, which is an important indicator of water quality and algal biomass.

In [None]:
# Load and describe the chla dataset
chla = pd.read_csv('data/chla_subset.csv')
chla['date_acquired'] = pd.to_datetime(chla['date_acquired'])

# Show the first few rows
display(chla.head())

### 4. Any other dataset of your choice

Here're some Seaborn datasets for inspiration: https://github.com/mwaskom/seaborn-data

## Exercise!

#### Objective: Create some data visualizations that we can reuse for the rest of the tutorial.

### If you selected the `Penguins` dataset

#### 1. Create a bar chart showing how many penguins of each species are in the dataset.
Hint: plt.bar(), sns.countplot() or px.bar().

#### 2. Make a histogram of penguin body weights to see the distribution.
Hint: plt.hist(), sns.histplot() or px.histogram().

#### 3. Create a scatter plot comparing bill length vs bill depth, with different colors for each species.
Hint: plt.scatter(), sns.scatterplot() or px.scatter().

#### 4. Make a box plot showing flipper length for each penguin species.
Hint: plt.boxplot(), sns.boxplot() or px.box().

#### 5. Create a line plot showing the average body mass for each species across different islands.
Hint: plt.plot(), sns.lineplot() or px.line().

### If you selected the `crashes` dataset

#### 1. Create a bar chart showing the total number of car crashes by state.
Hint: plt.bar(), sns.barplot() or px.bar().

#### 2. Make a histogram of insurance premiums to see the distribution across states.
Hint: plt.hist(), sns.histplot() or px.histogram().

#### 3. Create a scatter plot comparing total crashes vs speeding-related crashes, with state abbreviations as labels.
Hint: plt.scatter(), sns.scatterplot() or px.scatter().

#### 4. Make a box plot showing the distribution of alcohol-related crashes.
Hint: plt.boxplot(), sns.boxplot() or px.box().

#### 5. Create a bar chart comparing insurance premiums vs insurance losses by state.
Hint: plt.bar(), sns.barplot() or px.bar().

### If you selected the `chlorophyll` dataset

#### 1. Create a histogram of chlorophyll-a predictions to see the distribution of concentration levels.
Hint: plt.hist(), sns.histplot() or px.histogram().

#### 2. Make a scatter plot showing the geographic distribution of water bodies using longitude and latitude coordinates.
Hint: plt.scatter(), sns.scatterplot() or px.scatter().

#### 3. Create a line plot showing how chlorophyll-a predictions change over time (by date_acquired).
Hint: plt.plot(), sns.lineplot() or px.line().

#### 4. Make a box plot comparing chlorophyll-a predictions across different water bodies (top 10 most frequent).
Hint: plt.boxplot(), sns.boxplot() or px.box().

#### 5. Create a scatter plot comparing longitude vs chlorophyll-a predictions, with different colors for different concentration ranges.
Hint: plt.scatter(), sns.scatterplot() or px.scatter().

### If you selected a different dataset

- What are the 5 most important insights we can get from this?