# The Absolute Basics (Recap)

**Anyone who has already taken a Python course with me and/or is familiar with Jupyter Notebooks like this one can skip to the next section.**

This is a review / introduction to the Jupyter Notebooks user interface. It's meant to help you quickly get (re)acquainted with the setup. If you're already comfortable with Python & Jupyter Notebooks, feel free to go through this section more quickly or skip it entirely.

## Layout
This tutorial is set up in what's called a Jupyter Notebook (file type .ipynb). It contains two types of "cells":
- Markdown cells (like this one) contain text and instructions.
- Code cells (see below) contain... well... code. You can run them by clicking on a small play button or by clicking inside them and pressing SHIFT+ENTER or CTRL+ENTER.

What's nice about Jupyter Notebooks is that Markdown cells allow us to explain code easily, but more importantly, each code cell can be run individually. This way, you can write and test your code piece by piece and see your output (or error message) immediately. This makes it easy for us to debug and learn.

<font color='green'>**Try it out by running the code block below.**</font>

In [None]:
print("Welcome to the first hands-on exercise in the course 'Advanced Python - Data Analysis, Visualization, and Statistics in Python!'")

You'll see the printed output just below the code cell.

<font color='green'>**Use the empty cell below to try the print function yourself.**</font>

## Defining Variables

Python—and all other programming languages—rely on variables or "objects" for calculations. In simple terms, every variable has a *value* and a *type*. In Python, the type is inferred implicitly, which makes it a more accessible language for beginners since you don't have to explicitly state the data type of every object.

Below, we define two variables, `a` and `b`.

- `a` is an *integer* with the value 3.
- `b` is a *string* (=text) with the value "Hello".

<font color='green'>**Run the cells below.**</font>

In [None]:
a = 3
b = "Hello"

## Calling Variables

We have now stored the values 3 and "Hello" in the variables `a` and `b`. We can retrieve these stored variables by *calling* them.

<font color='green'>**Run the cell below to call `a` and display its value on the screen.**</font>

(Note: By default, Jupyter runs all the code in your code cell but only shows you the last variable or output, unless you explicitly use `print()`.)

In [None]:
a

Now we also want to call the variable `b`.

<font color='green'>**Create a new code cell right below this one and call the variable `b`.**</font>


You can create new cells by clicking the plus (+) button in the toolbar.

We can define new variables based on previously defined ones. For example, we can define `c = a + 10`.

In [None]:
c = a + 10
c

Great job! 🥳 You've now grasped the basics of the Jupyter interface and how to define variables.

## Handling Errors
Python may be a forgiving programming language, but it's still a programming language. That means if you don't follow its rules, it will return an error and won't run your code.

This can be tricky when you're starting out, as even a single typo can stop your code from working. Fortunately, Python gives some hints about why your code failed. So, when you encounter an error, try to follow these steps:
- Relax. Errors happen all the time, even to the best of us.
- Read the error message (the beginning and the end of the message especially provide helpful clues).
- Try to determine the cause of the error.
  - Sometimes the error message might tell you something.
  - Sometimes you just have to look closely.
  - Sometimes it can be helpful to break down a more complex sequence of steps to see exactly where your code is failing.
- If you can't fix the code yourself, StackOverflow.com is your friend. Google your error and see what advice you get on StackOverflow. Everyone does this.
- ChatGPT is also an excellent debugging tool. You can paste in code, ask it to explain what each line does, or have it automatically fix your code.
  - But try not to just copy-paste; really try to understand what makes the code work and what doesn't.
- If all else fails, try asking another person for help. Sometimes a fresh pair of eyes is all it takes.

<font color='green'>**Now, see what's wrong with the following code and fix the error.**</font>

In [None]:
print("This is a really important message. Unfortunately, it won't execute.)

Remember your variables `a` and `b` that you defined above?

<font color='green'>**Call them again to remind yourself of their values.**</font>

<font color='green'>**Define the variable `d = a * 2`.**</font>

<font color='green'>**What happens now if you calculate `b * d`?**</font>

### Using Code Comments
Sometimes it's helpful to write little notes in your code to explain what's happening. To do this, type `#` at the beginning of your line. Such lines will not be executed as code.

In [None]:
# This line is just a comment and will not be executed
print("This line will be printed")

You can also use comments to "comment out" entire blocks of code. This can also be helpful when debugging.

In [None]:
#print("Line 1")
#print("Line 2")
print("Line 3")
#print("Line 4")

## Hooray! 🥳🎉
You've mastered the basic steps. Now let's do something more interesting, now that you're settled in.

# Exploratory Data Analysis
We will delve into the beginnings of exploratory data analysis and learn how to calculate descriptive statistics for a variable as well as a correlation between two variables.

*Note*: This course is designed for advanced users. It assumes that you are already familiar with the basics of Python, especially `pandas`. If you don't know how to select columns and filter datasets with `pandas`, I recommend that you first take one of our basic courses. If you still want to participate, AI assistants like ChatGPT or Google Gemini can help you in areas where you have difficulties.

## Reading in Data
For practice, we'll use [this data](https://www.kaggle.com/datasets/steve1215rogg/student-lifestyle-dataset) from Kaggle. I have added one more column to it.

<font color='green'>**Read in the file `"01_student_lifestyle.csv"`.**</font>

Keep in mind that you first need to upload the file to the folder area (on the left). The csv file must be unzipped for this (= not a ZIP file). There is a short demo video in the course materials on GitHub if you have trouble with this.

In [None]:
import pandas as pd

In [None]:
# df = pd.read_csv('01_student_lifestyle.csv')
# Your code here

<font color='green'>**Briefly familiarize yourself with the DataFrame by looking at the columns, rows, and interesting values.**</font>

How and how deeply you do this is up to you.

In [None]:
# Your code here

## Descriptive Statistics
When we analyze data, we are often interested in the distribution of certain variables. A simple measure for this is provided by descriptive statistics such as mean, minimum, maximum, quantiles, etc.

For example, we might want to find out the average shopping cart value of our e-commerce customers, the earliest goal of our favorite team, or the top 1% of movies in a database.

Most of these descriptive statistics can only be applied to **numerical variables**. For **categorical and ordinal** variables, the mode is particularly useful, which gives the most frequent value.

### Minimum and Maximum
Self-explanatory. Applies to numerical and ordinal variables.

We can find out how diligent the most and least hardworking students in our sample are.

In [None]:
df.Study_Hours_Per_Day.max()

In [None]:
df.Study_Hours_Per_Day.min()

If we want to know which row the values belong to, we use `idxmax()` and `idxmin()` to get the **index**, which is the identifier of the row.

In [None]:
df.Study_Hours_Per_Day.idxmax()

In [None]:
# This is what the life of the most diligent student looks like.
# No wonder they're stressed, studying 10 hours a day ;)
df.iloc[100]

<font color='green'>**Try these functions for another column of your choice.**</font>

In [None]:
# Your code here

### Averages
We've learned to identify the extremes of the distribution. Now let's get to know the center of the distribution.

Many of you might think of the mean when you hear "average." However, there are other averages that are useful in data analysis: the median and the mode.

A brief explanation:
- **Mean**
  - The sum of all values divided by the number of values. Also called the arithmetic mean.
  - Sensitive to outliers.
  - Applies only to numerical variables.
- **Median**
  - The value for which half of the observations are above it and the other half are below it.
  - Is not distorted by outliers.
  - Applies to ordinal and numerical variables, but is more meaningful for numerical ones.
- **Mode**
  - Returns the most frequently occurring observation.
  - Unlike the others, it can also be applied to categorical variables.
  - Applies to all variable types, but is more meaningful for categorical & ordinal ones.

In [None]:
df.Study_Hours_Per_Day.mean()

In [None]:
df.Study_Hours_Per_Day.median()

In [None]:
df.Study_Hours_Per_Day.mode()

<font color='green'>**Try out the average statistics for variables of your choice.**</font>

In [None]:
# Your code here

### Other Statistics
- `.count()`: Returns the number of observations.
- `.std()`: Indicates the standard deviation.
  - This measures the spread of the distribution. A small standard deviation means the values are close together; a large one means they are far apart.
- `.quantile(q)`: Returns the value for the entered quantile `q`.
  - For example, for `q=0.99`, we get the value of the top 1% of the distribution.

<font color='green'>**Try out the other statistics.**</font>

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

### A Quick Solution for Everything

Because the above statistics are often calculated together, there's a function that bundles this for us.

In [None]:
df.Study_Hours_Per_Day.describe()

<font color='green'>**Do the same for another column.**</font>

What happens if you choose a categorical column?

In [None]:
# Your code here

In [None]:
# Your code here

## Correlations

Correlations are a fundamental tool in statistics to examine the relationship between two variables. They indicate whether and how strongly two quantities are related to each other.

The correlation is described by the correlation coefficient, which can take values between -1 and +1:

- **+1**: Perfect **positive correlation** (as one variable increases, the other also increases proportionally).
- **0**: **No correlation** (no discernible relationship).
- **-1**: Perfect **negative correlation** (as one variable increases, the other decreases proportionally).

In practice, we often use the **Pearson correlation coefficient** to measure linear relationships. The prerequisite for this is that we have two numerical variables.

In Python, there are various ways to calculate correlation:

We see that longer study time is associated with a higher GPA (Grade Point Average). This probably matches our expectations. However, we cannot yet confirm causality from this.

Note on interpreting `GPA`: Unlike some other school systems, a *higher* `GPA` corresponds to *better* performance.

In [None]:
# With pandas, we select one column
# Then we specify the function to be used (corr())
# And as the argument for the corr() function, we select the second column
df.Study_Hours_Per_Day.corr(df.GPA)

Alternatively, we can also determine the Pearson correlation coefficient using `scipy`.

In [None]:
# We import the "pearsonr" function from the "stats" module of the "scipy" package
from scipy.stats import pearsonr

In [None]:
# We determine the result of the correlation
result = pearsonr(df.Study_Hours_Per_Day, df.GPA)
print(result[0])

The `scipy.stats` library also gives us a second value, called the p-value. For now, just know that a very small p-value (like the one we will see below) means that the observed correlation is very likely not just a random coincidence.

Even if the correlation is strong, we still can't speak of causality here. More on that in the next lesson. 😊

In [None]:
result

<font color='green'>**Try to determine the correlation between two variables yourself.**</font>

Hint: Correlations are symmetrical. So it doesn't matter which variable you take first and which second.

In [None]:
# Your code here

☕❗ <font color='red'>**Take a break here and let me know you're done.**</font> ❗☕

<font color='green'>**Check if someone else perhaps might need help.**</font>

---

# Checking Our Assumptions

❗ <font color='green'>**Only continue here after we have discussed the assumptions for correlation tests.**</font> ❗

Okay, that was pretty straightforward. But we left out a few details: the conditions that make a Pearson correlation test valid. Before we fully trust our correlation score, it's good practice to quickly check a few things.

For beginners, the best way to do this is visually.

- **1) Is the relationship linear?**
  - Pearson's correlation only captures straight-line relationships.

- **2) Are there any major outliers?**
  - Extreme data points can distort the result.

- **3) Are both columns normally distributed?**
  - Pearson's correlation can only be trusted if each variable follows a normal bell-shaped distribution.

Let's investigate these two points for `Study_Hours_Per_Day` and `GPA`.

## 1) Linearity
Pearson's r can only detect linear relationships, so it is poorly suited for relationships that have a curve.

For example, the relationship between temperature and well-being is usually non-linear: many people feel comfortable at 20-25°C, but not at very cold or very hot temperatures.

We can easily assess linearity by eye by looking at the variables in a scatter plot.

Indeed, we observe an upward trend, without curves. So there appears to be a linear relationship between the two variables.

In [None]:
df.plot(kind="scatter", x="Study_Hours_Per_Day", y="GPA")

## 2) Outlier Analysis
Outliers are data points that deviate significantly from the distribution of the rest of the data points. These can distort the result and may indicate a measurement error.

For example, if we observe data points of someone studying 22 hours a day, that is likely an error in data collection and should be investigated. But be careful: just because a data point is unusual, we shouldn't necessarily remove it. Unusual things happen, and if we subsequently remove them (or correct them otherwise), we might be the ones distorting the result and drawing false conclusions.

Therefore, the rule of thumb is: outliers should be investigated, but we should only make corrections if we
- a) have a reason to believe that there is a measurement error
- b) want to limit our statistical statements to the "majority" and consciously disregard the extremes.

*So how do we find out if a data point is an outlier?*

### Domain Knowledge
If we know that certain measurements are unrealistic or impossible (e.g., 22 or even 26 hours of study time per day), then we can confidently make corrections. Such interventions merely require a good understanding of the data and the data collection process, as well as some common sense.

When we look at the scatterplot above, we can see that the measured values do not appear to be unrealistic or impossible.

### Visual Inspection

With **scatter plots** or **box plots**, we can visually represent the distribution of our data points and thus quickly see if individual points deviate greatly from the usual distribution.

In [None]:
# If we want to check two variables at once, a scatter plot is suitable
df.plot(kind="scatter", x="Study_Hours_Per_Day", y="GPA", title="Scatterplot for two variables");

A **Box Plot** is another great tool. It shows several statistical values in one visualization:
- **Median**: The line in the middle of the box
- **25% and 75% Quantiles**: The edges of the box (also called the interquartile range, IQR)
- **Whiskers**: The lines extending from the box, which represent the typical range of the data.
- **Outliers**: The dots that fall outside the whiskers.

Any data points beyond the whiskers are classified as **outliers**.

However, outliers do **not necessarily require action**. They are just conspicuous points that could be investigated further.

In [None]:
# If we want to examine one variable in detail, a box plot is suitable
df.boxplot(column=["Study_Hours_Per_Day", "GPA"])

Looking at our plots, nothing seems impossible or extremely out of place. The data points look reasonable, so we can proceed with confidence.

## 3) Normal Distribution
Statistical tests often require that the variables being tested are normally distributed.
This distribution is widespread and also known as the Gaussian distribution or bell curve.

_How do we check whether variables are normally distributed?_


### Visual Inspection
There are statistical tests of normality which we will learn about later in this course. We can, however, also get a good impression by inspecting the data visually to see if it roughly follows the familiar bell shape. 

A histogram gives us an initial indication of whether a variable is normally distributed. In this case, we see the classic bell shape. The variable `GPA` therefore appears to be normally distributed.

In [None]:
df.GPA.hist()

<font color='green'>**Do the same for the other variable we are using in our correlation.**</font>

## What if the assumptions aren't met?
Our checks looked fairly good and so **Pearson correlation** might be the right test. What would we do, however, if one of our assumptions failed? In such cases, we should be cautious with the **Pearson correlation**.

Instead, we could use a correlation test that requires fewer assumptions: **Spearman's Rank Correlation**.

# Spearman's Rank Correlation
Spearman's Rank Correlation is a non-parametric version of a correlation test. The interpretation of the coefficient (from -1 to +1) remains the same, but Spearman's test can also be applied when the assumptions for Pearson's correlation test are not met.

As a rule of thumb, if the assumptions for a parametric test (like Pearson's) are met, it is more suitable. If they are not met, we turn to the non-parametric test (like Spearman's).

In [None]:
from scipy.stats import spearmanr

# Spearman's rank correlation
result = spearmanr(df.Study_Hours_Per_Day, df.GPA)

print(f"Spearman correlation coefficient: {result[0]}")
print(f"P-value: {result[1]}")

<font color='green'>**How do you interpret this result?**</font>

## Now You!
<font color='green'>**Check the visual assumptions mentioned above for one or two variables of your choice and run the appropriate correlation test.**</font>

Hint: The variables must be numerical.