# 1 Plotting and Data Visualization

In this notebook we want to get familiar with plotting and the Python library `matplotlib`. Plotting is an essential tool for data exploration that can help you to get an intuition about certain characteristics and features of data.


[Matplotlib](https://matplotlib.org/) is probably the most widely used Python library and will be the one we are using in this course. However, there are also other alternatives that might be interesting for you, for instance [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/) (for this course, we expect you to stick to matplotlib).

Let's install and import `matplotlib`:

In [1]:
!pip install matplotlib==3.5.1



In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np #numpy will always have our back

## 1.1 Obtaining a dataset

For illustration purposes, we will use datasets that are canon in machine learning and data science. Those datasets are already preprocessed and easily obtainable through the [scikit-learn](https://scikit-learn.org/stable/) library.

Now, let's install and import `scikit-learn` and load the dataset.

In [3]:
!pip install scikit-learn==1.0.2

Collecting scikit-learn==1.0.2
  Downloading scikit_learn-1.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.4/26.4 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting joblib>=0.11
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.0/307.0 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.1.0
  Downloading scipy-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully insta

In [4]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

Now we have the Diabetes Patients dataset. Let's try to explore it with ```matplotlib```.

In [5]:
print(diabetes["data"].shape)

(442, 10)


As you can see, we have 442 samples with 10 features (Actually, there is an 11th feature, the target, which denotes the response variable, a measure of disease progression one year after baseline). Now, have a look at what features we are dealing with:

In [6]:
print(diabetes["feature_names"])

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


Some of the given abbreviations may seem rightfully cryptic to you, more information can be found in the dataset description:

In [7]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

## 1.2 Visualize features

Now, we want to get to know more about a feature, how it is distributed, and what we can learn from it. Therefore, we will pick out one of the features and plot them in different ways. 

As an example feature, we will take "bp", the blood pressure measurement.

### Task 1.2.1: Get the feature
Isolate the feature "bp" from the data and save the vector into the provided variable _bp_.

In [None]:
bp = # Your Code here

### Task 1.2.2: Describe the data
Use the skills you learnt in the lecture and the last assignment (and `numpy`), to extract meaningful properties from the data:
- Attribute type (scale) of the data
- Mean
- Median
- Maximum value
- Minimum value
- Variance

_Note that the data of this ready-to-use dataset have been already mean centered and scaled proportional to standard deviation. However, extracting the common properties like mean, median etc. helps you to have a good first impression of the data._)

In [None]:
# attribute type: your answer here
bp_mean = # Your Code here
bp_median = # Your Code here
bp_max = # Your Code here
bp_min = # Your Code here
bp_var = # Your Code here

### Task 1.2.3: Show the distribution
Now, we are interested in the distribution of blood pressure in diabetes patients and therefore want to sort patients' blood pressure data into "buckets" in a histogram. 

_Have a look at the sample plots in on the [matplotlib website](https://matplotlib.org/stable/gallery/statistics/histogram_features.html)._

**Do not forget to label your axes correctly and give your plot a suitable title!**

In [None]:
plt.title("Your Title here")
plt.xlabel("Your X Label here")
plt.ylabel("Your Y Label here")
plt.show()

### Bonus Task 1.2.4: What type of function could describe the data approximately?

In [None]:
# Your Answer here

### Task 1.2.5: Show the boxplot and describe it
Now that you know how to plot with matplotlib, you are tasked to create a box and whiskers plot of the data. Have a look at the [official matplotlib documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html) if you need some guidance. Additionally, give a quick description of the plot and what you learn from it about the data, especially in terms of _outliers_.

In [None]:
plt.title("Your Title here")
plt.xlabel("Your X Label here")
plt.ylabel("Your Y Label here")
plt.show()

# Your Description here

## 1.3 Visualize 2D data
After we have looked at the characteristics of a single feature, let's see how `matplotlib` can help to visualise two-dimensional data and single samples. Prominent 2D data are greyscale photos or images, where the two dimensions are the x and y positions of the pixel values.

### 1.3.1 Obtain an image dataset
Fortunately, `scikit-learn` also provides a dataset with pixel images as samples.

In [None]:
digits = datasets.load_digits()
print(digits.keys())

Let's go the usual route and have a look at the shape of the data.

In [None]:
print(digits["data"].shape)

So, we have 1797 samples with 64 features each. That means, if we isolate a feature vector for a single sample, it has 64 features. But aren't we dealing with images that are usually 2-dimensional?

Perhaps the feature names can give more insight:

In [None]:
print(digits["feature_names"])

From the feature names, we can conclude that the pixel values are represented in a vector, row for row. In order to plot it as a picture, we need a 2D representation, though.

Therefore our tasks are now: 
- 1.) isolate the feature vector of a single sample
- 2.) reshape the vector into a 2D matrix
- 3.) plot the image using `matplotlib`

### Task 1.3.1 Isolate a feature vector
This task can be seen as the "inverse" of task 1.2.1. But now, instead of a single feature over all samples, we want all the features for a single sample!

Isolate a sample of your choice and save it in the variable _digit_.

In [None]:
digit = # Your Code here
print(digit.shape)

### Task 1.3.2 Reshape the vector into a 2D matrix
Now, you should have a vector of length 64. The image samples of the dataset are square. So now you need to reshape the vector into the appropriate shape using Numpy. Save the resulting matrix into the variable _im_.

Hint: a helpful function is [numpy.reshape](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html).

In [None]:
im = # Your Code here
print(im.shape)

### Task 1.3.3 Plot the image
Use the skills obtained above to plot the sample using the ```imshow``` [function from matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html).

**Keep in mind: No plot without title and axes labels!**

In [None]:
# Your Code here

## Task 1.4 Visual recognition of correlations

In this task your job is to plot different attributes against each other in a scatterplot to find out if the selected attributes are linearly correlated.

### Task 1.4.1 BMI and Blood Pressure

We assume a positive correlation between the body mass index (BMI) of a person and their blood pressure.
Plot the asscociated attributes against each other and analyze the plot to find out if this is true.

In [None]:
# Your Code here

# Is the assumption true? Your Answer here

### Task 1.4.2 Age and Diabetes Disease Progression

As a layman, one could assume that the older a patient is, the worse their status of diabetes progression. Therefore, we assume that both features have a positive linear correlation. Plot the asscociated attributes against each other and analyze the plot to find out if this assumption is true.

In [None]:
# Your Code here

# Is the assumption true? Your Answer here

### Task 1.4.3 High-Density Lipoproteins and Diabetes Disease Progression
Some correlations between features might be difficult to estimate/assume when dealing with new data. Let's explore the influence on high-density lipoproteins on diabetes disease progression.

Plot the asscociated attributes against each other and analyze the plot to find out if there is some correlation.

In [None]:
# Your Code here

# Is there a correlation? Your Answer here