# 人工智能在医学中的应用：数据科学--基础2

## Python 编程："numpy" 和 "pandas"

- 讲师：itwangyang (itwangyang@gmail.com)

- 目标受众：医学生

- 课程日期：2024 年 09 月 13 日

## 1.学习目标

在本次学习中，您将接触到**数据科学。您将使用**Python软件包 "numpy "和 "pandas"，加载并处理COVID-19数据集。

## 2.学习目标
#### 理论
* 数据科学
* `numpy` 库
* `pandas`库
### 实用
- 1.数据集
- 2.用 `pandas` 作为 `DataFrame` 读取数据
- 3.查看数据
- 4.选择列
- 5.获取列中的唯一条目
- 6.选择行
- 7.分组数据

## 3.参考文献

- 数据科学、机器学习、人工智能
    - http://varianceexplained.org/r/ds-ml-ai/
- 向量、矩阵、张量
    - https://www.quantstart.com/articles/scalars-vectors-matrices-and-tensors-linear-algebra-for-deep-learning-part-1/
    - https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
- `numpy`
    - https://numpy.org/doc/stable/user/absolute_beginners.html
    - https://scipy-lectures.org/intro/numpy/array_object.html
- `pandas`
    - https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955
    - https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/#iloc-selection
    - https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c
- 数据集
    - COVID-19 病例：[原始 RKI 数据](https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv)
    - 德国疫苗接种进展情况：[原始 RKI 数据](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Impfquoten-Tab.html) 和[处理后数据](https://github.com/ard-data/2020-rki-impf-archive)
- 数据可视化
    - RKI COVID-19 面板][https://corona.rki.de/]
    - [COVID-19 柏林各区案例 > Bezirke > Übersicht](https://www.berlin.de/corona/lagebericht/)
    - [疫苗接种仪表板](https://impfdashboard.de/)

## 4.理论

#### 数据科学

#### 数据科学、机器学习和人工智能之间有什么区别？

改编自 [David Robinson 的博文](http://varianceexplained.org/r/ds-ml-ai/)。

数据科学、机器学习和人工智能领域确实有很多***重叠之处，但它们***不能互换。

#### **数据科学**产生**见解**
- "普通患者的存活几率为 70%"（描述性：描述数据集）
- "不同的病人有不同的存活几率"（探索性：发现你不知道的关系）
- 随机实验表明，分配给 Alice 的病人比分配给 Bob 的病人更有可能存活"（相关性：找出一个变量在另一个变量发生变化时的情况）

#### **机器学习**（ML）产生**预测**

- "预测这名患者是否会患上败血症"
- "预测这张图片中是否有一只鸟"

#### **人工智能**（AI）产生**行动**

- 游戏算法（深蓝、AlphaGo）
- 机器人学和控制理论（运动规划、双足机器人行走）
- 优化（谷歌地图选择路线）

### `numpy` 库

#### 概览

* 角色：科学计算（使用阵列）
* 网站： https://numpy.org/
* 说明（摘自 [此处](https://numpy.org/doc/stable/user/absolute_beginners.html)）：
> NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.
* 文献资料：https://numpy.org/devdocs/

#### 应用

- 以数组形式创建向量（一维）、矩阵（二维）和张量（>=三维）
- 使用大量高级数学函数对这些数组进行操作
- 在 "pandas"、"scipy"、"matplotlib"、"scikit-learn "和大多数其他数据科学和科学 Python 软件包中广泛使用
![](https://res.cloudinary.com/practicaldev/image/fetch/s--oTgfo1EL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/adhiraiyan/DeepLearningWithTF2.0/master/notebooks/figures/fig0201a.png)

图源：https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66

### `pandas` 库

#### 概览

* 角色：数据处理和分析
* 网站： https://pandas.pydata.org/
* 说明（摘自 [此处](https://pandas.pydata.org/)）：
> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
* 文档：https://pandas.pydata.org/pandas-docs/stable/

#### 应用

摘自：https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955

> `pandas` 可以完成许多任务，包括

>

> 读取/写入多种不同的数据格式

> 选择数据子集

> 跨行和跨列计算

> 查找和填补缺失数据

> 对数据中的独立组进行操作

> 将数据重塑为不同的形式

> 通过 matplotlib 和 seaborn 实现可视化

#### `DataFrame`和`Series`。

pandas "库有两个主要的数据容器："DataFrame"（二维）和 "Series"（一维）。

- `DataFrame` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):
  > Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure
- `Series` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html):
  > One-dimensional ndarray with axis labels (including time series).


数据帧 "比 "系列 "使用得更多，让我们来看看它的组成部分。

![DataFrame anatomy](https://raw.githubusercontent.com/volkamerlab/ai_in_medicine/master/images/dataframe_anatomy.png)

图源：https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

## 5.实用性

<div class="alert alert-block alert-info">
    <b>我们的目标：</b>我们将介绍熊猫的所有功能，这些功能是我们按年龄组和地区可视化柏林 COVID-19 最新病例数所必需的。在了解了如何实现可视化之后，您将获得德国疫苗接种进展的最新数据，并亲自绘制首次/第二次疫苗接种的时间进程图。
</div>

### 5.1.数据集
我们将使用罗伯特-科赫研究所（RKI）每天发布的 COVID-19 病例数据，这些数据在 RKI COVID-19 Dashboard(https://corona.rki.de) 上可以非常直观地显示出来。在本笔记本中，我们将重点关注柏林的数据。

该数据集可在此处免费获取：我们可以通过以下网址将数据集直接加载到pandas中： https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv

### 5.2.用 `pandas` 作为 `DataFrame` 读取数据

In [1]:
import numpy as np
import pandas as pd

首先，我们导入库 `numpy` 和 `pandas`（缩写为 `np` 和 `pd`，这样我们就能写出更简短的代码）。库是一系列功能的集合，能让你在无需从头开始编写全部代码的情况下执行许多常见任务。

例如，"pandas "库提供了 "read_csv() "函数，用于将逗号分隔值（csv）文件读入所谓的 "数据帧"。

**提示**：您可以在本 Jupyter 笔记本中查看某个库的可用功能，方法是在库名后面写一个点，然后按 tab 键。所有可用功能都将弹出，供你探索。由于选项很多，你可以在弹出窗口时写上 "阅读 "等字样来缩小范围。

**注意***：如果您使用的是 Google Colab，则必须首先在 "工具">"设置">"编辑器 "中禁用 "自动触发代码补全"，才能使用此功能。

您可以通过 `pandas` 阅读所有可能的文件格式：

如果我们执行（用 `Enter`）这个单元格，我们会得到一个 `AttributeError` 因为模块 `pandas` 不知道 `read()`。

我们还可以使用 `?` 获取函数的 docstring，即关于该函数的作用和可以传递的参数类型的说明！

现在，我们将使用 `read_csv()` 函数（[参见文档](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)）将 csv 文件内容作为 `DataFrame` 加载到变量 `data_raw` 中。

In [4]:
%%time
# `read_csv`可以读取计算机中的路径，也可以读取互联网 URL！
# 读取远程 csv 文件只需几秒钟
data_raw = pd.read_csv("https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv", sep=',')

HTTPError: HTTP Error 404: Not Found

让我们来看看 `data_raw` 中的 `DataFrame` 。

### 5.3.查看数据

#### `DataFrame` 头/尾

让我们使用 `head()` 函数（[参见文档](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)）看看表格的前几行。

**注意**：为了避免在本 Jupyter 笔记本中打印大型表格，我们会经常使用该命令。

In [None]:
data_raw.head()  # Shows by default the first 5 entries

让我们使用 `tail()` 函数（[参见文档](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html)）看看表格的最后几行。请注意，您可以向 `head()` 和 `tail()` 函数传递一个数字，以指定要查看的第一行/最后一行的数量。

In [None]:
data_raw.tail(2)

#### `DataFrame` 维度

让我们使用 `shape` 以 `（行数，列数）` 的形式显示表格的列数和行数（= 维度/形状）。

In [None]:
data_raw.shape

#### `DataFrame` 列名

我们可以使用 `columns`获取所有列名。

In [None]:
data_raw.columns

Let's list here the meaning of a few criteria (see full list on [RKI COVID-19 data download website](https://www.arcgis.com/home/item.html?id=dd4580c810204019a7b8eb3e0b329dd6)):

- `Bundesland`: State name
- `Landkreis`: District name
- `Altersgruppe`: Age group (6 groups: `0-4`, `5-14`, `15-34`, `35-59`, `60-79`, `80+` and `unbekannt`=unknown)
- `Geschlecht`: Gender (`M`=male, `W`=female and `unbekannt`=unknown)
- `AnzahlFall`: Number of cases in group
- `AnzahlTodesfall`: Number of deaths in group
- `AnzahlGenesen`: Number of recoveries cases in group
- `Meldedatum`: Date when case was reported to the Gesundheitsamt (you will use this in the next lesson on data visualization with `matplotlib`)
- `Datenstand`: Date when data was updated

我们可以把 "DataFrame "看作是一个列表（每个列表可以包含不同的数据类型），它以表格的形式显示，并带有列名和索引名等元数据。

In [None]:
list_of_lists = [['Helen', 20, 'female'], ['Paul', 25, 'male'], ['Kim', 35, 'female']]
list_of_lists

In [None]:
pd.DataFrame(list_of_lists, columns=['name', 'age', 'gender'])

#### _Your turn_：练习

__Exercise 1__: 获取 (a) `data_raw` 中的前 4 行和 (b) 最后 5 行

__Exercise 2__: 获取 (a) `data_raw` 中的列数和 (b) 第三列名称

__Exercise 3__: 建立一个包含 4 个国家数据的 `DataFrame` ：

- 国家名称
- 您最喜欢的国家
- 您去过那里吗？

## ❤️接下来，我不想用中文了，我本来想锻炼自己的英语的，😭，你们就看英语吧❤️

### 5.4 Select columns

#### By column name

Let's select some interesting columns! The `DataFrame` is quite large and we are only interested in a subset of the offered criteria. With `pandas`, it is very easy to slice the columns that you want by the following syntax:

```python
data_raw[list_of_interesting_columns]
```

The list of column names of interest could look like this:
```python
list_of_interesting_columns = ['Bundesland', 'Landkreis']
```

Taking both steps together it looks like this (note the two sets of `[]`, the inner `[]` is part of the list, the outer `[]` is the syntax for `DataFrame` slicing):

In [None]:
data_raw[['Bundesland', 'Landkreis']].head()  # Note the use of .head() to show only the first 5 rows

We see in the following that it is possible to write a command over multiple lines to make is easier to read.

Let's write this operation's output into the variable `data`; we will use this variable from now on.

In [None]:
data = data_raw[
    [
        'Bundesland', 
        'Landkreis', 
        'Altersgruppe', 
        'Geschlecht', 
        'AnzahlFall', 
        'AnzahlTodesfall', 
        'AnzahlGenesen', 
        'Datenstand'
    ]
]

You will have noticed that there is no cell output as before. This happens when the output is saved in a variable (here `data`). Let's inspect the content of `data`:

In [None]:
data.head()

#### By column AND index names/indices using `loc/iloc`

1. __Recap__

So far we sliced columns using column names like this:

In [None]:
data[['Bundesland', 'Landkreis']].head()

2. __`loc`__

The above code is a shorter form for using `loc`:
```python
dataframe.loc[list_of_row_names, list_of_column_names]
```
`index_names` or `column_names` can be set to `:` if we want to select the full row or column, respectively.

In [None]:
data.loc[:, ['Bundesland', 'Landkreis']].head()

3. __`iloc`__

Or, instead of row and column names, we can use their indices (like you learnt on day 1 where you selected elements from a list). 

```python
dataframe.iloc[list_of_row_indices, list_of_column_indices].head()
```

Remember, in Python indices are 0-indexed.

In [None]:
# Check out index of columns of interest
data.columns

In [None]:
data.iloc[:, [2, 3]].head()

__Note__: You will use `loc/iloc` in the notebooks to come in the next lessons, but for this lesson here, we will use column selection by column names as discussed first:

In [None]:
data[['Bundesland', 'Landkreis']].head()

#### _Your turn_: Exercises

__Exercise 4__: Select the columns listing the number of cases, deaths and recoveries using their __column names__.

__Exercise 5__: Do the same as in Exercise 4 but this time use __`loc`__.

__Exercise 6__: Do the same as in Exercise 4 and 5 but this time use __`iloc`__.

### 5.5 Get unique entries in a column

Now, we'd like to check what kind of entries we can find in a column. 

First, we select a column, similar to how we learned it in *Chapter 5.4*. Since we select this time only **one** column, we do not pass the column name as a list but as a simple string.

In [None]:
data['Bundesland'].head()

This returns a `Series` (instead of a `DataFrame`):

In [None]:
type(data['Bundesland'])

Now let's apply the `unique()` function ([see docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)) and check the states in our dataset.

In [None]:
data['Bundesland'].unique()  # Note: Here we pass the single column as string not as list (as shown in Chapter 5.4)

There should be 16 states, let's check with Python's built-in function `len` ([see docs](https://docs.python.org/3/library/functions.html#len)) that returns the length of e.g. list-like objects:

In [None]:
len(data['Bundesland'].unique())

#### _Your turn_: Exercises

__Exercise 7__: Select the column on age groups (`'Altersgruppe'`) - which age groups are monitored?

__Exercise 8__: Select the column on districts (`'Landkreis'`) - how many districts are monitored?

### 5.6. Select rows (by conditions)

Very often, you not only have more criteria (columns) in your dataset than you are actually interested in but also more data points (rows) than you need. Let's say for instance, that we are mainly interested in data points regarding Berlin. Since we have a dataset for Germany, we will need to do some (row) filtering.

Let's select only the state column (`Bundesland`).

In [None]:
data['Bundesland']

With `Series` it is very easy to check for each row if it fullfils a given condition. As an example, let's ask for "Thüringen".

In [None]:
data['Bundesland'] == 'Thüringen'

You can see, that this operation returns a `Series` of the same length and index as our initial `Series` containing boolean values (`True` or `False`).

How can we use this boolean Series know to subset `data` for data points concerning Berlin (i.e. filter `data` for rows concerning Berlin)? We use the following syntax:

```python
data[condition]
```

In [None]:
# Condition
state_is_berlin = data['Bundesland'] == 'Berlin'

# Subset dataset by condition
data[state_is_berlin]  # equals
data[data['Bundesland'] == 'Berlin']

#### _Your turn_: Exercises

__Exercise 9__: Select only data points for Berlin Mitte (one condition).

__Exercise 10__: Select only data points for Berlin and patients between 35 and 59 years old (two conditions).

```python
# Use one condition
data[condition]

# Use multiple conditions
data[condition1 & condition2]  # Fullfill condition 1 AND 2
data[condition1 & not condition2]  # Fullfill condition 1 AND not 2
data[condition1 | condition2]  # Fullfill condition 1 OR 2
```

### 5.7. Group data

From here on, we will continue to work only with data for Berlin, so we will save the subset to the new variable `data_berlin`.

In [None]:
data_berlin = data[data['Bundesland'] == 'Berlin']
data_berlin.shape

From https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html:

> By `groupby()` we are referring to a process involving one or more of the following steps:
> * **Splitting** the data into groups based on some criteria.
> * **Applying** a function to each group independently.
> * **Combining** the results into a data structure.


#### Example: Get group sum with `sum()`

**Splitting**: Split data into groups based on a criteria.

In [None]:
data_berlin.groupby('Altersgruppe')

In [None]:
type(data_berlin.groupby('Altersgruppe'))

Look at one of the groups (= subset of the full DataFrame).

In [None]:
data_berlin.groupby('Altersgruppe').get_group("A00-A04")

**Applying and combining**: Apply function to each group, e.g. get the sum of numerical values in each group using `sum()`.

In [None]:
data_berlin.groupby('Altersgruppe').sum()

With `pandas` it is very easy to quickly plot data using the `plot()` function ([see docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)) - with the parameter `kind` you can specify what plot type you want to plot (in our case we want a barplot). Note that the index labels will serve as x-axis labels.

Select `AnzahlFall` for the plot.

In [None]:
data_berlin.groupby('Altersgruppe').sum()['AnzahlFall'].plot(
    kind='bar', 
    title=f'Number of all COVID19 cases in Berlin since the beginning of the pandemic ({data["Datenstand"].unique()[0]})'
);

Compare this plot with the [RKI Dashboard](https://corona.rki.de/).

#### _Your turn_: Exercises

__Exercise 11__: Since the `groupby` functionality is very powerful but also at first difficult to wraps our head around, go through the first two examples above again in your group and discuss questions.

__Exercise 12__: Get number of total COVID-19 cases by Berlin's districts and compare your findings to the [official COVID-19 table for Berlin > Bezirke > Übersicht](https://www.berlin.de/corona/lagebericht/).

__Exercise 13__: Plot the number of total COVID-19 cases in Berlin grouped by Berlin's districts (barplot).

## 6. Discussion

In this notebook, we saw how quickly possible it is to read in a csv file as `DataFrame` (*Chapter 5.2*) and to start working with it. 
- We got a first impression on our COVID-19 Berlin dataset. We looked at the number of data points (`DataFrame` rows) and criteria (`DataFrame` columns) as well as some example data points, see *Chapter 5.3*.
- We selected interesting columns and checked what kind of column entries we can except, see *Chapter 5.4 and 5.5*. 
- We grouped data by certain criteria (columns), and applied operations on these groups, e.g. we calculated the sum within each group). We also did some first steps towards plotting with `pandas`, see *Chapter 5.6*. 

## 7. Final exercise

As promised at the beginning, you will get your own dataset now :)

Last year during the course, we could only work with COVID-19 cases data but luckily, this year, we have something positive to look at as well - the vaccination progress in Germany! You can find that data online again at the [RKI website](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Daten/Impfquoten-Tab.html) (under the "Daten" section). 

The provided Excel file is a bit difficult to handle, thus many GitHub repos have been set up to process the dataset into formats that are easier to work with, e.g. https://github.com/ard-data/2020-rki-impf-archive.

1. Let's load the cumulative vaccination progress for Germany. 

In [None]:
vaccination_cumulative = pd.read_csv(
    "https://raw.githubusercontent.com/ard-data/2020-rki-impf-archive/master/data/9_csv_v3/region_DE.csv"
)
vaccination_cumulative.head()

2. Let the `date` column know that it represents dates (change data structure from `object` to `datetime`). This will help us later during plotting because `pandas` will not try to label each day in the plot but maybe rather every month (depending on the range of dates).

In [None]:
vaccination_cumulative["date"] = pd.to_datetime(vaccination_cumulative["date"])
vaccination_cumulative.head()

In [None]:
vaccination_cumulative.dtypes

3. Set the date as the `DataFrame` index. Use `your_dataframe.set_index(column_name)` for that.

4. Select only the columns containing the cumulative number of people who are fully vaccinated or vaccinated once/twice (`personen_voll_kumulativ`, `personen_erst_kumulativ`, and `personen_zweit_kumulativ`).

5. Plot the cumulative time series.

6. Compare your results to the data on the BMG website: https://impfdashboard.de/

__Solutions__

__Words of encouragement :)__ 

Before you take a look at the solutions, try to solve the exercises yourself. 

All the information needed lives in _5. Practical_ - if you are stuck, first take a look at the material there. Talk to your fellow students. If you have a solution, then go ahead and take a look here.

Also note that the solutions given here show only one possibility - most of the times there are multiple options to achieve the same end result.

<details>
<summary> > Solution 1</summary>
    
```python
data.head(4)
data.tail()
```
    
</details>

<details>
<summary> > Solution 2</summary>
    
```python
len(data.columns)
data.columns[2]
```
    
</details>

<details>
<summary> > Solution 3</summary>
    
```python
pd.DataFrame(
    [
        ["France", "Gewürztraminer", True], 
        ["Australia", "beautiful nature", True], 
        ["Israel", "hummus", True], 
        ["Iceland", "language", False]
    ], 
    columns=["country", "awesome because of", "been there"]
)
```
    
</details>

<details>
<summary> > Solution 4</summary>
    
```python
data[["AnzahlFall", "AnzahlTodesfall", "AnzahlGenesen"]]
```
    
</details>

<details>
<summary> > Solution 5</summary>
    
```python
data.loc[:, ["AnzahlFall", "AnzahlTodesfall", "AnzahlGenesen"]]
```
    
</details>

<details>
<summary> > Solution 6</summary>
    
```python
data.iloc[:, [4, 5, 6]]
```
    
</details>

<details>
<summary> > Solution 7</summary>
    
```python
data["Altersgruppe"].unique()
```
    
</details>

<details>
<summary> > Solution 8</summary>
    
```python
len(data["Landkreis"].unique())
```
    
</details>

<details>
<summary> > Solution 9</summary>
    
```python
data[data["Landkreis"] == "SK Berlin Mitte"]
```
    
</details>

<details>
<summary> > Solution 10</summary>
    
```python
data[
    (data["Bundesland"] == "Berlin") & 
    (data["Altersgruppe"] == "A35-A59")
]
```
    
</details>

<details>
<summary> > Solution 11</summary>
    
Go through _Chapter 5.6._ one more time.
    
</details>

<details>
<summary> > Solution 12</summary>
    
```python
data_berlin.groupby('Landkreis')['AnzahlFall'].sum()
```
    
</details>

<details>
<summary> > Solution 13</summary>
    
```python
data_berlin.groupby('Landkreis')['AnzahlFall'].sum().plot(
    kind='bar', title=f'Number of COVID-19 cases in Berlin'
)
```
    
</details>

<details>
<summary> > Solution to final exercise</summary>
    
```python
vaccination_cumulative = vaccination_cumulative.set_index("date")
vaccination_cumulative = vaccination_cumulative[["personen_erst_kumulativ", "personen_voll_kumulativ", "personen_zweit_kumulativ"]]
vaccination_cumulative.plot()
```
    
</details>