# Week 3: Identify Risk Factors for Infection

<span style="color:red">
**UPDATE**

Thank you again for the previous analysis. We will next be publishing a public health advisory that warns of specific infection risk factors of which individuals should be aware. Please advise as to which population characteristics are associated with higher infection rates. 
</span>

Your goal for this notebook will be to identify key potential demographic and economic risk factors for infection by comparing the infected and uninfected populations.

In [21]:
# 导入必要的库
import cudf
import cuml

# 加载数据
file_path = './data/week3.csv'
df = cudf.read_csv(file_path)

# 将'infected'列转换为float32类型
df['infected'] = df['infected'].astype('float32')

# 计算每个就业类型的感染率
employment_groups = df.groupby('employment').agg({'infected': 'mean', 'employment': 'count'}).rename(columns={'infected': 'infection_rate', 'employment': 'total_count'})
employment_groups = employment_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 读取就业代码指南
code_guide = cudf.read_csv('./data/code_guide.csv')

# 合并感染率数据和就业代码指南
employment_infection_rates = employment_groups.merge(code_guide, left_on='employment', right_on='Code')

# 打印感染率最高的就业类型
print("按就业类型排序的感染率：")
print(employment_infection_rates)

# 计算每个就业类型和性别的感染率
employment_sex_groups = df.groupby(['employment', 'sex']).agg({'infected': 'mean', 'employment': 'count'}).rename(columns={'infected': 'infection_rate', 'employment': 'total_count'})
employment_sex_groups = employment_sex_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 合并感染率数据和就业代码指南
employment_sex_infection_rates = employment_sex_groups.merge(code_guide, left_on='employment', right_on='Code')

# 打印按就业类型和性别排序的感染率
print("\n按就业类型和性别排序的感染率：")
print(employment_sex_infection_rates)

按就业类型排序的感染率：
   employment  infection_rate  total_count     Code  \
0           U        0.000217     12459115        U   
1           V        0.007590     10098466        V   
2           X        0.004539       181988        X   
3           Z        0.005655      7161907        Z   
4           A        0.003853       305755        A   
5     B, D, E        0.003774       486785  B, D, E   
6           C        0.003882      2653753        C   
7           F        0.003182      2075628        F   
8           G        0.004948      3549465        G   
9           H        0.003388      1398342        H   
10          I        0.010354      1556575        I   
11          J        0.003939      1180372        J   
12          K        0.004772      1122406        K   
13          L        0.004970       346470        L   
14          M        0.004777      2214336        M   
15          N        0.004784      1367137        N   
16          O        0.005284      1843446        O 

In [3]:
# 导入必要的库
import cudf
import cuml

# 加载数据
file_path = './data/week3.csv'
df = cudf.read_csv(file_path)

# 将'infected'列转换为float32类型
df['infected'] = df['infected'].astype('float32')

# 计算每个就业类型的感染率
employment_groups = df.groupby('employment_code').agg({'infected': 'mean', 'employment_code': 'count'}).rename(columns={'infected': 'infection_rate', 'employment_code': 'total_count'})
employment_groups = employment_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 读取就业代码指南
code_guide = cudf.read_csv('./data/code_guide.csv')

# 合并感染率数据和就业代码指南
employment_infection_rates = employment_groups.merge(code_guide, on='employment_code')

# 打印感染率最高的就业类型
print("按就业类型排序的感染率：")
print(employment_infection_rates)

# 计算每个就业类型和性别的感染率
employment_sex_groups = df.groupby(['employment_code', 'sex']).agg({'infected': 'mean', 'employment_code': 'count'}).rename(columns={'infected': 'infection_rate', 'employment_code': 'total_count'})
employment_sex_groups = employment_sex_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 合并感染率数据和就业代码指南
employment_sex_infection_rates = employment_sex_groups.merge(code_guide, on='employment_code')

# 打印按就业类型和性别排序的感染率
print("\n按就业类型和性别排序的感染率：")
print(employment_sex_infection_rates)

ValueError: Grouper and object must have same length

In [2]:
# 导入必要的库
import cudf
import cuml

# 加载数据
file_path = './data/week3.csv'
df = cudf.read_csv(file_path)

# 将'infected'列转换为float32类型
df['infected'] = df['infected'].astype('float32')

# 计算每个就业类型的感染率
employment_groups = df.groupby('employment_code').agg({'infected': 'mean', 'employment_code': 'count'}).rename(columns={'infected': 'infection_rate', 'employment_code': 'total_count'})
employment_groups = employment_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 读取就业代码指南
code_guide = cudf.read_csv('./data/code_guide.csv')

# 合并感染率数据和就业代码指南
employment_infection_rates = employment_groups.merge(code_guide, on='employment_code')

# 打印感染率最高的就业类型
print("按就业类型排序的感染率：")
print(employment_infection_rates)

# 计算每个就业类型和性别的感染率
employment_sex_groups = df.groupby(['employment_code', 'sex']).agg({'infected': 'mean', 'employment_code': 'count'}).rename(columns={'infected': 'infection_rate', 'employment_code': 'total_count'})
employment_sex_groups = employment_sex_groups.sort_values(by='infection_rate', ascending=False).reset_index()

# 合并感染率数据和就业代码指南
employment_sex_infection_rates = employment_sex_groups.merge(code_guide, on='employment_code')

# 打印按就业类型和性别排序的感染率
print("\n按就业类型和性别排序的感染率：")
print(employment_sex_infection_rates)

ValueError: Grouper and object must have same length

## Imports

In [4]:
import cudf
import cuml

## Load Data

Begin by loading the data you've received about week 3 of the outbreak into a cuDF data frame. The data is located at `./data/week3.csv`. For this notebook you will need all columns of the data.

In [5]:
# 加载数据
file_path = './data/week3.csv'
df = cudf.read_csv(file_path)

## Calculate Infection Rates by Employment Code

Convert the `infected` column to type `float32`. For people who are not infected, the float32 `infected` value should be `0.0`, and for infected people it should be `1.0`.

In [6]:
# 将'infected'列转换为float32类型
df['infected'] = df['infected'].astype('float32')

Now, produce a list of employment types and their associated **rates** of infection, sorted from highest to lowest rate of infection.

**NOTE**: The infection **rate** for each employment type should be the percentage of total individuals within an employment type who are infected. Therefore, if employment type "X" has 1000 people, and 10 of them are infected, the infection **rate** would be .01. If employment type "Z" has 10,000 people, and 50 of them are infected, the infection rate would be .005, and would be **lower** than for type "X", even though more people within that employment type were infected.

In [8]:
df.columns

Index(['age', 'sex', 'employment', 'infected'], dtype='object')

In [9]:
# 计算每个就业类型的感染率
employment_groups = df.groupby('employment').agg({'infected': 'mean', 'employment': 'count'}).rename(columns={'infected': 'infection_rate', 'employment_code': 'total_count'})
employment_groups = employment_groups.sort_values(by='infection_rate', ascending=False).reset_index()


Finally, read in the employment codes guide from `./data/code_guide.csv` to interpret which employment types are seeing the highest rates of infection.

In [15]:
# 读取就业代码指南
code_guide = cudf.read_csv('./data/code_guide.csv')

## Calculate Infection Rates by Employment Code and Sex

We want to see if there is an effect of `sex` on infection rate, either in addition to `employment` or confounding it. Group by both `employment` and `sex` simultaneously to get the infection rate for the intersection of those categories.

In [16]:
code_guide.columns

Index(['Code', 'Field'], dtype='object')

In [18]:
code_guide

Unnamed: 0,Code,Field
0,A,"Agriculture, forestry & fishing"
1,"B, D, E","Mining, energy and water supply"
2,C,Manufacturing
3,F,Construction
4,G,"Wholesale, retail & repair of motor vehicles"
5,H,Transport & storage
6,I,Accommodation & food services
7,J,Information & communication
8,K,Financial & insurance activities
9,L,Real estate activities


In [20]:
employment_groups.columns

Index(['employment', 'infection_rate'], dtype='object')

In [19]:
# 合并感染率数据和就业代码指南
employment_infection_rates = employment_groups.merge(code_guide, )
employmet_infection_rates=

ValueError: No common columns to perform merge on

## Take the Assessment

After completing the work above, visit the *Launch Section* web page that you used to launch this Jupyter Lab. Scroll down below where you launched Jupyter Lab, and answer the question *Week 3 Assessment*. You can view your overall progress in the assessment by visiting the same *Launch Section* page and clicking on the link to the *Progress* page. On the *Progress* page, if you have successfully answered all the assessment questions, you can click on *Generate Certificate* to receive your certificate in the course.

![launch_task_page](./images/launch_task_page.png)

<div align="center"><h2>Optional: Restart the Kernel</h2></div>

If you plan to continue work in other notebooks, please shutdown the kernel.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)