# Quick note on assignments:

I got the following question during office hours:

"Am I allowed to use the same codes on these lecture notes to solve HW problems? Also, am I allowed to use the codes on the internet"

### Answer:

YES, ABSOLUTELY! In fact, I would highly encourage you to copy and paste the codes that I provided to you, and most of the HW problems can be easily solved by slightly modifying my codes. In fact, the problems can be extremely difficult to solve, if you try to write everything from scratch.

You are also welcome to use the codes from the internet, but it might not be as useful as using the codes in this notebook, as you can't find the exact codes to solve the HW problems and modifying the codes on the internet for your purpose can take some time.

### Explanation:

Most research projects in Python, such as deep learning projects or RNA sequencing projects, use the codes that were developed in previous research projects. Codes are just tools for doing things, and I think the best way to improve coding is to copy/paste the prewritten codes and use it for your purpose. In fact, no HW problems can be simply solved by just copy/paste the codes from here or the internet, and you have to modify the codes.

# Time spent on assignments

The grade distribution of this course is the following:

- 68% Lab Project (The details will be given to you soon) [17% for 4 points]
- 32% Class Assignments [8% for 4 points]

Most of your grades are determined by the lab project, and the weekly HWs should just serve as a tool to check your understanding of the concepts that were covered in the labs. These assignments are created to be solvable in 1-2 hours, and not more than 6 hours, even if you are new to programming.

## Poll:

How much time do you spend on these assignments? Please be honest, as the responses are totally anonymous (We can't see your names in the responses)
- A: Less than 2 hours
- B: 2-6 hours (Less than an entire day)
- C: 6-10 hours (More than an entire day)
- D: 10+ hours (Your other studies are affected from this assignment)

# Today's agenda:

1. Data preparation / review from last time
2. Grouped bar chart
3. Stacked bar chart
4. Summary statistics (numpy)
5. Data analysis with numpy
6. COVID-19 practice (if time permits)

# Data preparation

Previously, we only worked with creating bar chart for 1 variable, but here will will split the data into several groups.

Remember that we were able to get the following two-way contingency table in the previous lab.

In [2]:
import pandas as pd
df = pd.read_csv("Data/NHANES.csv")

One thing to note:

```Python
df.head()
```
shows the first 5 rows, so that it would be easier to see the overall picture of the dataframe.

In [4]:
df.head()

Unnamed: 0,ID,SurveyYr,Gender,Age,AgeDecade,AgeMonths,Race1,Race3,Education,MaritalStatus,...,RegularMarij,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregntNow
0,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
1,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
2,51624,2009_10,male,34,30-39,409.0,White,,High School,Married,...,No,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,
3,51625,2009_10,male,4,0-9,49.0,Other,,,,...,,,,,,,,,,
4,51630,2009_10,female,49,40-49,596.0,White,,Some College,LivePartner,...,No,,Yes,Yes,12.0,10.0,1.0,Yes,Heterosexual,


## Refresh from pandas

The original dataframe contains all the data, but we are only interested in `Race3` and `BMI`. We can use the following code to select certain columns from the original dataframe:

```Python
df[['Column1', 'Column2', ...]]
```
Where `'Column1'`, `'Column2'` shows the column names.

In [13]:
data = df[['BMI', 'Race3']]

In [14]:
data.head()

Unnamed: 0,BMI,Race3
0,32.22,
1,32.22,
2,32.22,
3,15.3,
4,30.57,


Looks great! We will now drop the NaNs by using:
```Python
df.dropna()
```

In [15]:
data = data.dropna()

### Remember:

This code adds obesity categories to a new column called `Obesity`

In [16]:
# This was covered before
def function(row):
    if row['BMI'] >= 40:
        return 'Severe Obesity'
    elif row['BMI'] >= 30:
        return 'Obesity'
    elif row['BMI'] >= 25:
        return 'Overweight'
    elif row['BMI'] >= 18.5:
        return 'Normal'
    elif row['BMI'] < 18.5:
        return 'Underweight'
    else:
        return 'NaN'

In [17]:
data['Obesity'] = data.apply(function, axis=1)

## Create two-way frequency table

In your assignments that you have for this week (HW3), you have to create a frequency table of the counts. Here, we will be focusing on two-way frequency tables to examine obesity level broken down by race.

In [18]:
table = pd.crosstab(index=data["Obesity"], columns=data["Race3"]) 

In [19]:
table

Race3,Asian,Black,Hispanic,Mexican,Other,White
Obesity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Normal,130,135,86,122,46,974
Obesity,25,162,87,115,27,711
Overweight,72,131,105,119,34,858
Severe Obesity,4,45,6,16,13,140
Underweight,46,95,56,80,30,364


Reorder them.

In [20]:
table = table.reindex(['Underweight','Normal','Overweight','Obesity','Severe Obesity'])
table

Race3,Asian,Black,Hispanic,Mexican,Other,White
Obesity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Underweight,46,95,56,80,30,364
Normal,130,135,86,122,46,974
Overweight,72,131,105,119,34,858
Obesity,25,162,87,115,27,711
Severe Obesity,4,45,6,16,13,140


It is not a good idea to compare these counts, so we will obtain the percentage of these counts.

By the way, there was a very similar problem in HW3 that asks you to do this.

In [21]:
table['Asian_per'] = table['Asian'] / sum(table['Asian'])
table['Black_per'] = table['Black'] / sum(table['Black'])
table['Hispanic_per'] = table['Hispanic'] / sum(table['Hispanic'])
table['Mexican_per'] = table['Mexican'] / sum(table['Mexican'])
table['Other_per'] = table['Other'] / sum(table['Other'])
table['White_per'] = table['White'] / sum(table['White'])

In [22]:
table

Race3,Asian,Black,Hispanic,Mexican,Other,White,Asian_per,Black_per,Hispanic_per,Mexican_per,Other_per,White_per
Obesity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Underweight,46,95,56,80,30,364,0.166065,0.167254,0.164706,0.176991,0.2,0.119462
Normal,130,135,86,122,46,974,0.469314,0.237676,0.252941,0.269912,0.306667,0.319659
Overweight,72,131,105,119,34,858,0.259928,0.230634,0.308824,0.263274,0.226667,0.281588
Obesity,25,162,87,115,27,711,0.090253,0.285211,0.255882,0.254425,0.18,0.233344
Severe Obesity,4,45,6,16,13,140,0.01444,0.079225,0.017647,0.035398,0.086667,0.045947


One thing:

You can access the index (row names) of the dataframe by using the following code:

```Python
df.index
```

In [24]:
table.index

Index(['Underweight', 'Normal', 'Overweight', 'Obesity', 'Severe Obesity'], dtype='object', name='Obesity')

I hope it was a good reresher. This dataset is stored as `Data/BMI_table.csv`

You do not have to do it now, but this code can save the dataframe as a csv file for future purposes:

```Python
df.to_csv('name of csv file.csv')
```

I used the following code to save this as `BMI_table.csv` in `Data` folder:

```Python
table.to_csv('Data/BMI_table.csv')
```

## One note:

I'm covering this again, because I personally think that data preparation is the most important thing for conducting data analysis in a computer program like Python, especially if you are interested in analyzing your OWN dataset.

Realize that I dealt with:

- NaNs
- Categorizing data
- Selecting a subset of data
- Creating a dataset that can be easily analyzed (two-way contigency table in this example)

We will be providing raw dataset in many HW problems, and I greatly hope that you will be a master in this field by the end of this semester.