# Python Lab 3: Categorical Data Visualization

In this lab, I hope we can tackle the following question:

# "How likely are we to be infected by the coronavirus?"

<img src="Images/virus.png" width=400>

This question is very simple, but can be extremely difficult to answer. Before we dig deep into coding, I hope I can introduce some statistical concepts behind the analysis.

## How can we get conclusion from data?

We always have to conduct a "hypothesis testing" to make conclusions from data (this will be the main focus in the later labs). However, we should always try to visualize the data before conducting any anlaysis on it. This is because data visualizations can help us see the truth contained in the data.

## Randomized Experiments vs Observational Studies

## 1. What is a  randomized experiment?

Review from the slide:

<img src="Images/randomized.png" width=400>

What is a confounding variable?

<img src="Images/confounding.png" width=400>

Example:

African Americans have a higher COVID-19 mortality rates compared to Whites. However, we can not conclude that African Americans are more likely to die from COVID-19, because of the prevalence of confounding factors.

- Socioeconomic factors
- Health conditions
- Some factors that we don't know

## 2. What is an observational study?

<img src="Images/observational.png" width=400>

### Important:

- Statistical inferences of cause-and-effect relationships can be drawn from randomized experiments, but not from observational studies.
- Inferences to populations can be drawn from random sampling studies, but not otherwise.

Definition:

An inference is a conclusion that patterns in the data are present in some broader context.

We always have to think about our research question before conducting the analysis. It would not be a good idea to analyze big data from observational studies that won't answer your research question.

<img src="Images/bigdata.png" width=400>

## Why do we care about observational studies?

1. Observational studies can be easier to conduct
2. Observational studies can be used to find potential factors that may affect the outcome
3. Randomized experiment is often not possible to conduct

## Example: COVID-19 Data

Remember from the previous lab that we can load a csv file in Python by using the following command:

```Python
import pandas as pd
df = pd.read_csv('data.csv')
```
where ```'data.csv' ``` represents the name of the csv file.

In [6]:
import pandas as pd
df = pd.read_csv('Data/COVID.csv')

This code loads ```COVID.csv``` file that is in ```Data``` folder.

In [7]:
df

Unnamed: 0,Race,Cases,Deaths,Population
0,White,2010404,96979,197277789
1,Black,824169,39648,39445495
2,Latin,1030154,36882,56510571
3,Asian,123389,6891,16989540
4,Other,452013,8694,10781012
5,Unknown,2646058,11052,0


These numbers represent the exact numbers as of September 27, 2020.

### Question: Is this data from randomized experiment or observational study?

Answer: This is obviously from an observational study.

## Major reasons why randomized experiment / random sampling is difficult

### 1. Taking a random sample from the entire population in the U.S. is not practical

There isn't a large list of all the people in the U.S. However, let's just assume that we can get a random sample from the U.S. population by using a list of social security numbers. Imagine that you randomly take 1,000 people. If one of them is living in a countryside in Alaska, would you fly all the way to there to get data from 1 person? That would be too expensive

### 2. We cannot randomly assign race to different people

Let's imagine that we gathered some people for the experiment. Can we randomly assign race to those people? We cannot do this.

### 3. We need a large sample size

To illustrate this, let's analyze the dataset that was given before:

Remember that we can get all the values in a column in a dataframe by using the following code:
```Python
df['Name of Column']
```
Also,
```Python
sum()
```
can be used to find the sum of all the values in the list.

In [10]:
# Total number of COVID cases and deaths in the U.S.
cases = sum(df['Cases'])
deaths = sum(df['Deaths'])

# Find total U.S. population
population = sum(df['Population'])

# Find probabilities by dividing these number by the total population in the U.S.
# and multiplying by 100%
cases_rate = cases / population * 100
deaths_rate = deaths / population * 100

print('{} % of U.S. population is affected from COVID-19'.format(cases_rate))
print('{} % of U.S. population died from COVID-19'.format(deaths_rate))

2.207504584197188 % of U.S. population is affected from COVID-19
0.06234992281585717 % of U.S. population died from COVID-19


If we were to conduct a randomized experiment in COVID-19, we would do the following:

1. Take a sample from the population
2. Randomly assign a treatment to individuals
3. Continue to observe those people for 1-2 years and compare the proportion of COVID-19 deaths

We can see that the probability of dying from COVID-19 is really low, so we have to get an extremely laarge sample size to get a sufficient number of COVID-19 deaths. Also, there are ethical issues behind this experiment.

### Does this mean that we cannot get any information from COVID-19 data?

No. While we can't draw conclusions about the entire U.S. population, we can use the dateset to find the causes behind COVID-19 deaths. There are several tools that we can use to account for the effects of several confounding varialbes, which includes ANOVA and regression analysis, and we will cover them in the later labs.