## STATISTICS WORKSHOP  
__TEST__  

__USING THE NOTEBOOK__  
The present notebook is composed of text and code cells. The former include the instructions for the activity and look just like regular text in a webpage. The code cells look like gray squares with empty square brackets to their left ([ ]). To run the code inside the code cells you'll need to click on it and then click "shift" + "enter", this will make the outcome of the come to appear underneath the cell.  
We will be using the software package R to do the statistical analysis.  


__LOADING THE DATABASE__  
In this exercise we will use a database of patients evaluated for obstructive sleep apnea syndrome (OSAS). Each patient filled out a survey where epidemiological characteristics and symptoms were recorded. The database will contain some of those characteristics along with whether they had OSAS or not, and its severity, based on a measure of how frequently the patient stops breathing through the nigh called the Apnea-Hypopnea Index (ahi). 
We will upload the data we'll work into memory from a CSV file in the website GitHub

In [None]:
data = read.csv("https://raw.githubusercontent.com/gapatino/stats-notebooks/master/stats_workshop_database.csv", 
                header=T)

To make sure if the file was properly loaded we will display the first few rows of it

In [None]:
head(data)

As you can see the variable data is a spreasheet where each row is a different patient, and each column is a different characteristic of the patient (the variables).  
Some of the variables seem to be numerical, while others are likely categorical.  
To get a better understanding of the data we can use the <code>summary</code> function.

In [None]:
summary(data)

The summary demonstrates that age, bmi, and ahi are numerical variables. Age is discrete, while the other two are continuous. For all of them we get some basic statistics like the mean, median, minimum and maximum values. This is a good time to do quality of control of your data, by making sure that the values reported (especially the maximum and minimum) in each variable are make sense.   
Gender, snoring, apnea, alcohol, smoking, hypertension, and osas are binary variables. osas_severity is the only categorical variable with 4 different levels. Summary displays how many patients are in each category for the different categorical variables.   
Another quality control you should do at this point is to check if there are any cells with missing values, these would show as "NA's" counts in the respective column. None of our columns have them so we can proceed with the analysis.

__Checking numerical variables for normal distribution__  
The first step is to decide if we need to use parametric or non-parametric methods in our statistical analysis, and this depends on whether our data is normally distributed or not.
A first approximation to this is to produce the histogram for the distribution of age, but since we will probably compare the ages of people with and without osas then we have to check for normal distribution of each group separately.

In [None]:
hist(data$age[data$osas==TRUE])

The <code>hist()</code> function produces the histogram of the data in the parenthesis. In R <code>data\$age</code> means to take the column _age_ from the variable _data_, while <code>data\$age\[data\$osas==TRUE\]</code> means to take only the values in the column _age_ from patients (rows) that had a _TRUE_ value in the _osas_ column.  
In the following code cell make the histogram for the ages of patients that did not have a diagnosis of OSAS.

__Comparing numerical variables from two groups__
If we want to know if the patients diagnosed with OSAS tend to be younger or older than those without the disease, then our independent variable (age) is continuous while the dependent variable (osas) is binary. For this situation we can use a _t_-test if the independent variables are normally distributed, or a Wilcoxon rank-sum test if not.

In [None]:
t.test(data$age~data$osas, 
      paired=FALSE,
      alternative="two.sided",
      var.equal=TRUE,
      conf.level=0.95)

In [None]:
wilcox.test(data$age~data$osas, 
            paired=FALSE,
            alternative="two.sided",
            conf.level=0.95,
            conf.int=TRUE,
           )

In [None]:
barplot(c(mean(data$age[data$osas==TRUE]),
         mean(data$age[data$osas==FALSE])))