# Reviewing Statistics and R, part B

This is a continuation of your first lab for the Introduction to Statistical and Mathematical Foundations of Data Science course. 
You can refer to chapters 1 to 3 in [Intro to Statistics textbook](http://onlinestatbook.com/2/index.html) book for reference. 


** We now continue with the final portion of the introductory lab**

## Inferential Statistics

Inferential statistic measures help us draw inferences about larger population from sample data. 
You rarely have access to full population datasets. 

Consider the following: 
The evidence linking cigarette smoking and lung disease is almost irrefutable. 
Based upon this information, what proportion of Americans have given up smoking? 
One way to answer this question would be to survey the entire population of the United States. 
It would be impossible from the standpoint of time and cost-effectiveness. 
Mathematicians have created methods of estimating population parameters from samples drawn from target populations that adequately represent the larger population. 

In inferential statistics we will be answering questions or testing hypotheses about populations based upon samples or prior data. 
Since parameters of populations are generally not available we must rely on sampling techniques to estimate them.

### Sampling Distributions

A sampling distribution is a theoretical distribution that would result if we were to take all possible samples from a given population. 
Suppose we were able to draw a representative sample using 20 individuals (N=20) from a population. 
We could then calculate what is referred to as a sampling distribution. 
We typically draw fewer samples with knowledge that there will be some error. 
This error term can be calculated and will be used in inference. 

For example, how do we know the average height of males in this country is 5'9"? 
If we were to draw one hundred different samples of 10 males at random, 
we will find a certain amount of difference among the means and standard deviationss of the samples.
Imagine, that the standard deviation of our sample means is 2.25", 
we have what is called the standard error of the mean. 
It can be defined as "the theoretical standard deviation of sample means of a given sample drawn from a population". 

When a researcher asserts something he/she is inferring, 
he/she does so with the knowledge that there will be a calculated error. 
They generally designate two cutoff points for error based upon normal distributions and they are called significance levels. 
Some researchers conclude that if the event would occur by chance 5% of time or less, then the event can be attributed to non-chance factors. 
In other settings, researchers conclude that if the event would occur by chance 1% of time or less, then the event can be attributed to non-chance factors. 
These are 0.5 significance level and 0.1 significance levels. 
The problem/data domain and other factors drive the required significance levels for a particular statistical evaluation.

#### Example 

_
In the case of male heights, we could choose at random 10 males and their average height would fall 69" $\pm$ 2.25"(1.96) or 64.59 to 73.41. 
The sample standard deviation is multiplied by 1.96 since a z-score of $\pm$1.96 would encompasses 95% of the normal distribution. 
Here if we are using 0.05 significance level we would theoretically be correct 95% of the time. 
Also, we would know that 5% of the samples we chose would have a mean height of greater than 73.41" or less than 64.59".
_

When dealing with two tailed tests, sometimes alpha levels/significance levels are referred to as confidence bands or confidence intervals. 
The 0.05 alpha level is 95% confidence band. 
You might say that you are 95% confident in asserting your hypothesis.


In order to perform inferential statistics or parametric tests of significance, 
we'll have to use sampling distributions. 
In order to create a sampling distribution we would need to draw all possible samples of size N from a given population. 
Once we have calculated the mean for each distribution the resulting distribution would be the sampling distribution of means.

Sampling distributions have 3 characteristics:
* The mean of the sampling distribution will not change with a change in sample size. If the mean from the sampling distribution of means is 20 when N=10, it will remain 20 whether you increase or decrease the size of the samples. Simply put, the mean of the sampling distribution is equal to the mean of the population.
* As the sample size in the sampling distribution of means increases, the dispersion of sample means will become less. The larger the N, the more compact the distribution of sample means. As N increases, standard error of the mean decreases.
* If the sampling distribution of means is taken from a normally distributed population, the sample means will also be bell shaped.

Based on above three issues, the **Central Limit Theorem** states: 
If random samples of fixed N are drawn from any population, as N becomes larger, 
the distribution of sample means approaches normality with the overall mean approaching $\mu$.
The standard error of the sample means is equal to

$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt(N)}$$

and z-statistic is given as 

$$Z = \frac{\bar{X}-\mu}{\sigma_{\bar{x}}}$$

where,
* $\bar{X}$ = sample mean,
* $\mu$ = population mean and  
* $\sigma_{\bar{x}}$ = sample standard deviation  



**Reference:** [z-score](http://www.statisticshowto.com/probability-and-statistics/z-score/)

### Hypotheses

_Null Hypothesis ($H_0$):_ 
The null hypothesis specifies values for parameters. 
Generally referred to as "no significant difference" hypothesis. 
Most analyses are set up to reject or not reject the Null hypothesis.

_Alternate Hypothesis ($H_1$):_ 
The alternate hypothesis states that the population parameters are something other than the one hypothesized. 
A statement like "this class is different from other statistics classes" is example of an alternate hypothesis. 


#### Statistical error (Type I, Alpha & Type II, Beta)

_Type I_: 
When we reject null hypothesis and it is really true, it is called a Type I error. 
A Type I error is equal to the Alpha level set and sometimes referred to as an Alpha ($\alpha$) error. 
If we set the alpha level at 0.5, our chance of making a Type I error is 5%. 

_Type II_: 
When we fail to reject the null hypothesis when it is actually false it is called a Type II error. 
This is also called beta ($\beta$) error. 

**Note: ** 
A Type II error is more likely to be made than a Type I error. 
The lower we set the alpha level, 
the less likely we are to make a Type I error and more likely we are to make a Type II error. 

<img src="../images/hypothesis_test_results.PNG">

----
Let’s find out the inference with which we can draw from the body dimensions (_bdims_) data set we are going to work on. 
The dataset contains body dimensions data from 247 men and 260 women. 

Let’s say, we want to check the significance of variable `sex` for hypothesis testing. 
Assume that males (`sex=1`) have higher weight than the average population weight.

To verify this assumption, let’s use z-test and see if males are actually heavier than the over all population.

$H_0$: There is no significant difference in the weights of men and women

$H_1$: There is a better chance of men being heavier than average population weight

In [None]:
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

A quick peek into the first few rows of data...  
Note that the weight (`wgt`) and `sex` are the last two columns of data.

In [None]:
head(bdims)
summary(bdims)

From the above result you see that every observation has 25 measurements. 
The variable names description can be found at http://www.openintro.org/stat/data/bdims.php. 
We will work with just two columns for now: weight in kg (`wgt`) and `sex` (1 indicates male, 0 indicates female).

Let's go ahead and create two different data sets: one for men and one for women.

In [None]:
male <- subset(bdims, sex == 1)
female <- subset(bdims, sex == 0)

### calculating Z-score

A z-score is a measure of how many standard deviations below or above the population mean an observation is or the number of standard deviations from the mean a data point is. 
      
Recall from the Introduction to Statistics for Analytics course (or boot camp), the descriptive statistics and dispersion concepts. 
Variance is the average difference squared from the mean; 
and standard deviation is then the square root of the variance.

The below code computes some means, variances, z-scores, etc. of the dbims data set.

In [None]:
 sample_mean = mean(male$wgt)
 pop_mean = mean(bdims$wgt)
 pop_var = var(bdims$wgt)
 print(paste("sample mean : ",sample_mean))
 print(paste("population mean : ",pop_mean))
 print(paste("population variance : ",pop_var))
 zscore = (sample_mean - pop_mean) / (sqrt(pop_var)) #Standard Deviation
 print(paste("Z-score : ",zscore))

In [None]:
# The below function calculates the z score
#   You can write a function since R is just like any other programming language. 
#   This function evaluates the mean of the sample and the population, the standard deviation 
#    of population and calculates the z score.
z.score = function(sam, pop){
 sample_mean = mean(sam)
 pop_mean = mean(pop)
 pop_var = var(pop)
 zscore = (sample_mean - pop_mean) / (sqrt(pop_var))
 return(zscore)
}

#call function
#    We are using male weight below 
z.score(male$wgt, bdims$wgt)

The z score is 0.67 after rounding it to 2 decimals. 
Now we need to work out the percentage (or number) of men that weigh more and less than the population mean. 
We refer to the standard normal distribution table to find out this percentage value. 

![Standard Normal Distribution Table](../images/normal-table-large.png)

To read the table, we break our z-score into two parts 0.67 = 0.6 (_tenths_) + 0.07 (_hundreths_)

The tenths component is used to find the appropriate row in the table.  
The hundreths component is used for the column. 
You then find the cell in the table for that row and column, and this represents the % of the population with a smaller value.

Using the table, we can see that 74.86% of the population is lower than the average weight of men.

This allows us to reject the Null Hypothesis, and therefore affirms our hypothesis, 
$H_1$ above, that males tend to weight more than the general population.

In [None]:
sample_mean = mean(male$wgt)
pop_mean = mean(bdims$wgt)
pop_sd = sd(bdims$wgt)

# Instead of a table, use R
# probability under normal distribution for sample measure, mean, standard deviation.
p = pnorm(sample_mean, mean=pop_mean, sd=pop_sd, lower.tail=TRUE) 

# Since p is a probability, we can confer this to a percentage;  
print(paste("Probability=",p))
print(mean(male$wgt))
print(mean(bdims$wgt))

### UNCOMMENT THIS LINE TO READ DOCUMENTATION on pnorm, dnorm, etc.
# help(pnorm)


### Chi-Square analysis

A non parametric test of significance is one that makes no assumption concerning the shape of the population distribution and is commonly referred to as a _distribution-free_ test of significance. 
Non parametric procedures are more suitable when data is categorical and for group comparison research.

Chi square ($\chi^2$) allows us to determine whether or not the proportion of observations 
within mutually exclusive categories differs significantly from the proportions expected by statistical chance. 

### Chi-Square Goodness of Fit

This is the one variable (univariate) case and is used to determine whether significant differences occur within a single group. 
The null hypothesis can be tested by applying the formula below. 

$$\chi^2 = \sum\frac{(f_o - f_e)^2}{f_e}$$

where,

* $f_o$ = the observed number in a given category, and
* $f_e$ = the expected number in that category

For example, a psychology department at a university has three emphasis area options available for incoming students: clinical psychology, educational psychology and councelling psychology. 
If students were to just randomly select an area for study,
probability would suggest that one-third would choose clinical psychology, 
one-third would choose educational psychology and one-third would choose councelling psychology. 
Let's say, there are 100 incoming students and 45 choose clinical psychology, 
30 choose educational psychology and 25 choose councelling psychology. 
Then, we can construct a set of hypothesis to determine if the selection of emphasis areas is random.
We start with our Null Hypothesis and Alternative Hypothesis.

$H_0$: No significant differences exist among students choosing academic options within psychology.

$H_1$: Significant differences exist among students choosing academic options within psychology as compared to expectations (1/3 in each category).

Test: $\chi^2$

$\alpha$: 0.5

Sampling Distribution: degrees of freedom: K-1 = 2      where K is the number of groups


<table>
<tr>
<td>Cell 1 <br> Clicinal <br> Psychology </td>
<td>Cell 2 <br> Educational <br> Psychology </td>
<td>Cell 3 <br> Councelling <br> Psychology </td>
</tr>
<tr>
<td>$f_o$=45 <br> $f_e$=33.3 </td>
<td>$f_o$=30 <br> $f_e$=33.3 </td>
<td>$f_o$=25 <br> $f_e$=33.3 </td>
</tr>
</table>

$$\chi^2 = \frac{(45 - 33.3)^2}{33.3} + \frac{(30 - 33.3)^2}{33.3} + \frac{(25 - 33.3)^2}{33.3}$$

$$ = 4.10 + 0.33 + 2.08$$

$$ = 6.51$$

Decision: 
Since the calculated $\chi^2$ value of 6.51 exceeds the table value of 5.991 we would reject the null hypothesis. 
  * [See Chi Square table here](http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm)

### Chi-Square Test of Independence

Chi square test of independence of categorical variables is used when we describe differences between two or more groups in a two way table. 
It returns the probability for the computed chi-square distribution with the degree of freedom selected.

Probability of 0: It indicates that both categorical variables are dependent.

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence.

We will use the built-in function `chisq.test()`

In [None]:
help(chisq.test)
chisq.test(c(bdims$wgt,bdims$sex))

In [None]:
chisq.test(c(bdims$wgt, bdims$hgt))

Since the **p values** in both cases are < 0.05,
we can infer that height and sex are highly significant variables and must be included in our final data modeling stage.
We should perform chi squared tests on other variables in the dataset to see if they are 
significant and should be included in any data modeling.

##### Correlation:

Correlation determines the level of association between two variables. 
A scatter plot among the variables is one of the ways to find correlations between variables. 

In [None]:
plot(bdims$wgt, bdims$hgt, xlab = 'weight', ylab = 'height')

We see a positive correlation among `hgt` and `wgt` variables. 
R has a built-in function to measure the correlation. 
Let's use the `cor.test()` function to verify that height and weight variables are correlated.

In [None]:
help(cor.test)
cor.test(bdims$wgt, bdims$hgt, method = 'pearson')

As the scatter plot suggests, the correlation function supports our assumption that height and weight are associated. 
The level of corelation is 0.717. 

You can perform tests on other variables in the dataset and similarly find associations among other variables. 
Variables that are highly correlated, e.g., 0.99 correlation, do not add much information to a predictive model. 
Therefore, when you have two indepedent variables that are highly correlated, you can usually drop one of these variables from your final model input design.

### This concludes the whirlwind review of descriptive and inferential statistics