## 2. Contingency tables

We have already seen some contingency tables and its uses in calculating different types of probabilities, lets explore code to produce and analyze contingency tables in R.



First lets define joint, marginal, and conditional probabilities:

### Joint probabilities: Are the chances of two events occurring at the same time p(A$\bigcap$B) or P(A and B), the intersection of two events, these can be mutually exclusive or not mutually exclusive. 

### Marginal probability: Is the probability of occurring one single event, this can be calculated by adding all joint probabilities for that event. 

### Conditional Probabilities: As we have seen, conditional probabilities is the probability of event A occurring given event B.

The equation: $$P(A|B) = \frac{P(A and B)}{P(B)}$$

Lets create a contingency table in R, for this we will use a different way of writing code in R. The package dyplr is a very useful package for data analysis, here is the introductory vignette 

[https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)

This exercise was modified from [http://tinyheero.github.io/2016/03/20/basic-prob.html](http://tinyheero.github.io/2016/03/20/basic-prob.html)

Other examples see
[https://www.youtube.com/watch?time_continue=236&v=pIfpHdGVwLU](https://www.youtube.com/watch?time_continue=236&v=pIfpHdGVwLU)

### Remember, it is extremely important to know the direction of the conditional probability it is not the same to have P(A|B) P(is going to rain given that it is spring) than P(B|A) P(It is spring given that it is going to rain).

Always ask, which variable is hidden? which is out prior information (observed variable), What is the direction of the probability.

### Let's use the attached dataset BCHI2.csv in order to explore contigency tables and multiple probabilities calculations

## The Big cities Health Coalition, maintains a set of databases from multiple issues related to health and safety the country's largest metropolitan area. see [http://www.bigcitieshealth.org/](http://www.bigcitieshealth.org/).

### Today we will using a dataset that explores the major components of opioid addiction and unintentional opioid overdoses. This dataset contains more than 50 health status indicators, death rates, socio-economic and demographic factors. This dataset includes multiple years

In [1]:
library("ggplot2")
library("dplyr")
library("reshape2")
library("knitr")


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [2]:
BCHI = read.csv(file = "BCHI2.csv")

In [3]:
BCHI.sex.region.df <-
  BCHI %>%
  group_by(Sex, Region) %>%
  filter(Sex != "Both") %>%
  summarize(n = n())

In [6]:
df <- BCHI.sex.region.df %>%
  dcast(Sex ~ Region, value.nar = "n") %>%
  kable(align = "l", format = "markdown",
        table.attr='class="table table-striped table-hover"')

Using n as value column: use value.var to override.


In [7]:
df



|Sex    |East |West |
|:------|:----|:----|
|Female |1580 |1471 |
|Male   |1137 |1056 |

|Sex    |East |West |
|:------|:----|:----|
|Female |1580 |1471 |
|Male   |1137 |1056 |

## Lets calculate now that joint probabilities:
which is the probability of two different events occurring at the same time

In [8]:
BCHI.sex.region.prop.df <- 
  BCHI.sex.region.df %>%
  ungroup() %>%
  mutate(prop = n / sum(n))

BCHI.sex.region.prop.df %>%
  dcast(Sex ~ Region, value.var = "prop") %>%
  kable(align = "l", format = "markdown", 
        table.attr = 'class="table table-striped table-hover"')



|Sex    |East      |West      |
|:------|:---------|:---------|
|Female |0.3012967 |0.2805111 |
|Male   |0.2168192 |0.2013730 |

Joint probabilities can be calculated by taking the proportion of times a specific Sex-Region combination occurs divided by total number of all sex-regions combinations (i.e. frequency):

### What can we say about this trend?

## How about the Marginal probabilities:
Marginal probabilities focuses on single events, to calculate marginal probabilities we let constant our random variable we are interested for example: Males, and we sum all of the joint probabilities of the random variable Regions.

In [9]:
Sex.marginal.df <- 
  BCHI.sex.region.prop.df %>%
  group_by(Sex) %>%
  summarize(marginal = sum(prop))

Region.marginal.df <- 
  BCHI.sex.region.prop.df %>%
  group_by(Region) %>%
  summarize(marginal = sum(prop))

In [10]:
BCHI.sex.region.prop.df %>%
  dcast(Sex ~ Region, value.var = "prop") %>%
  left_join(Sex.marginal.df, by = "Sex") %>%
  bind_rows(
    Region.marginal.df %>%
      mutate(Sex = "marginal") %>%
      dcast(Sex ~ Region, value.var = "marginal")
  ) %>%
  kable(align = "l", format = "markdown",
        table.attr = 'class="table table-striped table-hover"')

“binding character and factor vector, coercing into character vector”



|Sex      |East      |West      |marginal  |
|:--------|:---------|:---------|:---------|
|Female   |0.3012967 |0.2805111 |0.5818078 |
|Male     |0.2168192 |0.2013730 |0.4181922 |
|marginal |0.5181159 |0.4818841 |NA        |

|Sex      |East      |West      |marginal  |
|:--------|:---------|:---------|:---------|
|Female   |0.3012967 |0.2805111 |0.5818078 |
|Male     |0.2168192 |0.2013730 |0.4181922 |
|marginal |0.5181159 |0.4818841 |NA        |

### Conditional Probabilities

Lets calculate the conditional probability that Females given that they are from the East coast. lets review the equation:

$$P(Female|East) = \frac{P(Females, East)}{P(East)}$$

P(Females,East) becomes the joint probability and P(East) is our marginal probability of female addiction in the East region.

In [11]:
joint.prob <- 
  BCHI.sex.region.prop.df %>%
  filter(Sex == "Female", Region == "East") %>%
  .$prop

marg.prob <- 
  Region.marginal.df %>%
  filter(Region == "East") %>%
  .$marginal

cond.prob <- joint.prob / marg.prob
cond.prob