### Unidad I. Regresiones y reducción de dimensionalidad.

## Independencia de variables y medidas de asociación.

- Distribución conjunta de variables aleatorias. 
 - Tablas de Contingencia.
 - Concepto de independencia. 

Llamamos [**distribución conjunta**](https://en.wikipedia.org/wiki/Joint_probability_distribution) (o *multivariada*) de dos (*distribución bivariada*) o más variables aleatorias a la distribución de la **intersección** de las variables. En el caso de dos variables aleatorias $X$ e $Y$:

$$P(X=x, Y=y) = P(X=x|Y=y) \cdot P(Y=y) = P(Y=y|X=x) \cdot P(X=x)$$

Donde $P(x|y)$ es la [**probabilidad condicional**](https://en.wikipedia.org/wiki/Conditional_probability) de $x$ dado $y$.  
La probabilidad conjunta cumple:  
$$\sum_{i}\sum_{j} P(X=x_{i}, Y=y_{j}) = 1$$

Si $X$ e $Y$ son <a href="https://en.wikipedia.org/wiki/Independence_(probability_theory)">variables **independientes**</a>, se cumple que:

$$P(X=x, Y=y) = P(X=x) \cdot P(Y=y)$$

Las definiciones anteriores son para dos variables categóricas, sin embargo la misma idea se extiende para variables continuas, donde la función de densidad de probabilidad (*PDF*) conjunta se define como:


$$PDF_{X,Y}(x,y) = PDF_{X|Y}(x|y) \cdot PDF_{Y}(y) = PDF_{Y|X}(y|x) \cdot PDF_{X}(x)$$

Donde $PDF_{X}(x)$ es la función de densidad de probabilidad marginal o [**distribución marginal**](https://en.wikipedia.org/wiki/Marginal_distribution).

$$PDF_{X}(x) = \int_{y} PDF_{X,Y}(x,y) dy = \int_{y} PDF_{X|Y}(x|y) \cdot PDF_{Y}(y) dy$$

En el caso de variables categóricas:

$$ P(X=x) = \sum_{y} P(X=x, Y=y) = \sum_{y} P(X=x|Y=y) \cdot P(Y=y)$$

La función de densidad de probabilidad acumulada (*CDF*) para dos variables se define como:

$$CDF_{X,Y}(x,y) = P(X \le x, Y \le y)$$

### Tablas de Contingencia

Las [**tablas de contingencia**](https://en.wikipedia.org/wiki/Contingency_table) representan todas las combinaciones de valores posibles para un determinado número de variables categóricas. Si por ejemplo tenemos tres variables categóricas $X$, $Y$ y $Z$, cada una con $i$, $j$ y $k$ niveles respectivamente. La tabla de contingencia que contiene la clasificación cruzada entre $X$ e $Y$ será una tabla $i \times j$ (*two-way table*). La tabla que además incluya a la variable $Z$ será una tabla $i \times j \times k$ (*three-way contingency table*).

In [1]:
using RDatasets
survey = dataset("MASS", "survey")

head(survey)

Unnamed: 0,Sex,WrHnd,NWHnd,WHnd,Fold,Pulse,Clap,Exer,Smoke,Height,MI,Age
1,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
2,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
3,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
4,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
5,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667
6,Female,18.0,17.7,Right,L on R,64.0,Right,Some,Never,172.72,Imperial,21.0


In [7]:
using FreqTables;

In [8]:
freqtable(survey, :Sex)

2-element NamedArray{Int64,1}:
Sex    │ 
───────┼────
Female │ 118
Male   │ 118

In [9]:
freqtable(survey, :WHnd)

2-element NamedArray{Int64,1}:
WHnd  │ 
──────┼────
Left  │  18
Right │ 218

In [22]:
# Nxy
conteos = freqtable(survey, :Sex, :WHnd)

2x2 NamedArray{Int64,2}:
Sex ╲ WHnd │  Left  Right
───────────┼─────────────
Female     │     7    110
Male       │    10    108

In [23]:
# Nx.
marginal_Sex = sum(conteos,2)

2x1 NamedArray{Int64,2}:
Sex ╲ WHnd │ sum(WHnd)
───────────┼──────────
Female     │       117
Male       │       118

In [24]:
# N.y
marginal_WHnd = sum(conteos,1)

1x2 NamedArray{Int64,2}:
Sex ╲ WHnd │  Left  Right
───────────┼─────────────
sum(Sex)   │    17    218

In [25]:
# N
total = sum(conteos)

235

$$ P(Right,Female) = \frac{N_{Right,Female}}{N}$$

In [26]:
# P(x,y)
probabilidades = conteos ./ total

2x2 NamedArray{Float64,2}:
Sex ╲ WHnd │      Left      Right
───────────┼─────────────────────
Female     │ 0.0297872   0.468085
Male       │ 0.0425532   0.459574

In [27]:
sum(probabilidades)

1.0

$$ P(Right|Female) = \frac{N_{Right,Female}}{N_{Female}}$$

In [31]:
# P(y|x)
condicionales_Sex = conteos ./ marginal_Sex



2x2 Array{Float64,2}:
 0.0598291  0.940171
 0.0847458  0.915254

In [33]:
sum(condicionales_Sex,2)

2x1 Array{Float64,2}:
 1.0
 1.0

### Pearson's Chi-square Test

In [10]:
tabla = freqtable(survey, :Smoke, :Exer)

4x3 NamedArray{Int64,2}:
Smoke ╲ Exer │ Freq  None  Some
─────────────┼─────────────────
Heavy        │    7     1     3
Never        │   87    18    84
Occas        │   12     3     4
Regul        │    9     1     7

In [11]:
using HypothesisTests;



In [12]:
ChisqTest(tabla)

Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.022712582591209424,0.3902434645216892,0.039230824475725366,0.03510126400459638,0.004542516518241884,0.07804869290433784,0.007846164895145074,0.0070202528009192765,0.019355070382074117,0.33255530020109164,0.03343148520540075,0.02991238149956909]
    point estimate:          [0.029661016949152543,0.3686440677966102,0.05084745762711865,0.038135593220338986,0.00423728813559322,0.07627118644067797,0.012711864406779662,0.00423728813559322,0.012711864406779662,0.3559322033898305,0.01694915254237288,0.029661016949152543]
    95% confidence interval: [(0.0,0.1779660476915742),(0.21186440677966104,0.5169490985390318),(0.0,0.1991524883695403),(0.0,0.18644062396276065),(0.0,0.15254231887801487),(0.0,0.2245762171830996),(0.0,0.1610168951492013),(0.0,0.15254231887801487),(0.0,0.1610168951492013),(0.19915254237288135,0.5042372341322521),(0.0,0.

In [27]:
using RCall

In [28]:
R"chisq.test($tabla)"

  Chi-squared approximation may be incorrect


RCall.RObject{RCall.VecSxp}

	Pearson's Chi-squared test

data:  ##RCall##11004
X-squared = 5.4885, df = 6, p-value = 0.4828



### Fisher's Exact Test 

In [29]:
tabla = freqtable(survey, :Sex, :WHnd)

2x2 NamedArray{Int64,2}:
Sex ╲ WHnd │  Left  Right
───────────┼─────────────
Female     │     7    110
Male       │    10    108

In [33]:
FisherExactTest(tabla'...)

Fisher's exact test
-------------------
Population details:
    parameter of interest:   Odds ratio
    value under h_0:         1.0
    point estimate:          0.6883618650408287
    95% confidence interval: (0.2140339458713404,2.0876577726326677)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.6287994157715476 (not significant)

Details:
    contingency table:
         7  110
        10  108


In [32]:
R"fisher.test($tabla)"

RCall.RObject{RCall.VecSxp}

	Fisher's Exact Test for Count Data

data:  ##RCall##11378
p-value = 0.6158
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.2140244 2.0876656
sample estimates:
odds ratio 
 0.6883665 

