## Probability

* $\Omega$: simple space
* $\omega$: outcome
* $A$: event (subset of $\Omega$)
* $|A|$: number of points in $A$
* $A^c$: complement of A **not A**
* $A \cup B$: union **A or B**monthly_net_salary
* $A \cap B$ or $AB$: intersection **A and B**
* $A - B$: **A diff B**
* $A \subset B$: inclusion **A is a subset of B**
* $A \subseteq B$: inclusion **A is a subset of equal to B**
* $\emptyset$: null event, empty set
* $\mathbb{P}$: probability

### Probability axioms

- $\mathbb{P}(A) \geq 0$ for every $A$
- $\mathbb{P}(\Omega) = 1$
- $\mathbb{P}(\cup_{i=1}^{\infty}A_i) = \sum_{i=1}^{\infty}\mathbb{P}(A_i)$


#### Probability derivated axioms

- $\mathbb{P}(\emptyset) = 0$
- $A \subset B \rightarrow \mathbb{P}(A) \leq \mathbb{P}(B)$
- $0 \leq \mathbb{P}(A) \leq 1$
- $\mathbb{P}(A^c) = 1 - \mathbb{P}(A)$
- $A \cap B = \emptyset \rightarrow = \mathbb{P}(A) + \mathbb{P}(B)$ 

### Probability on finite sample spaces

$\mathbb{P}(A) = \frac{|A|}{|\Omega|}$, where  
- $|A|$: number of points in $A$
- $|\Omega|$: number of points in the entire sample set

In [120]:
import pandas as pd

In [121]:
df = pd.read_csv('sysarmy_tech_salary_survey_2022.csv')

In [122]:
df.head()

Unnamed: 0,work_province,work_dedication,work_contract_type,monthly_gross_salary,monthly_net_salary,numero,salary_in_usd,salary_last_dollar_value,salary_pay_cripto,crypto_salary_percentage,...,profile_studies_level_state,profile_career,profile_university,profile_boot_camp,profile_boot_camp_carrer,work_on_call_duty,salary_on_call_duty_charge,work_on_call_duty_charge_type,profile_age,profile_gender
0,Catamarca,Full-Time,Staff (planta permanente),300000.0,245000.0,True,,,,,...,Completo,Licenciatura en redes y comunicación de datos,UP - Universidad de Palermo,,,,,,35,varon cis
1,Chaco,Full-Time,Remoto (empresa de otro país),900000.0,850000.0,True,Cobro todo el salario en dólares,300.0,,,...,,,,,,,,,31,varon cis
2,Chaco,Full-Time,Staff (planta permanente),120000.0,115000.0,True,,,,,...,,,,,,,,,27,varon cis
3,Chaco,Full-Time,Remoto (empresa de otro país),440000.0,0.0,True,Cobro todo el salario en dólares,220.0,Cobro todo el salario criptomonedas,100%,...,,,,,,,,,21,varon cis
4,Chaco,Full-Time,Staff (planta permanente),140000.0,125000.0,True,,,,,...,,,,,,,,,32,varon cis


In [123]:
n_rows = df.size

- what is the chance of a salary for higher than average earnings?

In [124]:
avg_salary = df.monthly_net_salary.mean()

above_avg_salary_mask = df.monthly_net_salary > avg_salary
above_avg_salary_applied = df[above_avg_salary_mask]

In [125]:
round(avg_salary, 2)

277010.79

In [126]:
above_avg_salary_applied

Unnamed: 0,work_province,work_dedication,work_contract_type,monthly_gross_salary,monthly_net_salary,numero,salary_in_usd,salary_last_dollar_value,salary_pay_cripto,crypto_salary_percentage,...,profile_studies_level_state,profile_career,profile_university,profile_boot_camp,profile_boot_camp_carrer,work_on_call_duty,salary_on_call_duty_charge,work_on_call_duty_charge_type,profile_age,profile_gender
1,Chaco,Full-Time,Remoto (empresa de otro país),900000.0,850000.0,True,Cobro todo el salario en dólares,300,,,...,,,,,,,,,31,varon cis
5,Chaco,Full-Time,Staff (planta permanente),633000.0,395000.0,True,Cobro parte del salario en dólares,,,,...,Completo,Licenciatura en Sistemas de Información,UNNE - Universidad Nacional Del Nordeste,,,,,,31,varon cis
12,Chaco,Full-Time,Staff (planta permanente),329850.0,315000.0,True,Cobro parte del salario en dólares,,,,...,Completo,Ingeniería en Sistemas de Información,UTN - Universidad Tecnológica Nacional,,,No,0.0,Porcentaje de mi sueldo bruto,45,varon cis
13,Chaco,Full-Time,Remoto (empresa de otro país),625000.0,625000.0,True,Cobro todo el salario en dólares,250,,,...,En curso,Ingeniería en Sistemas de Información,UTN - Universidad Tecnológica Nacional,,,,,,23,varon cis
15,Chaco,Full-Time,Staff (planta permanente),380000.0,317000.0,True,Cobro todo el salario en dólares,,,,...,En curso,Ingeniería en Sistemas de Información,UTN - Universidad Tecnológica Nacional,,,,,,29,varon cis
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5331,Tucumán,Full-Time,Staff (planta permanente),396000.0,290000.0,True,,,,,...,Incompleto,Licenciatura en Informática,universidad nacional de tucuman,,,No,0.0,Neto,36,varon cis
5337,Tucumán,Full-Time,Staff (planta permanente),546000.0,384000.0,True,Cobro parte del salario en dólares,,,,...,Completo,Analista de Sistemas,,,,"Sí, activa",0.0,Bruto,35,varon cis
5344,Tucumán,Full-Time,Staff (planta permanente),465000.0,340000.0,True,,,,,...,,,,,,,,,32,varon cis
5348,Tucumán,Full-Time,Remoto (empresa de otro país),300000.0,294000.0,True,Cobro todo el salario en dólares,,,,...,Incompleto,Licenciatura en Psicologia,UNT - Universidad Nacion de Tucuman,,,No,0.0,Porcentaje de mi sueldo bruto,30,varon cis


$\mathbb{P}(A_a) = \frac{|A_a|}{|\Omega|}$, where  
- $|A_a|$: number of points in $A$ with $A > avg\_salary$
- $|\Omega|$: number of points in the entire sample set

In [127]:
prob_salary_above_avg = above_avg_salary_applied.size / n_rows
prob_salary_above_avg

0.28368794326241137

In [128]:
round(prob_salary_above_avg * 100, 2)

28.37

## Conditional probability

The conditional probability of $A$ given $B$ is:

$\mathbb{P}(A | B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$

![](https://i.imgur.com/FKNyeqX.png)

- what is the chance of a salary for higher than average earnings with *more than 5 years of exp*?

- $A: earings > avg\_salary$
- $B: years\_of\_exp > 5$

In [129]:
more_than_5_exp_mask = df.profile_years_experience > 5
more_than_5_exp_applied = df[more_than_5_exp_mask]
more_than_5_exp_n_rows = more_than_5_exp_applied.size

In [130]:
above_salary_avg_and_5_years = df[above_avg_salary_mask & more_than_5_exp_mask]

In [131]:
prob_above_salary_avg_and_5_years = above_salary_avg_and_5_years.size / more_than_5_exp_n_rows
prob_above_salary_avg_and_5_years

0.429225645295587

In [132]:
round(prob_above_salary_avg_and_5_years * 100, 2)

42.92

### Independent events

Two events $A$ and $B$ are **independent** if $\mathbb{P}(AB) = \mathbb{P}(A) \mathbb{P}(B)$

In [133]:
prob_more_than_5_exp = more_than_5_exp_applied.size / n_rows
round(prob_more_than_5_exp * 100, 2)

44.83

In [134]:
prob_salary_above_avg, prob_more_than_5_exp

(0.28368794326241137, 0.4483016050765211)

In [135]:
prob_above_salary_avg_and_5_years == prob_salary_above_avg * prob_more_than_5_exp

False

- what is the chance of a salary for higher than average earnings being a woman/man?

In [136]:
df.profile_gender.unique()

array(['varon cis', 'mujer cis', 'diversidades'], dtype=object)

In [137]:
fem_mask, mas_mask = df.profile_gender == 'mujer cis', df.profile_gender == 'varon cis'

In [138]:
fem_applied = df[fem_mask]
fem_prob = fem_applied.size / n_rows

fem_above_avg_salary = df[fem_mask & above_avg_salary_mask]
prob_fem_above_avg_salary = fem_above_avg_salary.size / n_rows
round(prob_fem_above_avg_salary * 100, 2)

2.71

In [139]:
prob_fem_above_avg_salary == prob_salary_above_avg * fem_prob

False

In [140]:
mas_applied = df[mas_mask]
mas_prob = mas_applied.size / n_rows

mas_above_avg_salary = df[mas_mask & above_avg_salary_mask]
prob_mas_above_avg_salary = mas_above_avg_salary.size / n_rows
round(prob_mas_above_avg_salary * 100, 2)

25.25

In [141]:
prob_mas_above_avg_salary == prob_salary_above_avg * mas_prob

False

## Bayes theorem

$\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A) \mathbb{P}(A)}{\mathbb{P}(B)}$