# PCS5024 - Tarefa do dia 14/03/2017

## Arthur Colombini Gusmão - 6851136

Tarefa: Baixar a base *adult* do *UCI Repository*, entender os dados e criar uma Rede Bayesiana
com 5 variáveis (escolhidas por você) no SamIam, usando o seu conhecimento. Fazer um relatório
de uma página abordando as variáveis escolhidas e os valores que foram discretizados. Tentar
fazer algumas inferências para ver se é o que você realmente esperava.

Obs: este arquivo pode ser acessado em https://github.com/arthurcgusmao/PCS5708_adult_bayes_net/blob/master/analysis.ipynb
para que as images possam ser melhor visualizadas.

# Investigating the adult database from UCI repository

First we'll import the data and describe each feature:

In [1]:
# importing dependencies
import numpy as np
import pandas as pd
from IPython.display import display
import re

In [2]:
# loading the data
columns_names = ["age", "workclass", "fnlwgt", "education", "education-num", \
                 "marital-status", "occupation", "relationship", "race", "sex", \
                 "capital-gain", "capital-loss", "hours-per-week", "native-country", "label"]

train_data = pd.read_csv('adult.data', header=None)
train_data.columns = columns_names
display(train_data.head())

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


As we can see we have the following features in our dataset:

- age
- workclass (the working class of the person)
- fnlwgt (final weight)
- education
- education-num (the same as education, but ordered according to the person's education level)
- marital-status
- occupation
- relationship
- race
- sex
- capital-gain (assumed to be income from investment sources, apart from wages/salary)
- capital-loss (assumed to be losses from investment sources, apart from wages/salary)
- hours-per-week (hours worked per week)
- native-country
- label (weather or not the person earns more than 50k a year)

For the continuous features, we can see more information about them:

In [3]:
train_data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


For this task, the goal is to select by intuition 5 features that you think are the most relevant for finding out if the person earns more than 50k dollars a year. The ones I selected are:

- age
- workclass
- education-num
- occupation
- sex

Since we don't know the degree of correlation between them initially, we are going to assume that they are independent given the label. In other words, we are using the Naive Bayes assumption. Therefore, our Bayesian network will look like this:

![Our Bayesian network](images/bayes_net.png)

The idea here is to make a very simple model, so we are going to discretize all variables into 2 possible values. Currently, we have for each feature the following possible values:

In [4]:
feats = ["age", "workclass", "education-num", "occupation", "sex"]
for feat in feats:
    print(feat+':', '\n', sorted(train_data[feat].unique()), '\n')

age: 
 [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 90] 

workclass: 
 [' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'] 

education-num: 
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] 

occupation: 
 [' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners', ' Machine-op-inspct', ' Other-service', ' Priv-house-serv', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'] 

sex: 
 [' Female', ' Male'] 



As we can see, the selected variables can have many different values. We are going to assume the following separation:

- state 0 indicates that the person is less likely to earn 50k+/year
- state 1 indicates that the person is more likely to earn 50k+/year

<pre>
```

| feature       | state 0                | state 1                          |
|---------------|------------------------|----------------------------------|
|---------------|------------------------|----------------------------------|
| age           | between 17 and 25 or   | between 26 and 65                |
|               | greater than 65        |                                  |
|---------------|------------------------|----------------------------------|
| workclass     | Never-worked, Private, | Federal-gov, Local-gov,          |
|               | Self-emp-not-inc,      | Self-emp-inc, State-gov          |
|               | Without pay            |                                  |
|---------------|------------------------|----------------------------------|
| education-num | <= 9                   | >= 10 (some college or greater)  |
|---------------|------------------------|----------------------------------|
| occupation    | All other not inclused | Exec-managerial, Prof-specialty, |
|               | in state 1             | Protective-serv, Tech-support    |
|---------------|------------------------|----------------------------------|
| sex           | Female                 | Male                             |

```
<pre>

Now we have 5 binary variables, and we can calculate their state based on the table above. 
Now what's left for us is to calculate the prior probabilities of each variable. The posterior probabilities for the label given each of the 5 variables is the value that, in this exercise, we are going to use out intuition to guess.

First, calculating the probability of each variable is straightforward: we simply count how many occurrences there are for each value contained in a state:

In [5]:
age_state_1_count = train_data.query('age>25').query('age<66').shape[0]

workclass_state_1_count = train_data[train_data['workclass'] == ' Federal-gov'].shape[0] + \
                          train_data[train_data['workclass'] == ' State-gov'].shape[0] + \
                          train_data[train_data['workclass'] == ' Local-gov'].shape[0] + \
                          train_data[train_data['workclass'] == ' Self-emp-inc'].shape[0]

education_num_state_1_count = train_data[train_data['education-num'] >= 10].shape[0]

occupation_state_1_count = train_data[train_data['occupation'] == ' Exec-managerial'].shape[0] + \
                           train_data[train_data['occupation'] == ' Prof-specialty'].shape[0] + \
                           train_data[train_data['occupation'] == ' Protective-serv'].shape[0] + \
                           train_data[train_data['occupation'] == ' Tech-support'].shape[0]


sex_state_1_count = train_data[train_data['sex'] == ' Male'].shape[0]


# now we calculate the probabilities:
total_rows = train_data.shape[0]
prob_age = age_state_1_count / total_rows
prob_workclass = workclass_state_1_count / total_rows
prob_education_num = education_num_state_1_count / total_rows
prob_occupation = occupation_state_1_count / total_rows
prob_sex = sex_state_1_count / total_rows

#printing the results:
print('age -- state 1 count:', age_state_1_count)
print('workclass -- state 1 count:', workclass_state_1_count)
print('education-num -- state 1 count:', education_num_state_1_count)
print('occupation -- state 1 count:', occupation_state_1_count)
print('sex -- state 1 count:', sex_state_1_count)

print('\n-------------\n')

print('P(age == state 1) =', prob_age)
print('P(workclass == state 1) =', prob_workclass)
print('P(education-num == state 1) =', prob_education_num)
print('P(occupation == state 1) =', prob_occupation)
print('P(sex == state 1) =', prob_sex)

age -- state 1 count: 24992
workclass -- state 1 count: 5467
education-num -- state 1 count: 17807
occupation -- state 1 count: 9783
sex -- state 1 count: 21790

-------------

P(age == state 1) = 0.767543994349068
P(workclass == state 1) = 0.16790024876385862
P(education-num == state 1) = 0.5468812382912073
P(occupation == state 1) = 0.30045146033598474
P(sex == state 1) = 0.6692054912318418


Now, we guess the posterior probabilities of the label:

![The conditional probability table](images/conditional_prob_table.png)

State 1 for the label indicates that the person earns more than 50k dollars a year. Now, we are able to make some inferences and compare the results with the actual labels:

![Values of the Bayes net without evidence](images/monitor_no_evidence.png)

We can already see that the values guessed are kind of discrepant from the true values because we have around 41% of the population earning more than 50k/year, while in the actual dataset this number is around 24% (from adult.names description). Nevertheless, we are going to make some inferences to see the network in action:

![Query 01](images/query_01.png)
![Query 02](images/query_02.png)
![Query 03](images/query_female_age.png)
![Query 04](images/query_edu_num.png)

## Preparing the data

Now we are going to create another dataset which will have only our chosen variables so we can use it to the next two tasks:
- Use SamIam to learn the parameters
- Use R's bnlearn package to learn both the structure and parameters

We begin by dropping all unnecessary features and converting all of them into binary. Then we save the data into a file that will be used by both SamIam and R.

In [6]:
# creating our custom train data DataFrame
col_list = ['age', 'workclass', 'education-num', 'occupation', 'sex', 'label']
our_train_data = train_data[col_list]


# setting values
workclass_state_1_values = [' Federal-gov', ' State-gov', ' Local-gov', ' Self-emp-inc']
workclass_state_0_values = [' Never-worked', ' Private', ' Self-emp-not-inc', ' Without-pay']
occupation_state_1_values = [' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Tech-support']
occupation_state_0_values = [' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Farming-fishing', \
                             ' Handlers-cleaners', ' Machine-op-inspct', ' Other-service', ' Priv-house-serv', \
                             ' Sales', ' Transport-moving'] 

# discretizing age
our_train_data.loc[our_train_data['age'] < 26, 'age'] = 0
our_train_data.loc[our_train_data['age'] > 65, 'age'] = 0
our_train_data.loc[our_train_data['age'] > 0, 'age'] = 1

# discretizing sex
our_train_data.loc[our_train_data['sex'] == ' Male', 'sex'] = 1
our_train_data.loc[our_train_data['sex'] == ' Female', 'sex'] = 0

# discretizing workclass
for val in workclass_state_1_values:
    our_train_data.loc[our_train_data['workclass'] == val, 'workclass'] = 1
for val in workclass_state_0_values:
    our_train_data.loc[our_train_data['workclass'] == val, 'workclass'] = 0
    
# discretizing education-num
our_train_data.loc[our_train_data['education-num'] < 10, 'education-num'] = 0
our_train_data.loc[our_train_data['education-num'] >= 10, 'education-num'] = 1

# discretizing occupation
for val in occupation_state_1_values:
    our_train_data.loc[our_train_data['occupation'] == val, 'occupation'] = 1
for val in occupation_state_0_values:
    our_train_data.loc[our_train_data['occupation'] == val, 'occupation'] = 0
    
# discretizing labels
our_train_data.loc[our_train_data['label'] == ' <=50K', 'label'] = 0
our_train_data.loc[our_train_data['label'] == ' >50K', 'label'] = 1

our_train_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,age,workclass,education-num,occupation,sex,label
0,1,1,1,0,1,0
1,1,0,1,1,1,0
2,1,0,0,0,1,0
3,1,0,0,0,1,0
4,1,0,1,1,0,0


In [7]:
# writing data to file
#our_train_data.to_csv('/home/arthurcgusmao/our_train_data.dat', header=None, index=None, sep=' ', mode='a')

## Using SamIam EM to learn parameters

...

...

## Using bnlearn from R to learn the structure and the parameters

...

...