## **Proyecto - Statistical Learning I**

### Desarrollo del Proyecto

#### Paquetes a utilizar

In [2]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import PolynomialFeatures
import math
from scipy.stats import norm
from datetime import datetime

In [3]:
if tf.__version__.startswith("2."):
    import tensorflow.compat.v1 as tf
    tf.compat.v1.disable_v2_behavior()
    tf.compat.v1.disable_eager_execution()

Instructions for updating:
non-resource variables are not supported in the long term


#### Carga de datos

In [4]:
data_titanic = pd.read_csv('data_titanic_proyecto.csv')
data_titanic

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,Lower,M,N
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,Middle,M,N
887,888,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,Upper,F,Y
888,889,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,S,Lower,F,N
889,890,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,Upper,M,Y


#### Limpieza de datos

Convertir valores NaN a cero

In [5]:
data_titanic = data_titanic.fillna(0)

#### Selección de Variables

Posibles variables predictoras
- Age
- Fare
- passenger_class
- passenger_sex

Variable a predecir

- passenger_survived

#### Conversión de variables para determinar nivel de correlación

A continuación se convierten aquellas variables categóricas a un factor númerico, esta conversión se hace únicamente para determinar el nivel de correlación de las variables independientes con la variable dependiente, pero no es una transformación de encoded para el proceso de entrenamiento.

*La variable Age no requiere conversión ya que por defecto su valor es númerico*

In [6]:
data_titanic['passenger_survived_codes'] = data_titanic['passenger_survived'].astype('category').cat.codes
data_titanic['passenger_sex_codes'] = data_titanic['passenger_sex'].astype('category').cat.codes
data_titanic['passenger_class_codes'] = data_titanic['passenger_class'].astype('category').cat.codes

#### Correlación de variables

In [7]:
data_titanic[data_titanic.columns[1:]].corr()['passenger_survived_codes'][:]

Age                         0.010539
SibSp                      -0.035322
Parch                       0.081629
Fare                        0.257307
passenger_survived_codes    1.000000
passenger_sex_codes        -0.543351
passenger_class_codes       0.338481
Name: passenger_survived_codes, dtype: float64

#### Depuración de features

Se eliminan aquellas características que son identificadores, nombres o etiquetas.

In [8]:
X=data_titanic.drop(['passenger_survived_codes','PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked', 'passenger_class', 'passenger_sex', 'passenger_survived'], axis=1)
Y=data_titanic['passenger_survived_codes']

#### Selección de mejores features

Utilizando la libreria "SelectKBest" se determinan las 3 mejores características y sobre esas se trabajan. 

In [9]:
best=SelectKBest(k=3)
best.fit_transform(X, Y)
selected = best.get_support(indices=True)
print(X.columns[selected])

Index(['Fare', 'passenger_sex_codes', 'passenger_class_codes'], dtype='object')


Se utilizarán las variables:

- Fare
- passenger_sex
- passenger_class

Ya que poseen un alto nivel de correlación y según la libreria "SelectKBest" las sugiere como mejores features.

In [10]:
used_features = X.columns[selected]

De la posibles features "X", se eliminan aquellas que no formaran parte del proceso de predicción.

In [11]:
X = X.drop(['Age', 'SibSp', 'Parch'], axis=1)

#### División de datos para entreno, validación y prueba

In [12]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
x_train, x_validate, y_train, y_validate = train_test_split(x_train, y_train, test_size=0.3)

#### Codificación de variables categoricas **"X"** y **"Y"**

In [None]:
one_hot_passenger_class = pd.get_dummies(data_titanic["passenger_sex"])
x = one_hot_passenger_class.to_numpy()

one_hot_passenger_survived = pd.get_dummies(data_titanic["passenger_survived"])
y = one_hot_passenger_survived.to_numpy()

#### Partición de datos para entreno, validación y prueba

In [None]:
# Variable(s) "x" 
len_train_validate = int(len(x)*0.8)
x_train_validate = x[0:len_train_validate]
x_test = x[len_train_validate:]

len_train = int(len(x_train_validate)*0.7)
x_train = x_train_validate[0:len_train]
x_validate = x_train_validate[len_train:]

# Variable "y"
len_train_validate = int(len(y)*0.8)
y_train_validate = y[0:len_train_validate]
y_test = y[len_train_validate:]

len_train = int(len(y_train_validate)*0.7)
y_train = y_train_validate[0:len_train]
y_validate = y_train_validate[len_train:]

#### Métodos de decodificación para variables categoricas

In [13]:
def decode(encode, dict_label_code):
    label_decode = []
    for i in range(len(encode)):
        label = [key  for (key, value) in dict_label_code.items() if (value == encode[i]).all()]
        label_decode = label_decode + label    
    return (label_decode)    

In [14]:
def get_label_x_code(data, label, encode, length):
    label_code_append = np.column_stack([data[label].to_numpy(), encode])
    label_code_unique = pd.DataFrame(label_code_append).drop_duplicates().to_numpy()
    keys = label_code_unique[:,0]
    values = label_code_unique[:,1:length+1]
    dict_label_code = dict(zip(keys, zip(*values)))
    return dict_label_code

#### Función para generación de métricas

In [15]:
def get_metrics(y_true, y_predict):
    accuracy = accuracy_score(y_true, y_predict)
    error = mean_squared_error(y_true, y_predict)
    precision = precision_score(y_true, y_predict, average='weighted')
    recall = recall_score(y_true, y_predict, average='weighted')
    f1 = f1_score(y_true, y_predict, average="weighted")
    
    return accuracy, error, precision, recall, f1    

#### Modelo - Árbol de decisión

In [16]:
def model_decision_tree(x_train, y_train, x_validate, y_validate):
    
    tree_model = tree.DecisionTreeClassifier()
    tree_model = tree_model.fit(x_train, y_train)
    y_predict = tree_model.predict(x_validate)
    
    return y_predict, tree_model, get_metrics(y_validate, y_predict)

#### Modelo - SVM

In [17]:
def model_svm(x_train, y_train, x_validate, y_validate):

    svm_model = svm.SVC()
    svm_model = svm_model.fit(x_train, y_train)
    y_predict = svm_model.predict(x_validate)
    
    return y_predict, svm_model, get_metrics(y_validate, y_predict)

#### Modelo - Naive Bayes

In [18]:
def predict_naive_bayes(model, x_validate):
    y_predict = []
    for i in range(x_validate.shape[0]):
        probability={}
        for y_class in model[3]:
            probability[y_class] = model[2].iloc[y_class]
            for index, _ in enumerate(x_validate.iloc[i]):
                probability[y_class] *= norm.pdf(x_validate.iloc[i], model[0].iloc[y_class, index], model[1].iloc[y_class, index])
        y_predict.append(get_argmax(probability))
    return y_predict

In [19]:
def get_argmax(probability):
    max_value = 0
    argmax = -1
    for (key, value) in probability.items():
        if (key == 0):
            max_value = max(value)
            argmax = key
        else:
            tmp = max(value)
            if(max_value < tmp):
                max_value = tmp
                argmax = key
    return argmax

In [20]:
def model_naive_bayes(x_train, y_train, x_validate, y_validate):
    
    mean = x_train.groupby(y_train).apply(np.mean)
    stdev = x_train.groupby(y_train).apply(np.std)
    probabilities = x_train.groupby(y_train).apply(lambda x: len(x) / x_train.shape[0])
    y_class = np.unique(y_train)
    bayes_model = [mean, stdev, probabilities, y_class]    
    y_predict = predict_naive_bayes(bayes_model, x_validate)
        
    return y_predict, bayes_model, get_metrics(y_validate, y_predict)

#### Modelo - Regresión Logística

In [21]:
#polynomial_features = PolynomialFeatures(2)
array_x = x_train.values #polynomial_features.fit_transform(x_train.values)
array_y = y_train.values

#### Definición del Grafo

In [22]:
tf.reset_default_graph()

weight = tf.Variable(tf.truncated_normal([3, 1]), name = "weight", dtype = tf.float32)
bias = tf.Variable(tf.zeros([]), name = "bias", dtype = tf.float32)

learning_rate = tf.placeholder(shape = [], name = "learning_rate", dtype = tf.float32)
regularization_factor = tf.placeholder(tf.float32)
tensor_x = tf.placeholder(shape = [None, 3], name = "tensor_x", dtype = tf.float32)
tensor_y = tf.placeholder(shape = [None, 1], name = "tensor_y", dtype = tf.float32)

with tf.name_scope("logits"):
    logits = tf.matmul(tensor_x, weight) + bias
    
with tf.name_scope("cross_entropy"):
    regularization = tf.nn.l2_loss(weight);
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = tensor_y)) + (regularization_factor*regularization)
    cross_entropy_summary = tf.summary.scalar(name="cross_entropy",tensor=cross_entropy)

with tf.name_scope("accuracy"):
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits,1), tf.argmax(tensor_y,1)), tf.float32))
    accuracy_summary = tf.summary.scalar(name="accuracy",tensor=accuracy)

with tf.name_scope("gradient"):
    gradient = tf.gradients(cross_entropy, weight)

with tf.name_scope("new_weight"):
    new_weight = tf.assign(weight, weight - learning_rate * gradient[0])

init = tf.global_variables_initializer()

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



In [25]:
regularizators = [pow(10,i) for i in np.arange(-0.05, -0.01, 0.01)]
regularizators

[0.8912509381337456, 0.9120108393559098, 0.933254300796991, 0.954992586021436]

#### Mini Batch Gradient Descent

In [23]:
batch_size = 20
sample_size = len(x_train)
total_iterations = int(sample_size / batch_size)

def train(epochs, lr):
    with tf.train.MonitoredSession() as session:
        session.run(init)
        #writer = tf.summary.FileWriter("./graphs/"+datetime.now().strftime("%Y%m%d_%H%M%S")+"_lr="+str(lr), session.graph)
        for regul_factor in regularizators:
            for epoch in range(epochs):
                for i in range(total_iterations):
                    start_index = i*batch_size
                    end_index = start_index+batch_size
                    x = np.array(array_x[start_index:end_index])
                    y = np.array(array_y[start_index:end_index]).reshape(20,1)

                    feed_dict = {tensor_x:x, tensor_y:y, learning_rate:lr, regularization_factor:regul_factor}
                    entropy_summary = session.run(cross_entropy_summary, feed_dict=feed_dict)
                    acc_summary = session.run(accuracy_summary, feed_dict=feed_dict)

                    _, c, a, w, b= session.run([new_weight, cross_entropy, accuracy, weight, bias],feed_dict=feed_dict)
                    if (i % 20 == 0):
                        print("Epoch: {} Iteration: {} Cross Entropy: {} Accuracy: {}".format(epoch,i,c,a))               
                        #writer.add_summary(entropy_summary, i)
                        #writer.add_summary(acc_summary, i)
            #writer.close()
    return (w,b)

In [47]:
w,b=train(10, 0.00001)

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Epoch: 0 Iteration: 0 Cross Entropy: 0.8877283334732056 Accuracy: 1.0
Epoch: 0 Iteration: 20 Cross Entropy: 0.8901100158691406 Accuracy: 1.0
Epoch: 1 Iteration: 0 Cross Entropy: 0.8906721472740173 Accuracy: 1.0
Epoch: 1 Iteration: 20 Cross Entropy: 0.8930597305297852 Accuracy: 1.0
Epoch: 2 Iteration: 0 Cross Entropy: 0.8936232328414917 Accuracy: 1.0
Epoch: 2 Iteration: 20 Cross Entropy: 0.8960162401199341 Accuracy: 1.0
Epoch: 3 Iteration: 0 Cross Entropy: 0.8965811133384705 Accuracy: 1.0
Epoch: 3 Iteration: 20 Cross Entropy: 0.8989799618721008 Accuracy: 1.0
Epoch: 4 Iteration: 0 Cross Entropy: 0.8995461463928223 Accuracy: 1.0
Epoch: 4 Iteration: 20 Cross Entropy: 0.9019507765769958 Accuracy: 1.0
Epoch: 5 Iteration: 0 Cross Entropy: 0.9025182127952576 Accuracy: 1.0
Epoch: 5 Iteration: 20 Cross Entropy: 0.9049287438392639 Accuracy: 1.0
Epoch: 6 Iteration: 0 Cross Entrop

In [48]:
w

array([[-1.2494775 ],
       [-0.79637516],
       [-0.268143  ]], dtype=float32)

In [33]:
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

In [721]:
sigmoid(-7.1061424175)

0.0008193803830555108

In [36]:
def predict_reg_logistic(x_validate, weight):
    y_predict = []
    for feature in x_validate.values:
        value = 0
        for i in range(len(feature)):
            value += feature[i] * weight[i][0]
        value_sigmoid = sigmoid(value)
        if value_sigmoid >= 0.5:
            y_predict.append(1)
        else:
            y_predict.append(0)
    return y_predict        

In [49]:
y_predict = predict_reg_logistic(x_validate, w)

In [50]:
accuracy_score(y_validate, y_predict)

0.6308411214953271

In [533]:
prediction, model, metrics = model_decision_tree(x_train[used_features], y_train, x_validate[used_features], y_validate)

In [535]:
prediction, model, metrics = model_svm(x_train[used_features], y_train, x_validate[used_features], y_validate)

In [554]:
prediction, model, metrics = model_naive_bayes(x_train[used_features], y_train, x_validate[used_features], y_validate)

#### Bootstrapping

Este métodlo permite realizar un remuestreo con el objetivo de evitar el sesgo, la idea principal es obtener una muestra con reemplazo de la muestra original N cantidad de veces, donde N viene siendo el tamaño total de la muestra. Al obtener una muestra del mismo tamaño que la original se consigue un estimador y para el caso, deben lograrse varios estimadores.

En este proyecto la manera de implementar Bootstrapping hubiera sido de la siguiente manera:

De la población con un tamaño de 891 se obtiene una muestra, para el caso la muestra aleatoria será de 500 y de esta se buscará obtener varios estimadores. El estimador debe poseer el mismo tamaño de la muestra original y se logra realizando el proceso repetitivo de obtener una muestra aleatoria con reemplazo. Para el caso se supondrá que el remuestreo se hará 100 veces, lo que quiere decir que se obtendrán 100 estimadores y para cada uno habrá un estadístico que servirá para determinar con mayor exactitud la predicción de sobrevivientes.

#### K-Folds Cross Validation

Esta técnica 


#### Decodificando predicción

In [None]:
label_x_code = get_label_x_code(data_validate, "passenger_survived", y_validate, 2)
label_decoded = decode(tree_predict, label_x_code)

In [None]:
label_x_code = np.column_stack([data_train['passenger_survived'].to_numpy(), passenger_survived_encoded])
df = pd.DataFrame(np.column_stack([data_train['passenger_survived'].to_numpy(), passenger_survived_encoded]))
survived = df.drop_duplicates().to_numpy()
keys = survived[:,0]
values = survived[:,1:3]
dict_survived = dict(zip(keys, zip(*values)))