<h1>Sklearn</h1>
<br>
O sklearn é uma das principais bibliotecas para aprendizagem de máquinas em Python. Os algorítmos mais populares possuem implementações nessa biblioteca, com excessão de técnicas que envolvam <i>Deep Learning</i>. Para tal, utilize bibliotecas como o <b>Tensorflow</b> ou <b>Pytorch</b>

In [31]:
# -*- coding: utf-8 -*-

<h2>Base de dados de teste</h2>
<br>
Para fins de teste, usaremos alguns conjuntos de dados teste que já estão disponíveis no sklearn. Inicialmente usaremos o iris dataset, que contém dados para a classificação de flores.
<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Iris_versicolor_3.jpg/1920px-Iris_versicolor_3.jpg" height="500" width="500">
<i>Fonte da imagem: Wikipedia: https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Iris_versicolor_3.jpg/1920px-Iris_versicolor_3.jpg</i>

In [32]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data
Y = iris.target

print('SHAPE X', np.shape(X))
print('SHAPE Y', np.shape(Y))

tag_set = set(Y)
print('TAG SET', tag_set)

SHAPE X (150, 4)
SHAPE Y (150,)
TAG SET {0, 1, 2}


<h2>Criando conjuntos de treinamento, teste e validação</h2>

In [33]:
from sklearn.model_selection import train_test_split


X_train_initial, X_test, Y_train_initial, Y_test = train_test_split(X, Y, 
                                                                    test_size=0.30, 
                                                                    stratify=Y,
                                                                    shuffle=True)

print('SHAPE X_train_test', np.shape(X_train_initial))
print('SHAPE X_test', np.shape(X_test))
print('SET Y_train', set(Y_train_initial))
print('SET Y_test', set(Y_test))

X_train, X_validation, Y_train, Y_validation = train_test_split(X_train_initial, Y_train_initial, 
                                                                test_size=0.2, 
                                                                stratify=Y_train_initial,
                                                                shuffle=True)

print('SHAPE train', np.shape(X_train))
print('SHAPE validation', np.shape(X_validation))
print('SHAPE test', np.shape(X_test))

SHAPE X_train_test (105, 4)
SHAPE X_test (45, 4)
SET Y_train {0, 1, 2}
SET Y_test {0, 1, 2}
SHAPE train (84, 4)
SHAPE validation (21, 4)
SHAPE test (45, 4)


<h2>Testando classificadores no conjunto de validação e no conjunto de teste</h2>

In [34]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

model_1 = GaussianNB()
model_1.fit(X_train, Y_train)

model_2 = BernoulliNB()
model_2.fit(X_train, Y_train)

accuracy_validation_1 = model_1.score(X_validation, Y_validation)
accuracy_validation_2 = model_2.score(X_validation, Y_validation)

print('GaussianNB - acuracia no conjunto de validacao', accuracy_validation_1)
print('BernoulliNB - acuracia no conjunto de validacao', accuracy_validation_2)

#Para testar o segundo classificador
#accuracy_validation_2 = 1.0

if accuracy_validation_1 > accuracy_validation_2:
    accuracy_test = model_1.score(X_test, Y_test)
    print('GaussianNB - acuracia no conjunto de teste', accuracy_test)
else:
    accuracy_test = model_2.score(X_test, Y_test)
    print('BernoulliNB - acuracia no conjunto de teste', accuracy_test)

GaussianNB - acuracia no conjunto de validacao 0.9047619047619048
BernoulliNB - acuracia no conjunto de validacao 0.3333333333333333
GaussianNB - acuracia no conjunto de teste 0.9555555555555556


<h2>Rodando o classificador em um conjunto de exemplos e salvando o modelo</h2>

In [52]:
#Realizando a previsao em um conjunto de dados

prediction = model_1.predict(X_test)
print('Prediction:', prediction[:10], 'vs Target:', Y_test[:10])

Prediction: [2 0 2 2 2 0 1 1 0 2] vs Target: [2 0 2 2 2 0 1 1 0 2]


In [53]:
#Salvando o modelo

#Existem duas formas diferentes

#Salvando....
import pickle
out = open('model_1.pickle', 'wb')
pickle.dump(model_1, out)
out.close()

#Carregando
model_saved = pickle.load(open('model_1.pickle', 'rb'))
prediction = model_saved.predict(X_test)
print('Prediction:', prediction[:10], 'vs Target:', Y_test[:10])

Prediction: [2 0 2 2 2 0 1 1 0 2] vs Target: [2 0 2 2 2 0 1 1 0 2]


In [54]:
#Segunda forma

from sklearn.externals import joblib

#Salvando
joblib.dump(model_1, 'model_1.pkl')

#Carregando
model_saved2 = joblib.load('model_1.pkl')
prediction = model_saved2.predict(X_test)
print('Prediction:', prediction[:10], 'vs Target:', Y_test[:10])

Prediction: [2 0 2 2 2 0 1 1 0 2] vs Target: [2 0 2 2 2 0 1 1 0 2]


<h2>Calculando as demais métricas dos classificadores</h2>

In [35]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix

Y_pred_1 = model_1.predict(X_test)
Y_pred_2 = model_2.predict(X_test)

precision_1 = precision_score(Y_test, Y_pred_1, average=None)
precision_1_average = precision_score(Y_test, Y_pred_1, average='weighted')
print('Precisao do GaussianNB por classe', precision_1)
print('Precisao do GaussianNB na media', precision_1_average)

precision_2 = precision_score(Y_test, Y_pred_2, average=None)
precision_2_average = precision_score(Y_test, Y_pred_2, average='weighted')
print('Precisao do BernoulliNB por classe', precision_2)
print('Precisao do BernoulliNB na media', precision_2_average)

recall_1 = recall_score(Y_test, Y_pred_1, average=None)
recall_1_average = recall_score(Y_test, Y_pred_1, average='weighted')
print('Recall do GaussianNB por classe', recall_1)
print('Recall do GaussianNB na media', recall_1_average)

recall_2 = recall_score(Y_test, Y_pred_2, average=None)
recall_2_average = recall_score(Y_test, Y_pred_2, average='weighted')
print('Recall do BernoulliNB por classe', recall_2)
print('Recall do BernoulliNB na media', recall_2_average)

cm_1 = confusion_matrix(Y_test, Y_pred_1)
print('Matriz de confusao - GaussianNB\n', cm_1)

cm_2 = confusion_matrix(Y_test, Y_pred_2)
print('Matriz de confusao - BernoulliNB\n', cm_2)

Precisao do GaussianNB por classe [1.         0.88235294 1.        ]
Precisao do GaussianNB na media 0.9607843137254902
Precisao do BernoulliNB por classe [0.33333333 0.         0.        ]
Precisao do BernoulliNB na media 0.1111111111111111
Recall do GaussianNB por classe [1.         1.         0.86666667]
Recall do GaussianNB na media 0.9555555555555556
Recall do BernoulliNB por classe [1. 0. 0.]
Recall do BernoulliNB na media 0.3333333333333333
Matriz de confusao - GaussianNB
 [[15  0  0]
 [ 0 15  0]
 [ 0  2 13]]
Matriz de confusao - BernoulliNB
 [[15  0  0]
 [15  0  0]
 [15  0  0]]


  'precision', 'predicted', average, warn_for)


<h1>Atividade: Teste diversos classificadores utilizando o digits dataset</h1>
<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" height="500" width="500">
<i>Fonte da imagem: Wikipedia: https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png</i>
<br>
<h4>Pesquisar classificadores aqui:</h4>
<a href="https://scikit-learn.org/stable/supervised_learning.html" target="_blank">Referência do sklearn</a>
    

In [36]:
iris = datasets.load_digits()
X = iris.data
Y = iris.target

print('SHAPE X', np.shape(X))
print('SHAPE Y', np.shape(Y))

tag_set = set(Y)
print('TAG SET', tag_set)


SHAPE X (1797, 64)
SHAPE Y (1797,)
TAG SET {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


<h2>Escalabilidade no treinamento - minibatches</h2>

In [3]:
#Operador yield

def create_squares_1(my_list):
    squared_list = []
    for elem in my_list:
        squared_list.append(elem ** 2)
    return squared_list

squared_list = create_squares_1([1,2,3,4])

for elem in squared_list:
    print(elem)

1
4
9
16


In [4]:
#Alternativa

def create_squares_1(my_list):
    squared_list = []
    for elem in my_list:
        yield elem ** 2

squared_list_gen = create_squares_1([1,2,3,4])

for elem in squared_list:
    print(elem)

1
4
9
16


<h3>Exemplo de treinamento em minibatch com o digits dataset</h3>

In [6]:
#Carregando o digits dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
import numpy as np
import math

iris = datasets.load_digits()
X = iris.data
Y = iris.target

X_train_initial, X_test, Y_train_initial, Y_test = train_test_split(X, Y, 
                                                                    test_size=0.30, 
                                                                    stratify=Y,
                                                                    shuffle=True)

X_train, X_validation, Y_train, Y_validation = train_test_split(X_train_initial, Y_train_initial, 
                                                                test_size=0.2, 
                                                                stratify=Y_train_initial,
                                                                shuffle=True)


In [11]:
#Rodar mais de uma vez

def create_batches(X_set, Y_set, size_batch):
    len_X = np.shape(X_set)[0]
    num_batches = int(math.floor(len_X/size_batch))
    start = 0
    for i in range(num_batches):
        x_batch = X_set[start:start + size_batch]
        y_batch = Y_set[start:start + size_batch]
        start += size_batch
        yield x_batch, y_batch

model = SGDClassifier(loss='hinge')
num_epochs = 3

my_classes = (0,1,2,3,4,5,6,7,8,9)

for epoch in range(num_epochs):
    print('Epoch', epoch, 'of', num_epochs)
    batch_generator = create_batches(X_train, Y_train, 30)
    
    for batch_x, batch_y in batch_generator:
        model.partial_fit(batch_x, batch_y, classes=my_classes)

accuracy_validation = model.score(X_validation, Y_validation)

print('Accuracy', accuracy_validation)

Epoch 0 of 3
Epoch 1 of 3
Epoch 2 of 3
Accuracy 0.9047619047619048




<h1>Exemplo de classificação de texto com Sklearn</h1>

In [37]:
import json

raw_file = open('datasets/reviews.json', 'r').read()
as_json = json.loads(raw_file)
num_texts = len(as_json['paper'])

print(num_texts)

172


In [38]:
#Observando os dados

entries = [j for j in as_json['paper']]
entries[0]

{'id': 1,
 'preliminary_decision': 'accept',
 'review': [{'confidence': '4',
   'evaluation': '1',
   'id': 1,
   'lan': 'es',
   'orientation': '0',
   'remarks': '',
   'text': '- El artículo aborda un problema contingente y muy relevante, e incluye tanto un diagnóstico nacional de uso de buenas prácticas como una solución (buenas prácticas concretas). - El lenguaje es adecuado.  - El artículo se siente como la concatenación de tres artículos diferentes: (1) resultados de una encuesta, (2) buenas prácticas de seguridad, (3) incorporación de buenas prácticas. - El orden de las secciones sería mejor si refleja este orden (la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera por evaluación de expertos.',
   'timespan': '2010-07-05'},
  {'confidence': '4',
   'evaluation': '1',
   'id': 2,
   'lan': 'es',
   'orientation': '1',
   'remarks': '',
   'text': 'El artículo presenta recomendaciones prácticas para el desarrollo de software seguro. S

In [39]:
import numpy as np

texts = [' '.join([x['text'] for x in j['review']]) for j in entries]
classifications = [j['preliminary_decision'] for j in entries]

class_set = list(set(classifications))
print('SET classifications', class_set)

numeric_classifications = [class_set.index(c) for c in classifications]

Y = np.array(numeric_classifications).ravel()

SET classifications ['accept', 'no decision', 'reject', 'probably reject']


<h3>Criação de features numéricas</h3>

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_1 = CountVectorizer()
vectorizer_2 = TfidfVectorizer()

X_1 = vectorizer_1.fit_transform(texts)
X_2 = vectorizer_2.fit_transform(texts)

Exemplo

In [41]:
X_1[0]

<1x6704 sparse matrix of type '<class 'numpy.int64'>'
	with 192 stored elements in Compressed Sparse Row format>

In [42]:
X_1[0].todense()

matrix([[0, 0, 0, ..., 0, 0, 0]])

In [43]:
np.sum(X_1[0].todense())

363

<h3>Teste dos classificadores</h3>

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(X_1, Y, 
                                                    test_size=0.30, 
                                                    stratify=Y,
                                                    shuffle=True,
                                                    random_state=42)

X_train_2, X_test_2, Y_train_2, Y_test_2 = train_test_split(X_2, Y, 
                                                    test_size=0.30, 
                                                    stratify=Y,
                                                    shuffle=True,
                                                    random_state=42)

model1_1 = GaussianNB()
model1_2 = GaussianNB()
model2_1 = LogisticRegression()
model2_2 = LogisticRegression()
model3_1 = MLPClassifier((10,), activation='logistic')
model3_2 = MLPClassifier((10,), activation='logistic')

model1_1.fit(X_train_1.todense(), Y_train_1)
model1_2.fit(X_train_2.todense(), Y_train_2)
model2_1.fit(X_train_1, Y_train_1)
model2_2.fit(X_train_2, Y_train_2)
model3_1.fit(X_train_1, Y_train_1)
model3_2.fit(X_train_2, Y_train_2)

accuracy1_1 = model1_1.score(X_test_1.todense(), Y_test_1)
accuracy1_2 = model1_2.score(X_test_2.todense(), Y_test_2)
accuracy2_1 = model2_1.score(X_test_1, Y_test_1)
accuracy2_2 = model2_2.score(X_test_2, Y_test_2)
accuracy3_1 = model3_1.score(X_test_1, Y_test_1)
accuracy3_2 = model3_2.score(X_test_2, Y_test_2)

print('Accuracy GaussianNB + CountVectorizer', accuracy1_1)
print('Accuracy GaussianNB + Tfidf', accuracy1_2)
print('Accuracy LogisticRegression + CountVectorizer', accuracy2_1)
print('Accuracy LogisticRegression + Tfidf', accuracy2_2)
print('Accuracy MLPClassifier + CountVectorizer', accuracy3_1)
print('Accuracy MLPClassifier + Tfidf', accuracy3_2)



Accuracy GaussianNB + CountVectorizer 0.5384615384615384
Accuracy GaussianNB + Tfidf 0.5192307692307693
Accuracy LogisticRegression + CountVectorizer 0.7115384615384616
Accuracy LogisticRegression + Tfidf 0.6730769230769231
Accuracy MLPClassifier + CountVectorizer 0.6730769230769231
Accuracy MLPClassifier + Tfidf 0.6730769230769231


<h1>Atividade: Crie um classificador para diferenciar pessoas fisicas ou juridicas</h1>

In [48]:
import pickle

corpus = pickle.load(open('datasets/name_company_corpus.pickle', 'rb'))

names = [x[0] for x in corpus if x[1] == 'NAME']
companies = [x[0] for x in corpus if x[1] == 'COMPANY']

all_texts = names + companies
labels = [0]*(len(names)) + [1]*(len(companies))
print('NUM PEOPLE', len(names))
print('NUM COMPANIES', len(companies))

NUM PEOPLE 12000
NUM COMPANIES 13085
