Curso de Especialização de Inteligência Artificial Aplicada

Setor de Educação Profissional e Tecnológica - SEPT

Universidade Federal do Paraná - UFPR

---

**IAA003 - Linguagem de Programação Aplicada**

Prof. Alexander Robert Kutzke

# Implementação com Scikit-Learn

Utilizando a base de dados presente no repositório:

1. Escreva *pipeline de classificação de texto* para classificar reviews de filmes como positivos e negativos;
2. Encontre um bom conjunto de parâmetros utilizando `GridSearchCV`;
3. Avalie o classificador utilizando parte do conjunto de dados (previamente separado para testes).
4. Repita os passos 1, 2 e 3 utilizando um algoritmo de classificação diferente;
5. Escreva um pequeno texto comparando os resultados obtidos para cada algoritmo.

O texto pode ser escrito em um "Jupyter Notebook" juntamente com o código. Ou qualquer outro tipo de documento.


Aluno: Brunno Cunha Mousquer de Oliveira

In [2]:
import random
import pandas as pd
import numpy as np
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

In [3]:
random.seed(42)
np.random.seed(42)

## Common Funcions

In [4]:
def get_train_test_data(verbose=True,
    data_path=r"lpa1/sklearn/sklearn-assignment/data"):
    movie_reviews_data_folder = data_path
    dataset = load_files(movie_reviews_data_folder, shuffle=False)
    x_train, x_test, y_train, y_test = train_test_split(
        dataset.data, dataset.target, test_size=0.25, random_state=42)
    if (verbose):
        print(f"n_samples: {len(dataset.data)}")
        print(f"Train data: features: {len(x_train)} | target: {len(y_train)}")
        print(f"Test data: features: {len(x_test)} | target: {len(y_test)}")
    return x_train, x_test, y_train, y_test

In [24]:
def grid_search(model, x_train, y_train):
    gs = GridSearchCV(model(), model.params(), n_jobs=-1, verbose=10)
    gs = gs.fit(x_train, y_train)
    print(f'Best score: {gs.best_score_} \n Best Params: {gs.best_params_}')
    # results = pd.DataFrame(gs.cv_results_()
    return gs

In [6]:
def print_metrics(model, predicted, y_test):
    print(f'Acertos: {round(np.mean(predicted == y_test) * 100,2)}%')
    print()
    print("Classification report for classifier %s:\n%s\n"
          % (model, metrics.classification_report(y_test, predicted)))
    print()
    print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, predicted))


## Models

In [7]:
class ModelBase:
    def __init__(self):
        self.model = None

    def __call__(self):
        return self.model

    def fit(self, x_train, y_train):
        self.model.fit(x_train, y_train)

    def predict(self, x_test):
        return self.model.predict(x_test)

In [40]:
class Model_A(ModelBase):
    def __init__(self):
        super().__init__()
        self.model = Pipeline([
            ('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', MultinomialNB())])

    def params(self):
        return {
            'tfidf__norm': ['l1', 'l2', 'max'],
            'tfidf__use_idf': [False, True],
            'tfidf__smooth_idf': [False, True],
            'tfidf__sublinear_tf': [False, True],
            # 'clf__alpha': [v/10 for v in range(11)], não é possível usar alpha < 1.0
            'clf__fit_prior': [False, True]
        }

In [89]:
class Model_B(ModelBase):
    def __init__(self):
        super().__init__()
        self.model = Pipeline([
            ('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', LinearSVC())])

    def params(self):
        return {
            'tfidf__norm': ['l1', 'l2', 'max'],
            'tfidf__use_idf': (False, True),
            'tfidf__smooth_idf': (False, True),
            'tfidf__sublinear_tf': (False, True),
            'clf__penalty' : ['l1', 'l2'],
            'clf__loss' : ['hinge', 'squared_hinge'],
            'clf__dual': [False, True]
            #'clf__C' : [v/10 for v in range(21) if v >= 1.0],
            #'clf__multi_class ': ['ovr']
        }

In [37]:
class Model_C(ModelBase):
    def __init__(self):
        super().__init__()
        self.model = Pipeline([
            ('vect', CountVectorizer()),
            ('tfidf', TfidfTransformer()),
            ('clf', DecisionTreeClassifier())])

    def params(self):
        return {
            'tfidf__norm': ['l1', 'l2', 'max'],
            'tfidf__use_idf': [False, True],
            'tfidf__smooth_idf': [False, True],
            'tfidf__sublinear_tf': [False, True],
            'clf__criterion' : ['gini', 'entropy'],
            'clf__splitter' : ['best', 'random'],
            'clf__min_samples_split' : [v for v in range(100) if v >= 5],
            'clf__min_samples_leaf' : [v for v in range(20) if v >= 5]
        }

## RUN

In [9]:
x_train, x_test, y_train, y_test = get_train_test_data(data_path=r'data/')

n_samples: 2000
Train data: features: 1500 | target: 1500
Test data: features: 500 | target: 500


### Model A

In [41]:
# Default Params
model_a = Model_A()
model_a.fit(x_train, y_train)
predicted = model_a.predict(x_test)
print_metrics(model_a, predicted, y_test)

Acertos: 80.0%

Classification report for classifier <__main__.Model_A object at 0x000001A089D17B48>:
              precision    recall  f1-score   support

           0       0.80      0.81      0.81       257
           1       0.80      0.79      0.79       243

    accuracy                           0.80       500
   macro avg       0.80      0.80      0.80       500
weighted avg       0.80      0.80      0.80       500



Confusion matrix:
[[209  48]
 [ 52 191]]


In [42]:
# Best Features
gs = grid_search(model_a, x_train, y_train)
best_params_predicted = gs.predict(x_test)
print_metrics(gs, best_params_predicted, y_test)


Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best score: 0.8313333333333335 
 Best Params: {'clf__fit_prior': True, 'tfidf__norm': 'max', 'tfidf__smooth_idf': False, 'tfidf__sublinear_tf': True, 'tfidf__use_idf': False}
Acertos: 82.2%

Classification report for classifier GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                    

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   15.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   25.3s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:   33.3s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:   38.2s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:   

A busca por hiperametros resultou em um aumento de 2 pontos percetuais em relação ao modelo
com os parametros default.

### Model B

In [90]:
# Default Params
model_b = Model_B()
model_b.fit(x_train, y_train)
predicted = model_b.predict(x_test)
print_metrics(model_b, predicted, y_test)

Acertos: 82.2%

Classification report for classifier <__main__.Model_B object at 0x000001A08FC63A48>:
              precision    recall  f1-score   support

           0       0.83      0.82      0.83       257
           1       0.81      0.82      0.82       243

    accuracy                           0.82       500
   macro avg       0.82      0.82      0.82       500
weighted avg       0.82      0.82      0.82       500



Confusion matrix:
[[211  46]
 [ 43 200]]


In [91]:
# Predict with best features
gs = grid_search(model_b, x_train, y_train)
best_params_predicted = gs.predict(x_test)
print_metrics(gs, best_params_predicted, y_test)

Fitting 5 folds for each of 192 candidates, totalling 960 fits
Best score: 0.8726666666666667 
 Best Params: {'clf__dual': True, 'clf__loss': 'hinge', 'clf__penalty': 'l2', 'tfidf__norm': 'l2', 'tfidf__smooth_idf': True, 'tfidf__sublinear_tf': True, 'tfidf__use_idf': True}
Acertos: 84.6%

Classification report for classifier GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
               

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   24.2s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:   27.3s
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:   31.5s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   35.3s
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:   

O modelo B com parametros default possui a mesma quantidade de acertos que o modelo A com hiperparametros.
O modelo B utiliza o algoritmos LinearSVC como classificador, enquanto o modelo A utiliza o MultinomialNB.

Ao realizar a bysca por hiperparametros, chegamos em 84.6% de acertos

### Modelo C

In [38]:
# Default Params
model_c = Model_C()
model_c.fit(x_train, y_train)
predicted = model_c.predict(x_test)
print_metrics(model_c, predicted, y_test)

Acertos: 64.8%

Classification report for classifier <__main__.Model_C object at 0x000001A08A0F8A88>:
              precision    recall  f1-score   support

           0       0.65      0.67      0.66       257
           1       0.64      0.62      0.63       243

    accuracy                           0.65       500
   macro avg       0.65      0.65      0.65       500
weighted avg       0.65      0.65      0.65       500



Confusion matrix:
[[173  84]
 [ 92 151]]


In [39]:
# Predict with best features
gs = grid_search(model_c, x_train, y_train)
best_params_predicted = gs.predict(x_test)
print_metrics(gs, best_params_predicted, y_test)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best score: 0.6719999999999999 
 Best Params: {'clf__criterion': 'gini', 'clf__splitter': 'best', 'tfidf__norm': 'max', 'tfidf__smooth_idf': False, 'tfidf__sublinear_tf': False, 'tfidf__use_idf': False}
Acertos: 60.4%

Classification report for classifier GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                        

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    5.8s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   23.4s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   29.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   33.9s
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:   41.8s
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:   49.5s
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:   58.7s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:  1

O modelo C teve a pior performance entre os 3, talvez arvores de decicoes não sejam uma boa opcao
para classificao de valores.


