# Criar Classificador de Igrejas usando SnorkelML

Olá!

Neste notebook usaremos a fantástica ferramenta [SnorkelML](snorkel.org/get-started/) para criar um classificador para as igrejas evangélicas em língua portuguesa.

Este trabalho faz parte do projeto dadascope. 

## Sumário

1. Abrir banco de dados e selecionar razões sociais únicas.
2. Criar funções de classificação (Labelling Functions) do teste.
3. Criar modelos e aplicar nos bancos de dados de desenvolvimento e teste.
4. Avaliar treinamento
5. Criar banco de dados com classificação

In [30]:
import pandas as pd
import os
import re
from snorkel.labeling import (LabelModel, 
                              PandasLFApplier, 
                              labeling_function,
                              LabelingFunction,
                              LFAnalysis,
                              MajorityLabelVoter,
                              LabelModel,
                              filter_unlabeled_dataframe)

from snorkel.analysis import get_label_buckets

from sklearn.feature_extraction.text import CountVectorizer

os.getcwd()

'/home/henrique/github_repos/igrejasevangelicas'

In [32]:
# Read dataframe
data = pd.read_csv('data/igreja_cnpj_ativo.zip', 
                   encoding='ISO-8859-1',
                   compression='zip')

# Split data into train, dev, valid, test data
df_copy  = data['razao_social'].unique()
df_copy  = pd.Series(df_copy)
df_copy  = pd.DataFrame({'index':df_copy.index, 'razao_social':df_copy.values})

df_train = df_copy.sample(frac=0.95, random_state=0)

# Open data after labelling is done.
df_dev   = df_train.sample(n=300)


# Afer labeling is done, open data
df_copy = pd.read_csv('data/human_labelling/df_valid.csv', 
                      sep=';')
df_valid = df_copy.sample(frac=0.5, random_state=0)
df_test = df_copy.drop(df_valid.index)
df_dev = pd.read_csv('data/human_labelling/df_dev.csv',
                    sep=';')

# Create outcomes
Y_dev   = df_dev.label.values
Y_valid = df_valid.label.values
Y_test  = df_test.label.values

  interactivity=interactivity, compiler=compiler, result=result)


# Criar labelling functions

De acordo com a documentação do Snorkel, iremos tentar criar funções 'grosseiras' que tentarão classificar as igrejas usando palavras-chaves e expressões regulares.

In [20]:
# Define the label mappings for convenience
ABSTAIN = -1
NOT_TARGET = 0
TARGET = 1

In [21]:
def keyword_lookup(x, keywords, label):
    if any(word in x.razao_social.lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=TARGET):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )


# Define Targets
keyword_assembleia = make_keyword_lf(keywords=["assembleia", "ass", "assem",
                                               "assemb", "asembleia"])

keyword_evangelica = make_keyword_lf(keywords=["evangelica", "evang"])

keyword_pentecostal = make_keyword_lf(keywords=["pentecostal"])

keyword_iurd = make_keyword_lf(keywords=["universal do reino de deus"])

keyword_internacional = make_keyword_lf(keywords=["internacional"])

keyword_ministerio = make_keyword_lf(keywords=["ministerio", "minist", "min"])

keyword_igreja = make_keyword_lf(keywords=["igreja"])

keyword_sara = make_keyword_lf(keywords=["sara"])

keyword_protestantes = make_keyword_lf(keywords=["luterana", "batista", "metodista", 
                                                "presbiteriana"])


# Define not targets
keyword_funerario = make_keyword_lf(keywords=["funerarios", "servico", "servicos",
                                             "funeraria", "funeral"],
                                      label=NOT_TARGET)

keyword_associacao = make_keyword_lf(keywords=["associacao"],
                                      label=NOT_TARGET)


keyword_espiritas = make_keyword_lf(keywords=["espirita"],
                                      label=NOT_TARGET)

keyword_catolica = make_keyword_lf(keywords=["catolica"],
                                      label=NOT_TARGET)

keyword_irmas = make_keyword_lf(keywords=["irmas"],
                                      label=NOT_TARGET)

keyword_moradores = make_keyword_lf(keywords=["moradores"],
                                      label=NOT_TARGET)

keyword_senhora = make_keyword_lf(keywords=["senhora"],
                                      label=NOT_TARGET)

keyword_paroquia = make_keyword_lf(keywords=["paroquia"],
                                      label=NOT_TARGET)

keyword_sao = make_keyword_lf(keywords=["sao"],
                              label=NOT_TARGET)

keyword_santo = make_keyword_lf(keywords=["santo"],
                              label=NOT_TARGET)

keyword_maconica = make_keyword_lf(keywords=["maconica"],
                              label=NOT_TARGET)

keyword_educacional = make_keyword_lf(keywords=["educacional"],
                              label=NOT_TARGET)

keyword_comunitario = make_keyword_lf(keywords=["comunitario"],
                              label=NOT_TARGET)

keyword_treinamento = make_keyword_lf(keywords=["treinamento"],
                              label=NOT_TARGET)

keyword_instituto = make_keyword_lf(keywords=["instituto"],
                              label=NOT_TARGET)


keyword_kardec = make_keyword_lf(keywords=["kardec"],
                              label=NOT_TARGET)

keyword_umbanda = make_keyword_lf(keywords=["umbanda"],
                              label=NOT_TARGET)

keyword_caboclo = make_keyword_lf(keywords=["caboclo"],
                              label=NOT_TARGET)

keyword_tenda = make_keyword_lf(keywords=["tenda"],
                              label=NOT_TARGET)

keyword_ogum = make_keyword_lf(keywords=["ogum"],
                              label=NOT_TARGET)

keyword_ubirajara = make_keyword_lf(keywords=["ubirajara"],
                              label=NOT_TARGET)

keyword_oxala = make_keyword_lf(keywords=["oxala"],
                              label=NOT_TARGET)

keyword_cacique = make_keyword_lf(keywords=["cacique"],
                              label=NOT_TARGET)

keyword_iemanja = make_keyword_lf(keywords=["yemanja", "iemanja"],
                              label=NOT_TARGET)

keyword_oxossi = make_keyword_lf(keywords=["oxossi"],
                              label=NOT_TARGET)

keyword_coral = make_keyword_lf(keywords=["coral"],
                              label=NOT_TARGET)

In [22]:
# Create a list of labeling functions
lfs = [keyword_assembleia,
keyword_evangelica,
keyword_pentecostal, 
keyword_iurd,
keyword_internacional,
keyword_ministerio,
keyword_igreja,
keyword_sara,
keyword_protestantes,
keyword_espiritas,
keyword_funerario,
keyword_associacao,
keyword_catolica,
keyword_senhora,
keyword_paroquia, 
keyword_sao,
keyword_santo, 
keyword_maconica, 
keyword_educacional, 
keyword_comunitario,
keyword_treinamento,
keyword_instituto,
keyword_kardec,
keyword_umbanda, 
keyword_caboclo,
keyword_tenda,
keyword_ogum,
keyword_ubirajara,
keyword_oxala,
keyword_cacique, 
keyword_iemanja,
keyword_oxossi,
keyword_coral]

In [23]:
# Aplicar funções nos bancos de dados 
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_dev   = applier.apply(df=df_dev)
L_valid = applier.apply(df=df_valid)

100%|██████████| 73688/73688 [01:47<00:00, 684.34it/s]
100%|██████████| 100/100 [00:00<00:00, 992.30it/s]
100%|██████████| 219/219 [00:00<00:00, 1014.47it/s]


In [24]:
# Summary labelling functions output
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
keyword_assembleia,0,[1],0.2,0.18,0.05,17,3,0.85
keyword_evangelica,1,[1],0.31,0.28,0.05,31,0,1.0
keyword_pentecostal,2,[1],0.19,0.18,0.0,19,0,1.0
keyword_universal do reino de deus,3,[],0.0,0.0,0.0,0,0,0.0
keyword_internacional,4,[],0.0,0.0,0.0,0,0,0.0
keyword_ministerio,5,[1],0.16,0.13,0.01,16,0,1.0
keyword_igreja,6,[1],0.61,0.54,0.07,61,0,1.0
keyword_sara,7,[1],0.03,0.03,0.0,3,0,1.0
keyword_luterana,8,[1],0.22,0.22,0.06,21,1,0.954545
keyword_espirita,9,[0],0.06,0.01,0.01,6,0,1.0


# Criar modelos para classificar igrejas

Hora de criar dois modelos e comparar a acurárcia de nossas classificações.

In [25]:
# Majority Model
majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

# Using Keras
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

In [26]:
# Compare Model Metrics
majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   90.4%
Label Model Accuracy:     91.8%


# Aplicar modelo e salvar banco de dados

Hora de aplicar o modelo com melhor acurácia e salvar o banco para posterior análise.

In [27]:
L_train = applier.apply(df=data) 
data['is_evangelic'] = label_model.predict(L=L_train, tie_break_policy="abstain") 

100%|██████████| 152269/152269 [02:41<00:00, 944.43it/s] 


In [28]:
data['is_evangelic'].value_counts()

 1    106941
-1     23966
 0     21362
Name: is_evangelic, dtype: int64

In [29]:
data.to_csv('data/final/cnae_labelled_data.csv.gz', 
           compression='gzip')