# Clasificador de Texto usando `sklearn`

## Se importa el paquete `fetch_20newsgroups` de sklearn, el cual contiene un dataset clasificado para 20 grupos de correos

In [1]:
#Loading the data set - training data.
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

## Los grupos de correos del paquete son:

In [2]:
# You can check the target names (categories) and some data files by following commands.
twenty_train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Analizar el tipo de datos

### El intro y el formato del correo es de la siguiente manera

In [3]:
print("\n".join(twenty_train.data[1].split("\n")[:3])) #prints first line of the first data file

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports


In [4]:
twenty_train.data[1]

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

## Se cuentan las palabras del texto y se vectoriza

In [5]:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

### El TF-IDF es un parámetro de conteo de palabras según la extensión relativa del documento, para evitar mayor influencia de muchas palabras repetidas en un documento extenso

In [6]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

# Utilizamos dos métodos diferentes para comparar

## Naive Bayes

### Se importa el paquete de Naive Bayes, se crea el model y se entrena con el *training set*

In [7]:
# Machine Learning
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [8]:
# Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
# We will be using the 'text_clf' going forward.
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

### Se evalúa la exactitud con el *testing set*

In [9]:
# Performance of NB Classifier
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

## SVM

### Se importa el paquete de SVM, se crea el modelo y entrena con el *training set*

In [10]:
# Training Support Vector Machines - SVM and calculating its performance

from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=15, random_state=42))])

text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)

0.8240839086563994

## Como este modelo tiene buen resultado, se procede a predecir el texto que queremos

In [11]:
text_1 = "A multidisciplinary team from two Johns Hopkins University institutions, including neurotoxicologists and virologists from the Bloomberg School of Public Health and infectious disease specialists from the school of medicine, has found that organoids (tiny tissue cultures made from human cells that simulate whole organs) known as mini-brains can be infected by the SARS-CoV-2 virus that causes COVID-19. \n The results, which suggest that the virus can infect human brain cells, were published online June 26, 2020, in the journal ALTEX: Alternatives to Animal Experimentation. \n Early reports from Wuhan, China, the origin of the COVID-19 pandemic, have suggested that 36% of patients with the disease show neurological symptoms, but it has been unclear whether or not the virus infects human brain cells. In their study, the Johns Hopkins researchers demonstrated that certain human neurons express a receptor, ACE2, which is the same one that the SARS-CoV-2 virus uses to enter the lungs. Therefore, they surmised, ACE2 also might provide access to the brain."
class_1 = "Medicina"

text_2 = "Vatican City, Jun 30, 2020 / 07:10 am MT (CNA).- The world needs media that can help young people to distinguish between good and evil, presenting the facts in a clear and unbiased way, Pope Francis said Tuesday.In a June 30 message to members of the Catholic Press Association, the pope appealed to Catholic journalists to help break down barriers of misunderstanding between people. We need media capable of building bridges, defending life and breaking down the walls, visible and invisible, that prevent sincere dialogue and truthful communication between individuals and communities he wrote.\n We need media that can help people, especially the young, to distinguish good from evil, to develop sound judgments based on a clear and unbiased presentation of the facts, and to understand the importance of working for justice, social concord and respect for our common home."
class_2 = "Religion"

text_3 = "High school baseball in Japan is a noisy, communal, hyperkinetic affair. It starts right away, as coaches hit pregame infield practice with three baseballs going at once, one after another after another, and the constant collision of chatter resembles an aviary. It's difficult to square this raucous spectacle with the perception of baseball in America, where detractors see a staid, inert sport incapable of capturing the attention of the terminally inattentive. \n Every baseball field in Japan is considered sacred ground, and before each game the players gather in a ruler-straight line in front of their dugouts and bow to the earth to thank it for providing the canvas for their endeavor. It's deep, and it's portrayed in riveting detail in Koshien: Japan's Field of Dreams, a documentary that premieres at 7 p.m. Monday on ESPN. It highlights a difficult truth: There is very little levity that comes with playing high school baseball at a high level in Japan; it is a task filled with regimented training and complete devotion, heavy responsibility and historical accountability."
class_3 = "Baseball"

text = (text_1, text_2, text_3)

In [12]:
predict_workshop = text_clf_svm.predict(text)

In [13]:
predict = [twenty_train.target_names[element] for element in predict_workshop]
print(list(zip((class_1, class_2, class_3), predict)))

[('Medicina', 'sci.med'), ('Religion', 'soc.religion.christian'), ('Baseball', 'rec.sport.baseball')]


Basado en *https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a*