# Auxiliar 1 : Claisficación de texto


----------------------------------

## Breve resumen de las clases anteriores


### ¿En qué consiste la clasificación de texto?

Según [wikipedia](https://en.wikipedia.org/wiki/Document_classification): 

    "The task is to assign a document to one or more classes or categories"


Algunos ejemplos:

- Assigning subject categories, topics, or genres
- Spam detection 
- Authorship identification
- Age/gender identification
- Language Identification
- Sentiment analysis
- ...

### Definición formal

Input:
- A document    $d$
- A fixed set of classes $C=\{c_{1},    c_{2},...,    c_{J}\}$

Output:    

- A predicted class $c \in C$

### Tipos de técnicas de clasificación: 

- Hand-coded Rules. (No se verán en este aux).
- Supervised Machine Learning: 
    - Naïve Bayes
    - Logistic regression
    - Support vector machines
    - k-Nearest Neighbors


## Objetivo del Auxiliar

Introducirlos en los primeros tópicos y herramientas comunes de NLP.

Para esto, implementaremos variados modelos de clasificación de texto destinadas a **predecir la categoría de noticias de la radio biobio**.

Las tecnicas que usaremos serán las vistas en clases: 

- Bayes
- Logistic regression 

Las herramientas que usaremos son (y qué serán **necesarias** para ejecutar este notebook): 

- Pandas
- Scikit-Learn
- Spacy
- NLTK



## Créditos

Todas las noticias extraidas perteneces a [Biobio Chile](https://www.biobiochile.cl/), los cuales gentilmente licencian todo su material a través de la [licencia Creative Commons (CC-BY-NC)](https://creativecommons.org/licenses/by-nc/2.0/cl/)

## Referencias: 

Gitgub del curso: 
- https://github.com/dccuchile/CC6205

Slides:
- https://web.stanford.edu/~jurafsky/slp3/slides/7_NB.pdf


Códigos varios:
- https://affectivetweets.cms.waikato.ac.nz/benchmark/

Algunos Recursos útiles
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Scikit-learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
- [Spacy Tutorial](https://www.datacamp.com/community/blog/spacy-cheatsheet)
- [NLTK Cheat sheet](http://sapir.psych.wisc.edu/programming_for_psychologists/cheat_sheets/Text-Analysis-with-NLTK-Cheatsheet.pdf)

## Imports

In [1]:
import pandas as pd    
import spacy
import nltk

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score, classification_report

from nltk.stem import SnowballStemmer
from spacy.lang.es.stop_words import STOP_WORDS

nlp = spacy.load("es_core_news_sm", disable=['ner', 'parser', 'tagger'])
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pablo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Datos

### Cargar los datasets 

In [2]:
nacional = pd.read_json("./datasets/biobio_nacional.json", encoding ='utf-8')
internacional = pd.read_json("./datasets/biobio_internacional.json", encoding ='utf-8')
economia = pd.read_json("./datasets/biobio_economia.json", encoding ='utf-8')
sociedad = pd.read_json("./datasets/biobio_sociedad.json", encoding ='utf-8')
opinion = pd.read_json("./datasets/biobio_opinion.json", encoding ='utf-8')

In [3]:
sociedad.describe()

Unnamed: 0,publication_date,publication_hour,author,author_link,title,link,category,subcategory,content,tags,embedded_links
count,200,200,200,200,200,200,200,200,200.0,200,200
unique,42,167,13,13,200,200,1,1,196.0,191,3
top,26/06/2019,15:12,César Vega Martínez,/lista/autores/cevega,Arqueólogos aseguran haber hallado el lugar ex...,https://www.biobiochile.cl/noticias/sociedad/c...,Sociedad,Sociedad,,[],[]
freq,10,3,78,78,1,1,200,200,5.0,9,198


#### Ejemplo de noticia de categoría sociedad: 

In [4]:
sample = sociedad.iloc[19:20]

In [5]:
sample

Unnamed: 0,publication_date,publication_hour,author,author_link,title,link,category,subcategory,content,tags,embedded_links
19,02/08/2019,15:08,Emilio Contreras,/lista/autores/Econtreras,Chile deja de utilizar 16.170 toneladas de bol...,https://www.biobiochile.cl/noticias/sociedad/d...,Sociedad,Sociedad,Chile dejó de utilizar 16.170 toneladas de b...,"[#16.170 toneladas, #balance, #bolsas plástica...",[]


In [6]:
sample_content = sample.values[0][8]
sample_category = sample.values[0][6]

In [7]:
print("\033[1mContenido:\033[0m\n\n", sample_content.strip(), "\n\n\033[1mClase:\033[0m\n\n", sample_category)

Contenido:

 Chile dejó de utilizar 16.170 toneladas de bolsas plásticas desde que hace un año implementó una ley que prohíbe su entrega en supermercados y retails, informó este viernes el ministerio del Medio Ambiente.  Tras un periodo de prueba de seis meses, el gobierno chileno puso en vigencia en agosto del año pasado la ley que evitó el consumo de unas 2.200 millones de bolsas plásticas, una reducción significativa tomando en cuenta que, hasta la promulgación de la norma, Chile producía 3.200 millones de bolsas anuales .  “Si consideramos el peso de estas bolsas que se dejaron de entregar, unas 16.170 toneladas, equivalen a 13.940 autos”, indicó el ministerio en un comunicado.  Desde la puesta en marcha de la norma, los chilenos asumieron el hábito de utilizar bolsas reutilizables, de tela o material reciclable, lo cual ha colaborado en reducir la contaminación que producen los sacos plásticos principalmente en los océanos, en los que yacen unos 13 millones de toneladas de plástic

### Tokenizar

¿Qué era tokenizar?

    In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).
    
Referencia: [Tokenización en wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)

In [8]:
tokenized_content = [word.text for word in  nlp(sample_content)]

In [9]:
pd.DataFrame(tokenized_content)

Unnamed: 0,0
0,
1,Chile
2,dejó
3,de
4,utilizar
5,16.170
6,toneladas
7,de
8,bolsas
9,plásticas


### Stopwords 

¿Qué eran las stopwords?

    In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Stop words are generally the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools avoid removing stop words to support phrase search. 
    
Referencias: [Stopwords en Wikipedia](https://en.wikipedia.org/wiki/Stop_words)

In [10]:
pd.DataFrame(STOP_WORDS).sample(20)

Unnamed: 0,0
127,estaba
336,de
272,podrian
367,estamos
160,hizo
445,los
340,ningunos
314,hacen
95,última
449,últimos


In [11]:
tokenized_content_no_stop_words = [token for token in tokenized_content if token not in STOP_WORDS ]

In [12]:
pd.DataFrame(tokenized_content_no_stop_words).sample(20)

Unnamed: 0,0
105,toneladas
64,16.170
6,plásticas
128,importante
215,polietileno
9,ley
168,%
111,
148,año
191,o


### Stemming

¿Qué era el stemming? 

    Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
    
Referencia: [Stemming en Wikipedia](https://en.wikipedia.org/wiki/Stemming)
  
#### Ejemplos: 


| word | stem of the word  |
|---|---|
working | work
worked | work
works | work

In [13]:
stemmer = SnowballStemmer('spanish')
stemmed_content = [stemmer.stem(word) for word in tokenized_content]

In [14]:
pd.DataFrame(zip(tokenized_content, stemmed_content), columns=['original', 'stem']).sample(15)

Unnamed: 0,original,stem
260,que,que
139,el,el
31,ministerio,ministeri
313,%,%
40,de,de
68,bolsas,bols
95,“,“
316,envases,envas
292,en,en
309,espera,esper


### Lematización

¿Qué era lematización? 

    
    Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.[
    
    
Referencia: [Lematización en wikipedia](https://en.wikipedia.org/wiki/Lemmatisation)
    
#### Ejemplos

| word | lemma  |
|---|---|
dije| decir 
guapas | guapo
mesa | mesas


In [15]:
lemmatized_content = [word.lemma_ for word in nlp(sample_content)]

In [16]:
# Visualizar la lematización
pd.DataFrame(zip(tokenized_content, lemmatized_content), columns=['original', 'lemma']).sample(15)

Unnamed: 0,original,lemma
62,consumo,consumir
197,,
335,la,lo
324,“,“
359,del,del
316,envases,envase
27,informó,informar
308,gobierno,gobernar
3,de,de
53,agosto,agostar


### Bag of Words

¿Qué es?


    
    The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words 
    
Referencia: [BoW en wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)

### Ejemplo

- Doc1 : 'I love dogs'
- Doc2: 'I hate dogs and knitting.
- Doc3: 'Knitting is my hobby and my passion.

![BOW](https://i1.wp.com/datameetsmedia.com/wp-content/uploads/2017/05/bagofwords.004.jpeg?resize=1024%2C260)

In [17]:
# Tokenizers for CountVectorizer

def tokenizer(doc):
    return [x.orth_ for x in nlp(doc) ]
    
def tokenizer_with_stopwords(doc):
    return [x.orth_ for x in nlp(doc) if x.orth_ not in STOP_WORDS]

def tokenizer_with_lemmatization (doc):
    return [x.lemma_ for x in nlp(doc)]
  
def tokenizer_with_stemming(doc):
    stemmer = SnowballStemmer('spanish')
    return [stemmer.stem(word) for word in [x.orth_ for x in nlp(doc)]]

In [18]:
vectorizer = CountVectorizer(analyzer='word', tokenizer = tokenizer, ngram_range=(1,2))  
bow = vectorizer.fit_transform(opinion.sample(4).content)
bow[0]

<1x2305 sparse matrix of type '<class 'numpy.int64'>'
	with 216 stored elements in Compressed Sparse Row format>

## Procesar los datasets

Seleccionar solo las columnas relevantes y divider en conjuntos de entrenamiento y de prueba.

In [19]:
def process_datasets(datasets):
    dataset = pd.concat(datasets)
    X_train, X_test, y_train, y_test = train_test_split(dataset.content, dataset.category, test_size=0.33, random_state=42)
    
    return X_train, X_test, y_train, y_test


In [20]:
datasets = [nacional, internacional, economia, sociedad, opinion]
X_train, X_test, y_train, y_test = process_datasets(datasets)

## Clasificación de tópico con Naive Bayes

- Simple (“naïve”) classification method based on Bayes rule
- Relies on very simple representation of document
- Bag of words

Para un document d y la clase C, la probabilidad de está dada por:

$$P(c|d) = \frac{P(d|c)P(c)}{P(d)}$$

Consideremos MAP como Maximum a posteriori o la clase mas probable. 

$$ C_{MAP} = argmax_{c \in C} P(c|d)$$

Aplicando el teorema de Bayes:

$$ C_{MAP} = argmax_{c \in C} \frac{P(d|c)P(c)}{P(d)} $$

Descartamos el denominador:

$$ C_{MAP} = argmax_{c \in C} P(d|c)P(c) $$

Si el documento d, ahora lo consideramos como un arreglo de palabras (pensando en bag of words): 

$$ C_{MAP} = argmax_{c\in C} P(x_1, x_2, ..., x_n | c)P(c) $$ 

El clasificador aprenderá $O(|X|^n * |C|)$ parámetros, los cuales deberán ser entrenados a partir de una gran cantidad de ejemplos de entrenamiento.

### Establecer el Pipeline

In [21]:
# Qué tokenizer usaremos?
TOKENIZER = tokenizer_with_lemmatization

vectorizer = CountVectorizer(analyzer='word', tokenizer = TOKENIZER, ngram_range=(1,3))  
clf = MultinomialNB()   

text_clf = Pipeline([('vect', vectorizer), ('clf', clf)])

### Entrenar

In [22]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 3), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_with_lemmatization at 0x000001E085876158>,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

### Evaluación

In [23]:
predicted = text_clf.predict(X_test)

conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)

print('\nConfusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('\nClassification Report')
print(class_rep)
print('\nkappa:'+str(kappa))



Confusion Matrix for Logistic Regression + ngram features:
[[56  0  2 11  1]
 [ 0 64  1  2 10]
 [ 1  0 37 19  1]
 [ 0  0  0 66  0]
 [ 0  5  0  3 51]]

Classification Report
               precision    recall  f1-score   support

     Economia       0.98      0.80      0.88        70
Internacional       0.93      0.83      0.88        77
     Nacional       0.93      0.64      0.76        58
      Opinion       0.65      1.00      0.79        66
     Sociedad       0.81      0.86      0.84        59

     accuracy                           0.83       330
    macro avg       0.86      0.83      0.83       330
 weighted avg       0.86      0.83      0.83       330


kappa:0.7873270881763988


### Ejemplos

In [24]:
text_clf.predict(["En puerto montt se encontró un perrito, que aparentemente, habría consumido drogas de alto calibre. Producto de esto, padecera severa caña durante varios dias."])

array(['Sociedad'], dtype='<U13')

In [25]:
text_clf.predict(["kim jong un será el próximo candidato a ministro de educación."])

array(['Internacional'], dtype='<U13')

In [26]:
text_clf.predict(["El banco mundial presentó para chile un decrecimiento económico de 92% y una inflación de 8239832983289%."])

array(['Economia'], dtype='<U13')

## Regresión Logísitica

Explicación cuática aquí

### Pipeline

In [27]:
# Qué tokenizer usaremos?
TOKENIZER = tokenizer_with_lemmatization

log_mod = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter = 1000)   
log_pipe = Pipeline([('vect', vectorizer), ('clf', log_mod)])

### Entrenar

In [28]:
log_pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 3), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_with_lemmatization at 0x000001E085876158>,
                                 vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=1000,
    

### Evaluar

In [29]:
predicted = log_pipe.predict(X_test)

conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)

print('\nConfusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('\nClassification Report')
print(class_rep)
print('\nkappa:'+str(kappa))


Confusion Matrix for Logistic Regression + ngram features:
[[58  1  6  1  4]
 [ 5 61  5  1  5]
 [ 2  0 53  1  2]
 [ 1  1  2 61  1]
 [ 2  6  2  1 48]]

Classification Report
               precision    recall  f1-score   support

     Economia       0.85      0.83      0.84        70
Internacional       0.88      0.79      0.84        77
     Nacional       0.78      0.91      0.84        58
      Opinion       0.94      0.92      0.93        66
     Sociedad       0.80      0.81      0.81        59

     accuracy                           0.85       330
    macro avg       0.85      0.85      0.85       330
 weighted avg       0.85      0.85      0.85       330


kappa:0.8142510884174009


### Ejemplos

In [30]:
log_pipe.predict(["En puerto montt se encontró un perrito, que aparentemente, habría consumido drogas de alto calibre. Producto de esto, padecera severa caña durante varios dias."])

array(['Sociedad'], dtype=object)

In [31]:
log_pipe.predict(["kim jong un será el próximo candidato a ministro de educación."])

array(['Sociedad'], dtype=object)

## Clasificación de Autoría

¿Existirá un patrón en como escriben los periodistas que nos permitan identificarlos a partir de sus textos?

In [32]:
pd.concat(datasets).author.unique()

array(['Gonzalo Cifuentes', 'Felipe Delgado', 'Nicolás Parra',
       'Catalina Díaz', 'Valentina González', 'Emilio Lara',
       'Manuel Stuardo', 'Nicolás Díaz', 'Sandar Oporto',
       'María José Villarroel', 'Catalina Sánchez', 'Matías Vega',
       'Manuel Cabrera', 'Periodismo UCSC', 'Diego Vera', 'Yerko Roa',
       'Felipe Díaz Montero', 'Ariela Muñoz', 'Yessenia Márquez',
       'Gerson Guzmán D.', 'Paola Alemán', 'Sebastián Asencio',
       'Claudia Miño', 'Camilo Suazo', 'Verónica Reyes', 'Max Duhalde',
       'Francisca Rivas', 'Hernán Bustamante', 'Leonardo Casas',
       'Alberto González', 'Jonathan Flores', 'Scarlet Stuardo',
       'Gerson Guzmán', 'Bernardita Villa', 'César Vega Martínez',
       'Camila Álvarez', 'Jaime Parra', 'Emilio Contreras',
       'Fabián Barría', 'Denisse Charpentier', 'Nicole Briones', 'Tu Voz',
       'Natalia Muñoz', 'Tamara Rojas', 'Alejandra Soto', 'Pablo Cabeza'],
      dtype=object)

In [33]:
def process_datasets_by_author(datasets):
    dataset = pd.concat(datasets)
    X_train, X_test, y_train, y_test = train_test_split(dataset.content, dataset.author, test_size=0.33, random_state=42)
    
    return X_train, X_test, y_train, y_test


In [34]:
X_train_2, X_test_2, y_train_2, y_test_2 = process_datasets_by_author(datasets)

### Definir Pipe

In [35]:
# Qué tokenizer usaremos?
TOKENIZER = tokenizer_with_lemmatization

log_mod_by_author = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter = 1000)   
log_pipe_by_author = Pipeline([('vect', vectorizer), ('clf', log_mod_by_author)])

### Entrenar

In [36]:
log_pipe_by_author.fit(X_train_2, y_train_2)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 3), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_with_lemmatization at 0x000001E085876158>,
                                 vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=1000,
    

### Evaluar

In [37]:
predicted = log_pipe_by_author.predict(X_test_2)

conf = confusion_matrix(y_test_2, predicted)
kappa = cohen_kappa_score(y_test_2, predicted) 
class_rep = classification_report(y_test_2, predicted)

print('\nConfusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('\nClassification Report')
print(class_rep)
print('\nkappa:'+str(kappa))

  'precision', 'predicted', average, warn_for)



Confusion Matrix for Logistic Regression + ngram features:
[[ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  1 ...  0  0  0]
 ...
 [ 0  0  1 ... 34  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]

Classification Report
                       precision    recall  f1-score   support

       Alejandra Soto       0.00      0.00      0.00         3
         Ariela Muñoz       0.00      0.00      0.00         3
     Bernardita Villa       0.12      0.25      0.17         4
       Camila Álvarez       0.00      0.00      0.00         1
         Camilo Suazo       0.50      0.11      0.18         9
        Catalina Díaz       0.00      0.00      0.00         3
         Claudia Miño       0.00      0.00      0.00         3
  César Vega Martínez       0.70      0.76      0.73        25
  Denisse Charpentier       0.00      0.00      0.00         1
           Diego Vera       0.45      0.75      0.56        44
     Emilio Contreras       0.00      0.00      0.00         2
    