# FDS Final Project - Cemplate
Ima drop some template here for easy implementation

In [14]:
import pandas as pd
import numpy as np

import pickle
import time

# surpess warnings
import warnings
warnings.filterwarnings('ignore')

## Datasets

So here, we can prepare the datasets to test out the models. We can use several of them during our analysis, and see how the models perform on different datasets. If there is no variation in the performance, then we may remove all but one before the final submission. Or use it as an evidence of robustness. 

I am intentionally not changing the columns for now, so you can see the data as it is. Later on, we will have one column of **text** and one column of **label**.

In [2]:
# Data from Huffington Post, no null values
huff_post = pd.read_json('data/huff_post.json', lines=True)
huff_post.head(1)

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23


In [3]:
# Data from Bancolombia, no null values
bancolombia = pd.read_csv('data/bancolombia.csv')
bancolombia.head(1)

Unnamed: 0,url,news,Type
0,https://www.larepublica.co/redirect/post/3201905,Durante el foro La banca articulador empresari...,Otra


## Classifiers

So here we set up the classifiers we are going to use. Since SKLearn has a constant interface, we can just use a list of classifiers and iterate over them. For example, they all have a `fit` method, and a `predict` method. So we can just call them in a loop, and get the results. That is what I've seen in the other notebooks.

In [7]:
# we will import Naive Bayes, softmax classifier, and svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [15]:
classifiers = [
    MultinomialNB(),
    LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=1, C=0.1),
    LogisticRegression(penalty='l2', C=0.1),
    SVC()
]

classifier_names = ['Naive Bayes', 'Softmax ElasticNet', 'Softmax L2', 'SVM']

## Fit and Predict

So here we fit the model, and predict the results. I am just printint the accuracy for the training set, and the test set. But we can also print other metrics. Also SKLearn has a `classification_report` function that can be used to print the precision, recall, and f1-score for each class, just writing this down so I don't forget.

In [9]:
# we will import CountVectorizer to vectorize our text
from sklearn.feature_extraction.text import CountVectorizer

# we will import train_test_split to split our data
from sklearn.model_selection import train_test_split

In [10]:
# load stopwords
with open('stopwords/spanish', 'rb') as f:
    spanish_stopwords = pickle.load(f)
    
vectorizer = CountVectorizer(min_df=5, stop_words=spanish_stopwords)
X = vectorizer.fit_transform(bancolombia['news'])

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, bancolombia['Type'], test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((973, 7661), (244, 7661), (973,), (244,))

In [16]:
# Train and test each classifier
for i, clf in enumerate(classifiers):
    # start timer
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print(f'Classifier:\t\t {classifier_names[i]}')
    print(f'Training time:\t\t {end - start:.3f} seconds')
    print(f'Training data accuracy:\t {clf.score(X_train, y_train):.3f}')
    print(f'Test data accuracy:\t {clf.score(X_test, y_test):.3f}')
    print()

Classifier:		 Naive Bayes
Training time:		 0.013 seconds
Training data accuracy:	 0.952
Test data accuracy:	 0.865

Classifier:		 Softmax ElasticNet
Training time:		 1.692 seconds
Training data accuracy:	 0.899
Test data accuracy:	 0.844

Classifier:		 Softmax L2
Training time:		 1.147 seconds
Training data accuracy:	 0.997
Test data accuracy:	 0.844

Classifier:		 SVM
Training time:		 3.175 seconds
Training data accuracy:	 0.955
Test data accuracy:	 0.799



## Evaluation

Here we can look deeper into the models. As an example, below I am printing the most important words (that have the highest conditonal probability given a class) for the Naive Bayes.

In [22]:
nbc = MultinomialNB()
nbc.fit(X_train, y_train)
phi = nbc.feature_log_prob_
phi.shape

In [30]:
print('Most informative features for each class for Naive Bayes', end='\n\n')
for i, cls in enumerate(nbc.classes_):
    if cls == 'Otra':
        print(f'{cls}:\t\t', end=' ')
    else:
        print(f'{cls}:\t', end=' ')
    print(', '.join([vectorizer.get_feature_names()[j] for j in phi[i].argsort()[-10:]]))

Most informative features for each class for Naive Bayes

Alianzas:	 parte, presidente, año, mercado, empresas, 000, país, millones, alianza, colombia
Innovacion:	 tecnología, forma, través, innovación, digital, empresas, banco, datos, clientes, bbva
Macroeconomia:	 aumento, economía, alimentos, 2022, mayor, crecimiento, bbva, año, precios, inflación
Otra:		 000, crédito, financiera, entidad, año, clientes, colombia, millones, banco, bbva
Regulaciones:	 año, si, servicios, gobierno, mercado, país, dijo, empresas, colombia, regulación
Reputacion:	 mejor, puesto, colombia, país, 10, millones, posición, sector, reputación, empresas
Sostenibilidad:	 puede, millones, cada, además, cambio, agua, sostenibilidad, sostenible, energía, bbva
