# Machine Learning Second Term Final Exam 
# Camilo Andrés Romero Maldonado
# **Email** camiloa.romero@correo.usa.edu.co
# **CC** 1020844233

## Required Imports & General Configurations

In [87]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier

from nltk import PorterStemmer
from nltk.tokenize import RegexpTokenizer

from google.colab import drive
import json

 Drive mounting

In [88]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Pretty print dictionaries function

In [89]:
def printDict(d):
    print(json.dumps(d,sort_keys=True, indent=4))

# Dataset News All categories


## Initial exploration

### Dataset reading & exploration

In [56]:
dataset = pd.read_json('/content/drive/MyDrive/ML/Segundo Corte/news_category_dataset_v2.json', lines=True)

dataset

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


### Dataset information cleaning

#### Registers filtering and dataset cleaning

Check if there is a register which relevant columns for classification are empty for removing it

In [57]:
(dataset[((dataset.authors == '') & (dataset.headline == '') & (dataset.short_description == ''))]).count()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

In [58]:
(dataset[dataset.authors == '']).count()

category             36620
headline             36620
authors              36620
link                 36620
short_description    36620
date                 36620
dtype: int64

In [59]:
(dataset[dataset.headline == '']).count()

category             6
headline             6
authors              6
link                 6
short_description    6
date                 6
dtype: int64

In [60]:
(dataset[dataset.short_description == '']).count()

category             19712
headline             19712
authors              19712
link                 19712
short_description    19712
date                 19712
dtype: int64

Filling empty columns with NO_[COLUMN_NAME]

In [61]:
dataset.loc[dataset['authors'] == '' ,'authors'] = 'NO_AUTHOR'
dataset.loc[dataset['headline'] == '' ,'headline'] = 'NO_HEADLINE'
dataset.loc[dataset['short_description'] == '' ,'short_description'] = 'NO_DESCRIPTION'

dataset

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,NO_AUTHOR,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",NO_AUTHOR,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,NO_AUTHOR,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


Describing the dataset for discovering the labels and the amount of entries from the dataset. Also, some general information.

#### Dataset exploration

In [62]:
dataset.describe()

  """Entry point for launching an IPython kernel.


Unnamed: 0,category,headline,authors,link,short_description,date
count,200853,200853,200853,200853,200853,200853
unique,41,199344,27993,200812,178353,2309
top,POLITICS,Sunday Roundup,NO_AUTHOR,https://www.huffingtonpost.comhttps://www.wash...,NO_DESCRIPTION,2013-01-17 00:00:00
freq,32739,90,36620,2,19712,100
first,,,,,,2012-01-28 00:00:00
last,,,,,,2018-05-26 00:00:00


General info from the dataset that shows that there is no null entries in any in the dataset

In [63]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


As it was said before, there is no null entries in any label of the dataset, so there is no data cleaning needed. This can be reaffirmed with the following code snippet.

In [64]:
dataset.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

There are 41 categories that the model will predict

In [65]:
len(dataset['category'].unique())

41

##### Dataset's empty values

36620 news have no authors

In [66]:
dataset.value_counts(dataset['authors'])

authors
NO_AUTHOR                                               36620
Lee Moran                                                2423
Ron Dicker                                               1913
Reuters, Reuters                                         1562
Ed Mazza                                                 1322
                                                        ...  
Lily Kuo, Quartz Africa                                     1
Lily Remi, ContributorYoung Aussie writer & student         1
Lin Stranberg, Contributor\nWriter/Editor/Enthusiast        1
Lina Esco, ContributorDirector, 'Free the Nipple'           1
 Basil Kreimendahl, Contributor\nPlaywright                 1
Length: 27993, dtype: int64

19712 news have no short description

In [67]:
dataset.value_counts(dataset['short_description'])

short_description
NO_DESCRIPTION                                                                                                                   19712
Welcome to the HuffPost Rise Morning Newsbrief, a short wrap-up of the news to help you start your day.                            192
The stress and strain of constantly being connected can sometimes take your life -- and your well-being -- off course. GPS         125
Want more? Be sure to check out HuffPost Style on Twitter, Facebook, Tumblr, Pinterest and Instagram at @HuffPostStyle. -- Do       91
Do you have a home story idea or tip? Email us at homesubmissions@huffingtonpost.com. (PR pitches sent to this address will         75
                                                                                                                                 ...  
Take a Twitter poetry break in honor of National Poetry Day!                                                                         1
Take This Waltz 1n 1978, Diane was of

All news have its corresponding headline

In [68]:
dataset.value_counts(dataset['headline'])

headline
Sunday Roundup                                                                                   90
The 20 Funniest Tweets From Women This Week                                                      80
Weekly Roundup of eBay Vintage Clothing Finds (PHOTOS)                                           59
Weekly Roundup of eBay Vintage Home Finds (PHOTOS)                                               54
Watch The Top 9 YouTube Videos Of The Week                                                       46
                                                                                                 ..
Quvenzhané Wallis Oscar Dress 2013: Actress Looks Adorable In Armani And Puppy Purse (PHOTOS)     1
Quvenzhané Wallis Has Her Fingers Crossed For Another Oscar Nod                                   1
Quvenzhané Wallis Had A Dance-Off To End All Dance-Offs (VIDEO)                                   1
Qutenza, Chili Pepper Drug, Gets Mixed Review For Treating HIV-Related Pain                

## Pre-Processing
There is no need to re-preprocess the dataset for each iteration, so only one preprocessing on the dataset for each iteration will be made.

### Categorize category column

These column was used because it needed a categorized value for having a more comprehensible information in the dataset.
In order to transform the column, was needed to use the Ordinal encoder.

In [69]:
categories = dataset[['category']]
encoder = OrdinalEncoder()
transformed_categories = encoder.fit_transform(categories)
transformed_categories

array([[ 6.],
       [10.],
       [10.],
       ...,
       [28.],
       [28.],
       [28.]])

In [70]:
dataset['category'] = transformed_categories
dataset

Unnamed: 0,category,headline,authors,link,short_description,date
0,6.0,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,10.0,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,10.0,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,10.0,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,10.0,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,32.0,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,28.0,Maria Sharapova Stunned By Victoria Azarenka I...,NO_AUTHOR,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,28.0,"Giants Over Patriots, Jets Over Colts Among M...",NO_AUTHOR,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,28.0,Aldon Smith Arrested: 49ers Linebacker Busted ...,NO_AUTHOR,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


### Split between training and test
Usage of train_test_split function to obtain the train and test sub-datasets

In [71]:
train, test = train_test_split(dataset, test_size = 0.3)

## Selection and Evaluation of models

### Utility functions

Function for obtaining the base pipeline, it receives the classifier which is the only argument that needs to be changed

In [72]:
def getPipeline(classifier):
    return Pipeline(steps=[('stemmer', CustomStemmer()),('count', CountVectorizer(stop_words='english')),('scaler', MaxAbsScaler()),('clf', classifier())])

### Evaluation

In [73]:
class CustomStemmer(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.stemmer = PorterStemmer()
    self.tokenizer = RegexpTokenizer(r'\w+')
  def fit(self, X, y=None):
      
    return self
      
  def transform(self, X, y=None):
    a = []
    

    for x in X.to_numpy():
      s = ''
      for w in self.tokenizer.tokenize(x):

        s += ' ' + self.stemmer.stem(w) + ' '
      a.append(s)    
    return np.array(a)

#### Pipelines definitions

The pipelines' classifiers selected were:

* Decision Tree
* Multinomial Naive Bayes
* K Nearest Neighbors

A dictionary was defined in order to iterate it and therefore, avoid code duplication.

In [74]:
tree_pipeline = getPipeline(DecisionTreeClassifier)
NB_pipeline = getPipeline(MultinomialNB)
KNN_pipeline = getPipeline(KNeighborsClassifier)
pipe_dict = {
    'Decision_Tree': tree_pipeline,
    'Multinomial Naive Bayes': NB_pipeline,
    'KNN': KNN_pipeline
}

#### Hyperparams definitions

Definition of a hyperparameters dictionary to iterate it and avoid code duplication

In [75]:
hyperparams_dict = {
    'Decision_Tree': [{"clf__max_depth":[6,8,10]}],
    'Multinomial Naive Bayes': [{"clf__fit_prior":[True, False]}],
    'KNN': {"clf__n_neighbors":[3,5,7]}
}

#### Training alternatives

Definition of a trainings dictionary to iterate it and avoid code duplication. 

Also, defined an empty dictionary that will store all the accuracy results from the different trainings

In [76]:
trainings_dict = {
    'Headline': train['headline'],
    'Short Description': train['short_description'],
    'Authors': train['authors'],
    'Headline-Short Description': train['headline'] + ' ' + train['short_description'],
    'Headline-Authors': train['headline'] + ' ' + train['authors'],
    'Authors-Short Description': train['authors'] + ' ' + train['short_description'],
    'All': train['authors'] + ' ' + train['headline'] + ' ' + train['short_description']
}

models_accuracy = {}

Loop over the trainings, pipelines and hyperparameters dictionaries with references to their corresponding function in order to store the accuracy score in the models_accuracy dictionary by its correspondig key-value pair

In [260]:
y = train['category']
for training in trainings_dict:
    models_accuracy[training] = {}
    models_accuracy[training] = {}
for training in trainings_dict:
    X = trainings_dict[training]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
    for pipeline in pipe_dict:
        print(training, pipeline)
        gs = GridSearchCV(pipe_dict[pipeline], hyperparams_dict[pipeline], n_jobs=-1)
        gs.fit(X_train,y_train)
        y_predict = gs.predict(X_test)
        models_accuracy[training][pipeline] = accuracy_score(y_test, y_predict)  

Headline Decision_Tree
Headline Multinomial Naive Bayes
Headline KNN
Short Description Decision_Tree




Short Description Multinomial Naive Bayes
Short Description KNN
Authors Decision_Tree
Authors Multinomial Naive Bayes
Authors KNN
Headline-Short Description Decision_Tree




Headline-Short Description Multinomial Naive Bayes
Headline-Short Description KNN
Headline-Authors Decision_Tree
Headline-Authors Multinomial Naive Bayes
Headline-Authors KNN
Authors-Short Description Decision_Tree
Authors-Short Description Multinomial Naive Bayes
Authors-Short Description KNN
All Decision_Tree
All Multinomial Naive Bayes
All KNN


In [82]:
printDict(models_accuracy)

{
    "All": {
        "Decision_Tree": 0.33499288762446655,
        "KNN": 0.4517306780464675,
        "Multinomial Naive Bayes": 0.6324087245139877
    },
    "Authors": {
        "Decision_Tree": 0.28584637268847796,
        "KNN": 0.48909435751541014,
        "Multinomial Naive Bayes": 0.5433380749170222
    },
    "Authors-Short Description": {
        "Decision_Tree": 0.30436225699383596,
        "KNN": 0.4395448079658606,
        "Multinomial Naive Bayes": 0.5953295400663822
    },
    "Headline": {
        "Decision_Tree": 0.22219061166429588,
        "KNN": 0.2843764817449028,
        "Multinomial Naive Bayes": 0.5228544333807492
    },
    "Headline-Authors": {
        "Decision_Tree": 0.33079658605974394,
        "KNN": 0.5382408724513987,
        "Multinomial Naive Bayes": 0.6701280227596017
    },
    "Headline-Short Description": {
        "Decision_Tree": 0.23824087245139877,
        "KNN": 0.2293978188715031,
        "Multinomial Naive Bayes": 0.5020862968231389
    },


# Dataset News first 10 categories

For this section, only the dataset will be changed, so the same variables defined previously will work fine

## Dataset transformation

In [32]:
dataset = dataset[dataset['category'] <= 10]
dataset



Unnamed: 0,category,headline,authors,link,short_description,date
0,6.0,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,10.0,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,10.0,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,10.0,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,10.0,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200838,10.0,"Sundance, Ice-T, and Shades of the American Ra...","Courtney Garcia, Contributor\nI tell stories a...",https://www.huffingtonpost.com/entry/sundance-...,Representation of the collective diaspora has ...,2012-01-28
200839,10.0,'Girl With the Dragon Tattoo' India Release Ca...,NO_AUTHOR,https://www.huffingtonpost.com/entry/girl-with...,"""Sony Pictures will not be releasing The Girl ...",2012-01-28
200840,7.0,'Don't Think': A Look At The Chemical Brothers...,Kia Makarechi,https://www.huffingtonpost.com/entry/dont-thin...,"Amid cheers and the occasional ""Here we go!"" f...",2012-01-28
200841,7.0,Matthew Marks Discusses His New LA Gallery,NO_AUTHOR,https://www.huffingtonpost.com/entry/matthew-m...,Was it an obvious choice to recruit Ellsworth ...,2012-01-28


### Train test split

In [33]:
train, test = train_test_split(dataset, test_size = 0.3)

### Training dictionary redefinition for new dataset

In [34]:
trainings_dict = {
    'Headline': train['headline'],
    'Short Description': train['short_description'],
    'Authors': train['authors'],
    'Headline-Short Description': train['headline'] + ' ' + train['short_description'],
    'Headline-Authors': train['headline'] + ' ' + train['authors'],
    'Authors-Short Description': train['authors'] + ' ' + train['short_description'],
    'All': train['authors'] + ' ' + train['headline'] + ' ' + train['short_description']
}

models_accuracy_10categories = {}

## Evaluation

In [35]:
y = train['category']
for training in trainings_dict:
    models_accuracy_10categories[training] = {}
    models_accuracy_10categories[training] = {}
for training in trainings_dict:
    X = trainings_dict[training]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
    for pipeline in pipe_dict:
        print(training, pipeline)
        gs = GridSearchCV(pipe_dict[pipeline], hyperparams_dict[pipeline], n_jobs=-1)
        gs.fit(X_train,y_train)
        y_predict = gs.predict(X_test)
        models_accuracy_10categories[training][pipeline] = accuracy_score(y_test, y_predict)  

Headline Decision_Tree
Headline Multinomial Naive Bayes
Headline KNN
Short Description Decision_Tree
Short Description Multinomial Naive Bayes
Short Description KNN
Authors Decision_Tree
Authors Multinomial Naive Bayes
Authors KNN
Headline-Short Description Decision_Tree
Headline-Short Description Multinomial Naive Bayes
Headline-Short Description KNN
Headline-Authors Decision_Tree
Headline-Authors Multinomial Naive Bayes
Headline-Authors KNN
Authors-Short Description Decision_Tree
Authors-Short Description Multinomial Naive Bayes
Authors-Short Description KNN
All Decision_Tree
All Multinomial Naive Bayes
All KNN


In [55]:
printDict(models_accuracy_10categories)

{
    "All": {
        "Decision_Tree": 0.501656513839906,
        "KNN": 0.5732606604680988,
        "Multinomial Naive Bayes": 0.7817676605749706
    },
    "Authors": {
        "Decision_Tree": 0.4329379074489687,
        "KNN": 0.606711552848135,
        "Multinomial Naive Bayes": 0.6404830608100887
    },
    "Authors-Short Description": {
        "Decision_Tree": 0.42545687720423214,
        "KNN": 0.5331837127284386,
        "Multinomial Naive Bayes": 0.7281179865341456
    },
    "Headline": {
        "Decision_Tree": 0.4682056214598696,
        "KNN": 0.3540664742973175,
        "Multinomial Naive Bayes": 0.663567382708133
    },
    "Headline-Authors": {
        "Decision_Tree": 0.47248049588543334,
        "KNN": 0.6079940151758042,
        "Multinomial Naive Bayes": 0.7872181254675644
    },
    "Headline-Short Description": {
        "Decision_Tree": 0.4906487121940793,
        "KNN": 0.36432617291867053,
        "Multinomial Naive Bayes": 0.6689109757400876
    },
    "Sh

# Final Analysis

## Results comparison

### All categories Dataset

In [86]:
printDict(models_accuracy)

{
    "All": {
        "Decision_Tree": 0.33499288762446655,
        "KNN": 0.4517306780464675,
        "Multinomial Naive Bayes": 0.6324087245139877
    },
    "Authors": {
        "Decision_Tree": 0.28584637268847796,
        "KNN": 0.48909435751541014,
        "Multinomial Naive Bayes": 0.5433380749170222
    },
    "Authors-Short Description": {
        "Decision_Tree": 0.30436225699383596,
        "KNN": 0.4395448079658606,
        "Multinomial Naive Bayes": 0.5953295400663822
    },
    "Headline": {
        "Decision_Tree": 0.22219061166429588,
        "KNN": 0.2843764817449028,
        "Multinomial Naive Bayes": 0.5228544333807492
    },
    "Headline-Authors": {
        "Decision_Tree": 0.33079658605974394,
        "KNN": 0.5382408724513987,
        "Multinomial Naive Bayes": 0.6701280227596017
    },
    "Headline-Short Description": {
        "Decision_Tree": 0.23824087245139877,
        "KNN": 0.2293978188715031,
        "Multinomial Naive Bayes": 0.5020862968231389
    },


Throughout this dataset, a main issue was presented while fitting the model, this was that, while using the GridSearchCV, each possible training lasted, in some cases more than half an hour. This could increase even more if the dataset continues growing.

An important observation, was that the accuracy obtained in most of the predictions was less than 60% (In some results, the accuracy was of 15.9%). This are bad results considering that it will fail in, at least, 40% of the occasions.

The best results obtained were that, in all possible trainings, the Multinomial Naive Bayes classifier was the one that gave the best accuracy scores. Therefore, this classifier will be the go to approach.

Finally, the best dataset subsection by the concatenation of its columns, was the one that included all the most relevant columns (Authors - Headline - Short Description) with an accuracy score of 63.2%.

### 10 categories Dataset

In [84]:
printDict(models_accuracy_10categories)

{
    "All": {
        "Decision_Tree": 0.501656513839906,
        "KNN": 0.5732606604680988,
        "Multinomial Naive Bayes": 0.7817676605749706
    },
    "Authors": {
        "Decision_Tree": 0.4329379074489687,
        "KNN": 0.606711552848135,
        "Multinomial Naive Bayes": 0.6404830608100887
    },
    "Authors-Short Description": {
        "Decision_Tree": 0.42545687720423214,
        "KNN": 0.5331837127284386,
        "Multinomial Naive Bayes": 0.7281179865341456
    },
    "Headline": {
        "Decision_Tree": 0.4682056214598696,
        "KNN": 0.3540664742973175,
        "Multinomial Naive Bayes": 0.663567382708133
    },
    "Headline-Authors": {
        "Decision_Tree": 0.47248049588543334,
        "KNN": 0.6079940151758042,
        "Multinomial Naive Bayes": 0.7872181254675644
    },
    "Headline-Short Description": {
        "Decision_Tree": 0.4906487121940793,
        "KNN": 0.36432617291867053,
        "Multinomial Naive Bayes": 0.6689109757400876
    },
    "Sh

Throughout this dataset, the issue presented on the previous dataset was mitigated by the reduction of the possible categories to predict but also took a significant amount of time to fit (about 1 hour for all the models).

Compared to the previous dataset, the least accuracy score was 16% higher than the other and the accuracy rate also increased in all classifiers obtaining better results than in the other dataset.

The best results obtained were that, in all possible trainings, the Multinomial Naive Bayes classifier was the one that gave the best accuracy scores as also was seen in the previous dataset. Therefore, this classifier will be the go to approach.

Finally, the best dataset subsection by the concatenation of its columns, was the one that included the headline and author columns with an accuracy score of 78.7%.

# Final Observation

As it was seen in the accuracy scores from both datasets, the best results were the ones obtained by using only 10 categories. This could happen because there are less possibilities to evaluate, so this have a positive effect in the results in the best result of 78.7% in the best classifier with 10 categories by 63.2% in the corresponding one from all the categories.

Finally, as a last statement, a good advice could be to train many models with at most 10 categories so that the result would become more accurate.