## NLP News Topic Analysis & Classification

The goal of this project is to create a model that maximes accuracy in classifying new articles based on their headlines.  The dataset, cited below, provides in `.json` format article information with categorization regarding the associated predominant topic.  This includes a link to the article, the headline, the data published, and a short description.  I'll use the NLTK library in order to tokenize the text information, determine the lemmas for each word and part of speech, and engineer features based on this information.  I'll apply several different kinds of machine learning classifiers to see how high an accuracy score we can generate on a 'test' portion of the data, training/validating on the remaining portion.

**Kaggle page: https://www.kaggle.com/datasets/rmisra/news-category-dataset**

**Dataset source: https://rishabhmisra.github.io/publications**

1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

Let's import the initial NLTK libraries we'll be using for tokenization, lemmatization, and identification for part of speech.

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
file_path = 'News_Category_Dataset_v3.json'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dmark\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dmark\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dmark\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


We'll read the json into a dataframe and check to see how large it is.

In [2]:
data = pd.read_json(file_path, lines=True)
rows_num = int(data.shape[0])
data.shape

(209527, 6)

Fairly large-- more than 200,000 instances.  Let's get a look at those 6 columns and their datatypes:

In [3]:
print(list(data.columns))
data.dtypes

['link', 'headline', 'category', 'short_description', 'authors', 'date']


link                         object
headline                     object
category                     object
short_description            object
authors                      object
date                 datetime64[ns]
dtype: object

Now we can take a look at what kind of entries fall under each column.

In [4]:
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


Let's check if any records have null values:

In [5]:
data.isnull().sum().sum()

0

Nope-- we'll be able to keep all these records.  Now we can get a closer look at the topics that fall under the `category` column.

In [6]:
data['category'].value_counts()

POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN              3572
CRIME              3562
IMPACT             3484
DIVORCE            3426
WORLD NEWS         3299
MEDIA              2944
WEIRD NEWS         2777
GREEN              2622
WORLDPOST          2579
RELIGION           2577
STYLE              2254
SCIENCE            2206
TECH               2104
TASTE              2096
MONEY              1756
ARTS               1509
ENVIRONMENT        1444
FIFTY              1401
GOOD NEWS          1398
U.S. NEWS          1377
ARTS & CULTURE     1339
COLLEGE            1144
LATINO VOICES      1130
CULTURE & ARTS     1074
EDUCATION       

May of these are ambiguous.  Some also seem to overlap.  Let's make our life a little bit easier and get rid of the instances/rows with category lables that seem not to be distinct catgories.  These might overlap but might not, so we'll simply jettison them for now.

In [7]:
# Categories whose instances we'll remove
DROP_CATS = ['COLLEGE','GOOD NEWS','FIFTY','WEIRD NEWS','TASTE','WOMEN','IMPACT','GREEN','WEDDINGS','MONEY',
            'LATINO VOICES','MEDIA','DIVORCE','MONEY']
# Create a list of the indexes that we've marked True, if their category label is in our drop list
DROP_IDX = data[data['category'].map(lambda x:True if x in DROP_CATS else False)].index
data.drop(index=DROP_IDX, inplace=True)
data.shape

(178124, 6)

Looking at the shape, we got rid of a decent number of instances, but we still have the vast majority.  Now let's unite some categories that seem to cover the same topics:

In [8]:
COMBINE_CATS = {'SCIENCE':'SCIENCE & TECH','TECH':'SCIENCE & TECH',
               'STYLE':'STYLE & BEAUTY','PARENTS':'PARENTING',
               'THE WORLDPOST':'WORLD NEWS','WORLDPOST':'WORLD NEWS',
               'ARTS':'ARTS & CULTURE','CULTURE & ARTS':'ARTS & CULTURE'}
# If the topic is in the dictionary, we can convert it; otherwise, keep the topic as is
data['category'] = data['category'].map(lambda x:COMBINE_CATS[x] if x in COMBINE_CATS.keys() else x)
data['category'].value_counts()

POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
PARENTING         12746
STYLE & BEAUTY    12068
TRAVEL             9900
WORLD NEWS         9542
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
SCIENCE & TECH     4310
ARTS & CULTURE     3922
CRIME              3562
RELIGION           2577
ENVIRONMENT        1444
U.S. NEWS          1377
EDUCATION          1014
Name: category, dtype: int64

Now we can start to build out a feature set based on the text of the headlines and/or the descriptions.  I'm not going to look at the descriptions for now, for two reasons.  One is it is going to pull a substantial amount of processing power just parsing and working with the headline text.  Second is that I am curious as to how well we can build a classifier on more limited information (i.e., only the headline).  We'll drop columns we don't care about.

In [9]:
data.drop(columns=['link','authors','date','short_description'], inplace=True)

In [10]:
data.head()

Unnamed: 0,headline,category
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY
3,The Funniest Tweets From Parents This Week (Se...,PARENTING
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS


Now we can do some work.  I'm going to first create tokens using NLTK's word_tokenize method, and then take those values and convert them to NLTK Text class.  I'll also add up the number of words and add this as a feature column.  If we look at the data, we can see the new columns and their formats.

In [11]:
data['headline_text'] = data['headline'].map(nltk.word_tokenize).map(nltk.Text)
data['headline_len'] = data['headline_text'].map(len)
data.head()

Unnamed: 0,headline,category,headline_text,headline_len
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,"(Over, 4, Million, Americans, Roll, Up, Sleeve...",11
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,"(American, Airlines, Flyer, Charged, ,, Banned...",14
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"(23, Of, The, Funniest, Tweets, About, Cats, A...",15
3,The Funniest Tweets From Parents This Week (Se...,PARENTING,"(The, Funniest, Tweets, From, Parents, This, W...",11
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,"(Woman, Who, Called, Cops, On, Black, Bird-Wat...",11


## Lemmatization & POS Identification

I want to assign a part of speech to each word too, since some words in english can be a noun or a verb and be more or less useful to a classifier.  I at least want to caputre that differnece so it can be seen by the models.  First we'll identify our list of parts of speech (`poss`) that NLTK's lemmatizer can work with.

In [12]:
# Create a list of the key parts of speech we can lemmatize off of.  I'll drop 's' as in my testing of this 
# program, I noticed that no instances of 's' show up, and we'll want to save some time when we loop through this
# massive dataset
poss = ['v','n','a','r']#,'s']

Next we'll define a function that will take the part of speech tag that NLTK's pos_tag will provide and convert it back to the tag that NLTK's lemmatizer works with.

In [13]:
def convert_tag(tag):    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

Lemmatization will take each word down to its root, depending on the part of speech we are applying. So, our plan is to go through each part of speech and each row of the data.  We'll take the headline text and then lemmatize the word.  We'll take that word, identify the part of speech, and then create a string of the lemmatized word concatenated with its part of speech.  We'll store these values in columns related to the lemma pos we applied.  The results will be strings in each column that have the lemmatized word coded by its pos.  These will eventually be our features for our models to train on.

In [14]:
### TESTING 
# def lemm(pos,row,l):
#     # create a list of lemmatized words based on the pos, and making sure it is an actual word and not an empty string
#     lemmas = [lemmatizer.lemmatize(word.lower(),pos) for word in l if (word and word[0].isalpha())]
#     # Create a list of pos tags from the lemmatized words
#     pos_tags = nltk.pos_tag(lemmas, lang='eng')
#     # Save a string of the words + the nltk-identified pos
#     return ' '.join([word+pos for word, pos_tag in pos_tags if convert_tag(pos_tag) == pos])

In [15]:
### TESTING 
# from multiprocessing import Pool

# lemmatizer = WordNetLemmatizer()
# for pos in poss:
#     data[f'headline_lemmas_{pos}'] = [list() for x in range(len(data.index))]

# print('Headline Lemmatization & Word POS Classification')
# input = [(pos,row,data.loc[row,'headline_text']) for pos in poss for row in data.index]

# if __name__ == '__main__':
#     with Pool(8) as p:
#         results = p.map(lemm, input)

In [16]:
# I've run this before and it takes a while, so I've saved previous results in a csv file and will load it 
# if it exists
try:
    data = pd.read_csv('data.csv')
except:
    lemmatizer = WordNetLemmatizer()
    print('Headline Lemmatization & Word POS Classification')
    # Pick part of speech for lemmatization
    for pos in poss:
        print(f'Working on {pos}')
        # Create an empty column related to the lemmatization pos
        data[f'headline_lemmas_{pos}'] = [list() for x in range(len(data.index))]
        # Parse each row
        for row in data.index:
            if row % int(rows_num/4) == 0:
                print(f'{row//int(rows_num/4)*25}% complete')
            # create a list of lemmatized words based on the pos, and making sure it is an actual word and not an empty string
            lemmas = [lemmatizer.lemmatize(word.lower(),pos) for word in data.at[row,'headline_text'] if (word and word[0].isalpha())]
            # Create a list of pos tags from the lemmatized words
            pos_tags = nltk.pos_tag(lemmas, lang='eng')
            # Save a string of the words + the nltk-identified pos
            data.at[row,f'headline_lemmas_{pos}'] = ' '.join([word+pos for word, pos_tag in pos_tags if convert_tag(pos_tag) == pos])
# The .csv file-save-to-pd-reload changes blank strings to NaN, so changing those back
data.fillna('',inplace=True)

We can move on to our models. . .

## Feature Engineering & Train/Test Split

We'll import the libraries we need.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

I'll work from a copy of the data so far, in case we need to come back to it.  `data2` will what we build our feature set and targets off of.

In [18]:
data2 = data.copy()

Our original texts have served their purpose, so we'll drop them off to make working with data2 a little easier.

In [19]:
try:
    data2.drop(columns=['headline','headline_text'],inplace=True)
except:
    pass

In [20]:
data2.head(3)

Unnamed: 0.1,Unnamed: 0,category,headline_len,headline_lemmas_v,headline_lemmas_n,headline_lemmas_a,headline_lemmas_r
0,0,U.S. NEWS,11,rollv,rolln sleeven covidn boostern,omicron-targeteda,upr
1,1,U.S. NEWS,14,chargev,airlinen flyern lifen flightn attendantn videon,americana,flyerr
2,2,COMEDY,15,,tweetn catn dogn weekn sept.n,funnya,


I want to combine all of the lemmatized, pos-coded words into one string to allow our vectorizer to work on it.  I'll work with just the columns that have those terms, and then I'll concatenate all of these words into a new column called 'headline_lemmas.'

In [21]:
headline_lemma_col_idx = [i for i,col in enumerate(data2.columns) if col.find('headline_lemmas') != -1]
headline_lemma_cols = [col for i,col in enumerate(data2.columns) if col.find('headline_lemmas') != -1]

In [22]:
data2['headline_lemmas'] = ''
for col_idx in headline_lemma_col_idx:
    data2['headline_lemmas'] += data2.iloc[:,col_idx] + ' '

Now we'll drop the columns that had the lemmas-pos words, as we don't need them any more.  We'll take a look at our data again.

In [23]:
data2.drop(columns=headline_lemma_cols,inplace=True)

In [24]:
data2.head(5)

Unnamed: 0.1,Unnamed: 0,category,headline_len,headline_lemmas
0,0,U.S. NEWS,11,rollv rolln sleeven covidn boostern omicron-ta...
1,1,U.S. NEWS,14,chargev airlinen flyern lifen flightn attendan...
2,2,COMEDY,15,tweetn catn dogn weekn sept.n funnya
3,3,PARENTING,11,tweetn parentn weekn sept.n funnya
4,4,U.S. NEWS,11,callv womann copn losesn lawsuitn ex-employern...


Let's construct the feature set and target, essentially just separating the category label column from the rest of the dataframe. 

In [25]:
X = data2.copy()
y = X.pop('category')

We'll create a split of our data and use the Tfidf Vectorizer to take each term in our lemmas-pos combo and create a sparse matrix relating to the occurence of each term in a give headline/instance.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=126)
X_train1 = X_train.pop('headline_lemmas')
# Instatiate vectorizer and fit it on the lemma-pos terms
vect_Tfidf = TfidfVectorizer().fit(X_train1)
features = list(vect_Tfidf.get_feature_names_out())
X_train1_v = vect_Tfidf.transform(X_train1)

I want to add the length of the headline back in, so we'll create a fucntion that will let us do that while keeping the sparse nature of the matrix.

In [72]:
def add_feature(X, feature_to_add):
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

Now we can add that length of the headline back in.

In [28]:
X_train_v = add_feature(X_train1_v, X_train['headline_len'])

We'll take the same steps as above, now for our testing set.

In [29]:
X_test1 = X_test.pop('headline_lemmas')
X_test1_v = vect_Tfidf.transform(X_test1)
X_test_v = add_feature(X_test1_v, X_test['headline_len'])

Tf-Idf isn't the only vecotrizer we can.  We can also try out the CountVectorizer.  This vectorizer allows us to consider n-grams as well.  This will help to see if word combo sequences are helpful.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
vect_ct = CountVectorizer(min_df=2, ngram_range=(2,5), analyzer='char_wb').fit(X_train1)
features = list(vect_ct.get_feature_names_out())
X_train1_vc = vect_ct.transform(X_train1)
X_train_vc = add_feature(X_train1_vc, X_train['headline_len'])
X_test1_vc = vect_ct.transform(X_test1)
X_test_vc = add_feature(X_test1_vc, X_test['headline_len'])

## Grid Search & Model Parameter Tuning

Time to build some classifiers.  We'll try 4 different classifiers, and we'll keep track of how they do to see which is best (based on accuracy on the testing split).  I'll use GridSearch to try to tune the parameters to get the best fit that we can on the training data, but we'll avoid overfitting by judging the "best" one on the testing split, as noted.  I'll print a classification report for each as well, but mostly for reference.  Again, accuracy is what I'm looking for.

In [31]:
best_test_score = 0
best_model = None

#### Classifer 1 -- Multinomial Naive Bayes

We'll run this twice, once with the training/test sets vectorized with Tf-Idf, and the other with the sets vectorized with Count.

In [32]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
cv = 3
mnb_params = {'alpha':[0.01, 0.1, 1.0]}

In [33]:
# Tf-Idf-Vectorized sets
grid = GridSearchCV(mnb, param_grid=mnb_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_v, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_v, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_v,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_v,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_v)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_Tfidf

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best (Training) Params: {'alpha': 0.01}
Best (Training) Estimator: MultinomialNB(alpha=0.01)
Best (Training) Score: 0.8281843992108175
Validation Score: 0.5926868520528463 

                precision    recall  f1-score   support

ARTS & CULTURE       0.52      0.30      0.38      1187
  BLACK VOICES       0.45      0.24      0.31      1350
      BUSINESS       0.55      0.34      0.42      1802
        COMEDY       0.47      0.27      0.34      1564
         CRIME       0.56      0.45      0.50      1074
     EDUCATION       0.39      0.08      0.13       300
 ENTERTAINMENT       0.56      0.69      0.62      5190
   ENVIRONMENT       0.55      0.20      0.30       485
  FOOD & DRINK       0.74      0.66      0.70      1886
HEALTHY LIVING       0.30      0.13      0.18      1991
 HOME & LIVING       0.71      0.52      0.60      1323
     PARENTING       0.54      0.57      0.56      3837
      POLITICS       0.61      0.85   

In [34]:
# Counter-Vectorized sets
mnbc = MultinomialNB()
grid = GridSearchCV(mnbc, param_grid=mnb_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_vc, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_vc, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_vc,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_vc,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_vc)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_ct

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best (Training) Params: {'alpha': 1.0}
Best (Training) Estimator: MultinomialNB()
Best (Training) Score: 0.6676611648460934
Validation Score: 0.609622366106516 

                precision    recall  f1-score   support

ARTS & CULTURE       0.49      0.37      0.42      1187
  BLACK VOICES       0.41      0.38      0.39      1350
      BUSINESS       0.46      0.49      0.48      1802
        COMEDY       0.44      0.41      0.42      1564
         CRIME       0.46      0.60      0.52      1074
     EDUCATION       0.59      0.12      0.20       300
 ENTERTAINMENT       0.60      0.64      0.62      5190
   ENVIRONMENT       0.62      0.19      0.29       485
  FOOD & DRINK       0.68      0.68      0.68      1886
HEALTHY LIVING       0.29      0.20      0.24      1991
 HOME & LIVING       0.63      0.66      0.65      1323
     PARENTING       0.55      0.66      0.60      3837
      POLITICS       0.78      0.72      0.75     

#### Classifer 2 -- Decision Tree

In [35]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
cv = 3
tree_params = {'max_depth':[6,12,18]}

In [36]:
grid = GridSearchCV(tree, param_grid=tree_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_v, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_v, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_v,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_v,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_v)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_Tfidf

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best (Training) Params: {'max_depth': 18}
Best (Training) Estimator: DecisionTreeClassifier(max_depth=18)
Best (Training) Score: 0.30942527629405064
Validation Score: 0.30075975897301543 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

ARTS & CULTURE       0.98      0.03      0.07      1187
  BLACK VOICES       0.52      0.16      0.24      1350
      BUSINESS       0.28      0.00      0.01      1802
        COMEDY       0.68      0.08      0.14      1564
         CRIME       0.46      0.01      0.01      1074
     EDUCATION       0.00      0.00      0.00       300
 ENTERTAINMENT       0.46      0.01      0.01      5190
   ENVIRONMENT       0.96      0.10      0.18       485
  FOOD & DRINK       0.90      0.17      0.29      1886
HEALTHY LIVING       0.00      0.00      0.00      1991
 HOME & LIVING       0.89      0.18      0.30      1323
     PARENTING       0.67      0.29      0.40      3837
      POLITICS       0.26      0.83      0.40     10606
  QUEER VOICES       0.80      0.17      0.28      1885
      RELIGION       0.87      0.05      0.09       728
SCIENCE & TECH       0.25      0.00      0.00      1297
        SPORTS       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))


In [37]:
treec = DecisionTreeClassifier()
grid = GridSearchCV(treec, param_grid=tree_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_vc, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_vc, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_vc,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_vc,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_vc)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_ct

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best (Training) Params: {'max_depth': 18}
Best (Training) Estimator: DecisionTreeClassifier(max_depth=18)
Best (Training) Score: 0.3288901721123462
Validation Score: 0.3146824357198997 

                precision    recall  f1-score   support

ARTS & CULTURE       0.38      0.01      0.01      1187
  BLACK VOICES       0.53      0.02      0.04      1350
      BUSINESS       0.38      0.00      0.01      1802
        COMEDY       0.65      0.09      0.17      1564
         CRIME       0.40      0.00      0.01      1074
     EDUCATION       0.00      0.00      0.00       300
 ENTERTAINMENT       0.30      0.01      0.01      5190
   ENVIRONMENT       0.84      0.10      0.17       485
  FOOD & DRINK       0.88      0.18      0.30      1886
HEALTHY LIVING       0.45      0.01      0.01      1991
 HOME & LIVING       0.89      0.17      0.28      1323
     PARENTING       0.68      0.32      0.43      3837
      POLITICS       0.31

#### Classifer 3 -- Random Forest

In [38]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
cv = 2
mf = int(np.sqrt(len(features)))
rf_params = {
    'n_estimators':[100,200],
    'max_depth':[15,25],
    'max_features':[mf]
            }

In [39]:
grid = GridSearchCV(rf, param_grid=rf_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_v, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_v, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_v,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_v,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_v)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_Tfidf

Fitting 2 folds for each of 4 candidates, totalling 8 fits
Best (Training) Params: {'max_depth': 25, 'max_features': 344, 'n_estimators': 200}
Best (Training) Estimator: RandomForestClassifier(max_depth=25, max_features=344, n_estimators=200)
Best (Training) Score: 0.2778820396836854
Validation Score: 0.2704442531531869 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

ARTS & CULTURE       0.00      0.00      0.00      1187
  BLACK VOICES       0.00      0.00      0.00      1350
      BUSINESS       0.00      0.00      0.00      1802
        COMEDY       0.00      0.00      0.00      1564
         CRIME       0.00      0.00      0.00      1074
     EDUCATION       0.00      0.00      0.00       300
 ENTERTAINMENT       0.81      0.06      0.12      5190
   ENVIRONMENT       0.00      0.00      0.00       485
  FOOD & DRINK       0.92      0.16      0.27      1886
HEALTHY LIVING       0.00      0.00      0.00      1991
 HOME & LIVING       1.00      0.04      0.08      1323
     PARENTING       0.76      0.24      0.37      3837
      POLITICS       0.22      1.00      0.36     10606
  QUEER VOICES       0.89      0.05      0.09      1885
      RELIGION       0.00      0.00      0.00       728
SCIENCE & TECH       0.00      0.00      0.00      1297
        SPORTS       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))


In [40]:
rfc = RandomForestClassifier()
grid = GridSearchCV(rfc, param_grid=rf_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_vc, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_vc, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_vc,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_vc,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_vc)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_ct

Fitting 2 folds for each of 4 candidates, totalling 8 fits
Best (Training) Params: {'max_depth': 25, 'max_features': 344, 'n_estimators': 100}
Best (Training) Estimator: RandomForestClassifier(max_depth=25, max_features=344)
Best (Training) Score: 0.4236401841425661
Validation Score: 0.39019050114150977 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

ARTS & CULTURE       1.00      0.01      0.01      1187
  BLACK VOICES       0.54      0.14      0.22      1350
      BUSINESS       0.90      0.03      0.05      1802
        COMEDY       0.79      0.09      0.16      1564
         CRIME       0.00      0.00      0.00      1074
     EDUCATION       0.00      0.00      0.00       300
 ENTERTAINMENT       0.68      0.23      0.34      5190
   ENVIRONMENT       1.00      0.10      0.18       485
  FOOD & DRINK       0.90      0.23      0.37      1886
HEALTHY LIVING       0.00      0.00      0.00      1991
 HOME & LIVING       0.89      0.23      0.36      1323
     PARENTING       0.66      0.47      0.55      3837
      POLITICS       0.28      0.96      0.43     10606
  QUEER VOICES       0.84      0.41      0.55      1885
      RELIGION       1.00      0.06      0.11       728
SCIENCE & TECH       1.00      0.02      0.04      1297
        SPORTS       0.74      0.02      0.03  

  _warn_prf(average, modifier, msg_start, len(result))


#### Classifer 4 -- AdaBoost

In [41]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()
cv = 2
ada_params = {
#     'n_estimators':[50,150],
#     'learning_rate':[0.1,0.3],
}

In [42]:
grid = GridSearchCV(ada, param_grid=ada_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_v, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_v, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_v,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_v,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_v)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_Tfidf

Fitting 2 folds for each of 1 candidates, totalling 2 fits
Best (Training) Params: {}
Best (Training) Estimator: AdaBoostClassifier()
Best (Training) Score: 0.32892225269877934
Validation Score: 0.3297279089786294 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

ARTS & CULTURE       0.54      0.16      0.25      1187
  BLACK VOICES       0.47      0.17      0.25      1350
      BUSINESS       0.56      0.06      0.11      1802
        COMEDY       0.23      0.19      0.21      1564
         CRIME       0.34      0.27      0.31      1074
     EDUCATION       0.30      0.14      0.19       300
 ENTERTAINMENT       0.43      0.00      0.00      5190
   ENVIRONMENT       0.33      0.00      0.01       485
  FOOD & DRINK       0.61      0.26      0.37      1886
HEALTHY LIVING       0.00      0.00      0.00      1991
 HOME & LIVING       0.55      0.37      0.44      1323
     PARENTING       0.63      0.41      0.50      3837
      POLITICS       0.28      0.77      0.42     10606
  QUEER VOICES       0.79      0.36      0.50      1885
      RELIGION       0.55      0.24      0.33       728
SCIENCE & TECH       0.61      0.05      0.10      1297
        SPORTS       0.59      0.06      0.10  

  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
adac = AdaBoostClassifier()
grid = GridSearchCV(adac, param_grid=ada_params, cv=cv, n_jobs=-1, verbose=2).fit(X_train_vc, y_train)
print('Best (Training) Params:', grid.best_params_)
print('Best (Training) Estimator:', grid.best_estimator_)
print('Best (Training) Score:', grid.best_estimator_.score(X_train_vc, y_train))
print('Validation Score:',grid.best_estimator_.score(X_test_vc,y_test),'\n')
test_score = grid.best_estimator_.score(X_test_vc,y_test)
print(classification_report(y_test,grid.best_estimator_.predict(X_test_vc)))
# Save the best version on the validation data
if test_score > best_test_score:
    best_test_score = test_score
    best_model = grid.best_estimator_
    best_vect = vect_ct

Fitting 2 folds for each of 1 candidates, totalling 2 fits
Best (Training) Params: {}
Best (Training) Estimator: AdaBoostClassifier()
Best (Training) Score: 0.34437707521293487
Validation Score: 0.3426213555896553 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

ARTS & CULTURE       0.49      0.17      0.25      1187
  BLACK VOICES       0.45      0.18      0.26      1350
      BUSINESS       0.57      0.06      0.11      1802
        COMEDY       0.15      0.10      0.12      1564
         CRIME       0.35      0.23      0.27      1074
     EDUCATION       0.31      0.16      0.21       300
 ENTERTAINMENT       0.00      0.00      0.00      5190
   ENVIRONMENT       0.33      0.00      0.01       485
  FOOD & DRINK       0.64      0.35      0.45      1886
HEALTHY LIVING       0.00      0.00      0.00      1991
 HOME & LIVING       0.57      0.41      0.48      1323
     PARENTING       0.60      0.43      0.50      3837
      POLITICS       0.33      0.66      0.44     10606
  QUEER VOICES       0.71      0.47      0.56      1885
      RELIGION       0.52      0.28      0.36       728
SCIENCE & TECH       0.00      0.00      0.00      1297
        SPORTS       0.53      0.19      0.28  

  _warn_prf(average, modifier, msg_start, len(result))


### Analysis

Let's see who "won":

In [44]:
print("Best Model:",best_model)
print("Best Model Parameters:",best_model.get_params())
print("Best Test Score:",round(best_test_score,3))
print("Best Vectorizer:",best_vect)

Best Model: MultinomialNB()
Best Model Parameters: {'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': 'warn'}
Best Test Score: 0.61
Best Vectorizer: CountVectorizer(analyzer='char_wb', min_df=2, ngram_range=(2, 5))


The best accuracy score isn't that great, to be honest.  It uses the CountVectorizer and the Multinomial Naive Bayes classifer.  We can see if we can do a little bit better by iterating on a few parameters of either the vectorizer or the model.  Let's give that a go. 

### Refinement - Classifier & Vectorizer

In [60]:
mnb_params = {'alpha':[0.1, 1.0, 10]}
min_dfs = [1,2,3,4]
ngram_ranges = [(x,y) for x in [2,3,4] for y in [5,6]]
iters = len(mnb_params.values()) * len(min_dfs) * len(ngram_ranges)
count = 1
best_mnb_score = 0
best_mnb = None

In [61]:
print('Starting first vectorizer/model combo. . .')
for min_df in min_dfs:
    for ngram_range in ngram_ranges:
        vect_ct = CountVectorizer(min_df=min_df, ngram_range=ngram_range, analyzer='char_wb').fit(X_train1)
        features = list(vect_ct.get_feature_names_out())
        X_train1_vc = vect_ct.transform(X_train1)
        X_train_vc = add_feature(X_train1_vc, X_train['headline_len'])
        X_test1_vc = vect_ct.transform(X_test1)
        X_test_vc = add_feature(X_test1_vc, X_test['headline_len'])
        mnbc = MultinomialNB()
        grid = GridSearchCV(mnbc, param_grid=mnb_params, cv=cv, n_jobs=-1, verbose=0).fit(X_train_vc, y_train)
        test_score = grid.best_estimator_.score(X_test_vc,y_test)
        # Save the best version on the validation data
        if test_score > best_mnb_score:
            best_mnb_score = test_score
            best_mnb = grid.best_estimator_
            best_vect_min_df, best_vect_ngram_range = (min_df, ngram_range)
            best_vect = vect_ct
        print(count,'of',iters,'complete')
        count += 1

Starting first vectorizer/model combo. . .
1 of 24 complete
2 of 24 complete
3 of 24 complete
4 of 24 complete
5 of 24 complete
6 of 24 complete
7 of 24 complete
8 of 24 complete
9 of 24 complete
10 of 24 complete
11 of 24 complete
12 of 24 complete
13 of 24 complete
14 of 24 complete
15 of 24 complete
16 of 24 complete
17 of 24 complete
18 of 24 complete
19 of 24 complete
20 of 24 complete
21 of 24 complete
22 of 24 complete
23 of 24 complete
24 of 24 complete


In [66]:
print("Best Model Parameters:",best_mnb.get_params())
print("Best Test Score:",round(best_mnb_score,3))
print("Best Count Vectorizer:",best_vect)
print("Best Count Vectorizer min_df:",best_vect_min_df)
print("Best Count Vectorizer ngram_range:",best_vect_ngram_range)

Best Model Parameters: {'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': 'warn'}
Best Test Score: 0.622
Best Count Vectorizer: CountVectorizer(analyzer='char_wb', min_df=2, ngram_range=(4, 6))
Best Count Vectorizer min_df: 2
Best Count Vectorizer ngram_range: (4, 6)


This additional iteration didn't do a whole lot better, unfortunately.  I suspect that there are other classifiers and larger parameter sets we could try.  For the one we created, let's take a look at classifying an example headline.  I've pulled it from [the Oregonian](https://www.oregonlive.com/silicon-forest/2024/02/amazon-will-start-buying-clean-power-for-oregon-data-centers.html) in February, 2024: 

**"Amazon will start buying clean power for Oregon data centers"**

In [89]:
headline = ['Amazon will start buying clean power for Oregon data centers']
headline_matrix = best_vect.transform(headline)
headline_matrix = add_feature(headline_matrix,len(headline))

In [112]:
best_mnb.predict(headline_matrix)

array(['BUSINESS'], dtype='<U14')

We can look at the probablities that the classifier assigned to the categories:

In [123]:
probs = best_mnb.predict_proba(headline_matrix).tolist()[0]
categories = best_mnb.classes_
cat_probs = list(zip(probs,categories))
cat_probs_sorted = sorted(cat_probs, key=lambda x:x[0], reverse=True)
cat_probs_sorted

[(0.8004469844534391, 'BUSINESS'),
 (0.19955296623348417, 'HOME & LIVING'),
 (4.90455007158537e-08, 'SCIENCE & TECH'),
 (2.642809075639102e-10, 'CRIME'),
 (1.718235586647692e-12, 'TRAVEL'),
 (4.081351228446005e-13, 'FOOD & DRINK'),
 (3.534918987687584e-13, 'QUEER VOICES'),
 (3.2660277963371936e-13, 'SPORTS'),
 (1.7792301849396431e-13, 'POLITICS'),
 (1.617376383771402e-13, 'BLACK VOICES'),
 (1.416503103489661e-13, 'U.S. NEWS'),
 (4.146021706449459e-15, 'ARTS & CULTURE'),
 (2.5904616444714967e-15, 'HEALTHY LIVING'),
 (1.7679519983495307e-15, 'ENVIRONMENT'),
 (5.358270905661502e-16, 'COMEDY'),
 (1.523251423355252e-16, 'PARENTING'),
 (3.793212117096159e-17, 'STYLE & BEAUTY'),
 (5.975856729346842e-18, 'WORLD NEWS'),
 (2.282262281622486e-21, 'RELIGION'),
 (2.1229280193398124e-21, 'WELLNESS'),
 (8.48840044116647e-23, 'ENTERTAINMENT'),
 (1.4969827965042132e-24, 'EDUCATION')]

It is overwhelming for BUSINESS, and I'd say it got it right.  I think one of the other difficulties with this project is that many of the categories have signficant overlap.  Looking at the descriptions might help to continue to refine our predictions as well.  

Let's try one more.  This one is from the [NYT](https://www.nytimes.com/2024/02/09/world/europe/ukraine-oleksandr-syrsky-war-russia.html):

**"Ukraine Has a New Military Commander but the Problems Haven't Changed"** 

In [124]:
headline = ['Ukraine Has a New Military Commander but the Problems Haven\'t Changed']
headline_matrix = best_vect.transform(headline)
headline_matrix = add_feature(headline_matrix,len(headline))

In [125]:
best_mnb.predict(headline_matrix)

array(['WORLD NEWS'], dtype='<U14')

In [127]:
probs = best_mnb.predict_proba(headline_matrix).tolist()[0]
categories = best_mnb.classes_
cat_probs = list(zip(probs,categories))
cat_probs_sorted = sorted(cat_probs, key=lambda x:x[0], reverse=True)
cat_probs_sorted

[(0.9999999999998863, 'WORLD NEWS'),
 (6.314513004240485e-14, 'POLITICS'),
 (1.3508612418655215e-34, 'BUSINESS'),
 (3.724944661110938e-37, 'SCIENCE & TECH'),
 (2.8724250308358754e-37, 'WELLNESS'),
 (2.076527274719885e-37, 'QUEER VOICES'),
 (2.059427430381407e-39, 'COMEDY'),
 (6.943702428223902e-41, 'RELIGION'),
 (3.138214188894739e-41, 'HEALTHY LIVING'),
 (8.911636673159475e-43, 'PARENTING'),
 (3.7026321481975896e-44, 'BLACK VOICES'),
 (2.4183324491183298e-45, 'ARTS & CULTURE'),
 (6.165830573136674e-46, 'ENVIRONMENT'),
 (5.772458470157757e-48, 'ENTERTAINMENT'),
 (8.445010613882301e-49, 'U.S. NEWS'),
 (1.4385972766589937e-49, 'CRIME'),
 (1.561874305038305e-51, 'SPORTS'),
 (3.7476205431717555e-52, 'STYLE & BEAUTY'),
 (4.586233166200243e-53, 'EDUCATION'),
 (1.589651020776957e-54, 'TRAVEL'),
 (3.464690527668069e-58, 'HOME & LIVING'),
 (7.396299667826973e-60, 'FOOD & DRINK')]

On the money