# Exploring the Data

We will explore classifying this news data using a simple classifier: Logistic Regression.
The Logistic Regression Algorithm will be given a bag of words.

In [24]:
import numpy as np
import pandas as pd
import time

## Load the Data

In [25]:
# The dataset provided is malformed JSON. Need to fix up the JSON formatting
# so that it can be ingested by pandas.

with open('./News_Category_Dataset_v2.json') as file:
    lines = file.readlines()
    json = f'[{",".join(lines)}]'


In [26]:
data_raw = pd.read_json(json, orient='records')

## Exploring Categories

In [27]:
data_raw.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [28]:
data_raw['category'].value_counts()

POLITICS          32739
WELLNESS          17827
ENTERTAINMENT     16058
TRAVEL             9887
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6694
QUEER VOICES       6314
FOOD & DRINK       6226
BUSINESS           5937
COMEDY             5175
SPORTS             4884
BLACK VOICES       4528
HOME & LIVING      4195
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3651
WOMEN              3490
IMPACT             3459
DIVORCE            3426
CRIME              3405
MEDIA              2815
WEIRD NEWS         2670
GREEN              2622
WORLDPOST          2579
RELIGION           2556
STYLE              2254
SCIENCE            2178
WORLD NEWS         2177
TASTE              2096
TECH               2082
MONEY              1707
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1339
ENVIRONMENT        1323
COLLEGE            1144
LATINO VOICES      1129
CULTURE & ARTS     1030
EDUCATION          1004
Name: category, 

We can see that the dominant class is Politics. What portion of news articles are classified as Politics?

In [29]:
f'{float((data_raw["category"] == "POLITICS").sum()) / len(data_raw["category"]) * 100:.02f}%'

'16.30%'

So as a baseline, we would expect our model to have an accuracy of at least as good as *16%*, which would be the equivalent of classifying every news article as Politics.

## Exploring Authors

In [30]:
data_raw['authors'].describe()

count     200853
unique     27993
top             
freq       36620
Name: authors, dtype: object

In [31]:
data_raw['authors'].value_counts()

                                                                                                 36620
Lee Moran                                                                                         2423
Ron Dicker                                                                                        1913
Reuters, Reuters                                                                                  1562
Ed Mazza                                                                                          1322
                                                                                                 ...  
Bonnie St. John, ContributorOlympic Ski Medalist, Amputee, Rhodes Scholar, former White Ho...        1
Basil Soper, ContributorTrans writer, intersectional activist, double cancer, and anim...            1
Jacqueline Herrera, Contributor\nCo-founder, Kitechild                                               1
Irwin Zalkin, ContributorAttorney                                        

A large portion of articles have missing authors. It would be good to get a sense of the distribution of articles written by repeat authors.

In [32]:
authors_dist = data_raw['authors'].value_counts()
authors_dist = authors_dist.drop('')

authors_dist.describe()

count    27992.000000
mean         5.867141
std         41.929860
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max       2423.000000
Name: authors, dtype: float64

There appears to be a long tail of single-article authors. What portion of articles contain an author? What portion of articles contain a repeating author?

In [33]:
repeat_authors = authors_dist[authors_dist > 1].index.values

count_articles = len(data_raw)
count_articles_with_authors = len(data_raw[data_raw['authors'] != ''])
count_articles_with_repeat_authors = len(data_raw[data_raw['authors'].isin(repeat_authors)])

print(f'{float(count_articles_with_authors) / count_articles * 100:.02f}% of articles contain authors.')
print(f'{float(count_articles_with_repeat_authors) / count_articles * 100:.02f} articles contain repeat authors.')

81.77% of articles contain authors.
73.45 articles contain repeat authors.


## Tokenizing and Exploring Vocabulary

In [34]:
import string

from nltk.tokenize.regexp import WordPunctTokenizer

In [35]:
tokenizer = WordPunctTokenizer()

In [36]:
def cleanup_and_tokenize_text(text):
    cleaned = ''.join([c for c in text if c not in string.punctuation]).lower()
    return tokenizer.tokenize(cleaned)


In [37]:
def tokenize_rows(df):
    tokenized_headlines = df['headline'].apply(cleanup_and_tokenize_text).tolist()
    tokenized_desc = df['short_description'].apply(cleanup_and_tokenize_text).tolist()

    return [tokens1 + tokens2 for tokens1, tokens2 in zip(tokenized_headlines, tokenized_desc)]
    

In [70]:
def create_unigram_counts(rows):
    # Flatten
    tokens = [t for tokens in rows for t in tokens]
    
    counts = {}

    for token in tokens:
        if token not in counts:
            counts[token] = 0
        counts[token] += 1

    return counts
    

In [71]:
def create_encoder_and_decoder(unigram_counts):
    encoder = {t:i for i,t in enumerate(unigram_counts.keys())}
    decoder = {i:t for t,i in encoder.items()}
    
    return encoder, decoder
    

In [92]:
def create_bow_dataframe(encoded_token_rows, encoder, decoder):
    bows = np.zeros((len(encoded_token_rows), len(encoder)))

    for i, encoded_tokens in enumerate(encoded_token_rows):
        for encoded in encoded_tokens:
            bows[i, encoded] += 1
    
    df = pd.DataFrame(data=bows)
    df.columns = [decoder[i] for i in range(len(decoder))]
    
    return df
    

### Unigram Counts

In [77]:
start_time = time.time()

print('[1/2] Tokenizing rows...')
token_rows = tokenize_rows(data_raw)

print('[2/2] Generating global unigram count...')
unigram_counts = create_unigram_counts(token_rows)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')


[1/2] Tokenizing rows...
[2/2] Generating global unigram count...
Done!
Ran in 0.20m


In [78]:
print(f'There are {len(unigram_counts)} unique tokens.')

There are 112586 unique tokens.


In [79]:
low_count_words = [w for w,c in unigram_counts.items() if c <= 5]

print(f'There are {len(low_count_words)} low count words.')

There are 80763 low count words.


### Removing Low-Frequency Words

More than two-thirds of our vocabulary consists of words that show up fewer than 5 times throughout the corpus. These words could slow down learning dramatically while not providing much signal. Will marginalize these words.

In [80]:
# Special token for tokens that occur 5 or fewer times in the
# entire corpus.
__LOW_FREQ_TOKEN__ = '__LOW_FREQ_TOKEN__'

In [81]:
start_time = time.time()

print(f'[1/2] Filtering out low-frequency words...')
token_rows = [[token if unigram_counts[token] > 5 else __LOW_FREQ_TOKEN__ for token in tokens] for tokens in token_rows]

print(f'[2/2] Re-computing unigram counts...')
unigram_counts = create_unigram_counts(token_rows)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')


[1/2] Filtering out low-frequency words...
[2/2] Re-computing unigram counts...
Done!
Ran in 0.05m


In [82]:
print(f'There are {len(unigram_counts)} unique tokens.')


There are 31824 unique tokens.


In [93]:
# Fully process the text in the data frame to a one-hot vector
# bag-of-words representation.

start_time = time.time()

print('[1/3] Create encoder / decoder...')
encoder, decoder = create_encoder_and_decoder(unigram_counts)

print('[2/3] Encoding Token Rows...')
encoded_token_rows = [[encoder[t] for t in tokens] for tokens in token_rows]

print('[3/3] Creating Bag Of Words DataFrame...')
data_bow = create_bow_dataframe(encoded_token_rows, encoder, decoder)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')

[1/3] Create encoder / decoder...
[2/3] Encoding Token Rows...
[3/3] Creating Bag Of Words DataFrame...
Done!
Ran in 0.23m


In [94]:
data_bow.head()

Unnamed: 0,there,were,2,mass,shootings,in,texas,last,week,but,...,oosthuizen,sozzani,flajnik,alber,elbaz,qnexa,garance,mengelt,collegehoopsnet,xlvi
0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Total Vocab Size

In [63]:
len(encoder)

112586

### Most Frequent Words

In [64]:
def k_most_frequent(unigram_counts, k):
    top_words = []

    # The "candidate" is the word in the list that is
    # next up to get replaced if we find a better word.
    candidate_count = 0
    candidate_index = -1

    # We want to support k most and k least frequent words.
    pos_k = k if k >= 0 else -k
    min_or_max = min if k >= 0 else max

    for word, count in unigram_counts.items():

        if len(top_words) < pos_k or min_or_max(count, candidate_count) == candidate_count:
            top_words.append(word)
        else:
            continue

        if len(top_words) > pos_k:
            # Need to remove the shortest word.
            del top_words[candidate_index]
            
        counts = [unigram_counts[w] for w in top_words]
        candidate_count = min_or_max(counts)
        candidate_index = counts.index(candidate_count)
        
    return top_words


In [96]:
# Most Frequent Words
k_most_frequent(unigram_counts, k=10)

['in',
 'and',
 'for',
 'the',
 'of',
 'a',
 '__LOW_FREQ_TOKEN__',
 'to',
 'is',
 'that']

In [97]:
# Least Frequent Words
k_most_frequent(unigram_counts, k=-10)

['trop',
 'clementis',
 'krentcil',
 'keiko',
 'prettify',
 'flajnik',
 'alber',
 'qnexa',
 'mengelt',
 'collegehoopsnet']