# Exploring the Data

We will explore classifying this news data using a simple classifier: Logistic Regression.
The Logistic Regression Algorithm will be given a bag of words.

In [146]:
import numpy as np
import pandas as pd
import time

## Load the Data

In [147]:
# The dataset provided is malformed JSON. Need to fix up the JSON formatting
# so that it can be ingested by pandas.

with open('./data/News_Category_Dataset_v2.json') as file:
    lines = file.readlines()
    json = f'[{",".join(lines)}]'


In [148]:
data = pd.read_json(json, orient='records')

## Downsampling

Due to limitation in compute, I need to downsample the data. The goal for downsampling will be to make sure there are enough samples of different types of data. Each row is kept with a probability that is inversly proportional to number of rows that exist in the data frame with the same label. By doing this, we downsample from more common categories more aggresively than less common categories. A parameter, C, will be used to configure the probability downsampling. A C value of 1 means all categories will have the same expected number of rows with larger values of C mean that downsampling happens less (as C approaches infinity, downsampling does not happen at all).

In [149]:
def downsample(df, c):
    category_counts = df['category'].value_counts()
    min_count = category_counts.min()

    # Calculate the probability of keeping a row
    # of a given category.
    category_probs = (min_count / category_counts) ** (1/c)

    # This is a series used to determine the probability that each
    # row is kept. Each rows mask depends on its category.
    prob_mask = np.zeros(len(df))

    for i, category in enumerate(category_counts.index.tolist()):
        category_prob = category_probs[i]
        category_keep_mask = (df['category'] == category) * category_prob
        prob_mask = prob_mask + category_keep_mask

    keep_mask = np.random.rand(len(df)) <= prob_mask
    
    return df[keep_mask]


In [150]:
data = downsample(data, c=2)

## Exploring Categories

In [151]:
data.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
12,IMPACT,"With Its Way Of Life At Risk, This Remote Oyst...",Karen Pinchin,https://www.huffingtonpost.com/entry/remote-oy...,The revolution is coming to rural New Brunswick.,2018-05-26
17,POLITICS,Ireland Votes To Repeal Abortion Amendment In ...,Laura Bassett,https://www.huffingtonpost.com/entry/results-f...,Irish women will no longer have to travel to t...,2018-05-26
20,WEIRD NEWS,Weird Father's Day Gifts Your Dad Doesn't Know...,David Moye,https://www.huffingtonpost.com/entry/weird-fat...,Why buy a boring tie when you can give him tes...,2018-05-26


In [152]:
data['category'].value_counts()

POLITICS          5810
WELLNESS          4256
ENTERTAINMENT     3883
TRAVEL            3131
STYLE & BEAUTY    3123
PARENTING         2948
HEALTHY LIVING    2650
QUEER VOICES      2555
FOOD & DRINK      2540
BUSINESS          2418
COMEDY            2302
SPORTS            2229
BLACK VOICES      2204
HOME & LIVING     2044
PARENTS           1971
WOMEN             1923
THE WORLDPOST     1914
WEDDINGS          1905
IMPACT            1868
DIVORCE           1844
CRIME             1838
MEDIA             1696
WEIRD NEWS        1644
RELIGION          1617
GREEN             1613
WORLDPOST         1601
STYLE             1543
TASTE             1481
SCIENCE           1476
WORLD NEWS        1462
TECH              1431
MONEY             1311
ARTS              1215
GOOD NEWS         1199
FIFTY             1187
ENVIRONMENT       1177
ARTS & CULTURE    1156
COLLEGE           1077
LATINO VOICES     1064
CULTURE & ARTS    1018
EDUCATION         1004
Name: category, dtype: int64

We can see that the dominant class is Politics. What portion of news articles are classified as Politics?

In [153]:
f'{float((data["category"] == "POLITICS").sum()) / len(data["category"]) * 100:.02f}%'

'7.06%'

So as a baseline, we would expect our model to have an accuracy of at least as good as *16%*, which would be the equivalent of classifying every news article as Politics.

## Exploring Authors

In [154]:
data['authors'].describe()

count     82328
unique    16679
top            
freq      15252
Name: authors, dtype: object

In [155]:
data['authors'].value_counts()

                                                                                                          15252
Lee Moran                                                                                                  1094
Ron Dicker                                                                                                  867
Reuters, Reuters                                                                                            651
Ed Mazza                                                                                                    571
                                                                                                          ...  
By Michelle Nichols, Reuters                                                                                  1
Karen E. Quinones Miller, ContributorJournalist, Best-Selling Author, Activist, An All-around Angry...        1
Hale Dwoskin, Contributor\nAuthor, 'The Sedona Method'                                                  

A large portion of articles have missing authors. It would be good to get a sense of the distribution of articles written by repeat authors.

In [156]:
authors_dist = data['authors'].value_counts()
authors_dist = authors_dist.drop('')

authors_dist.describe()

count    16678.000000
mean         4.021825
std         22.699956
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max       1094.000000
Name: authors, dtype: float64

There appears to be a long tail of single-article authors. What portion of articles contain an author? What portion of articles contain a repeating author?

In [157]:
repeat_authors = authors_dist[authors_dist > 1].index.values

count_articles = len(data)
count_articles_with_authors = len(data[data['authors'] != ''])
count_articles_with_repeat_authors = len(data[data['authors'].isin(repeat_authors)])

print(f'{float(count_articles_with_authors) / count_articles * 100:.02f}% of articles contain authors.')
print(f'{float(count_articles_with_repeat_authors) / count_articles * 100:.02f} articles contain repeat authors.')

81.47% of articles contain authors.
68.15 articles contain repeat authors.


## Tokenizing and Exploring Vocabulary

In [158]:
import string

from nltk.tokenize.regexp import WordPunctTokenizer

In [159]:
tokenizer = WordPunctTokenizer()

In [160]:
def cleanup_and_tokenize_text(text):
    cleaned = ''.join([c for c in text if c not in string.punctuation]).lower()
    return tokenizer.tokenize(cleaned)


In [161]:
def tokenize_rows(df):
    tokenized_headlines = df['headline'].apply(cleanup_and_tokenize_text).tolist()
    tokenized_desc = df['short_description'].apply(cleanup_and_tokenize_text).tolist()

    return [tokens1 + tokens2 for tokens1, tokens2 in zip(tokenized_headlines, tokenized_desc)]
    

In [162]:
def create_unigram_counts(rows):
    # Flatten
    tokens = [t for tokens in rows for t in tokens]
    
    counts = {}

    for token in tokens:
        if token not in counts:
            counts[token] = 0
        counts[token] += 1

    return counts
    

In [163]:
def create_encoder_and_decoder(unigram_counts):
    encoder = {t:i for i,t in enumerate(unigram_counts.keys())}
    decoder = {i:t for t,i in encoder.items()}
    
    return encoder, decoder
    

In [164]:
def create_bow_dataframe(encoded_token_rows, encoder, decoder):
    bows = np.zeros((len(encoded_token_rows), len(encoder)))

    for i, encoded_tokens in enumerate(encoded_token_rows):
        for encoded in encoded_tokens:
            bows[i, encoded] += 1
    
    df = pd.DataFrame(data=bows)
    df.columns = [decoder[i] for i in range(len(decoder))]
    
    return df
    

### Unigram Counts

In [165]:
start_time = time.time()

print('[1/2] Tokenizing rows...')
token_rows = tokenize_rows(data)

print('[2/2] Generating global unigram count...')
unigram_counts = create_unigram_counts(token_rows)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')


[1/2] Tokenizing rows...
[2/2] Generating global unigram count...
Done!
Ran in 0.07m


In [166]:
print(f'There are {len(unigram_counts)} unique tokens.')

There are 75412 unique tokens.


### Removing Low-Frequency Words


In [167]:
MIN_WORD_FREQ = 5

In [168]:
low_count_tokens = [t for t,c in unigram_counts.items() if c <= MIN_WORD_FREQ]

print(f'There are {len(low_count_tokens)} low count tokens.')

There are 54986 low count tokens.


More than three-forths of our vocabulary consists of words that show up fewer than `MIN_WORD_FREQ` times throughout the corpus. These words could slow down learning dramatically while not providing much signal. Will marginalize these words.

In [169]:
# Special token for tokens that occur MIN_WORD_FREQ or fewer times in the
# entire corpus.
__LOW_FREQ_TOKEN__ = '__LOW_FREQ_TOKEN__'

In [170]:
start_time = time.time()

print(f'[1/2] Filtering out low-frequency words...')
token_rows = [[token if unigram_counts[token] > 10 else __LOW_FREQ_TOKEN__ for token in tokens] for tokens in token_rows]

print(f'[2/2] Re-computing unigram counts...')
unigram_counts = create_unigram_counts(token_rows)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')


[1/2] Filtering out low-frequency words...
[2/2] Re-computing unigram counts...
Done!
Ran in 0.03m


In [171]:
print(f'There are {len(unigram_counts)} unique tokens.')


There are 13996 unique tokens.


### Creating New Data Frame

In [172]:
# Fully process the text in the data frame to a one-hot vector
# bag-of-words representation.

start_time = time.time()

print('[1/3] Create encoder / decoder...')
encoder, decoder = create_encoder_and_decoder(unigram_counts)

print('[2/3] Encoding Token Rows...')
encoded_token_rows = [[encoder[t] for t in tokens] for tokens in token_rows]

print('[3/3] Creating Bag Of Words DataFrame...')
data_bow = create_bow_dataframe(encoded_token_rows, encoder, decoder)

end_time = time.time()

print('Done!')
print(f'Ran in {(end_time - start_time)/60:.02f}m')

[1/3] Create encoder / decoder...
[2/3] Encoding Token Rows...
[3/3] Creating Bag Of Words DataFrame...
Done!
Ran in 0.08m


In [173]:
data_bow.head()

Unnamed: 0,there,were,2,mass,shootings,in,texas,last,week,but,...,vases,leann,printable,g8,trierweiler,dolomites,jubilee,stylelist,donnas,psychometer
0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Total Vocab Size

In [174]:
len(encoder)

13996

### Most Frequent Words

In [175]:
def k_most_frequent(unigram_counts, k):
    top_words = []

    # The "candidate" is the word in the list that is
    # next up to get replaced if we find a better word.
    candidate_count = 0
    candidate_index = -1

    # We want to support k most and k least frequent words.
    pos_k = k if k >= 0 else -k
    min_or_max = min if k >= 0 else max

    for word, count in unigram_counts.items():

        if len(top_words) < pos_k or min_or_max(count, candidate_count) == candidate_count:
            top_words.append(word)
        else:
            continue

        if len(top_words) > pos_k:
            # Need to remove the shortest word.
            del top_words[candidate_index]
            
        counts = [unigram_counts[w] for w in top_words]
        candidate_count = min_or_max(counts)
        candidate_index = counts.index(candidate_count)
        
    return top_words


In [176]:
# Most Frequent Words
k_most_frequent(unigram_counts, k=10)

['in',
 '__LOW_FREQ_TOKEN__',
 'to',
 'the',
 'is',
 'a',
 'of',
 'for',
 'and',
 'that']

In [177]:
# Least Frequent Words
k_most_frequent(unigram_counts, k=-10)

['divorcee',
 'fabrics',
 'cm',
 'summery',
 'huffpostbeauty',
 'gwist',
 'printable',
 'trierweiler',
 'donnas',
 'psychometer']