# Exploring the Data

We will explore classifying this news data using a simple classifier: Logistic Regression.
The Logistic Regression Algorithm will be given a bag of words.

In [1]:
import numpy as np
import pandas as pd

## Load the Data

In [7]:
# The dataset provided is malformed JSON. Need to fix up the JSON formatting
# so that it can be ingested by pandas.

with open('./News_Category_Dataset_v2.json') as file:
    lines = file.readlines()
    json = f'[{",".join(lines)}]'


In [8]:
data_raw = pd.read_json(json, orient='records')

## Exploring Categories

In [9]:
data_raw.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [13]:
data_raw['category'].value_counts()

POLITICS          32739
WELLNESS          17827
ENTERTAINMENT     16058
TRAVEL             9887
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6694
QUEER VOICES       6314
FOOD & DRINK       6226
BUSINESS           5937
COMEDY             5175
SPORTS             4884
BLACK VOICES       4528
HOME & LIVING      4195
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3651
WOMEN              3490
IMPACT             3459
DIVORCE            3426
CRIME              3405
MEDIA              2815
WEIRD NEWS         2670
GREEN              2622
WORLDPOST          2579
RELIGION           2556
STYLE              2254
SCIENCE            2178
WORLD NEWS         2177
TASTE              2096
TECH               2082
MONEY              1707
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1339
ENVIRONMENT        1323
COLLEGE            1144
LATINO VOICES      1129
CULTURE & ARTS     1030
EDUCATION          1004
Name: category, 

We can see that the dominant class is Politics. What portion of news articles are classified as Politics?

In [18]:
f'{float((data_raw["category"] == "POLITICS").sum()) / len(data_raw["category"]) * 100:.02f}%'

'16.30%'

So as a baseline, we would expect our model to have an accuracy of at least as good as *16%*, which would be the equivalent of classifying every news article as Politics.

## Exploring Authors

In [23]:
data_raw['authors'].describe()

count     200853
unique     27993
top             
freq       36620
Name: authors, dtype: object

In [25]:
data_raw['authors'].value_counts()

                                                                                                         36620
Lee Moran                                                                                                 2423
Ron Dicker                                                                                                1913
Reuters, Reuters                                                                                          1562
Ed Mazza                                                                                                  1322
                                                                                                         ...  
Arthur Delaney and Amanda Terkel                                                                             1
Claudine Chicheportiche, ContributorWriter. Yogi. Karate gal. Boxer. Fighter. Lover. Infinite Beli...        1
Ambassador Eric Goosby, MD and Anthony S. Fauci, M.D., Guest Writers                                         1
S

A large portion of articles have missing authors. It would be good to get a sense of the distribution of articles written by repeat authors.

In [33]:
authors_dist = data_raw['authors'].value_counts()
authors_dist = authors_dist.drop('')

authors_dist.describe()

count    27992.000000
mean         5.867141
std         41.929860
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max       2423.000000
Name: authors, dtype: float64

There appears to be a long tail of single-article authors. What portion of articles contain an author? What portion of articles contain a repeating author?

In [43]:
repeat_authors = authors_dist[authors_dist > 1].index.values

count_articles = len(data_raw)
count_articles_with_authors = len(data_raw[data_raw['authors'] != ''])
count_articles_with_repeat_authors = len(data_raw[data_raw['authors'].isin(repeat_authors)])

print(f'{float(count_articles_with_authors) / count_articles * 100:.02f}% of articles contain authors.')
print(f'{float(count_articles_with_repeat_authors) / count_articles * 100:.02f} articles contain repeat authors.')

81.77% of articles contain authors.
73.45 articles contain repeat authors.


## Exploring Headline and Description