# Workflow:
1. Randomly Sample 200 articles for each outlet, and concatenate to a dataframe
2. Label each outlet with the associated outlet_labels and merge
3. Get the average bias score from reference_df for each outlet and merge with data
4. Compare outlet_labels and bias scores in order to determine left/right bias
5. Start preprocessing and model building
6. Account for class-imbalance and feature selection

### Notes:
- outlet_labels are based on mediabiasfactcheck.com
- reference_df is based on adfontesmedia.com
- df is based on kaggle.com/snapcrack/all-the-news

# Data importing and creation

In [1]:
import numpy as np
import pandas as pd

In [3]:
df1 = pd.read_csv("../data/articles1.csv")
df2 = pd.read_csv("../data/articles2.csv")
df3 = pd.read_csv("../data/articles3.csv")
df = pd.concat([df1, df2, df3])

reference_df = pd.read_csv("../data/Interactive Media Bias Chart - Ad Fontes Media.csv")

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [5]:
# Getting unique outlets
outlets = sorted(df['publication'].unique())

In [6]:
outlets

['Atlantic',
 'Breitbart',
 'Business Insider',
 'Buzzfeed News',
 'CNN',
 'Fox News',
 'Guardian',
 'NPR',
 'National Review',
 'New York Post',
 'New York Times',
 'Reuters',
 'Talking Points Memo',
 'Vox',
 'Washington Post']

In [7]:
# Extreme Left, Left, Left-Center, Center, Right-Center, Right, Extreme Right
outlet_labels = ['Left-Center', 'Extreme Right', 'Left-Center', 'Left-Center', 'Left', 'Right', 'Left-Center', 'Left-Center',
               'Right', 'Right-Center', 'Left-Center', 'Center', 'Left', 'Left', 'Left-Center']

outlet_df = pd.DataFrame({'outlet': outlets, 
                          'label':outlet_labels})

In [8]:
outlet_df.head()

Unnamed: 0,outlet,label
0,Atlantic,Left-Center
1,Breitbart,Extreme Right
2,Business Insider,Left-Center
3,Buzzfeed News,Left-Center
4,CNN,Left


In [9]:
# A negative bias is Left Leaning, a positive bias is Right Leaning
reference_df.head()

Unnamed: 0,Source,Url,Bias,Quality
0,ABC,https://abcnews.go.com/Politics/us-disrupted-a...,-5.33,52.33
1,ABC,https://abcnews.go.com/Politics/appeals-court-...,0.67,51.67
2,ABC,https://abcnews.go.com/Politics/electoral-coll...,-10.0,32.0
3,ABC,https://abcnews.go.com/Politics/facebook-agree...,-2.33,52.33
4,ABC,https://abcnews.go.com/Politics/donald-trump-t...,-4.33,52.67


In [10]:
reference_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1916 entries, 0 to 1915
Data columns (total 4 columns):
Source     1916 non-null object
Url        1916 non-null object
Bias       1916 non-null float64
Quality    1916 non-null float64
dtypes: float64(2), object(2)
memory usage: 60.0+ KB


# Preprocessing
##### We have 3 dataframes:
1. df -> main data, with articles and features
2. outlet_df -> political labels for each outlet
3. reference_df -> political bias scores for each article from each outlet

### Processing reference_df to get the average bias score per outlet

In [11]:
# Renaming entities in reference_df to match data
reference_df['Source'].replace({'The Atlantic': 'Atlantic',
                                'BuzzFeed': 'Buzzfeed News',
                                'The Guardian': 'Guardian', 
                                'National Public Radio': 'NPR'}, inplace=True)

In [12]:
# Filtering to outlets present in data
reference_df = reference_df[reference_df['Source'].isin(outlets)]

In [13]:
# Getting mean bias score for each outlet
reference_df = reference_df[['Source', 'Bias']].groupby('Source').mean()

In [14]:
# Adding index and renaming
reference_df.reset_index(level=0, inplace=True)
reference_df.rename(columns={"Source": "outlet"}, inplace=True)

In [15]:
reference_df.head()

Unnamed: 0,outlet,Bias
0,Atlantic,-6.410714
1,Breitbart,18.987857
2,Business Insider,-0.378
3,Buzzfeed News,-7.061333
4,CNN,-8.553827


### Sampling 200 articles from each outlet and concatenating to dataframe

In [16]:
df_list = []
for outlet in outlets:
    df_list.append(df[df['publication'] == outlet].sample(200))

df_sample = pd.concat(df_list)

In [17]:
# Dropping unused features
df_sample.drop(columns=['Unnamed: 0', 'id', 'date', 'month', 'url'], inplace=True)

In [18]:
df_sample.head()

Unnamed: 0,title,publication,author,year,content
3860,Can Trump TV Succeed?,Atlantic,Nora Kelly,2016.0,I want to receive updates from partners and...
3965,Europe’s Counterrevolution Has Begun,Atlantic,Uri Friedman,2016.0,I want to receive updates from partners and...
1539,"Donald Trump, Inevitable Hawk",Atlantic,McKay Coppins,2017.0,"That’s because, as with everything else, Trump..."
5002,RuPaul’s Drag Race Claims Its Queer Cultural C...,Atlantic,Spencer Kornhaber,2016.0,For us to continue writing great stori...
3243,Hillary Clinton Has Enough Delegates to Claim ...,Atlantic,Nora Kelly,2016.0,I want to receive updates from partners and...


### Merging the 3 dataframes

In [19]:
final_reference_df = reference_df.merge(outlet_df, on='outlet')

In [20]:
final_reference_df.head()

Unnamed: 0,outlet,Bias,label
0,Atlantic,-6.410714,Left-Center
1,Breitbart,18.987857,Extreme Right
2,Business Insider,-0.378,Left-Center
3,Buzzfeed News,-7.061333,Left-Center
4,CNN,-8.553827,Left


In [21]:
df_sample.rename(columns={"publication": "outlet"}, inplace=True)
df_sample = df_sample.merge(final_reference_df, on='outlet')

In [22]:
df_sample.head(10)

Unnamed: 0,title,outlet,author,year,content,Bias,label
0,Can Trump TV Succeed?,Atlantic,Nora Kelly,2016.0,I want to receive updates from partners and...,-6.410714,Left-Center
1,Europe’s Counterrevolution Has Begun,Atlantic,Uri Friedman,2016.0,I want to receive updates from partners and...,-6.410714,Left-Center
2,"Donald Trump, Inevitable Hawk",Atlantic,McKay Coppins,2017.0,"That’s because, as with everything else, Trump...",-6.410714,Left-Center
3,RuPaul’s Drag Race Claims Its Queer Cultural C...,Atlantic,Spencer Kornhaber,2016.0,For us to continue writing great stori...,-6.410714,Left-Center
4,Hillary Clinton Has Enough Delegates to Claim ...,Atlantic,Nora Kelly,2016.0,I want to receive updates from partners and...,-6.410714,Left-Center
5,From Special Education to Suspicious Science: ...,Atlantic,Hayley Glatter,2017.0,"Dale Russakoff | The New York Times Magazine, ...",-6.410714,Left-Center
6,The End of a Political Revolution,Atlantic,Clare Foran,2016.0,", I want to receive updates from partners and ...",-6.410714,Left-Center
7,’There’s Enough Time to Change Everything’,Atlantic,Conor Friedersdorf,2017.0,It is hard to imagine a more misleading treatm...,-6.410714,Left-Center
8,Avenging a One-Star Review With Digital Sabotage,Atlantic,Kaveh Waddell,2017.0,"On Saturday, an unhappy customer vented onli...",-6.410714,Left-Center
9,The Radical Anti-Conservatism of Stephen Bannon,Atlantic,Conor Friedersdorf,2016.0,For us to continue writing great stori...,-6.410714,Left-Center


In [24]:
df_sample.to_csv('../outlet_data.csv')