In [7]:
import pandas as pd
import plotly.express as px
import plotly.subplots as sp

from IPython.display import display, Image

This notebook serves to explore and transform the data from the [Media Bias Detection group](https://github.com/Media-Bias-Group/Neural-Media-Bias-Detection-Using-Distant-Supervision-With-BABE/tree/main/data) such that it could be used to identify bias in text generated with ChatGPT. The data also contains media outlet political bias evaluations from [AllSlides](https://www.allsides.com/media-bias/media-bias-chart).

*This data set has also been used in the paper "Neural Media Bias Detection Using Distant Supervision With BABE - Bias
Annotations By Experts" by T.Spinde et al. for development a bias detection algorithm based on a pre-trained BERT model in media with added distant supervision. Their models will only be used here for comparison on the performance of my ML pipeline.*

# 1. Explore data set

There are several data files in the full dataset: the MBIC data, which contains bias labeling by crowdsourcing and SG1 + SG2 datasets, which contain bias labeling by a small set of experts. It also contains a lexicon of bias words and sets of biased and neutal news headlines. Let's load and explore these data files in detail to identify which ones would be appropriate for designing a ChatGPT-bias detection model.

In [6]:
df_mbic = pd.read_excel('../data/final_labels_MBIC.xlsx')
df_sg1 = pd.read_excel('../data/final_labels_SG1.xlsx')
df_sg2 = pd.read_excel('../data/final_labels_SG2.xlsx')
df_final_sg1 = pd.read_excel('../data/dt_final_SG1.xlsx')
df_final_sg2 = pd.read_excel('../data/dt_final_SG2.xlsx')
lexicon = pd.read_excel('../data/bias_word_lexicon.xlsx')
headlines_biased = pd.read_csv('../data/news_headlines_usa_biased.csv')
headlines_neutral = pd.read_csv('../data/news_headlines_usa_neutral.csv')


Unknown extension is not supported and will be removed



In [9]:
for name, df in zip(['MBIC', 'SG1', 'SG2', 'SG1-dt-final', 'SG2-dt-final', 'Lexicon', 'Biased headlines', 'Neutral Headlines'],
                    [df_mbic, df_sg1, df_sg2, df_final_sg1, df_final_sg2, lexicon, headlines_biased, headlines_neutral]):
    print(name)
    display(df.head(5))

MBIC


Unnamed: 0,text,news_link,outlet,topic,type,group_id,num_sent,label_bias,label_opinion,article,biased_words
0,YouTube is making clear there will be no “birt...,https://eu.usatoday.com/story/tech/2020/02/03/...,usa-today,elections-2020,center,1,1,Biased,Somewhat factual but also opinionated,YouTube says no ‘deepfakes’ or ‘birther’ video...,"['belated', 'birtherism']"
1,So while there may be a humanitarian crisis dr...,https://www.alternet.org/2019/01/here-are-5-of...,alternet,immigration,left,1,1,Biased,Expresses writer’s opinion,Speaking to the country for the first time fro...,['crisis']
2,"Looking around the United States, there is nev...",https://thefederalist.com/2020/03/11/woman-who...,federalist,abortion,right,1,1,Biased,Somewhat factual but also opinionated,The left has a thing for taking babies hostage...,"['killing', 'never', 'developing', 'humans', '..."
3,The Republican president assumed he was helpin...,http://www.msnbc.com/rachel-maddow-show/auto-i...,msnbc,environment,left,1,1,Biased,Expresses writer’s opinion,"In Barack Obama’s first term, the administrati...","['rejects', 'happy', 'assumed']"
4,The explosion of the Hispanic population has l...,https://www.breitbart.com/politics/2015/02/26/...,breitbart,student-debt,right,1,1,Biased,No agreement,"Republicans should stop fighting amnesty, Pres...",['explosion']


SG1


Unnamed: 0,text,news_link,outlet,topic,type,label_bias,label_opinion,biased_words
0,The Republican president assumed he was helpin...,http://www.msnbc.com/rachel-maddow-show/auto-i...,msnbc,environment,left,Biased,Expresses writer’s opinion,[]
1,Though the indictment of a woman for her own p...,https://eu.usatoday.com/story/news/nation/2019...,usa-today,abortion,center,Non-biased,Somewhat factual but also opinionated,[]
2,Ingraham began the exchange by noting American...,https://www.breitbart.com/economy/2020/01/12/d...,breitbart,immigration,right,No agreement,No agreement,['flood']
3,The tragedy of America’s 18 years in Afghanist...,http://feedproxy.google.com/~r/breitbart/~3/ER...,breitbart,international-politics-and-world-news,right,Biased,Somewhat factual but also opinionated,"['tragedy', 'stubborn']"
4,The justices threw out a challenge from gun ri...,https://www.huffpost.com/entry/supreme-court-g...,msnbc,gun-control,left,Non-biased,Entirely factual,[]


SG2


Unnamed: 0,text,news_link,outlet,topic,type,label_bias,label_opinion,biased_words
0,"""Orange Is the New Black"" star Yael Stone is r...",https://www.foxnews.com/entertainment/australi...,Fox News,environment,right,Non-biased,Entirely factual,[]
1,"""We have one beautiful law,"" Trump recently sa...",https://www.alternet.org/2020/06/law-and-order...,Alternet,gun control,left,Biased,Somewhat factual but also opinionated,"['bizarre', 'characteristically']"
2,"...immigrants as criminals and eugenics, all o...",https://www.nbcnews.com/news/latino/after-step...,MSNBC,white-nationalism,left,Biased,Expresses writer’s opinion,"['criminals', 'fringe', 'extreme']"
3,...we sounded the alarm in the early months of...,https://www.alternet.org/2019/07/fox-news-has-...,Alternet,white-nationalism,left,Biased,Somewhat factual but also opinionated,[]
4,[Black Lives Matter] is essentially a non-fals...,http://feedproxy.google.com/~r/breitbart/~3/-v...,Breitbart,marriage-equality,,Biased,Expresses writer’s opinion,['cult']


SG1-dt-final


Unnamed: 0.1,Unnamed: 0,sentence,outlet,topic,type,article,biased_words2,text,text_low,pos,...,ne_NORP_context,ne_ORDINAL_context,ne_ORG_context,ne_PERCENT_context,ne_PERSON_context,ne_PRODUCT_context,ne_QUANTITY_context,ne_TIME_context,ne_WORK_OF_ART_context,ne_LANGUAGE_context
0,0,YouTube is making clear there will be no “birt...,usa-today,elections-2020,center,YouTube is making clear there will be no “birt...,[],YouTube,youtube,PROPN,...,0,0,0,0,0,0,0,0,0,0
1,2,YouTube is making clear there will be no “birt...,usa-today,elections-2020,center,YouTube is making clear there will be no “birt...,[],making,making,VERB,...,0,0,0,0,0,0,0,0,0,0
2,3,YouTube is making clear there will be no “birt...,usa-today,elections-2020,center,YouTube is making clear there will be no “birt...,[],clear,clear,ADJ,...,0,0,0,0,0,0,0,0,0,0
3,8,YouTube is making clear there will be no “birt...,usa-today,elections-2020,center,YouTube is making clear there will be no “birt...,[],birtherism,birtherism,NOUN,...,0,0,0,0,0,0,0,0,0,0
4,11,YouTube is making clear there will be no “birt...,usa-today,elections-2020,center,YouTube is making clear there will be no “birt...,[],platform,platform,NOUN,...,0,0,0,0,0,0,0,0,0,0


SG2-dt-final


Unnamed: 0.1,Unnamed: 0,sentence,outlet,topic,type,article,biased_words2,text,text_low,pos,...,ne_NORP_context,ne_ORDINAL_context,ne_ORG_context,ne_PERCENT_context,ne_PERSON_context,ne_PRODUCT_context,ne_QUANTITY_context,ne_TIME_context,ne_WORK_OF_ART_context,ne_LANGUAGE_context
0,0,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,"""Orange Is the New Black"" star Yael Stone is r...",[],Orange,orange,PROPN,...,0,0,0,0,0,0,0,0,1,0
1,3,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,"""Orange Is the New Black"" star Yael Stone is r...",[],New,new,PROPN,...,0,0,0,0,0,0,0,0,1,0
2,4,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,"""Orange Is the New Black"" star Yael Stone is r...",[],Black,black,PROPN,...,0,0,0,0,1,0,0,0,1,0
3,5,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,"""Orange Is the New Black"" star Yael Stone is r...",[],star,star,NOUN,...,0,0,0,0,1,0,0,0,1,0
4,6,"""Orange Is the New Black"" star Yael Stone is r...",Fox News,environment,right,"""Orange Is the New Black"" star Yael Stone is r...",[],Yael,yael,PROPN,...,0,0,0,0,1,0,0,0,1,0


Lexicon


Unnamed: 0,slavishness
0,abhorring
1,passivism
2,discomfits
3,consequentialist
4,judgmentalism


Biased headlines


Unnamed: 0,stories_id,publish_date,title,url,language,ap_syndicated,themes,media_id,media_name,media_url
0,900954563,2020-04-15 00:00:00,One Day U: Going Back to College for One Day |...,https://www.huffingtonpost.com/renee-fisher/on...,en,False,,27502,HuffPost,http://www.huffingtonpost.com/#
1,802518660,2020-11-15 00:00:00,Watch Donald Trump Compassionately Defend Drea...,https://www.huffingtonpost.com/entry/trump-def...,en,False,,27502,HuffPost,http://www.huffingtonpost.com/#
2,869218430,2020-04-18 00:00:00,'Whitewashing War Crimes': How UK Academics Pr...,https://www.huffingtonpost.co.uk/entry/uk-acad...,en,False,,27502,HuffPost,http://www.huffingtonpost.com/#
3,1001112150,2020-06-18 00:00:00,Mexican Official Says Girl With Down Syndrome ...,https://www.huffingtonpost.co.uk/entry/mexican...,en,False,,27502,HuffPost,http://www.huffingtonpost.com/#
4,1483890044,,"Todd Bensman, Author at The Federalist",https://thefederalist.com/author/toddbensman/,en,False,,366282,Federalist,http://thefederalist.com/


Neutral Headlines


Unnamed: 0,stories_id,publish_date,title,url,language,ap_syndicated,themes,media_id,media_name,media_url
0,663692044,2020-07-12 08:00:00,"""Don't Tase Me, Bro"" tops '07 memorable quote ...",http://www.reuters.com/article/us-quotes-odd-i...,en,False,,4442,Reuters,http://www.reuters.com
1,668844035,2020-07-11 08:00:00,U.S. healthcare falls short in survey of 7 nat...,http://www.reuters.com/article/us-healthcare-s...,en,False,,4442,Reuters,http://www.reuters.com
2,669761491,2020-07-10 08:00:00,Nicotine may ease Parkinson's symptoms: U.S. s...,https://www.reuters.com/article/us-parkinsons-...,en,False,,4442,Reuters,http://www.reuters.com
3,680460617,2020-11-05 07:00:00,Experts Divided Over Safety of Indian Point Nu...,http://www.reuters.com/article/idUS38169798202...,en,False,,4442,Reuters,http://www.reuters.com
4,803694205,2020-06-15 00:00:00,"Reports: Petraeus off the list, Trump down to ...",http://thehill.com/blogs/blog-briefing-room/ne...,en,False,,18364,Hill,http://thehill.com/rss/syndicator/19109


So, the dt_final_sg1 and dt_final_sg2 data files seems to contain already encoded data. For now, I won't be using them because I will do my own encoding pipeline. The lexicon file only contains a list of words (and no header, so the first word is detected as column name - something that will need to be fixed if we use the lexicon!). 

So let's focus on the three main data files: MBIC, SG1 and SG2 and the headlines and check some of their attributes.

In [17]:
for name, df in zip(['MBIC', 'SG1', 'SG2', 'Biased headlines', 'Neutral headlines'],[df_mbic, df_sg1, df_sg2, headlines_biased, headlines_neutral]):
    print(f'{name} size: {df.shape}')

MBIC size: (1700, 11)
SG1 size: (1700, 8)
SG2 size: (3674, 8)
Biased headlines size: (45605, 10)
Neutral headlines size: (83143, 10)


In [22]:
overlap_mbic_sg1 = df_mbic['text'][df_mbic['text'].isin(df_sg1['text'])]
overlap_mbic_sg2 = df_mbic['text'][df_mbic['text'].isin(df_sg1['text'])]
overlap_sg1_sg2 = df_sg1['text'][df_sg1['text'].isin(df_sg2['text'])]

In [25]:
print(overlap_mbic_sg1.shape, overlap_mbic_sg2.shape, overlap_sg1_sg2.shape)

(1698,) (1698,) (1694,)


It seems that the SG2 dataset contains most of the items from SG1 and MBIC, so we can simply resort to using SG2 as the largest dataset without merging with the other two. On the other hand, we have the headlines data which is much larger but does not contain themes or indications of the political bias of the media outlets they come from. Additionally, the SG2 data has a label on whether the sentence is factual or opinionated which is another label we can classify on.

One thing we can do with the headline though is add a column for the political bias of the media outlets from the other data and explore them in a little more detail before moving on.

In [34]:
list(headlines_biased['media_name'].unique())+list(headlines_neutral['media_name'].unique())

['HuffPost',
 'Federalist',
 'Daily Beast',
 'Alternet',
 'Breitbart',
 'New Yorker',
 'American Greatness',
 'Daily Caller',
 'Daily Wire',
 'Slate',
 'Reuters',
 'Hill',
 'USA Today',
 'CNBC',
 'Yahoo News - Latest News & Headlines',
 'AP',
 'Bloomberg']

In [28]:
df_sg2['outlet'].unique()

array(['Fox News', 'Alternet', 'MSNBC', 'Breitbart', 'Federalist',
       'Reuters', 'USA Today', 'Daily Beast', 'HuffPost', 'Daily Stormer',
       'New York Times'], dtype=object)

Looks like we can't simply use the SG2 data since there are some missing outlets there, so let's generate a map from the AllSides media bias chart.
![AllSides Media Bias Chart](../data/allsides_media_bias_chart_version_80.png)

In [35]:
media_bias_map = {
'HuffPost': 'left',
'Federalist': 'right',
'Daily Beast': 'left',
'Alternet': 'left',
'Breitbart': 'right',
'New Yorker': 'left',
'American Greatness': 'right', # from https://mediabiasfactcheck.com/american-greatness/
'Daily Caller': 'right',
'Daily Wire': 'right',
'Slate': 'left',
'Reuters': 'center',
'Hill': 'center', # from https://mediabiasfactcheck.com/the-hill/
'USA Today': 'lean left',
'CNBC': 'lean left',
'Yahoo News - Latest News & Headlines': 'lean left',
'AP': 'lean left',
'Bloomberg': 'lean left'
}

In [39]:
headlines_biased['type'] = headlines_biased['media_name'].map(media_bias_map)
headlines_neutral['type'] = headlines_neutral['media_name'].map(media_bias_map)
headlines_biased['label_bias'] = 'biased'
headlines_neutral['label_bias'] = 'neutral'

headlines = pd.concat([headlines_biased, headlines_neutral]).sample(frac=1, random_state=42).reset_index(drop=True)

In [41]:
headlines.to_csv('../data/headlines_merged.csv')

In [42]:
df_sg2.to_csv('../data/sentences_data.csv')

## Visualize distributions and correlations
We have two datasets to work with now: the labeled sentences and headlines data. To ensure better performance of the ML model, let's see if the data is balanced in the categories we're interested in and decide how to sample it for training, validation and testing.

In [43]:
df1 = pd.read_csv('../data/sentences_data.csv')
df2 = pd.read_csv('../data/headlines_merged.csv')

### Bias

In [78]:
fig1 = px.bar(df1['label_bias'].value_counts(), y='count')
fig2 = px.bar(df2['label_bias'].value_counts(), y='count')

subplot_fig = sp.make_subplots(rows=1, cols=2, specs=[[{'type': 'xy'}, {'type': 'xy'}]],
                                subplot_titles=("Sentence data", "Headlines data"))
subplot_fig.add_trace(fig1['data'][0], row=1, col=1)
subplot_fig.add_trace(fig2['data'][0], row=1, col=2)


subplot_fig.update_layout(
    title_text='Distribution of bias labels',
    title_x=0.5,
    title_y=0.95,
    showlegend=False
)

subplot_fig.show()

### Factuality

In [55]:
px.histogram(df1, x='label_opinion', title='Distribution of factuality labels in Sentence data')

### Bias VS type of outlet

In [98]:
px.histogram(df1, x='type', color='label_bias', barmode='group', title='Sentence data')
# fig2 = px.histogram(df2, x='type', color='label_bias')

The distributions of sentence bias with respect to type of outlet are close to what I would expect: most sentences coming from center outlets are labeled as non-biased with a small set of biased sentences attributing to the fact that they may come from opinion pieces or human overlook. The right and left leaning outlets have, in contrast, a larger portion of biased-language sentences than non-biased.

In [99]:
px.histogram(df2, x='type', color='label_bias', barmode='group', title='Headlines data')

From the distributions of label bias and type of outlet it seems that in the headlines data, the bias is applied solely based on the type of media outlet (headlines coming from left and right outlets are labeled as biased, while center and leaning left headlines are labeled as neutral.) Since the purpose of this project is to determine bias present in sentences as percieved by human with respect to the words used in a sentence, this dataset is not appropriate. That means that I will move forward only with the sentence data where bias and factuality are labeled by expert human annotators.

# 2. Transform data

To achieve the embedding similarities in the training data set comparable to how ChatGPT operates, I will use the OpenAI API to calculate the sentence embeddings and subsequently, use them to train a classifier for bias detection in ChatGPT generated content.

In [24]:
import os
import numpy as np
import plotly.express as px
import openai
import tiktoken
from openai.embeddings_utils import get_embedding

In [16]:
api_key = os.getenv('OPENAI_API_KEY')
openai.api_key = api_key

In [12]:
df = pd.read_csv('../data/sentences_data.csv').drop(columns=['Unnamed: 0'])

In [9]:
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000

In [18]:
encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.text.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens]
len(df)

3674

In [19]:
df["embedding"] = df.text.apply(lambda x: get_embedding(x, engine=embedding_model))

In [28]:
embeddings = np.vstack(df['embedding'].values)
np.save('../data/sentences_embeddings.npy', embeddings)

In [113]:
df.to_csv('../data/sentences_embeddings.csv')

In [29]:
embeddings.shape

(3674, 1536)

In [31]:
from umap import UMAP

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


In [105]:
embeddings_3d = UMAP(n_components=3).fit_transform(embeddings)

In [106]:
df['umap_3d_x'] = embeddings_3d[:,0]
df['umap_3d_y'] = embeddings_3d[:,1]
df['umap_3d_z'] = embeddings_3d[:,2]

In [107]:
fig = px.scatter_3d(df, x='umap_3d_x', y='umap_3d_y', z='umap_3d_z', color='topic', width=1000, height=680)
fig.update_traces(marker_size=1.5)
fig.update_layout(
    scene=dict(
        xaxis=dict(showticklabels=False, title=''),
        yaxis=dict(showticklabels=False, title=''),
        zaxis=dict(showticklabels=False, title=''),
    )
)

In [112]:
fig = px.scatter_3d(df[df['topic']=='taxes'], x='umap_3d_x', y='umap_3d_y', z='umap_3d_z', color='label_bias', width=1000, height=680)
fig.update_traces(marker_size=1.5)
fig.update_layout(
    scene=dict(
        xaxis=dict(showticklabels=False, title=''),
        yaxis=dict(showticklabels=False, title=''),
        zaxis=dict(showticklabels=False, title=''),
    )
)

From the UMAP dimensionality reduction plots above it's clear that the main driver between similarity of the embedded data is the topic of the sentence. So, a spatial-distance type of model like kNN will be most likely biased towards detecting topics more so than bias. We'll test a couple of different models in the MLP notebook before moving onto finalizing a classification pipeline.