### Headline network for left and right news media

<span style="font-family: 'Verdana', sans-serif; font-size: 16px;line-height: 1.5;">
This script processes the news article titles, initially cleaning and splitting the data into subsets based on left and right political ideologies. It then tokenizes the titles, identifies co-occurrences of words in each subset, and filters these co-occurrences based on frequency. The resulting information is organized into DataFrames representing edges between words, with additional attributes such as weights. The script also incorporates word counts and Shifterator scores, linking these values to the corresponding nodes. The final output includes filtered DataFrames for edges and nodes, providing insights into the relationships and frequencies of words within the context of left and right ideologies in news media titles.<span>

In [9]:
# Importing modules
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from itertools import combinations
from collections import defaultdict
import operator

In [10]:
# load categorised data by media ideological affiliation
df = pd.read_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/preprocessed_nov23.csv', encoding='latin-1')
nodes_df = pd.read_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/shifterator_scores_text_nov23.csv', encoding='latin-1')
df.head()  


Unnamed: 0,date,maintext,title,source,media_name,ideology,Congress,maintext_fullstop
0,03/01/2013,greg gutfeld co host hello america greg gutfel...,reagan era over al jazeera buy current tv,C:/Users/2146806A/OneDrive - University of Gla...,Fox News,right,113th,greg gutfeld co host hello america . greg gutf...
1,11/01/2013,late threat leadership easily speaker house jo...,job safe straus challenge gop,C:/Users/2146806A/OneDrive - University of Gla...,The New York Times,left,113th,late threat leadership easily speaker house jo...
2,18/01/2013,story highlight mill remember march selma mont...,year mlk march voting rights vulnerable,https://edition.cnn.com/2013/01/18/opinion/mil...,CNN,left,113th,story highlight mill remember march selma mont...
3,20/01/2013,barack michelle obama spend thousand day displ...,change come year friend shift obama,C:/Users/2146806A/OneDrive - University of Gla...,The New York Times,left,113th,barack michelle obama spend thousand day displ...
4,21/01/2013,sean_hannity host hannity moment ago president...,karl rove,C:/Users/2146806A/OneDrive - University of Gla...,Fox News,right,113th,sean_hannity host hannity moment ago president...


## Co-occurrences in headlines

In [6]:
df['title'] = df['title'].str.replace("'", "")

In [18]:
# Split corpus into right and left news media
left = df[df['ideology'].isin(['left'])]
right = df[df['ideology'].isin(['right'])]

# we want to find co-occurrences of words in headlines so we convert the title column into a list
right_list = right['title'].tolist()
left_list = left['title'].tolist()

In [19]:
tokenized_left = [word_tokenize(item) for item in left['title'] if isinstance(item, str) and item.strip()]
tokenized_right = [word_tokenize(item) for item in right['title'] if isinstance(item, str) and item.strip()]

We then create an edge list dictionary. For every word that occurs with another word we give it a value of 1. <br> 
So it looks like this: <br> 
 {('ways', 'league'): 2, <br> 
   ('ways', 'women'): 2, <br> 
  ('ways', 'voters'): 3, ... <br> 

In [20]:
# edgelist dictionary for right
edgelist_right = defaultdict(int)
for text in tokenized_right:
    for node1, node2 in combinations(text, 2):
        if node1 != node2:  # Skip pairs with the same node
            edgelist_right[(node1, node2)] += 1

# left
edgelist_left = defaultdict(int)
for text in tokenized_left:
    for node1, node2 in combinations(text, 2):
        if node1 != node2:  # Skip pairs with the same node
            edgelist_left[(node1, node2)] += 1

In [21]:
# RIGHT
sorted_edges_r = sorted(edgelist_right.items(), key=operator.itemgetter(1), reverse = True)
sorted_edges_r

# LEFT
sorted_edges_l = sorted(edgelist_left.items(), key=operator.itemgetter(1), reverse = True)
sorted_edges_l

[(('voter', 'identification'), 235),
 (('identification', 'law'), 111),
 (('voter', 'law'), 103),
 (('biden', 'coronavirus'), 88),
 (('democrats', 'biden'), 85),
 (('biden', 'border'), 82),
 (('coronavirus', 'biden'), 73),
 (('biden', 'united_states'), 70),
 (('election', 'law'), 66),
 (('democrats', 'election'), 66),
 (('biden', 'administration'), 65),
 (('trump', 'biden'), 61),
 (('democrats', 'bill'), 61),
 (('biden', 'democrats'), 61),
 (('georgia', 'law'), 60),
 (('biden', 'republican'), 60),
 (('democrats', 'republican'), 56),
 (('biden', 'crisis'), 56),
 (('voting', 'law'), 55),
 (('election', 'integrity'), 53),
 (('election', 'bill'), 52),
 (('voting', 'bill'), 51),
 (('democrats', 'coronavirus'), 50),
 (('biden', 'people'), 50),
 (('trump', 'election'), 48),
 (('biden', 'media'), 46),
 (('gop', 'biden'), 46),
 (('trump', 'united_states'), 45),
 (('election', 'biden'), 45),
 (('democrats', 'voting'), 44),
 (('republican', 'biden'), 44),
 (('biden', 'election'), 43),
 (('biden',

In [27]:
#RIGHT
# we filter the number of edges based on weight (aka the frequency of co-occurrence) 
filtered_r = {k: v for k, v in edgelist_right.items() if v > 3}

#LEFT
filtered_l = {k: v for k, v in edgelist_left.items() if v > 2}

# Print lengths
print("Length of filtered_l:", len(filtered_l))
print("Length of filtered_r:", len(filtered_r))

Length of filtered_l: 2013
Length of filtered_r: 7021


In [28]:
# we create a dataframe from the edgelist dictionary wXXith a weight column denoting the number of co=occurrences of a speciic pairing
edges_right = pd.DataFrame.from_dict({'Weight': list(filtered_r.values()), 'Edge': list(filtered_r.keys())})
edges_right.head(20)

Unnamed: 0,Weight,Edge
0,4,"(radical, run)"
1,4,"(house, senator)"
2,8,"(senator, scott)"
3,31,"(texas, voter)"
4,27,"(texas, identification)"
5,23,"(texas, law)"
6,235,"(voter, identification)"
7,103,"(voter, law)"
8,4,"(voter, decision)"
9,111,"(identification, law)"


In [29]:
# we create a dataframe from the edgelist dictionary wXXith a weight column denoting the number of co=occurrences of a speciic pairing
edges_left = pd.DataFrame.from_dict({'Weight': list(filtered_l.values()), 'Edge': list(filtered_l.keys())})
edges_left.head(20)

Unnamed: 0,Weight,Edge
0,3,"(challenge, gop)"
1,5,"(year, march)"
2,7,"(year, voting)"
3,7,"(year, rights)"
4,7,"(march, voting)"
5,8,"(march, rights)"
6,206,"(voting, rights)"
7,3,"(virginia, photo)"
8,4,"(virginia, identification)"
9,3,"(lawmaker, pass)"


In [30]:
# we extract the words from the column edge so that we have a source and target word
weight_r = edges_right['Weight']
source_r = [x[0] for x in edges_right['Edge']]
target_r = [x[1] for x in edges_right['Edge']]

weight_r = weight_r.to_list()

# we create a dataframe of edges
df_edges_r = {
    'col1': source_r,
    'col2': target_r,
    'col3': weight_r,
}

df_edges_r = pd.DataFrame(df_edges_r)
df_edges_r.rename(columns={'col1': 'Source', 'col2': 'Target', 'col3':'Weight'}, inplace=True)

df_edges_r = df_edges_r.assign(Type='Undirected')

In [31]:
# we extract the words from the column edge so that we have a source and target word
weight_l = edges_left['Weight']
source_l = [x[0] for x in edges_left['Edge']]
target_l = [x[1] for x in edges_left['Edge']]

weight_l = weight_l.to_list()

# we create a dataframe of edges
df_edges_l = {
    'col1': source_l,
    'col2': target_l,
    'col3': weight_l,
}

df_edges_l = pd.DataFrame(df_edges_l)
df_edges_l.rename(columns={'col1': 'Source', 'col2': 'Target', 'col3':'Weight'}, inplace=True)

df_edges_l

df_edges_l = df_edges_l.assign(Type='Undirected')
df_edges_l


Unnamed: 0,Source,Target,Weight,Type
0,challenge,gop,3,Undirected
1,year,march,5,Undirected
2,year,voting,7,Undirected
3,year,rights,7,Undirected
4,march,voting,7,Undirected
...,...,...,...,...
2008,court,power,3,Undirected
2009,weigh,rule,4,Undirected
2010,power,rule,4,Undirected
2011,power,case,3,Undirected


#### Add word count as node attribute

In [32]:
words_right = right['title'].str.split().explode()
word_counts_r = words_right.value_counts().reset_index() # Calculate the word counts
word_counts_r.columns = ['word', 'count'] # Rename the columns
word_counts_r

Unnamed: 0,word,count
0,biden,389
1,voter,376
2,election,337
3,identification,289
4,trump,260
...,...,...
4860,agitate,1
4861,uagitate,1
4862,type,1
4863,gradual,1


In [33]:
words_left = left['title'].str.split().explode()
word_counts_l = words_left.value_counts().reset_index() # Calculate the word counts
word_counts_l.columns = ['Id', 'count'] # Rename the columns
word_counts_l

Unnamed: 0,Id,count
0,voter,548
1,voting,430
2,law,333
3,vote,283
4,identification,279
...,...,...
3414,purple,1
3415,tampering,1
3416,ugreater,1
3417,exclusionist,1


In [34]:
filtered_word_counts_l = word_counts_l[word_counts_l['Id'].isin(df_edges_l['Source']) | word_counts_l['Id'].isin(df_edges_l['Target'])]
len(filtered_word_counts_l)

534

In [35]:
words_right = right['title'].str.split().explode()
word_counts_r = words_right.value_counts().reset_index() # Calculate the word counts
word_counts_r.columns = ['Id', 'count'] # Rename the columns
word_counts_r

Unnamed: 0,Id,count
0,biden,389
1,voter,376
2,election,337
3,identification,289
4,trump,260
...,...,...
4860,agitate,1
4861,uagitate,1
4862,type,1
4863,gradual,1


In [36]:
filtered_word_counts_r = word_counts_r[word_counts_r['Id'].isin(df_edges_r['Source']) | word_counts_r['Id'].isin(df_edges_r['Target'])]
len(filtered_word_counts_r)

1342

## Shifterator values

In [37]:
nodes_df

Unnamed: 0,score,word,left_right,norm_score
0,0.000049,scratch,right,-0.239889
1,0.000049,diagnose,right,-0.239889
2,-0.000018,resist,left,0.090298
3,-0.000068,reconfigure,left,0.330187
4,-0.000018,darn,left,0.090298
...,...,...,...,...
6254,0.000245,chicago,right,-1.199446
6255,0.000147,harass,right,-0.719668
6256,-0.000068,tampering,left,0.330187
6257,-0.000068,taunts,left,0.330187


In [38]:
nodes_df['score'] = abs(nodes_df['norm_score']) # convert shifterator scores all positive
nodes_df

Unnamed: 0,score,word,left_right,norm_score
0,0.239889,scratch,right,-0.239889
1,0.239889,diagnose,right,-0.239889
2,0.090298,resist,left,0.090298
3,0.330187,reconfigure,left,0.330187
4,0.090298,darn,left,0.090298
...,...,...,...,...
6254,1.199446,chicago,right,-1.199446
6255,0.719668,harass,right,-0.719668
6256,0.330187,tampering,left,0.330187
6257,0.330187,taunts,left,0.330187


In [39]:
# we first remove words from the nodes list that do not appear in the edges list
nodes_df_l = nodes_df[nodes_df['word'].isin(df_edges_l['Source']) | nodes_df['word'].isin(df_edges_l['Target'])]
nodes_df_r = nodes_df[nodes_df['word'].isin(df_edges_r['Source']) | nodes_df['word'].isin(df_edges_r['Target'])]
# we then remove words from the edges list that do not appear in the nodes list
Edges_df_filtered_l = df_edges_l[df_edges_l['Source'].isin(nodes_df['word'])| df_edges_l['Target'].isin(nodes_df['word'])]
Edges_df_filtered_r = df_edges_r[df_edges_r['Source'].isin(nodes_df['word'])| df_edges_r['Target'].isin(nodes_df['word'])]
Edges_df_filtered_l


Unnamed: 0,Source,Target,Weight,Type
0,challenge,gop,3,Undirected
1,year,march,5,Undirected
2,year,voting,7,Undirected
3,year,rights,7,Undirected
4,march,voting,7,Undirected
...,...,...,...,...
2008,court,power,3,Undirected
2009,weigh,rule,4,Undirected
2010,power,rule,4,Undirected
2011,power,case,3,Undirected


In [40]:
# Group the DataFrame by 'Source' and 'Target' and sum the 'Weight'
df_edges_l.groupby(['Source', 'Target']).agg({'Weight': 'sum'}).reset_index()
df_edges_r.groupby(['Source', 'Target']).agg({'Weight': 'sum'}).reset_index()

# The DataFrame now has a single row for each combination of 'Source' and 'Target' with the 'Weight' values summed
df_edges_l
df_edges_r

Unnamed: 0,Source,Target,Weight,Type
0,radical,run,4,Undirected
1,house,senator,4,Undirected
2,senator,scott,8,Undirected
3,texas,voter,31,Undirected
4,texas,identification,27,Undirected
...,...,...,...,...
7016,border,tiktok,4,Undirected
7017,border,youngkin,4,Undirected
7018,border,one,4,Undirected
7019,tiktok,youngkin,4,Undirected


In [41]:
#nodes_df_filtered['Id'] = range(1, len(nodes_df_filtered) + 1) # create new ID column
nodes_df_l.rename(columns={'word': 'Id'}, inplace=True)
nodes_df_r.rename(columns={'word': 'Id'}, inplace=True)
nodes_df_r


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nodes_df_l.rename(columns={'word': 'Id'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nodes_df_r.rename(columns={'word': 'Id'}, inplace=True)


Unnamed: 0,score,Id,left_right,norm_score
9,0.719668,pour,right,-0.719668
21,13.642681,crisis,right,-13.642681
22,4.834227,session,left,4.834227
24,0.358475,gubernatorial,right,-0.358475
45,0.118585,universal,right,-0.118585
...,...,...,...,...
6234,0.719668,oakland,right,-0.719668
6240,0.301900,review,left,0.301900
6246,2.644218,set,left,2.644218
6254,1.199446,chicago,right,-1.199446


Add counts as attributes

In [42]:
words_r = right['title'].str.split().explode()
words_l = left['title'].str.split().explode()

# Calculate the word counts
word_counts_r = words_r.value_counts().reset_index()
word_counts_l = words_l.value_counts().reset_index()

# Rename the columns
word_counts_r.columns = ['Id', 'count']
word_counts_l.columns = ['Id', 'count']

In [43]:
word_counts_l

Unnamed: 0,Id,count
0,voter,548
1,voting,430
2,law,333
3,vote,283
4,identification,279
...,...,...
3414,purple,1
3415,tampering,1
3416,ugreater,1
3417,exclusionist,1


In [44]:
# Get unique words from 'source' and 'target' columns in the edgelist dataframe
unique_words_l = set(df_edges_l['Source']).union(set(df_edges_l['Target']))
unique_words_r = set(df_edges_r['Source']).union(set(df_edges_r['Target']))

# Filter the word counts dataframe based on unique words
filtered_word_counts_l = word_counts_l[word_counts_l['Id'].isin(unique_words_l)]
filtered_word_counts_r = word_counts_r[word_counts_r['Id'].isin(unique_words_r)]

In [45]:
nodes_df_l

Unnamed: 0,score,Id,left_right,norm_score
21,13.642681,crisis,right,-13.642681
22,4.834227,session,left,4.834227
28,1.080861,norm,left,1.080861
48,22.463622,supreme_court,left,22.463622
54,1.650937,stress,left,1.650937
...,...,...,...,...
6214,4.173852,it,left,4.173852
6222,1.163004,real,right,-1.163004
6224,1.047137,attempt,right,-1.047137
6240,0.301900,review,left,0.301900


In [46]:
nodes_df_filtered_l = nodes_df_l[nodes_df_l['Id'].isin(unique_words_l)]
nodes_df_filtered_r = nodes_df_r[nodes_df_r['Id'].isin(unique_words_r)]

In [47]:
nodes_df_filtered_r = pd.merge(nodes_df_filtered_r, filtered_word_counts_r, on='Id', how='left')
nodes_df_filtered_l = pd.merge(nodes_df_filtered_l, filtered_word_counts_l, on='Id', how='left')
nodes_df_filtered_r

Unnamed: 0,score,Id,left_right,norm_score,count
0,0.719668,pour,right,-0.719668,3
1,13.642681,crisis,right,-13.642681,61
2,4.834227,session,left,4.834227,6
3,0.358475,gubernatorial,right,-0.358475,7
4,0.118585,universal,right,-0.118585,6
...,...,...,...,...,...
1335,0.719668,oakland,right,-0.719668,3
1336,0.301900,review,left,0.301900,7
1337,2.644218,set,left,2.644218,11
1338,1.199446,chicago,right,-1.199446,5


In [48]:
nodes_df_filtered_l

Unnamed: 0,score,Id,left_right,norm_score,count
0,13.642681,crisis,right,-13.642681,3
1,4.834227,session,left,4.834227,19
2,1.080861,norm,left,1.080861,4
3,22.463622,supreme_court,left,22.463622,100
4,1.650937,stress,left,1.650937,5
...,...,...,...,...,...
528,4.173852,it,left,4.173852,17
529,1.163004,real,right,-1.163004,19
530,1.047137,attempt,right,-1.047137,7
531,0.301900,review,left,0.301900,6


In [49]:
nodes_df_filtered_l['count'].isna().sum()

0

In [50]:
nodes_df_filtered_l = nodes_df_filtered_l.drop('norm_score', axis=1)
nodes_df_filtered_r = nodes_df_filtered_r.drop('norm_score', axis=1)


In [51]:
nodes_df_filtered_r = nodes_df_filtered_r.dropna(subset=['count'])
nodes_df_filtered_r.head()

Unnamed: 0,score,Id,left_right,count
0,0.719668,pour,right,3
1,13.642681,crisis,right,61
2,4.834227,session,left,6
3,0.358475,gubernatorial,right,7
4,0.118585,universal,right,6


In [52]:
nodes_df_filtered_l = nodes_df_filtered_l.dropna(subset=['count'])
nodes_df_filtered_l.head()

Unnamed: 0,score,Id,left_right,count
0,13.642681,crisis,right,3
1,4.834227,session,left,19
2,1.080861,norm,left,4
3,22.463622,supreme_court,left,100
4,1.650937,stress,left,5


In [53]:
len(nodes_df_filtered_l)

533

In [54]:
nodes_df_filtered_l

Unnamed: 0,score,Id,left_right,count
0,13.642681,crisis,right,3
1,4.834227,session,left,19
2,1.080861,norm,left,4
3,22.463622,supreme_court,left,100
4,1.650937,stress,left,5
...,...,...,...,...
528,4.173852,it,left,17
529,1.163004,real,right,19
530,1.047137,attempt,right,7
531,0.301900,review,left,6


In [55]:
nodes_df_filtered_l = nodes_df_filtered_l[['Id', 'score', 'left_right', 'count']]

In [56]:
Edges_df_filtered_l

Unnamed: 0,Source,Target,Weight,Type
0,challenge,gop,3,Undirected
1,year,march,5,Undirected
2,year,voting,7,Undirected
3,year,rights,7,Undirected
4,march,voting,7,Undirected
...,...,...,...,...
2008,court,power,3,Undirected
2009,weigh,rule,4,Undirected
2010,power,rule,4,Undirected
2011,power,case,3,Undirected


In [71]:
#save edgelist as csv file to import into Gephi
df_edges_l.to_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/edges_left_title_nov23.csv', index=False)
df_edges_r.to_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/edges_right_title_nov23.csv', index=False)

In [72]:
nodes_df_filtered_l.to_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/nodes_left_title_nov23.csv', index=False) #export to csv
nodes_df_filtered_r.to_csv('C:/Users/2146806A/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/GITHUB FILES/Data/nodes_right_title_nov23.csv', index=False) #export to csv

In [None]:
# FOR MAIN BODY OF TEXT
# Split corpus into right and left news media
left = df[df['ideology'].isin(['left'])]
right = df[df['ideology'].isin(['right'])]

# we want to find co-occurrences of words in sentences so we convert the text_sentence column into a list
right_list = right['maintext'].tolist()
left_list = left['maintext'].tolist()

In [None]:
# split the articles into sentences and create a dataframe with the sentences
sentences_left = []
for article in left_list:
    if isinstance(article, str) and article.strip():
        article_sentences = sent_tokenize(article)
        sentences_left.extend(article_sentences)

df_sentences_left= pd.DataFrame({'sentences': sentences_left})

sentences_right = []
for article in right_list:
    if isinstance(article, str) and article.strip():
        article_sentences = sent_tokenize(article)
        sentences_right.extend(article_sentences)

df_sentences_right= pd.DataFrame({'sentences': sentences_right})

# we create a list of individual sentences
sentences_left = df_sentences_left['sentences'].tolist()
sentences_right = df_sentences_right['sentences'].tolist()

In [None]:
# MEMORY ERROR HERE
# edgelist dictionary for right
edgelist_right = defaultdict(int)
for text in tokenized_right:
    for node1, node2 in combinations(text, 2):
        if node1 != node2:  # Skip pairs with the same node
            edgelist_right[(node1, node2)] += 1

# left
edgelist_left = defaultdict(int)
for text in tokenized_left:
    for node1, node2 in combinations(text, 2):
        if node1 != node2:  # Skip pairs with the same node
            edgelist_left[(node1, node2)] += 1

MemoryError: 