# Clustering
Here we will clean the dataset of the News Channels that are not in english and generally not in the US

Then we try to find communities within our filtered dataset.

To detect the communites we use the Louvain Algorithm

In [1]:
import numpy as np
import networkx as nx
import networkx.algorithms.community as nx_comm
import pandas as pd


## Load the graph generated previously

In [2]:
df_edges = pd.read_csv("data/graph.csv", sep=';')
display(df_edges.head())


Unnamed: 0,source,target,weight
0,0,1,2
1,0,2,1
2,0,3,5
3,0,4,3
4,0,12,1


In [3]:
# convert to networkx graph
G = nx.from_pandas_edgelist(df_edges, edge_attr=True)

## Louvain algorithm on the unfiltered data

We chose the parameters for the algorithm so that it favors larger communities

In [4]:
louvain_partitions = nx_comm.louvain_communities(G, resolution=1e-1, threshold=1e-100,seed=1)

In [5]:
for idx in range(len(louvain_partitions)):
    print("Community number {i} has {number} members".format(i=idx,number=len(louvain_partitions[idx])))

Community number 0 has 83 members
Community number 1 has 3 members
Community number 2 has 243 members
Community number 3 has 2 members
Community number 4 has 15 members
Community number 5 has 6 members
Community number 6 has 2 members
Community number 7 has 2 members
Community number 8 has 2 members
Community number 9 has 3 members
Community number 10 has 5 members
Community number 11 has 2 members


Now we print the contents of the 2 largest community numbers. We have not printed all of them here for brevity. Since we swapped the channel id strings for our channel id numbers, to find a channel id you need to : take the channel number go to the channels.csv file and find the corresponding channel id. Then type https://www.youtube.com/channel/"channel_id" to find the channel

The first community contains channel like IndiaTV or ABP News which are not in the scope of what we want to do.

In [6]:
print(louvain_partitions[0])

{0, 256, 1, 3, 4, 2, 776, 12, 13, 272, 1040, 18, 19, 535, 25, 28, 30, 34, 36, 292, 39, 43, 44, 45, 47, 303, 51, 54, 316, 60, 317, 63, 65, 70, 327, 329, 79, 595, 84, 86, 89, 94, 98, 99, 100, 357, 359, 104, 105, 112, 113, 133, 134, 395, 654, 146, 149, 150, 156, 159, 418, 420, 170, 172, 686, 174, 178, 440, 702, 447, 451, 454, 456, 461, 206, 464, 470, 216, 220, 991, 236, 237, 254}


In this community we have what we are interested in. Channel number 6 is CNN and all the channels that we have checked ~60 out of 243 are all in english.

For the next part we will use these channels for our graph.

In [7]:
print(louvain_partitions[2])

filtered_channels = pd.DataFrame(louvain_partitions[2])

{2049, 1027, 517, 6, 7, 8, 9, 10, 519, 11, 2061, 1039, 16, 15, 530, 529, 21, 534, 23, 539, 29, 31, 32, 546, 37, 2086, 554, 53, 1077, 56, 1084, 61, 62, 64, 576, 1602, 67, 582, 583, 1611, 76, 2124, 78, 1107, 1110, 87, 88, 90, 92, 93, 96, 97, 611, 612, 617, 1133, 109, 1134, 622, 626, 627, 1141, 629, 120, 633, 1146, 122, 121, 125, 634, 639, 645, 137, 651, 1163, 141, 142, 140, 1680, 145, 144, 147, 148, 661, 1165, 152, 1177, 153, 155, 1180, 672, 1185, 165, 1190, 169, 2218, 173, 176, 691, 692, 179, 1203, 1205, 187, 701, 191, 192, 195, 1220, 709, 199, 200, 201, 202, 203, 714, 717, 2255, 207, 208, 209, 723, 211, 722, 212, 215, 725, 221, 1249, 741, 229, 744, 238, 240, 754, 1778, 257, 770, 259, 264, 265, 266, 267, 1800, 1806, 273, 276, 1813, 278, 280, 794, 803, 293, 1833, 813, 307, 308, 309, 310, 824, 313, 1337, 315, 314, 318, 1349, 328, 331, 333, 1360, 339, 852, 341, 342, 855, 340, 859, 1377, 1379, 867, 360, 365, 1390, 366, 878, 1392, 370, 374, 376, 889, 378, 1920, 1408, 1413, 391, 904, 903, 394

In [8]:
filtered_channels = filtered_channels[0].sort_values(ascending=True)


In [9]:
display(filtered_channels)

3         6
4         7
5         8
6         9
7        10
       ... 
10     2061
25     2086
41     2124
95     2218
117    2255
Name: 0, Length: 243, dtype: int64

In [10]:
filtered_channels.to_csv("data/louvain_filtered_channels.csv", sep=';', index=False)

We filter our graph with the channels that interest us.

In [11]:
def filter_function(n):
    return n in louvain_partitions[2]

sub_G = nx.subgraph_view(G, filter_node=filter_function)

We run the louvain algorithm on our filtered channels and print how many communities we have detected

In [12]:
louvain_communities = nx_comm.louvain_communities(sub_G, resolution=0.9,threshold=1e-7, seed=1)
print("We have detected {num} communities".format(num=len(louvain_communities)))

We have detected 7 communities


In [13]:
for idx in range(len(louvain_communities)):
    print("Community number {i} has {number} members".format(i=idx,number=len(louvain_communities[idx])))

Community number 0 has 52 members
Community number 1 has 43 members
Community number 2 has 42 members
Community number 3 has 4 members
Community number 4 has 31 members
Community number 5 has 52 members
Community number 6 has 19 members


## Analysis of the communities

We print the channels numbers for each community and list for each community the name of the most known or defining channels we have found.
Note that the order in which the communities appear is made randomly by the louvain algorithm. Changing the seed changes the order in which the communities appear, but not the result.

### Leaning right

We have found these channels : Philip de Franco, Drama Alert, IntMensOrg.

We found that these channels were not the most "serious" news channels even though some of them are really well known. With some research we have found that they where leaning right and that IntMensOrg is proabably mysogynistic channels.

In [14]:
pd.Series(list(louvain_communities[0])).head()

0     519
1     903
2     265
3      11
4    1165
dtype: int64

### Clearly left

We have found these channels : CNN, Vox, MSNBC, The Young Turks which we have found to be clearly polarised to the left. These were more "serious" news channels

In [15]:
pd.Series(list(louvain_communities[1])).head()

0    257
1    770
2    517
3      6
4      8
dtype: int64

### Far Right

In this community we have found Inside Edition, Fox News, Daily Wire, Rebel News. We have found the channels to be far right. Note that Rebel News is a canadian channe but is far right. This shows that far right communities cross borders, but also that we might have to clean our dataset a bit more 

In [16]:
pd.Series(list(louvain_communities[2])).head()

0     645
1    1800
2     264
3     904
4       9
dtype: int64

### A little bit of unwanted channels

In this community we have found : Pat Condell(a sort of conspirationist), Sky News Australia, National Post.These channels are to be removed. Most of them are not really known and are not in the US. They are canadian, australian..

In [17]:
pd.Series(list(louvain_communities[3])).head()

0    1337
1    1524
2     617
3     407
dtype: int64

### Business oriented ?

In this community we have found : Today, China uncensored, Fox Business. Further investigation is needed. Maybe when we use the whole dataset we will find more meaningful conclusions. This community might be business oriented du to its interest in China (China uncensored speaks about what is happening in China and Fox business talks about china)

In [18]:
pd.Series(list(louvain_communities[4])).head()

0    1920
1     195
2     199
3     203
4    2124
dtype: int64

### Leaning Left

In this group we have found : Truly, ABC News, BBC News, True Crime Daily, Business Insider, New York Times. These channels are all really well known. On allsides they are classified as being either left, leaning left and center. More data might classify this better, if not we will have to investigate.

In [19]:
pd.Series(list(louvain_communities[5])).head()

0    2049
1     259
2    1413
3    1146
4     391
dtype: int64

### International Channels

In this community we have found : Al Jazeera (Quatari owned news channel), African Diaspora news, France 24 English, Visual Politik EN. All these channels are in english and might talk about what is happening in the US but might be outside of the scope of what we want to see

In [20]:
pd.Series(list(louvain_communities[6])).head()

0    1408
1    1027
2     200
3     651
4     396
dtype: int64

In [21]:
df_tosave = nx.to_pandas_edgelist(sub_G,)
df_tosave.to_csv('data/louvain_filtered_graph.csv', sep=';', index=False)