## **Assignment 2**  
**02805 Social Graphs & Interactions**  

**Group 13** 
- Anna Bøgevang Ekner (s193396)
- Morten Møller Christensen (s204258)


**Feedback from Assignment 1**
- Don't scale the axes of the distributions (especially not x-axis), because we distort the distributions and will arrive to wrong conclusions. He said our in- and out-degree distributions look kinda weird and not how they would expect. 
- Instead of using plt.hist, bin the data as described in the Week 2 notebook and then plot with plt.bar. Make sure that when we do comparisons of distributions, the data should be binned in the same way. 

In [6]:
import warnings
warnings.filterwarnings('ignore')
import random
import os
import re
import json
import scipy
import urllib.request
import urllib.parse
import numpy as np
import networkx as nx
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
from itertools import count

sns.set_style('darkgrid')
sns.set(font_scale=1.)

# **Part 1: Genres and communities and plotting**

### **Loading and pre-processing the undirected graph**

In [27]:
def load_data():
    """
    Load the artist genre data and the undirected country musician graph

    Returns
        artist_genres_dict: dictionary of artist genres (dict)
        country_performer_undir_graph: undirected graph of country musicians (nx.Graph)
    """
    # Load the artist genre data
    artist_genres = open('artists_genres_dictionary.txt', 'r')
    raw_data = artist_genres.read() 

    # Convert into dictionary
    artist_genres_dict = eval(raw_data)

    # Load the undirected graph from the json file
    with open('undirected_graph.txt', 'r', encoding='utf-8') as f:
        undirected_graph_json = json.load(f)

    undirected_graph_data = json.loads(undirected_graph_json)
    country_performer_undir_graph = nx.node_link_graph(undirected_graph_data)

    return artist_genres_dict, country_performer_undir_graph

artist_genres_dict, country_performer_undir_graph = load_data()

In [35]:
def preprocess_graph(artist_genres_dict, country_performer_undir_graph):
    """ 
    Preprocess the graph by keeping only the nodes that have genre information, 
    and also remove the article_length attribute from all nodes.

    Args
        artist_genres_dict: dictionary of artist genres (dict)
        country_performer_undir_graph: undirected graph of country musicians (nx.Graph)

    Returns
        artist_undirected_graph: preprocessed undirected graph of country musicians (nx.Graph)
    """
    # Keep only the nodes that have genre information
    artist_undirected_graph = country_performer_undir_graph.copy()

    for node in list(country_performer_undir_graph.nodes()):

        # Remove the article_length attribute
        if 'article_length' in artist_undirected_graph.nodes[node]:
            del artist_undirected_graph.nodes[node]['article_length']

        if node.replace('_', ' ') not in artist_genres_dict:
            artist_undirected_graph.remove_node(node)

    return artist_undirected_graph

artist_undirected_graph_preprocessed = preprocess_graph(artist_genres_dict, country_performer_undir_graph)

print(f'Artist genre dictionary')
print(f'\tNumber of artists: {len(artist_genres_dict)}\n')

print(f'Artist graph (undirected)')
print(f'\tNumber of nodes: {artist_undirected_graph_preprocessed.number_of_nodes()}')
print(f'\tNumber of edges: {artist_undirected_graph_preprocessed.number_of_edges()}')

Artist genre dictionary
	Number of artists: 1833

Artist graph (undirected)
	Number of nodes: 1833
	Number of edges: 13943


### **Exercise 1.1: Genres and modularity**

Write about genres and modularity. 

See Network Science, Section 9.4, and reference it with: [[Network Science, Section 9.4]](https://networksciencebook.com/chapter/9#modularity) 

Mention Equation (9.12): 

\begin{equation}
M = \sum\limits^{n_c}_{c=1} \left[ \dfrac{L_c}{L} - \left(\dfrac{k_c}{2L}\right)^2 \right], \quad (\text{Equation } 9.12)
\end{equation}

where $L$ is the number of links in the network, $n_c$ is the number of communities in the partition, while $L_c$ and $k_c$ are the number of links and total degree of the nodes, respectively, for the community $c$. 

### **Exercise 1.2: Detecting communities and the value of modularity in comparison to the genres**

Detect the communities, discuss the value of modularity in comparison to the genres. 

In [91]:
def add_genre_attribute(artist_genres_dict, artist_undirected_graph, selection_method = 'first'):
    """
    Add the genre attribute to the nodes in the graph, either by selecting the 
    first genre in the list or by selecting the first genre that is not 'country'.

    Args
        artist_genres_dict: dictionary of artist genres (dict)
        artist_undirected_graph: undirected graph of country musicians (nx.Graph)
        selection_method: method to select the genre, either 'first' or 'not_country' (str)

    Returns
        artist_undirected_graph: undirected graph of country musicians with genre attribute (nx.Graph)
    """

    for node in list(artist_undirected_graph.nodes()):

        # Get the genres of the artist
        genres = artist_genres_dict[node.replace('_', ' ')]

        # Select the genre based on the selection method
        if selection_method == 'first':
            genre = genres[0]

        elif selection_method == 'not_country':

            # If the artist has only one genre, select it
            if len(genres) == 1:
                genre = genres[0]

            # If the artist has multiple genres, select the first genre that is not 'country'
            else:
                genre = [g for g in genres if g != 'country'][0]
        
        # Add the genre attribute to the node
        artist_undirected_graph.nodes[node]['genre'] = genre

    return artist_undirected_graph

def find_communities(artist_undirected_graph):
    """
    Find the communities in the graph, i.e. groups of nodes that share the same genre.
    Each node is characterized by the first genre in its list of genres.

    Args
        artist_undirected_graph: undirected graph of country musicians (nx.Graph)

    Returns
        communities: dictionary of communities, where keys are genres and values are 
                     nodes characterized by that genre (dict)
    """

    # Get all genres present in the graph
    all_genres = set(nx.get_node_attributes(artist_undirected_graph, 'genre').values())

    # Dictionary for the communities with each genre as key
    communities = {genre: [] for genre in all_genres}

    for node in artist_undirected_graph.nodes():

        # Get genre of the node
        genre = artist_undirected_graph.nodes[node]['genre']

        # Add the node to the corresponding community
        communities[genre].append(node)

    return communities

def compute_modularity(artist_undirected_graph, communities):
    """
    Compute the modularity of the graph, i.e. the strength of the partition of the graph into communities.

    Args
        artist_undirected_graph: undirected graph of country musicians (nx.Graph)
        communities: dictionary of communities (dict)

    Returns
        modularity: modularity of the graph (float)
    """

    L = artist_undirected_graph.number_of_edges()  # Total number of links in the graph
    n_c = len(communities)                         # Number of communities
    community_modularities = np.zeros(n_c)         # Array for storing the modularity values of each community
    
    for i, c in enumerate(communities.keys()):
        
        # Number of links in the community
        L_c = artist_undirected_graph.subgraph(communities[c]).number_of_edges()

        # Sum of the degrees of the nodes in the community
        k_c = np.sum([artist_undirected_graph.degree[node] for node in communities[c]])

        # Modularity value for the community
        community_modularities[i] = L_c / L - (k_c / (2 * L))**2

    # Modularity of the partition
    modularity = np.sum(community_modularities)

    return modularity

**Communities and modularity when each artist is characterized by their first genre**

In [92]:
# Add the genre attribute to the nodes in the graph (selecting the first genre)
artist_undirected_graph = add_genre_attribute(artist_genres_dict, artist_undirected_graph_preprocessed, selection_method = 'first')

# Find communities in the graph based on genre
communities = find_communities(artist_undirected_graph)

# Compute the modularity of the partition
modularity = compute_modularity(artist_undirected_graph, communities)

print(f'Partion of the artist graph based on genre')
print(f'\tSelection method: first genre')
print(f'\tNumber of communities: {len(communities)} communities')
print(f'\tLargest community: {max([len(communities[genre]) for genre in communities])} artists')
print(f'\tSmallest community: {min([len(communities[genre]) for genre in communities])} artist')
print(f'\tModularity: {modularity:.4f}')

Partion of the artist graph based on genre
	Selection method: first genre
	Number of communities: 112 communities
	Largest community: 1221 artists
	Smallest community: 1 artist
	Modularity: 0.0713


**Communities and modularity when each artist is characterized by their first genre that is not `country`**

In [93]:
# Add the genre attribute to the nodes in the graph (selecting the first genre that is not 'country')
artist_undirected_graph = add_genre_attribute(artist_genres_dict, artist_undirected_graph_preprocessed, selection_method = 'not_country')

# Find communities in the graph based on genre
communities = find_communities(artist_undirected_graph)

# Compute the modularity of the partition
modularity = compute_modularity(artist_undirected_graph, communities)

print(f'Partion of the artist graph based on genre')
print(f'\tSelection method: first genre that is not "country"')
print(f'\tNumber of communities: {len(communities)} communities')
print(f'\tLargest community: {max([len(communities[genre]) for genre in communities])} artists')
print(f'\tSmallest community: {min([len(communities[genre]) for genre in communities])} artist')
print(f'\tModularity: {modularity:.4f}')

Partion of the artist graph based on genre
	Selection method: first genre that is not "country"
	Number of communities: 141 communities
	Largest community: 765 artists
	Smallest community: 1 artist
	Modularity: 0.0842


### **Exercise 1.3: Calculating the matrix $D$ and discussing the findings**

### **Exercise 1.4: Plotting the communities**

# **Part 2: TF-IDF to understand genres and communities**

### **Exercise 2.1: The concept of TF-IDF**

### **Exercise 2.2: Calculating and visualizing TF-IDF for the genres and communities**

### **Exercise 2.3: Discussing the difference between the word-clouds between genres and communities**

# **Part 3: Sentiment of the artists and communities**

### **Exercise 3.1: Sentiment of the artist pages**

### **Exercise 3.2: Sentiment of the largest communities**