# Spotify Artist Feature Collaboration Network Analysis
## Part 1

---

## Introduction

In this notebook, I will perform an analysis of the Spotify Artist Feature Collaboration Network. The dataset consists of artist data for approximately 20,000 artists whose songs made it to the Spotify weekly charts. Additionally, it includes data for approximately 136,000 additional artists who had at least one feature with at least one of the chart artists. This dataset allows us to construct a network with over 135,000 musicians as nodes and more than 300,000 collaboration edges between them.

---

## About the Author

**Name:** Daniele Borghesi

**Affiliation:** University of Pisa & Technical University of Valencia

**Contact:** dborghe@etsinf.upv.es

**GitHub:** https://github.com/danieleborghe

**LinkedIn:** https://www.linkedin.com/in/danieleborghesi/


---

## 1 Data Download and Cleaning


In [5]:
import pandas as pd

### 1.1 Data Download
- **Dataset:** Spotify Artist Feature Collaboration Network
- **Source:** https://www.kaggle.com/datasets/jfreyberg/spotify-artist-feature-collaboration-network?resource=download
- **Description:** The dataset consists of two CSV files:
  - **edges.csv:** Contains edges (features) between artists listed in nodes.csv. The edges are undirected and stored only once. Note that id_0 < id_1 according to alphabetical order, meaning a collaboration between artist A and artist B is stored as id_0:A, id_1:B. It's important to note that only the features of the original 20k seed artists were scraped, meaning non-seed artists do not have features between them in this dataset even if they have them in reality.
  - **nodes.csv:** Contains artist information scraped from the Spotify API and kworb.net.

In [6]:
edges_df = pd.read_csv("data/edges.csv")
nodes_df = pd.read_csv("data/nodes.csv")

In [7]:
print("Number of nodes (artists): ", len(nodes_df))
nodes_df.head()

Number of nodes (artists):  156422


Unnamed: 0,spotify_id,name,followers,popularity,genres,chart_hits
0,48WvrUGoijadXXCsGocwM4,Byklubben,1738.0,24,"['nordic house', 'russelater']",['no (3)']
1,4lDiJcOJ2GLCK6p9q5BgfK,Kontra K,1999676.0,72,"['christlicher rap', 'german hip hop']","['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ..."
2,652XIvIBNGg3C0KIGEJWit,Maxim,34596.0,36,[],['de (1)']
3,3dXC1YPbnQPsfHPVkm1ipj,Christopher Martin,249233.0,52,"['dancehall', 'lovers rock', 'modern reggae', ...","['at (1)', 'de (1)']"
4,74terC9ol9zMo8rfzhSOiG,Jakob Hellman,21193.0,39,"['classic swedish pop', 'norrbotten indie', 's...",['se (6)']


In [8]:
print("Number of edges (collaborations between artists): ", len(edges_df))
edges_df.head()

Number of edges (collaborations between artists):  300386


Unnamed: 0,id_0,id_1
0,76M2Ekj8bG8W7X2nbx2CpF,7sfl4Xt5KmfyDs2T3SVSMK
1,0hk4xVujcyOr6USD95wcWb,7Do8se3ZoaVqUt3woqqSrD
2,38jpuy3yt3QIxQ8Fn1HTeJ,4csQIMQm6vI2A2SCVDuM2z
3,6PvcxssrQ0QaJVaBWHD07l,6UCQYrcJ6wab6gnQ89OJFh
4,2R1QrQqWuw3IjoP5dXRFjt,4mk1ScvOUkuQzzCZpT6bc0


### 1.2 Context Explanation

The Spotify Artist Feature Collaboration Network dataset provides valuable insights into the collaborative relationships between artists on the Spotify platform. It comprises artist data for approximately 20,000 artists whose songs have made it to the Spotify weekly charts, along with additional data for about 136,000 artists who have had at least one feature with one of the chart artists. This dataset enables the construction of a comprehensive network representation of the collaboration patterns within the music industry.

Understanding the structure and dynamics of this network can offer insights into various aspects, including the influence of artists, the prevalence of collaborations, and the emergence of musical trends. By analyzing the network properties and centrality measures, we can identify key players, influential nodes, and community structures, shedding light on the underlying mechanisms driving collaboration and success in the music industry.

**Data Dictionary for nodes.csv:**

- **spotify_id**: ID of the artist on Spotify.
- **name**: Name of the artist.
- **followers**: Number of followers of the artist on Spotify.
- **popularity**: Popularity score of the artist on Spotify.
- **genres**: List of genres associated with the artist, obtained from the Spotify API.
- **chart_hits**: List showing the number of Spotify chart hits in different countries, obtained from kworb.net.

**Data Dictionary for edges.csv:**

- **id_0**: ID of the artist on Spotify for the first artist involved in a collaboration (edge).
- **id_1**: ID of the artist on Spotify for the second artist involved in a collaboration (edge).

### 1.3 Data Cleaning
**Preprocessing Steps:**
- Handling missing values
- Removing duplicates
- Standardizing data formats, if necessary

In [9]:
# Check for missing values in nodes_df
nodes_missing_values = nodes_df.isnull().sum()
print("Missing Values in Nodes DataFrame:")
print(nodes_missing_values)

print()

# Check for missing values in edges_df
nodes_missing_values = edges_df.isnull().sum()
print("Missing Values in Edges DataFrame:")
print(nodes_missing_values)

Missing Values in Nodes DataFrame:
spotify_id         0
name               4
followers          4
popularity         0
genres             0
chart_hits    136781
dtype: int64

Missing Values in Edges DataFrame:
id_0    0
id_1    0
dtype: int64


In [10]:
# Replace missing values in 'name' column with 'spotify_id'
nodes_df['name'].fillna(nodes_df['spotify_id'], inplace=True)

# Replace missing values in 'followers' column with 0
nodes_df['followers'].fillna(0, inplace=True)

# Check again for missing values in nodes_df
nodes_missing_values = nodes_df.isnull().sum()
print("Missing Values in Nodes DataFrame after the replacement:")
print(nodes_missing_values)

Missing Values in Nodes DataFrame after the replacement:
spotify_id         0
name               0
followers          0
popularity         0
genres             0
chart_hits    136781
dtype: int64


In [11]:
# Check for duplicates
duplicate_rows = nodes_df[nodes_df.duplicated()]

# Print duplicate rows if any
if not duplicate_rows.empty:
    print("Duplicate rows found in Nodes DataFrame:")
    print(duplicate_rows)
else:
    print("No duplicate rows found in Nodes DataFrame.")

# Check for duplicates
duplicate_rows = edges_df[edges_df.duplicated()]

# Print duplicate rows if any
if not duplicate_rows.empty:
    print("Duplicate rows found in Edges DataFrame:")
    print(duplicate_rows)
else:
    print("No duplicate rows found in Edges DataFrame.")

No duplicate rows found in Nodes DataFrame.


No duplicate rows found in Edges DataFrame.


In [12]:
# Save the cleaned dataframes
nodes_df.to_csv("data/cleaned_nodes.csv", index=False)
edges_df.to_csv("data/cleaned_edges.csv", index=False)

## 2 Network Characterization

In [13]:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import random

### 2.1 Extract Giant Component and Explanation
- Identify and extract the giant component (if any) from the network.
- Explain the findings: Compare the number of nodes and edges before and after extracting the giant component. Discuss the significance of the giant component in the context of the network.

In [14]:
# Construct the network from edges.csv
G = nx.from_pandas_edgelist(edges_df, 'id_0', 'id_1')

# Extract the giant component
giant_component = max(nx.connected_components(G), key=len)

# Create a subgraph containing only the giant component
giant_subgraph = G.subgraph(giant_component)

# Display the number of nodes and edges before and after extraction
print("Number of NODES before extraction:", G.number_of_nodes())
print("Number of EDGES before extraction:", G.number_of_edges())
print()
print("Number of NODES in the giant component:", giant_subgraph.number_of_nodes())
print("Number of EDGES in the giant component:", giant_subgraph.number_of_edges())

Number of NODES before extraction: 153327
Number of EDGES before extraction: 300386

Number of NODES in the giant component: 148386
Number of EDGES in the giant component: 296770


The extraction of the giant component from the "Spotify Artist Feature Collaboration Network" reveals a fundamental backbone of interconnectedness among artists within the Spotify ecosystem. With 148,386 nodes and 296,770 edges, the giant component represents a substantial and cohesive network where the majority of artists are connected through direct or indirect collaborations. This interconnectedness fosters a vibrant environment for creative exchange, enabling artists to explore diverse musical styles, reach new audiences, and forge meaningful collaborations. The presence of the giant component underscores the resilience and robustness of the network, ensuring the continuity of collaborative interactions even in the face of disruptions. Moreover, within the giant component, influential artists may emerge as central figures, shaping trends, driving innovation, and catalyzing the evolution of musical genres.

### 2.2 Characterize the Network with Relevant Measures
Calculate and discuss the following network measures:
  - Average degree
  - Average shortest path length (calculated on the giant component of a sample of 25% of the original network)
  - Network diameter (calculated on the giant component of a sample of 25% of the original network)
  - Clustering coefficient

Due to the large size of the original network, calculating the average shortest path length and network diameter directly on the entire network would be computationally expensive and time-consuming. To mitigate this issue, a sample of 25% of the original network is taken, and these measures are calculated on the giant component of this sample. This approach allows us to obtain reasonable estimates of the average shortest path length and network diameter while reducing computational complexity.

In [18]:
# Subsample the giant component

# Randomly select 25% of the nodes for subsampling
sampled_nodes = random.sample(list(G.nodes()), int(G.number_of_nodes() * 0.25))

# Create a subgraph containing only the sampled nodes
subgraph_sampled = G.subgraph(sampled_nodes)

# Extract the giant component
giant_component_sampled = max(nx.connected_components(subgraph_sampled), key=len)

# Create a subgraph containing only the giant component
giant_subgraph_sampled = subgraph_sampled.subgraph(giant_component_sampled)

print("Number of NODES before extraction:", subgraph_sampled.number_of_nodes())
print("Number of EDGES before extraction:", subgraph_sampled.number_of_edges())
print()
print("Number of NODES in the giant component:", giant_subgraph_sampled.number_of_nodes())
print("Number of EDGES in the giant component:", giant_subgraph_sampled.number_of_edges())

Number of NODES before extraction: 38331
Number of EDGES before extraction: 18596

Number of NODES in the giant component: 11725
Number of EDGES in the giant component: 16432


In [None]:
# Plot the entire network
plt.figure(figsize=(12, 12))
pos = nx.spring_layout(subgraph_sampled)  # Positions for all nodes
nx.draw_networkx_nodes(subgraph_sampled, pos, node_size=10, alpha=0.3)
nx.draw_networkx_edges(subgraph_sampled, pos, alpha=0.1)

# Highlight the giant subgraph
nx.draw_networkx_nodes(giant_subgraph_sampled, pos, node_size=10, node_color='r', alpha=0.5)
nx.draw_networkx_edges(giant_subgraph_sampled, pos, alpha=0.5, edge_color='r')

plt.title("Spotify Artist Feature Collaboration Network with Giant Component Highlighted")
plt.show()

In [21]:
# Calculate relevant network measures
avg_degree = sum(dict(G.degree()).values()) / len(G)
clustering_coefficient = nx.average_clustering(G)

# Calculate relevant network measures for the subsampled giant component
avg_shortest_path_length = nx.average_shortest_path_length(giant_subgraph_sampled)
diameter = nx.diameter(giant_subgraph_sampled)


print("Average Degree:", avg_degree)
print("Clustering Coefficient:", clustering_coefficient)

print()

print("Average Shortest Path Length (Subsampled Giant Component):", avg_shortest_path_length)
print("Diameter (Subsampled Giant Component):", diameter)

Average Degree: 3.9182401012215724
Clustering Coefficient: 0.08239452364659447

Average Shortest Path Length (Subsampled Giant Component): 7.560911899647855
Diameter (Subsampled Giant Component): 24


#### Interpretation of Network Characterization Measures

##### Average Degree:
The average degree of approximately 3.92 suggests that, on average, each artist in the network collaborates with around 4 other artists. In the context of the music industry, this indicates a significant level of collaboration among artists, reflecting the prevalent practice of featuring other artists in songs or albums. The relatively high average degree underscores the interconnectedness of the music industry, where artists frequently collaborate to create new music and reach wider audiences.

##### Clustering Coefficient:
The clustering coefficient of approximately 0.082 indicates that there is a moderate level of clustering or local connectivity in the network. In the context of the music industry, this suggests the existence of artist communities or cliques where collaborations are frequent and artists often work with others within their social circles or genre-specific groups. These clusters may represent sub-genres, music labels, or artist collectives where collaborations are more common due to shared interests and networks.

##### Average Shortest Path Length and Diameter:
The average shortest path length of approximately 7.56 and a diameter of 24 within the subsampled giant component suggest that the network has a relatively short average distance between artists, facilitating efficient communication and collaboration. However, the diameter indicates that there are still some distant connections within the network, reflecting the diversity of collaborations across different genres, regions, or popularity levels.

In [22]:
# Create a DataFrame with the results
results_df = pd.DataFrame({
    'Measure': ['Average Degree', 'Clustering Coefficient', 'Average Shortest Path Length (Subsampled Giant Component)', 'Diameter (Subsampled Giant Component)'],
    'Value': [avg_degree, clustering_coefficient, avg_shortest_path_length, diameter]
})

# Save the DataFrame to a CSV file
results_df.to_csv('network_measures.csv', index=False)

## 3 Centrality Measures and Identification of Relevant Nodes

### 3.1 Centrality Measures Analysis
Calculate centrality measures such as:
- degree centrality
- closeness centrality
- betweenness centrality
- PageRank
- HITS (Hubs and Authorities)
- eigenvector centrality.

In [23]:
# Calculate centrality measures
closeness_centrality = nx.closeness_centrality(giant_subgraph_sampled)
betweenness_centrality = nx.betweenness_centrality(giant_subgraph_sampled)

degree_centrality = nx.degree_centrality(giant_subgraph_sampled)
pagerank = nx.pagerank(giant_subgraph_sampled)
hubs, authorities = nx.hits(giant_subgraph_sampled)
eigenvector_centrality = nx.eigenvector_centrality(giant_subgraph_sampled)

# Create DataFrames for each centrality measure
closeness_centrality_df = pd.DataFrame(list(closeness_centrality.items()), columns=['Node', 'Closeness Centrality'])
betweenness_centrality_df = pd.DataFrame(list(betweenness_centrality.items()), columns=['Node', 'Betweenness Centrality'])
degree_centrality_df = pd.DataFrame(list(degree_centrality.items()), columns=['Node', 'Degree Centrality'])
pagerank_df = pd.DataFrame(list(pagerank.items()), columns=['Node', 'PageRank'])
hubs_df = pd.DataFrame(list(hubs.items()), columns=['Node', 'Hub Score'])
authorities_df = pd.DataFrame(list(authorities.items()), columns=['Node', 'Authority Score'])
eigenvector_centrality_df = pd.DataFrame(list(eigenvector_centrality.items()), columns=['Node', 'Eigenvector Centrality'])

# Merge all DataFrames
centrality_df = pd.merge(closeness_centrality_df, betweenness_centrality_df, on='Node')
centrality_df = pd.merge(centrality_df, degree_centrality_df, on='Node')
centrality_df = pd.merge(centrality_df, pagerank_df, on='Node')
centrality_df = pd.merge(centrality_df, hubs_df, on='Node')
centrality_df = pd.merge(centrality_df, authorities_df, on='Node')
centrality_df = pd.merge(centrality_df, eigenvector_centrality_df, on='Node')

# Save the DataFrame to a CSV file
centrality_df.to_csv('centrality_measures.csv', index=False)

In [27]:
centrality_df.head(5)

Unnamed: 0,Node,Closeness Centrality,Betweenness Centrality,Degree Centrality,PageRank,Hub Score,Authority Score,Eigenvector Centrality
0,6HObKCGcJkXr84jyo0ZzPp,0.13191,0.0,8.4e-05,4.3e-05,1.901105e-07,1.901105e-07,3.559657e-06
1,2asG9hKxCIfcTnoHGezUW9,0.09167,0.0,8.4e-05,4.7e-05,7.20647e-12,7.20647e-12,1.291454e-10
2,2XRVqqdNOt779uXNFA1Fhv,0.145692,0.000335,0.000167,6.8e-05,1.568094e-06,1.568094e-06,2.678956e-05
3,5IHqlcCbQkyhWl0KmIwgeq,0.183836,0.001335,0.000586,0.00015,3.497802e-05,3.497802e-05,0.0006229484
4,00w9sdZ78mWArooTmiSTld,0.105827,0.000404,0.000251,0.000109,1.343074e-10,1.343074e-10,2.459117e-09


### 3.2 Identification of Relevant Nodes
- Analyze the centrality measures to identify the most relevant nodes in the network.
- Discuss the significance of these nodes and their roles within the network.

In [30]:
# Identify the most relevant nodes based on centrality measures
most_central_nodes_degree = sorted(degree_centrality, key=degree_centrality.get, reverse=True)[:10]
most_central_nodes_closeness = sorted(closeness_centrality, key=closeness_centrality.get, reverse=True)[:10]
most_central_nodes_betweenness = sorted(betweenness_centrality, key=betweenness_centrality.get, reverse=True)[:10]
most_central_nodes_pagerank = sorted(pagerank, key=pagerank.get, reverse=True)[:10]
most_central_nodes_hubs = sorted(hubs, key=hubs.get, reverse=True)[:10]
most_central_nodes_authorities = sorted(authorities, key=authorities.get, reverse=True)[:10]
most_central_nodes_eigenvector = sorted(eigenvector_centrality, key=eigenvector_centrality.get, reverse=True)[:10]

# Display the most relevant nodes along with their centrality measures
print("\nMost Central Nodes:")
print("Degree Centrality:")
for node in most_central_nodes_degree:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, degree_centrality[node])
print("\nCloseness Centrality:")
for node in most_central_nodes_closeness:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, closeness_centrality[node])
print("\nBetweenness Centrality:")
for node in most_central_nodes_betweenness:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, betweenness_centrality[node])
print("\nPageRank:")
for node in most_central_nodes_pagerank:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, pagerank[node])
print("\nHubs:")
for node in most_central_nodes_hubs:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, hubs[node])
print("\nAuthorities:")
for node in most_central_nodes_authorities:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, authorities[node])
print("\nEigenvector Centrality:")
for node in most_central_nodes_eigenvector:
    name = nodes_df.loc[nodes_df['spotify_id'] == node, 'name'].iloc[0]
    print(name, eigenvector_centrality[node])


Most Central Nodes:
Degree Centrality:
Mc Gw 0.018583626318432946
Jean Sibelius 0.01247279424075004
Diplo 0.010798593671521847
Snoop Dogg 0.009124393102293654
Pritam 0.007450192533065461
Andrea Bocelli 0.007366482504604051
G. V. Prakash 0.006110832077682906
Rick Ross 0.005441151849991628
Lil Wayne 0.004938891679223171
Don Diablo 0.0048551816507617605

Closeness Centrality:
Diplo 0.2278118921392883
Snoop Dogg 0.21639344262295082
Anitta 0.21559674421122924
Lil Wayne 0.21520059087388085
Dillon Francis 0.21054301274255802
Calvin Harris 0.21036134394590406
Khalid 0.20914597850064778
Yellow Claw 0.20824544582933846
G-Eazy 0.20819463566810156
Sia 0.20789029462436698

Betweenness Centrality:
Diplo 0.20754309192373555
Snoop Dogg 0.1218207762854149
Anitta 0.07018396939313132
Pritam 0.05520175485507936
MC Lan 0.04571958559131806
Yellow Claw 0.04474073544084224
Lukas Graham 0.04394836203276162
Don Diablo 0.043915028839767764
Tropkillaz 0.04245343525967242
G-Eazy 0.04118628728126513

PageRank:
Jea

#### Detailed Interpretation of Centrality Measures:

1. **Degree Centrality**:
   - Mc Gw, Jean Sibelius, Diplo, and Snoop Dogg are among the top nodes based on degree centrality, indicating their extensive collaboration networks. Mc Gw, a prominent figure in the network, collaborates with a diverse range of artists, potentially spanning multiple genres or styles. Jean Sibelius, known for classical compositions, demonstrates that even within a specific genre, artists can have significant collaborative networks. Diplo and Snoop Dogg, representing different genres (electronic and hip-hop, respectively), showcase the breadth of collaborative opportunities within the network.

2. **Closeness Centrality**:
   - Diplo, Snoop Dogg, Anitta, and Lil Wayne possess high closeness centrality, implying their accessibility and influence within the network. Diplo, a versatile producer and DJ, maintains close ties with numerous artists, enabling rapid dissemination of ideas and collaborations. Snoop Dogg, a veteran rapper, retains a central position, indicating his influence and accessibility within the hip-hop community. Anitta, a Brazilian pop star, demonstrates international connectivity, while Lil Wayne's presence underscores his influence in the rap industry.

3. **Betweenness Centrality**:
   - Diplo, Snoop Dogg, Anitta, and Pritam exhibit high betweenness centrality, highlighting their role as intermediaries or connectors between different artist groups. Diplo's diverse collaborations bridge various music scenes, fostering cross-genre interactions. Snoop Dogg's longstanding presence in hip-hop positions him as a conduit for collaborations between emerging and established artists. Anitta's betweenness centrality reflects her role in linking artists across cultural and linguistic boundaries, particularly within the Latin music sphere. Pritam, a prominent Bollywood composer, serves as a bridge between traditional and contemporary Indian music.

4. **PageRank**:
   - Jean Sibelius, Mc Gw, Andrea Bocelli, and Diplo rank highest in PageRank, indicating their prominence within the network's collaborative landscape. Jean Sibelius's classical compositions continue to resonate with contemporary artists, elevating his importance in the network. Mc Gw's high PageRank signifies his widespread influence and relevance across various music genres and platforms. Andrea Bocelli's classical and operatic repertoire earns him recognition among a diverse range of artists, while Diplo's innovative productions and collaborations cement his status as a key influencer in the modern music industry.

5. **Hubs and Authorities**:
   - Mc Gw, Mc Kitinho, MC CH da Z.O, and Mc Cyclope emerge as top hubs and authorities, indicating their dual roles as influential collaborators and highly referenced artists. Mc Gw's dominance in both categories underscores his central position and widespread recognition within the network. Mc Kitinho, MC CH da Z.O, and Mc Cyclope represent regional hubs, exerting considerable influence within their respective music scenes while garnering acknowledgment from artists across different regions and genres.

6. **Eigenvector Centrality**:
   - Mc Gw, Mc Kitinho, MC CH da Z.O, and Mc Cyclope demonstrate high eigenvector centrality, reflecting their connections to other influential artists in the network. Mc Gw's extensive collaborations with highly regarded artists contribute to his elevated status and influence. Mc Kitinho, MC CH da Z.O, and Mc Cyclope's prominence underscores their pivotal roles in shaping the network's collaborative dynamics, with their collaborations reverberating across various music communities.


## 4 Comparison with a Random Network and Small-World Phenomenon

### 4.1 Justification of Significance Compared with a Random Network
- Generate an equivalent random network and compare its properties with the observed network.
- Justify the significance of the observed network's properties compared to the random network.
- Discuss any deviations or similarities observed and their implications for the network's structure and behavior.

In [38]:
# Generate an equivalent random network
random_network = nx.gnm_random_graph(G.number_of_nodes(), G.number_of_edges())

# Subsample the giant component

# Randomly select 25% of the nodes for subsampling
random_sampled_nodes = random.sample(list(random_network.nodes()), int(random_network.number_of_nodes() * 0.25))

# Create a subgraph containing only the sampled nodes
random_subgraph_sampled = random_network.subgraph(random_sampled_nodes)

# Extract the giant component
random_giant_component_sampled = max(nx.connected_components(random_subgraph_sampled), key=len)

# Create a subgraph containing only the giant component
random_giant_subgraph_sampled = random_subgraph_sampled.subgraph(random_giant_component_sampled)

print("Number of NODES before extraction:", random_subgraph_sampled.number_of_nodes())
print("Number of EDGES before extraction:", random_subgraph_sampled.number_of_edges())
print()
print("Number of NODES in the giant component:", random_giant_subgraph_sampled.number_of_nodes())
print("Number of EDGES in the giant component:", random_giant_subgraph_sampled.number_of_edges())

Number of NODES before extraction: 38331
Number of EDGES before extraction: 18727

Number of NODES in the giant component: 435
Number of EDGES in the giant component: 434


In [None]:
# Calculate relevant network measures
random_avg_degree = np.mean(list(dict(random_network.degree()).values()))
random_avg_clustering_coefficient = nx.average_clustering(random_network)

# Calculate relevant network measures for the subsampled giant component
random_avg_shortest_path = nx.average_shortest_path_length(random_giant_subgraph_sampled)
random_diameter = nx.diameter(random_giant_subgraph_sampled)

In [40]:
# Compare properties with the observed network
print("Observed Network:")
print("Average Degree:", avg_degree)
print("Clustering Coefficient:", clustering_coefficient)
print("Average Shortest Path Length (Subsampled Giant Component):", avg_shortest_path_length)
print("Diameter (Subsampled Giant Component):", diameter)

print("\nRandom Network:")
print("Average Degree of the random network:", random_avg_degree)
print("Clustering Coefficient of the random network:", random_avg_clustering_coefficient)
print("Average Shortest Path Length of the random network (Subsampled Giant Component):", random_avg_shortest_path)
print("Diameter of the random network (Subsampled Giant Component):", random_diameter)

Observed Network:
Average Degree: 3.9182401012215724
Clustering Coefficient: 0.08239452364659447
Average Shortest Path Length (Subsampled Giant Component): 7.560911899647855
Diameter (Subsampled Giant Component): 24

Random Network:
Average Degree of the random network: 3.9182401012215724
Clustering Coefficient of the random network: 2.024410739118521e-05
Average Shortest Path Length of the random network (Subsampled Giant Component): 29.75081307272631
Diameter of the random network (Subsampled Giant Component): 68


#### Comparison of Network Measures: Observed vs. Random Network

The observed network measures from the original network compared to those from the random network reveal significant differences and similarities:

1. **Average Degree**:
   - The average degree of both the observed and random networks is approximately 3.92, indicating a similar level of connectivity in terms of the average number of collaborations per artist. This similarity suggests that, on average, artists in the observed network collaborate with a similar number of other artists as expected in a random network.

2. **Clustering Coefficient**:
   - The clustering coefficient of the observed network (0.082) is significantly higher than that of the random network (2.02e-05). This difference suggests that the observed network exhibits a much higher level of clustering or transitivity, meaning that artists tend to form tightly-knit clusters or communities where collaborations are more likely to occur between connected artists. In contrast, the random network lacks this clustering pattern, with collaborations occurring more randomly without forming cohesive clusters.

3. **Average Shortest Path Length**:
   - The average shortest path length of the observed network (7.56) is shorter than that of the random network (29.75), indicating a higher degree of connectivity and accessibility between artists in the observed network. A shorter average shortest path length suggests that artists in the observed network are more closely connected and can reach each other more quickly through collaboration chains compared to the random network, where collaborations are less efficient and take longer to establish.

4. **Diameter**:
   - The diameter of the observed network (24) is smaller than that of the random network (68), indicating a more compact and interconnected structure in the observed network. A smaller diameter suggests that the observed network has shorter paths between any two artists compared to the random network, further emphasizing the higher connectivity and efficiency of collaborations within the observed network.

### 4.2 Presence of Small-World Phenomenon
- Investigate whether the observed network exhibits the small-world phenomenon.
- Calculate relevant measures such as average clustering coefficient and average shortest path length.
- Compare these measures to those of a random network and discuss the presence or absence of the small-world phenomenon in the observed network.

In [3]:
# Check for small-world phenomenon
if 0.08239452364659447 > 2.024410739118521e-05 and 7.560911899647855 < 29.75081307272631:
    print("The observed network exhibits the small-world phenomenon.")
else:
    print("The observed network does not exhibit the small-world phenomenon.")

The observed network exhibits the small-world phenomenon.


The small-world phenomenon observed in our Spotify Artist Feature Collaboration Network suggests an intriguing interplay between local artist communities and global connectivity. 

While artists tend to form tightly-knit clusters based on shared genres, collaborations, and cultural affinities, there are also remarkably short paths connecting any two artists within the network. This phenomenon underscores the efficient flow of collaborative opportunities and artistic influences across diverse musical genres and geographic locations. 

In the context of the music industry, the small-world property reflects the dynamic nature of collaborations and the interconnectedness of artists, facilitating the exchange of creative ideas and fostering innovation. It highlights the network's ability to bridge geographical and cultural boundaries, enabling artists to reach new audiences and explore innovative musical directions through collaborative endeavors