## Implementation Technique of Problem 1

### Part A: Choosing Romania Deezer Gemsec Dataset
We selected the **Romania Deezer Gemsec Dataset**, which contains:
- **41,773 nodes**
- **125,826 edges**

The dataset was sourced from Stanford. To represent the graph:
1. Initialized the network class to define nodes and edges.
2. Created the adjacency matrix using the following line of code:

```python
self.adjacency_matrix = self.create_adjacency_matrix()
```

This adjacency matrix is crucial for visualizing and performing computations on the network.

---

### Part B: Visualizing the Graph
To visualize the graph, we iterated over the adjacency matrix created in the previous step. Each node and its connections (edges) are represented graphically to understand the network's structure better.

---

### Part C: Sparsity in the Network
**Sparsity** in a network occurs when the number of actual links between nodes is much smaller than the total possible links.

The sparsity is calculated using the formula:

$$
\text{sparseness} = 1 - \frac{\text{actual\_Edges}}{\frac{\text{number of nodes} \times (\text{number of nodes} - 1)}{2}}
$$


---

### Part D: Average Degree of the Network
The **average degree** of a network is computed as:

$$

\text{Average Degree} = \frac{2 \times \text{actual\_Edges}}{\text{number of nodes}}

$$

This represents the average number of connections each node has within the network.

---

### Part E: Plotting Scaled Degree Distribution
The **degree distribution** describes the probability \(P(k)\) that a randomly chosen node has degree \(k\). It is computed as:

$$
    P(k) = \frac{Nₖ}{N}
$$
Where:
- \(Nₖ\): Number of nodes with degree \(k\)
- \(N\): Total number of nodes

We plotted the degree distribution, which aligns with the expected distribution discussed during lectures by Sir.

---

### Part F: Computing and Plotting Average Path Length
The **average path length** is calculated as the average length of all shortest paths between pairs of nodes. To achieve this:
1. Implemented **Breadth-First Search (BFS)** to find all shortest paths between node pairs.
2. Computed the average path length by summing the lengths of all shortest paths and dividing by the total number of paths.
3. Plotted the distribution of path lengths using a bar graph, showing how path lengths are distributed within the network.
4. Since this part of code requires iterating to each and every node to find the shortest path, it took me about 2.5 hours to get the average path length and additional 2 hours to plot the distribution of path lengths.

---

### Part G: Computing and Plotting Clustering Coefficient
The **clustering coefficient** measures how connected a node's neighbors are to each other. For a node \(i\), it is defined as:

$$
Cᵢ = \frac{2Lᵢ}{kᵢ(kᵢ - 1)}
$$

Where:
- \(Lᵢ\): Number of links between the \(kᵢ\) neighbors of node \(i\)
- \(kᵢ\): Degree of node \(i\)

To compute the overall clustering coefficient:
1. Calculated the clustering coefficient \(Cᵢ\) for each node and storing them in dictionary to make it handy.
2. Averaged these values to find the **average clustering coefficient**.
3. Plotted the clustering coefficient distribution to visualize the local connectivity of nodes within the network.

---

In [21]:
!pip install numpy matplotlib



In [22]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
from collections import Counter, deque, defaultdict

In [23]:
def data(file_path):
    edges = []
    with open(file_path, 'r') as f:
        for line in f:
            n1, n2 = map(int, line.strip().split())
            edges.append((n1, n2))
    return edges

def adjMat(edges):
    max_node = max(max(i) for i in edges)
    adj_matrix = np.zeros((max_node + 1, max_node + 1), dtype=int)
    
    for n1, n2 in edges:
        adj_matrix[n1][n2] = 1
        adj_matrix[n2][n1] = 1
    
    return adj_matrix

def edgeList(edges, output_file):
    
    df = pd.DataFrame(edges, columns=['Source', 'Target'])
    df.to_csv(output_file, index=False)

def saveadjMat(adj_matrix, output_file):
    df = pd.DataFrame(adj_matrix)
    df.to_csv(output_file)

def visualizeNetwork(adjMatrix, output_path='B_network_visualization.png'):
    plt.figure(figsize=(12, 8))
    plt.spy(adjMatrix, markersize=0.1)
    plt.title(f'Network Visualization (Nodes: {len(adjMatrix)})')
    plt.xlabel('Node Index')
    plt.ylabel('Node Index')
    plt.tight_layout()
    plt.savefig(output_path, dpi=300)
    plt.close()
    
def sparseness(adjMatrix):
    n = len(adjMatrix)
    actEdges = np.sum(adjMatrix) / 2
    maxEdges = (n * (n - 1)) / 2
    sparseness = 1 - (actEdges / maxEdges)
    return sparseness

def avgDeg(adjacency_matrix):
    degrees = np.sum(adjacency_matrix, axis=1)
    avgDeg = np.mean(degrees)
    return avgDeg

def degDis(adjacency_matrix):
    degList = np.sum(adjacency_matrix, axis=1)
    N = len(degList)  
    degCnt = Counter(degList)
    degrees = np.array(sorted(degCnt.keys()))
    Nk = np.array([degCnt[k] for k in degrees])
    
    pk = Nk / N
    
    return degrees, pk

def plotDegDis(adjacency_matrix, output_path='E_degree_distribution.png'):
    k, pk = degDis(adjacency_matrix)
    
    actpk = pk * k
    
    plt.figure(figsize=(10, 6))
    
    plt.plot(k, actpk, 'o:', color='blue', markersize=8, 
            markerfacecolor='white', markeredgecolor='blue',
            markeredgewidth=2, linestyle=':')
    
    plt.xlabel('k')
    plt.ylabel('k × Pₖ')
    plt.title('Scaled Degree Distribution')
    
    if max(k) / min(k) > 10:
        plt.xscale('log')
    if max(actpk) / min(actpk[actpk > 0]) > 10:
        plt.yscale('log')
    
    plt.grid(True, which='both', linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()

    return k, pk, actpk

def bfs(adjMatrix, start_node):

    n = len(adjMatrix)
    dis = {i: float('inf') for i in range(n)} 
    dis[start_node] = 0
    
    queue = deque([start_node])
    visited = {start_node}
    
    while queue:
        curr = queue.popleft()
        for neighbor in range(n):
            if adjMatrix[curr][neighbor] == 1 and neighbor not in visited:
                visited.add(neighbor)
                dis[neighbor] = dis[curr] + 1
                queue.append(neighbor)
    
    return dis

def storeShortestPath(adj_matrix):
    n = len(adj_matrix)
    all_paths = []
    path_dict = {} 
    
    for i in range(n):
        distances = bfs(adj_matrix, i)
        for end in range(i + 1, n):  
            if distances[end] != float('inf'):  
                path_length = distances[end]
                all_paths.append(path_length)
                path_dict[(i, end)] = path_length
    
    avgPathLen = np.mean(all_paths)
    return all_paths, avgPathLen, path_dict

def plotPathLen(path_lengths, output_path='F_path_length_distribution.png'):
    lengths, counts = np.unique(path_lengths, return_counts=True)
    probabilities = counts / len(path_lengths)
    
    plt.figure(figsize=(10, 6))
    
    plt.plot(lengths, probabilities, 'o:', color='blue', 
            markersize=8, markerfacecolor='white', 
            markeredgecolor='blue', markeredgewidth=2)
    
    plt.xlabel('Path Length')
    plt.ylabel('Probability')
    plt.title('Path Length Distribution')
    plt.grid(True, linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()
    
def get_neighbors(adj_matrix, node):
    return np.where(adj_matrix[node] == 1)[0]

def links(adj_matrix, neighbors):

    if len(neighbors) < 2:
        return 0
    
    count = 0
    for i in range(len(neighbors)):
        for j in range(i + 1, len(neighbors)):
            if adj_matrix[neighbors[i]][neighbors[j]] == 1:
                count += 1
    return count

def clusCoeff(adj_matrix, node):
    neighbors = get_neighbors(adj_matrix, node)
    ki = len(neighbors) 
    
    if ki < 2:  
        return 0.0
    
    Li = links(adj_matrix, neighbors)
    Ci = (2.0 * Li) / (ki * (ki - 1))
    return Ci

def avgClusCoeff(adj_matrix):
    n = len(adj_matrix)
    degree_clustering = defaultdict(list)  
    
    all_coefficients = []
    for node in range(n):
        degree = np.sum(adj_matrix[node])
        if degree > 1:  
            ci = clusCoeff(adj_matrix, node)
            degree_clustering[degree].append(ci)
            all_coefficients.append(ci)
    
    result = {
        k: np.mean(coefficients) 
        for k, coefficients in degree_clustering.items()
    }
    
    overall = np.mean(all_coefficients)
    
    return result, overall

def plotClusCoeff(avg_clustering_by_degree, 
                                    output_path='G_clustering_coefficient.png'):

    degrees = np.array(list(avg_clustering_by_degree.keys()))
    clustering_coeffs = np.array(list(avg_clustering_by_degree.values()))
    
    scaled_clustering = clustering_coeffs * degrees
    
    plt.figure(figsize=(10, 6))
    
    plt.plot(degrees, scaled_clustering, 'o:', color='blue', 
            markersize=8, markerfacecolor='white', 
            markeredgecolor='blue', markeredgewidth=2)
    
    plt.xlabel('k')
    plt.ylabel('Cₖ × k')
    plt.title('Scaled Clustering Coefficient Distribution')
    
    if max(degrees) / min(degrees) > 10:
        plt.xscale('log')
    if max(scaled_clustering) / min(scaled_clustering[scaled_clustering > 0]) > 10:
        plt.yscale('log')
    
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()
    

if __name__ == '__main__':
    edges = data('data/1684.edges')
    
    adj_matrix = adjMat(edges)
    
    edgeList(edges, 'edge_list.csv')
    
    saveadjMat(adj_matrix, 'adjacency_matrix.csv')
    
    # part B
    visualizeNetwork(adj_matrix)
    
    # part C
    sparseness = sparseness(adj_matrix)
    
    # part D
    avg_degree = avgDeg(adj_matrix)
    
    # part E
    k, pk, scaled_pk = plotDegDis(adj_matrix)
    
    # part F
    path_lengths, avg_path_length, path_dict = storeShortestPath(adj_matrix)
    plotPathLen(path_lengths)
    
    # part G
    avgClusCoeff, overall_avg_clustering = avgClusCoeff(adj_matrix)
    plotClusCoeff(avgClusCoeff)
    
    with open('network_stats.txt', 'w') as f:
        f.write(f"Number of edges: {len(edges)}\n")
        f.write(f"Number of nodes: {adj_matrix.shape[0]}\n")
        f.write(f"Matrix size: {adj_matrix.shape[0]} x {adj_matrix.shape[1]}\n")
        f.write(f"Network sparseness: {sparseness:.6f}\n")
        f.write(f"Average degree <k>: {avg_degree:.2f}\n")
        f.write("Degree (k)\tP(k)\tk×P(k)\n")
        for i in range(len(k)):
            f.write(f"{k[i]}\t{pk[i]:.6f}\t{scaled_pk[i]:.6f}\n")
        f.write(f"Average Path Length: {avg_path_length:.4f}\n\n")
        f.write("Shortest Paths between all pairs:\n")
        for (start, end), length in sorted(path_dict.items()):
            f.write(f"Node {start} to Node {end}: {length}\n")
        
        f.write("\nPath Length Distribution:\n")
        unique_lengths, counts = np.unique(path_lengths, return_counts=True)
        for length, count in zip(unique_lengths, counts):
            f.write(f"Length {length}: {count} paths\n")
        f.write(f"Overall Average Clustering Coefficient: {overall_avg_clustering:.6f}\n\n")
        f.write("Average Clustering Coefficient by Degree:\n")
        for k in sorted(avgClusCoeff.keys()):
            f.write(f"Degree {k}: C({k}) = {avgClusCoeff[k]:.6f}, "
                   f"C({k}) × k = {avgClusCoeff[k] * k:.6f}\n")

  plt.plot(k, actpk, 'o:', color='blue', markersize=8,
  if max(k) / min(k) > 10:
