## Introduction

In this tutorial, we'll explore how to analyze relationships between companies and technologies using network analysis. We'll work with a dataset of company-technology relationships extracted from news articles and create visualizations to understand the connections between different companies based on their technology focuses. The creation of the network will be based on LLM-extraction.

## Setup

First, let's install the required packages:

In [None]:
# Install required packages
!pip install ollama pandas networkx matplotlib tqdm -q

## Data Extraction

### Setting up Ollama

We'll use Ollama, an open-source large language model framework, to extract relationships from our text data. First, we need to set up the Ollama server:

In [None]:
# Install Ollama
!sudo apt-get install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
import os
import threading
import subprocess

def start_ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

ollama_thread = threading.Thread(target=start_ollama)
ollama_thread.start()

In [None]:
# make sure to download a model
# ollama run qwen2.5

In [None]:
import json
import ollama
import networkx as nx
from tqdm.notebook import tqdm
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

### Defining the Extraction Schema

We'll use a structured schema to extract relationships between companies and technologies. Each relationship will include:

- Source company (`from`)
- Technology (`to`)
- Relationship type (`type`: owns/develops/implements)
- Technology category (`tech_type`)

Here's our system prompt that defines the extraction rules:

In [None]:
SYSTEM_PROMPT = """Extract relationships between companies and technologies from the given text. Focus only on relationships where a company owns, develops, or implements a specific technology. Provide output in this JSON format:
{
 "edges": [
 {"from": "Company Name", "to": "Technology Name", "type": "relationship_type", "tech_type": "Technology Category"}
 ]
}
The "type" field should be "owns", "develops", or "implements".
The "tech_type" field should categorize the technology into one of these types:
1. Customer Service and Support AI
2. AI Infrastructure and Operations
3. Robotics and Autonomous Systems
4. Construction and Manufacturing AI
5. Healthcare AI Applications
6. Business Process and Workflow Automation
7. Extended Reality (AR/VR) and Immersive Technologies
8. AI in Mobile and Imaging
9. AI Audio and Video Generation
10. Search and Information Retrieval AI
11. Financial Technology (FinTech) and Financial AI
12. Smart Home and IoT AI
13. E-Commerce AI Solutions
14. Cybersecurity AI Solutions
15. Recruitment and Human Resources (HR) AI
16. Media and Content Personalization AI
17. Data Analytics and Business Intelligence
18. Software Development and DevOps AI Tools
19. Generative and Multimodal AI
20. Educational and Training AI

Ensure a valid JSON object with an 'edges' array, even if empty. English output only.

Examples based on the input articles:
1. {"from": "Google", "to": "AI-powered conversational chatbot", "type": "develops", "tech_type": "Customer Service and Support AI"}
2. {"from": "OpenAI", "to": "ChatGPT desktop app for macOS", "type": "develops", "tech_type": "AI Infrastructure and Operations"}
3. {"from": "YouTube", "to": "AI chatbot for Premium subscribers", "type": "implements", "tech_type": "Customer Service and Support AI"}
4. {"from": "Apple", "to": "AI training curriculum for Developer Academy", "type": "develops", "tech_type": "Educational and Training AI"}
5. {"from": "Adobe", "to": "Firefly AI for text-to-video generation", "type": "develops", "tech_type": "AI Audio and Video Generation"}
"""

### Extracting Relationships

We'll create a function to process each article and extract the relationships:

In [None]:
def extract_relationships(article):
    prompt = f"""
    Extract key relationships between companies and technologies from this text:
    Title: {article['title']}
    Text: {article['text']}
    Focus on relationships where a company owns, develops, or implements a specific technology.
    Categorize each technology according to the tech_type categories provided.
    """
    response = ollama.chat(
        model='qwen2.5',
        messages=[
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': prompt},
        ],
        format='json',
        options={"temperature":0.1}
    )
    return response['message']['content']

Extract for the first 50 articles in the data

In [None]:
!wget https://raw.githubusercontent.com/aaubs/ds-master/refs/heads/main/data/paraphrased_articles.jsonl

In [None]:
# Read input articles
with open('paraphrased_articles.jsonl', 'r', encoding='utf-8') as f:
    articles = [json.loads(line) for line in f][:50]

# Process articles
results = []
for article in tqdm(articles, desc="Processing articles"):
    try:
        extracted_data = json.loads(extract_relationships(article))
        results.append({
            'title': article['title'],
            'extracted_data': extracted_data
        })
    except json.JSONDecodeError as e:
        print(f"Error processing article '{article['title']}': {str(e)}")

# Display first 5 results
for i, result in enumerate(results[:5]):
    print(f"Article {i+1}: {result['title']}")
    print(json.dumps(result['extracted_data'], indent=2))
    print("\n" + "="*50 + "\n")

# Save all results
with open('company_technology_relationships.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print(f"Results saved to company_technology_relationships.json")

## Network Analysis

### Creating the Bipartite Graph

We'll first create a bipartite graph where one set of nodes represents companies and the other represents technology types:

In [None]:
# Create a bipartite graph from the extracted relationships
B = nx.Graph()
companies = set()
tech_types = set()

print("Building bipartite graph...")
for result in tqdm(results, desc="Processing Results"):
    for edge in result['extracted_data']['edges']:
        company = edge.get('from')
        tech_type = edge.get('tech_type')
        if not company or not tech_type:
            continue
        companies.add(company)
        tech_types.add(tech_type)
        B.add_edge(company, tech_type)

print(f"\nNumber of companies: {len(companies)}")
print(f"Number of technology types: {len(tech_types)}")
print(f"Number of edges in the bipartite graph: {B.number_of_edges()}")

# Project onto the company layer using weighted_projected_graph
print("\nProjecting bipartite graph onto company layer...")
company_graph = nx.bipartite.weighted_projected_graph(B, companies)
print(f"Number of nodes in company graph: {company_graph.number_of_nodes()}")
print(f"Number of edges in company graph: {company_graph.number_of_edges()}")

### Graph Projection and Edge Trimming

To analyze relationships between companies, we'll project the bipartite graph onto the company layer and trim weak connections:

In [None]:
def trim_edges(graph, percentile=20):
    """
    Removes edges from the graph that have weights below the specified percentile.

    Parameters:
    - graph (networkx.Graph): The input graph with weighted edges.
    - percentile (float): The percentile threshold below which edges will be removed.

    Returns:
    - networkx.Graph: The trimmed graph.
    """
    if not graph.edges(data=True):
        print("The graph has no edges to trim.")
        return graph

    weights = [data['weight'] for u, v, data in graph.edges(data=True)]
    threshold = np.percentile(weights, percentile)
    print(f"Trimming edges with weights below the {percentile}th percentile (threshold: {threshold:.4f})")

    trimmed_graph = nx.Graph()
    for u, v, data in graph.edges(data=True):
        if data['weight'] >= threshold:
            trimmed_graph.add_edge(u, v, **data)

    trimmed_graph.add_nodes_from(graph.nodes(data=True))
    return trimmed_graph

In [None]:
# Trim edges with lowest weights before plotting
percentile = 10  # Adjust this value as needed (e.g., remove bottom 10% of edges)
print("\nTrimming low-weight edges...")
trimmed_company_graph = trim_edges(company_graph, percentile=percentile)
print(f"Number of nodes after trimming: {trimmed_company_graph.number_of_nodes()}")
print(f"Number of edges after trimming: {trimmed_company_graph.number_of_edges()}")

# Calculate centralities on the trimmed graph
print("\nCalculating centrality measures...")
degree_centrality = nx.degree_centrality(trimmed_company_graph)
betweenness_centrality = nx.betweenness_centrality(trimmed_company_graph)
eigenvector_centrality = nx.eigenvector_centrality(trimmed_company_graph, max_iter=1000)

# Combine centralities
combined_centrality = {
    node: (degree_centrality.get(node, 0) +
           betweenness_centrality.get(node, 0) +
           eigenvector_centrality.get(node, 0)) / 3
    for node in trimmed_company_graph.nodes()
}

# Sort companies by combined centrality
sorted_companies = sorted(combined_centrality.items(), key=lambda x: x[1], reverse=True)

# Select top N companies for visualization
N = 75  # Adjust as needed
top_companies = [company for company, _ in sorted_companies[:N]]

# Create a subgraph with only the top companies
subgraph = trimmed_company_graph.subgraph(top_companies).copy()
print(f"\nNumber of nodes in subgraph: {subgraph.number_of_nodes()}")
print(f"Number of edges in subgraph: {subgraph.number_of_edges()}")

### Calculating Centrality Measures

We'll use multiple centrality measures to identify important companies in the network:

In [None]:
def calculate_centralities(graph):
    return {
        'degree': nx.degree_centrality(graph),
        'betweenness': nx.betweenness_centrality(graph),
        'eigenvector': nx.eigenvector_centrality(graph, max_iter=1000)
    }

## Visualization

Finally, we'll create a network visualization of the most central companies:

In [None]:
# Prepare for visualization
print("\nPreparing visualization...")
pos = nx.spring_layout(subgraph, k=0.3, iterations=50)  # Adjust layout parameters as needed
sizes = [combined_centrality[node] * 5000 for node in subgraph.nodes()]  # Scale node sizes
labels = {node: node for node in subgraph.nodes()}

# Visualize the network
plt.figure(figsize=(15, 20))  # Adjust figure size as needed
nx.draw_networkx_nodes(subgraph, pos, node_size=sizes, alpha=0.7, node_color='skyblue')
nx.draw_networkx_edges(
    subgraph, pos, alpha=0.2,
    width=[data['weight']/10 for _, _, data in subgraph.edges(data=True)],
    edge_color='gray'
)
nx.draw_networkx_labels(subgraph, pos, labels, font_size=10)

plt.title("Company Relationships based on Shared Technology Types", fontsize=20)
plt.axis('off')
plt.tight_layout()
plt.savefig('company_network.png', dpi=300, bbox_inches='tight')
plt.show()

Print out some results

In [None]:
# Print top N companies and their centralities
print(f"\nTop {N} Companies by Combined Centrality:")
for company, centrality in sorted_companies[:N]:
    print(f"{company}: {centrality:.4f}")

# Calculate and print technology type distribution for top companies
tech_type_distribution = defaultdict(lambda: defaultdict(int))
for company in top_companies:
    for tech_type in B.neighbors(company):
        tech_type_distribution[company][tech_type] += 1

print(f"\nTechnology Type Distribution for Top {N} Companies:")
for company in top_companies:
    print(f"\n{company}:")
    total = sum(tech_type_distribution[company].values())
    sorted_tech = sorted(tech_type_distribution[company].items(), key=lambda x: x[1], reverse=True)
    for tech_type, count in sorted_tech:
        percentage = (count / total) * 100
        print(f"  {tech_type}: {percentage:.2f}%")

# Save centrality data
centrality_data = {
    'degree': degree_centrality,
    'betweenness': betweenness_centrality,
    'eigenvector': eigenvector_centrality,
    'combined': combined_centrality
}

with open('company_centralities.json', 'w') as f:
    json.dump(centrality_data, f, indent=2)

print("\nCentrality data saved to company_centralities.json")