# Star Wars Data Science
## Network Analysis, Topic Modeling, and a Wordcloud!
https://linkedin.com/in/dennisbakhuis

## 5. Star Wars Network analysis

We collected a total of 5334 characters and wouldn't it be great to analyze the relations between each character. As a first attempt to work with graph networks, I want to visualize the network around characters. For this we need to have nodes, which are the characters, and their relations, which are called edges in a graph. For example, Anakin Skywalker would have a relation called 'father of' to Luke Skywalker. To extract the various relations would mean extensive natural language processing to reduce the corpus to forms of node-edge-node, which is far from trivial.

To make it a bit easier, we will sum all relations to a single kind of relation which we call 'connected to'. To find out if a character is connected to another, we will look if there is a link on the page. We expect that on the page of Anakin Skywalker there will be a link to the page of Luke Skywalker. All these links are collected during the scraping process as a list which we call crosslinks.

In [None]:
import pickle
from pathlib import Path
import urllib
import collections


import pandas as pd
from tqdm import tqdm
from networkx import nx


import matplotlib.pyplot as plt
import seaborn as sns
from pyvis.network import Network
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})

Lets start with preparing the raw data:

In [None]:
files = sorted(Path('../Dataset').glob('*.pickle'))
data = {}
for fn in files:
    with open(fn, 'rb') as f:
        part = pickle.load(f)
    data.update(part)

def remove_url_shizzle(text):
    return urllib.parse.unquote(text).replace('"', '').replace("'", '')

cleaned = {}
for key, value in tqdm(data.items()):
    new_key = remove_url_shizzle(key)
    cleaned[new_key] = value
    cleaned[new_key]['crosslinks'] = [remove_url_shizzle(crosslink) for crosslink in value['crosslinks']]
data = cleaned

characters = pd.read_parquet('../Dataset/StarWars_Characters.parquet')['key'].tolist()


Now we can build a Graph network and we will be using the excellent networkx package. Networks are geometrical structures to organize data into nodes and edges. In our case, nodes are the characters and the relations between the nodes are edges. Normally you also provide a weight to an edge. This can be seen as how strong a bond is, but also as a distance. These weights enable many methods for analysis such as the shortest path. As we cannot add a measure to our relation (at least not that I know of from this data), all the weights will be set to unity.

When building the network, we will scan the crosslinks of each character. These can point to other characters, but also to any other page. Therefore, we need to check if the crosslink is a character and ignore non-characters:

In [None]:
graph = nx.Graph()
for key in tqdm(characters):
    crosslinks = data[key]['crosslinks']
    for crosslink in crosslinks:
        if crosslink in characters:
            graph.add_edge(key, crosslink)

print(f'Nodes: {graph.number_of_nodes()}, Links: {graph.number_of_edges()}')

After the creation we end up with 4794 nodes. Apparently, about 500 characters do not have a single relation to another character. From the characters that do have relations (edges), this is a total of almost 20k or 4 relations on average.

Lets first have a look at the degree distribution, which show the amount of connections to other nodes. A node of degree 1 has only one connection, while a node with degree 10 has 10 connections to other nodes. The code is available in the notebook.

In [None]:
degree_sequence = sorted([d for n, d in graph.degree()]) 
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())
degree_df = pd.DataFrame({'degree': deg, 'count':cnt})

fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax.set_facecolor("white")
sns.barplot( x='degree', y='count', data=degree_df.loc[degree_df.degree<32])
_, _ = ax.set_ylabel("Count"), ax.set_xlabel("Degree")
sns.despine()
ax.grid(False)

fig.savefig('../Assets/sw_network1.png', bbox_inches='tight')

Most of the nodes only have a few connections and with a dataset that consists of more 5334 characters it is also no surprise. Many characters only have a single appearance. Characters like Anakin Skywalker have connections to more than 100 other characters and this is probably also the case of other stars from Star Wars.

In [None]:
cut_off = 2

while True: 
    nodes_to_remove = [
        node for node, degree in graph.degree(graph.nodes) 
        if degree <= cut_off
    ]
    if len(nodes_to_remove)==0:
        break
    else:
        print(f'Removing {len(nodes_to_remove)} nodes')
        graph.remove_nodes_from(nodes_to_remove)
        print(f'Remaining Nodes: {graph.number_of_nodes()} Links: {graph.number_of_edges()}')

This brings the characters down to just below 3000. The amount of edges almost did not reduce and we still have 17k of those. This makes of course sense as we only removed low degree nodes. 

The power of graphs is that you can use a bag of tools to investigate a nodes importance. One of these tools is called Betweenness centrality. It is a method to find nodes that serve as a bridge between other nodes, i.e. as a sort of relay station for information. In our example, we use the Betweenness centrality to find which nodes are the link to various other persons and hopefully find some key characters.

In [None]:
bc = nx.algorithms.betweenness_centrality(graph)
count_bc = collections.Counter(bc)

bc_df = pd.DataFrame({
    'name': count_bc.keys(),
    'degree': count_bc.values(),
}).sort_values('degree', ascending=False)
bc_df.head(10)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

ax.set_facecolor("white")
sns.barplot( x='name', y='degree', data=bc_df.iloc[:10])

_, _ = ax.set_ylabel("Degree"), ax.set_xlabel(None)
plt.xticks(rotation=60, ha='right')
sns.despine()
ax.grid(False)

fig.savefig('../Assets/sw_network2.png', bbox_inches='tight')

No surprises here. I had to look up Ezra Bridger but he was very prominent in the Clone War series. There are many other algorithms such as closeness centrality or page rank but these resulted in similar results.

To have a bit of more play play with the graph network, we will visualize it using PyVis. PyVis is build around VisJS and makes it extremely easy to plot these graphs and play around with the result. There is however one thing we need to take care off: we have much to much nodes and edges. While it seems that PyVis has no problem with large amounts, due to overlap of all the edges, it does not so interesting. Therefore, we need to make an algorithm that selects a subgroup from our network.

The next piece of code needs three parameters: a starting node, a maximum level penetration, and maximum crosslinks. The algorithm will, as the name suggest, start from the starting node and selects the maximum amount of neighboring nodes. If there are more nodes available than the maximum set crosslinks, it will select the nodes with the larges degree. Next, the algorithm will jump to the selected nodes (this is a level increase) and repeat the process until the maximum penetration is achieved. The amount of nodes selected will increase exponentially with maximum level penetration so be careful with that parameter.

The result is an interactive website that is shown directly in the Jupyter notebook or any other web browser.

👉 [Interactive network graph of Anakin Skywalker!](https://dennisbakhuis.github.io/wookieepediascience/)

### Interactive Network plot using PyVis

In [None]:
def get_crosslink_table(key, n = 30, ignore_keys=[]):
    cl = data[key]['crosslinks']
    result = []
    for link in cl:
        if link in characters:
            n_cl = len(data[link]['crosslinks'])
            result.append({'key': link, 'n_links': n_cl})
    result = pd.DataFrame(result)
    return result.loc[~result.key.isin(ignore_keys)].sort_values('n_links', ascending=False)['key'].head(n).tolist()

In [None]:
level_colors =  {
    0:'#7A84DD',
    1:"#B15B60", 
    2:'#8ACAE5', 
    3:'#BD9267', 
    4:'#F1A54D', 
    5:'#020104',
}

def add_node(graph, key, level, max_level=2, n_crosslinks=10, ignore_keys=[]):
    label = key.replace('_', ' ')
    char = [{'name': label, 'description': data[key]['paragraph'].strip()}]
    textblock = pd.DataFrame(char).to_html(header=False, index=False, columns=['description'])
    G.add_node(
        label,
        title=textblock,
        size=10,
        color=level_colors[level],
        label=label,
    )
    if level < max_level:
        next_nodes = get_crosslink_table(key, n=n_crosslinks, ignore_keys=ignore_keys)
        for next_key in next_nodes:
            add_node(G, next_key, level + 1, max_level=max_level, n_crosslinks=n_crosslinks, ignore_keys=next_nodes + [key] + ignore_keys)
            next_label = next_key.replace('_', ' ')
            G.add_edge(
                label,
                next_label,
                weight=max_level / (1 + level),
                title=label+' -> '+next_label,
                width=1.5,
            )

max_level = 2
n_crosslinks = 15
start_key = 'Anakin_Skywalker'

# G = Network(height="1000px", width="100%", bgcolor="#222222", font_color="white",notebook=True)
G = Network(height="1000px", width="100%", bgcolor="#000000", font_color="white",notebook=True)
add_node(G, start_key, 0, max_level=max_level, n_crosslinks=n_crosslinks)

In [None]:
G.barnes_hut(gravity=-5000, central_gravity=0, spring_length=200, spring_strength=0.009, damping=0.025, overlap=0)
G.show('../Docs/index.html')

Each node can be dragged with the mouse  and will pull its connections along with it. When hovering over a node, the first paragraph of the Wookieepedia page is shown and makes it easy to read who that character is. The color of the node indicate the level. The color of the edge is the color from its source. Feel free to investigate other characters as well!

## Round up
I had a lot of fun playing around with this dataset. There is still a lot possible as I barely touched the surface. But even with some straight forward methods we found quite some cool things. 

'this is the way' - Din Djarin

Please let me know if you have any comments! [Feel free to connect on LinkedIn](https://linkedin.com/in/dennisbakhuis).