<div style="text-align: center;" >
<h1 style="margin-top: 0.2em; margin-bottom: 0.1em;">Network Analysis in Python</h1>
<h4 style="margin-top: 0.7em; margin-bottom: 0.3em; font-style:italic">Investigating relationships between entities.</h4>
</div>
<br>

This tutroial notebook is about network analysis. We are going to learn what a network is, the different options to model a network and how to do all that in Python. 

## __Structure__

1. Abstract Data Structures 

2. Graph Theory

    2.1 Network Theory


3. The NetworkX package

    3.1 Building a Graph

    3.2 Directed Graphs
    
    3.3 Multi Graphs
    
    3.4 Layout Options

4. Interactive Example


## __1. Abstract Data Structures__

Recap: 
- We already got to know different types of data. The primitive data types are boolean, bytes, integer, floating etc.
- We also know different operations that can be performed (depending on the data type) like addition, multiplication, division, smaller/greater, equals or not, modulo, logical or bitwise operations.



#### Data Structures: 

*Data types* together with *operations defined on these data* that enable and realize access to and management of these data.


**Examples** of abstract data structures we already know (from the lecture):
* Arrays: systematic arrangement, usually in rows and columns.
* ((Doubly/Circular) Linked) Lists: Linear arrangement of data elements (called nodes), each with a pointer to the next node.
* Dictionaries: collection of key-value pairs, where each key is unique (appears at most once in the collection.
* Binary Trees: either empty or a structure that consists of one node which has at max. two children (the left and the right child), which are binary trees themselves. 
* Hash Tables

Some **new abstract data structures** are:
* Queues: linear data structure, following the First-In-First-Out principle (FIFO)
    * Operations are enqueue and dequeue
* Stacks: linear data structure, following the Last-In-First-Out principle (LIFO)
    * Operations are push and pop
* **Graphs**: very intuitive way of modeling networks or relations (interdependencies) between entities
    * graphs consist of vertices and edges
    * Vertices: represent any kind of entitiy and can be labeled or not i.e. cities
    * Edges: represent the relationship (labeled or not, directed or undirected) between the vertices, i.e. roads between cities

## __2. Graph Theory__

*Graph Theory* is the study of graphs, which are data structures to model relationships between objects or entities.

Formally, a graph is defined as G = (V, E), where V is the set of all vertices and E the set of all edges.
* If we have a **directed graph**, the tuples in E are ordered pairs (v_1, v_2) (--> edge from v_1 to v_2). 
* If the graph is **undirected**, the tuples are unordered, i.e. {v_1, v_2} 

In an abstract data structure, Graphs are represented by **adjacency matrices** or **adjacency lists** 

* Adjacency matrix read from row to column (arrow from row to column)
* Adjacency list


#### __2.1 Networks__

*Network Theory* is a part of graph theory, where you define networks as graphs

- A collection of interconnected entities (the nodes)
- Edges express the relationship between these entities
- Can be found in many contexts 
    * Social science: Social Media Networks or Friendships $\to$ group dynamics/find key influencers, behaviour adaption, spread of ideas/knowledge
    * Transportation: Streets/Railways connecting cities $\to$ what are efficient routes, what are bottlenecks
    * Technology: Laptops/technical devices being connected over i.e. the internet $\to$ understand the structure of an organisation
    * Biological systems: Cells or proteins and their (chemical) interactions

- Networks can be analysed with respect to 
    * the type of connection (directed/undirected, weighted, how many edges?) 
    * how information/ideas are spread? How is influencing whom?
    * the number of nodes they contain: how does the network change over i.e. time (new nodes coming in, old ones dropping out, how do the patters change?)
    * the pattern: how close are nodes, are there communities or groups of individuals?

- Analyzing networks can be done by
    * network visualization: overall structure/organisation of the network (our focus today)
    * centrality measures: how central/important is a certain node? (learn more i.e. in "Social Media Data Analysis" next semester)


Graphic representation, Adjacency Matrix (read row to colum), Adjacency List

![title](src/Network_Illustration.png)
![title](src/Adjacency_Matrix.png)
![title](src/Adjacency_List.png)


## __3. The NetworkX package__

The package we are going to use for our network analysis in Python is `networX`. See the [networX-Tutorial](https://networkx.org/documentation/stable/tutorial.html) for further information.

In [None]:
# uncomment & run, in case you haven't installed it yet

#!pip install networkx

In [None]:
# Import the required packages
import networkx as nx
import pandas as pd
import numpy as np

# and later, for plotting
import matplotlib.pyplot as plt
import matplotlib

#### __3.1 Building a Graph:__

In order to create a network we start of by inizialising an empty network without nodes or edges.

In [None]:
G = nx.Graph()

# empty graph
nx.draw(G)

__Adding nodes:__

Since a network without nodes and edges is kind of useless and boring, we should start by adding nodes. 
Let's assume we are modelling a group of friends.

Adding nodes (people) can be done either node, by node...

In [None]:
# Adding a first node
G.add_node('Peter')

# draw
nx.draw(G, with_labels = True)

In [None]:
# Adding a second node
G.add_node('Thomas')

nx.draw(G, with_labels = True)

...or by adding the nodes from an iterable thing like lists:

In [None]:
# adding mutliple nodes (people) at once (from an iterable element, like a list)
G.add_nodes_from(['Peter', 'Thomas', 'Anna'])

nx.draw(G, with_labels = True)

Using the `.nodes` method, we can see the currently present nodes in a graph. 

In [None]:
# View nodes
G.nodes

__Adding attributes:__

Since nodes represent entities they can have certain characteristics: i.e. politicians have a party-affiliation, people have a gender, age etc.
We might want to store these attributes/metadata along with the respective node.

These attributes can also be passed on in iterables using 2-tuples `[(node, attribute), (node, attribute), (node, attribute), ...]`

In [None]:
# a new node with attribute
G.add_node('Tabea', age='27')

# adding an attribute to an existing node
G.nodes['Thomas']['gender'] = 'male'

# attributes for several nodes
G.add_nodes_from([('Peter', {"gender": "male", "age" : "23"}), ('Anna', {"gender": "female"})])

# get information on all nodes with attributes
# node 1-3 have gender, node 4 has an age
G.nodes.data()

#G.nodes['Peter']
#G.nodes(data = True) # equivalent

In [None]:
# nodes can also have multiple attributes
# also adding Tabeas gender
G.nodes['Tabea']['gender'] = 'female'

# adding a favorite color of every person
G.add_nodes_from([('Peter', {"color": "blue"}), ('Thomas', {'color' : 'purple'}), ('Anna', {"color": "green"}), ('Tabea', {'color' : 'orange'})])


G.nodes.data()

In [None]:
G.nodes()

color_list = [c for n,c in G.nodes(data='color')]
color_list

In [None]:
# in order to visualize the people in their favorite color, we first need to extract the colors from the nodes attributes into a list

nx.draw(G, node_color = [c for n,c in G.nodes(data='color')], with_labels = True)

In [None]:
# Get nodes directly connected to another

G.adj['Peter']  # or list(G.neighbors(1))

# empty, we have to add edges!

__Adding edges:__

Well, usually we will construct networks to visualize relationships. So we need to add edges, representing these connections.
In our example, we need to add the relationships between the people.

Like nodes, edges can also be added to the graph step by step:

In [None]:
# add an edge from 1 to 2
G.add_edge('Thomas', 'Peter')

nx.draw(G, node_color = [c for n,c in G.nodes(data='color')], with_labels = True) 

Of course, also edges can have attributes, for example a weight, indicating how strong the connection is or the kind of relationships between people.

In [None]:
# edge with an attribute indicating how close people (weight) are and what their relationship is (label)

G.add_edge('Thomas', 'Peter', weight = 8, label = 'Kindergarden Friends')
G.add_edge('Thomas', 'Anna', weight = 30, label = 'Siblings')

nx.draw(G, node_color = [c for n,c in G.nodes(data='color')], with_labels = True)

***
### **Note on Edge Weights**

In the plot above, the edge weights are not yet considered. This is, because we did not define a explicit layout algorithm that (can) account for weights. 

To my knowledge, the default random layout (default) does not consider weights. 
***




To visualize the relationships with their attributes, we need some extra steps:

In [None]:
# inspect edges
G.edges(data = True)

In [None]:
# First, we need to extract the relationship labels for plotting into a separate dictionary
# (loop over edges (consisting of person1, person2, attribut_dict) and extract p1 and p2 together with their relation (a['label]))

edge_labels = dict([((p1, p2), a['label']) for p1, p2, a in G.edges(data=True)])

edge_labels

***
### **Visualize Edge Weights:**

In order to visualize the edge weights, we now define an explicit layout, namely the spring layout (read more [here](https://networkx.org/documentation/latest/reference/generated/networkx.drawing.layout.spring_layout.html)).

This layout does take weights into account. More specifically, the **weights influence the distance of nodes inversly**. This means:
* if Thomas and Anna, as siblings, have a weight of i.e. 30, their distance will be 1/30, meaning they are relatively close
* compared to that, Thomas and Peter, whose weight is 8, result in a 1/8 distance apart


***
Be aware that for other layout algorithms, i.e. [fruchterman_reingold_layout](https://networkx.org/documentation/networkx-1.11/reference/generated/networkx.drawing.layout.fruchterman_reingold_layout.html) and [kamada_kawai_layout](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.layout.kamada_kawai_layout.html) this changes.

Here, the **weights are interpreted as actual distance**, meaning a higher weight makes nodes further apart. 
To use edge weights inversely in these layouts, you'll need to explicitly define this: for example, extracting an edge weight and mapping its inverse to a new edge attribute 'distance', which can then be passed as the weight parameter in the layouts.

```
for node_1, node_2, dict in G.edges(data = True):
   G[node_1][node_2]['distance'] = 1 / G[node_1][node_2]['weight']
```
***

In [None]:
# Second, we need to define a fixed layout, to ensure that the plotting of the 
# networks nodes and edge-labels happens in the same position

layout = nx.spring_layout(G, weight='weight') # taking a layout that can consider weights

# plot network
nx.draw(G, pos = layout, node_color = [c for n,c in G.nodes(data='color')], with_labels = True)
# add edge labels
nx.draw_networkx_edge_labels(G, pos = layout, edge_labels = edge_labels, font_size=8)

Again, we can also using an iterable object of 2-tuples (the two nodes that shall be connected) or 3-tuples (having an additional attribute) to add edges:

In [None]:
# adding edges from iterables
#G.add_edges_from([('Peter', 'Anna'), ('Peter', 'Tabea)])

# and with attributes
G.add_edges_from([('Peter', 'Anna', {'weight': 70, 'label' : 'Couple'}), ('Peter', 'Tabea', {'weight': 1, 'label' : 'Ex-Partner'})])


# update relations-dictionary
edge_labels = dict([((p1, p2), a['label']) for p1, p2, a in G.edges(data=True)])

# plot
nx.draw(G, pos = layout, node_color = [c for n,c in G.nodes(data='color')], with_labels = True)
nx.draw_networkx_edge_labels(G, pos = layout, edge_labels = edge_labels, font_size=8)

In [None]:
# inspecting edges
G.edges()



print('Number of edges incident to node Peter: ', end='')
print(G.degree['Peter'])  # the number of edges incident to 1 --> only node nr. 1


__Removing nodes or edges:__

In general, this works analogue to adding nodes or edges:

Tabea is quiet far from the rest of the group. She and Peter broke off contact years ago, when Peter met Anna. Therefore, she is not relevant anymore and we can remove her from the network.

In [None]:
# remove a single node
G.remove_node('Tabea')

# plot
nx.draw(G, pos = layout, node_color = [c for n,c in G.nodes(data='color')], with_labels = True)
nx.draw_networkx_edge_labels(G, pos = layout, edge_labels = edge_labels, font_size=8)

In [None]:
# removing all nodes and edges
G.clear()

In [None]:
nx.draw(G)

#### __3.2 Directed Graphs:__

[NetworkX](https://networkx.org/documentation/stable/reference/classes/digraph.html) also enables us to build directed graphs. This means the edges don't just connect two nodes (without a direction), but are an arrow pointing from the first node to the second one.

For example, 
* While Peter is now in love with Anna (and vice versa), Tabea might still have feelings for him, but he does not have feelings for her.
* If we were to model a Twitter network in terms of followers, for example, we would use directed graphs as a person A can follow B without B also following A.

To see how to build such directed graphs, we now want to model a supply chain.

In [None]:
nx.relabel_nodes(G, {"Peter" : "PETER"})

In [None]:
# initialize an empty directed graph
DG = nx.DiGraph()

# add edges (will also automatically add the nodes)
DG.add_weighted_edges_from([('Supplier1', 'Production', 1), ('Supplier2', 'Production', 1), ('Production', 'Retailer', 1.5), ('Retailer', 'Customer', 1000)])


layout = nx.random_layout(DG, seed = 28) # Set a seed to fix the random layout for reproducability --> in the random layout, weights are not considered1

nx.draw(DG, pos = layout, with_labels = True)

#### __3.3 Multi Graphs:__

With [Multi Graphs](https://networkx.org/documentation/stable/reference/classes/multigraph.html) you can add several, distinct egdes connecting the same two nodes.

[Multi Directed Graphs](https://networkx.org/documentation/stable/reference/classes/multidigraph.html) allow several, directed edges between two nodes.

Visualizing these different edges can get quiet tricky. 



In [None]:
MG = nx.MultiGraph()

MG.add_edges_from([(1, 2), (1, 2), (2, 3), (1, 3)])

print(MG.edges) # there actually exist two edges between 1 and 2

nx.draw(MG, with_labels = True)

#### __3.4 Layout Options__

When visualizing the networks we constructed, a useful parameter is the 'pos' parameter, where you can specify a layout.
The layout is an algorithm determining the nodes positioning in the drawing.
If not specified, the layout will be chosen randomly (resultingly, graphs may look different with every code re-run).

Layout options are: [more details](https://networkx.org/documentation/stable/reference/drawing.html)
* `shell_layout` : position nodes in concentric circles
* `circular_layout`: positions nodes on a circle
* `planar_layout`: position nodes without edge intersections
* `kamada_kawai_layout` : position nodes using Kamada-Kawai path-length cost-function -> good for community detection
* `spring_layout`
* `spectral_layout`

Usually, you will need to research an appropriate layout for your current analysis or play around with different layouts until you reach an insightful plot.

***

## __4. Interactive Tutorial Part__

In this part we are going to investigate the network structre behind swiss politicans tweets. 

#### a) Import the data


Import the two provided csv-files containing 
1. the user profiles
2. the twitter timeline between 2021-07-12 and 2022-07-12

Make sure to load the columns 
* "id" 
* "author_id" and 
* "retweet_user_id" 
as strings (loading them as integers might cause issues).

*Hint:* the `converters` argument in pd.read_csv might be helpful

In [None]:
# import packages
import networkx as nx
import pandas as pd

In [None]:
# Let's get active

users = pd.read_csv("data/users.csv", converters={'id': str})
timelines_all = pd.read_csv("data/timelines.csv", converters = {'author_id' : str, 'retweet_user_id':str})

In [None]:
users

In [None]:
timelines_all

#### b) Filter out the irrelevant tweets from the timelines dataframe. 

Only those tweets should remain in the dataframe, where the "retweet_user_id" corresponds to the "user_id" of one of the politicians. \
*Hint:* the `.isin()` function might come in handy and you can find the "user_id" of the politicians in the "users" dataframe.

In [None]:
# Let's get active


#### c) Construct a list of vertices (nodes) 

The node attributes should containin the user ids, screen_names, and the political party label of the vertices.

The node list should have the form:

[('25254764', {'username': 'andreaskirstein', 'party': 'AL'}),
('472372843', {'username': 'bergerwthur', 'party': 'AL'}),
...]

In [None]:
# From looking at the df
users
# we can see that we are interested in the columns: id, username, party

In [None]:
# We can acces them as
users['id']

# and one entry
users['id'].iloc[100]

Utilize this to iterate over each user, and extract his/her id, username and party affiliation into the desired format (shown above).

In [None]:
# Let's get active
vertices = []




#### d) Then build a list of connections between politicians (the edges), 

Every edge is a pair of two users that exchanged *at least* one retweet with each other (*regardless* of the direction). 

The edge list should have the form: 

[('25254764', '2353332248'),
('25254764', '778497337'),
...]


__Think for a moment:__

How would you conceptually approach this task?

* Where do you need to loop over? 
* What do you need to compare? 
* How can you ensure each pair is only listes once (independant of the order)?

In [None]:
# New list, to store the unique retweet-relations, the edges
edges = []

# Let's get active




#### e) Create and empty graph object, and then add the nodes and the edges (from the list you created).

In [None]:
# Let's get active


#### f) Plot the network. 

Make sure to color the nodes according to the political party label of the politician and add a legend to the plot.

*Hint:* use the optional function parameters `nodelist` and `node_color` to pass a list of nodes and a list of corresponding colors to the drawing function.


In [None]:
# Import packages
import matplotlib.pyplot as plt
import matplotlib

##### 1) Color-Dictionary depending on party

In [None]:
# Dictionary with the party colors (manually defined)
colors = {'AL': "firebrick", 'BDP': 'yellow', 'CVP' : "orange", 'EDU': "red", 'EVP': "gold", 
          'FDP': 'deepskyblue', 'GLP' : "limegreen", 'Green' : "greenyellow", 'SP' : "coral", 
          'SVP' : "seagreen", 'UP' : "goldenrod"}


##### 2) Create a list of colors, in the same order as the nodes (politicians) $\to$ needed for plotting

Therefore, you need to iterate over the vertices (nodes), get the party affiliation of the current politician and depending in that one, add the color from the dictionary.

In [None]:
# List to store the colors
node_color = []


In [None]:
# inspect the vertices, how can we access the party label of a single politician?

vertices

In [None]:
# Let's get active


Now, we have the nodes, the edges and the reffering color for each node (politician).

We can start plotting!

Play around with layouts, the `node_size` and `width` parameter and dont forget to add the colors.

In [None]:
# Set a figure size
plt.figure(figsize = (15, 10))



# Choose/Play around with a layout 



# Plot the network



To make the plot more comprehensive, you can also add a title as well as a legend, explaining the party colors 

In [None]:
# set figure size
plt.figure(figsize = (15, 10))


# define a legend:


# make the legend


# set a title


# draw

