# Nodes and Edges

<b>Intent:</b> Taking a look at some of the things that you may need to use Python for in creating your network graphs.

<b>Data from:</b> Gabriella's spreadsheet

In [None]:
# import packages
from datascience import *
import pandas as pd

### Importing CSVs
We are going to read in a csv that we downloaded from Google Sheets, using the `read_table` method from `datascience`. Below is a picture of what it originally looked like.

![](Google Sheets.png)

In [None]:
graffiti = Table().read_table('WellonsG Graffiti mochicas in the Huaca Cao Viejo, El Brujo - Sheet1.csv')
graffiti

In [None]:
# selecting out the columns that we want to turn into our node / edge pairs
temple_and_code = graffiti.select('Temple', 'Code')
temple_and_code

### Adding IDs

We want to add a unique numerical ID to each of our items so that we can make our network graph. In order to do that, we will use a dictionary to keep track of what ID we assign to each different element. We are going to then define functions for adding elements to our dictionary and for getting the values that we assign to our keys.

In [None]:
# we are going to count up from zero as we assign ID's to keys
next_id = 0
# initializing the dictionary that we will keep our key-value pairs in
dictionary = {}

# making a function for adding keys
def add_to_dictionary(key):
    global next_id
    if key not in dictionary.keys():
        dictionary[key] = next_id
        next_id = next_id + 1

# a function to get back values from the dictionary
def get_id(key):
    return dictionary[key]

<b>Note:</b> \>>> (three greater than signs) means that that line is something we typed

#### <font color='blue'> Step by step of what we just defined</font>


```
>>> add_to_dictionary('E')
```

When we add 'dog' to our dictionary, we assign a unique ID to it, and store those values together. Think of this process as the computer remembering it like:

```
'dog' = 0
```
We then want to be able to get back an ID if we pass in a key:

```
>>> get_id('E')
0
```
This a repeatable process that will hold true as we continue to pass more things in.

```
>>> add_to_dictionary('D')
>>> add_to_dictionary('E1')
>>> add_to_dictionary('E2')
>>> get_id('E2')
3
>>> get_id('D')
1
```

In [None]:
# telling the computer, 'for each label in the labels
# of our temple_and_code table repeat this process':
for label in temple_and_code.labels:
    # apply the add_to_dictionary function to each value in the column 'label' of this table
    temple_and_code.apply(add_to_dictionary, label)

In [None]:
# creating columns for our node
nodes_and_edges = temple_and_code.with_columns([
        'Source', temple_and_code.apply(get_id, 'Temple'),
        'Edge', temple_and_code.apply(get_id, 'Code')
    ])
nodes_and_edges

### Getting Weights

We now have ID's for our table, but now we want weights for them that represent the strength of the relationship. There are many different ways to quantify a relationship, but we will do it based of frequency of appearance in our table. We will use the `group` function to let us know how many times a Temple-Code combo appears in our table.

In [None]:
weights = temple_and_code.group(['Temple', 'Code'])
weights.show()

We then will use the `pandas` function `merge` to join the two tables together so that we have the weights with our node and edge info. Notice that we need to convert our tables to `pandas` dataframes so that they will work with the function.

In [None]:
with_weights = pd.merge(nodes_and_edges.to_df(), weights.to_df(), how='right',on=['Temple', 'Code'])
# converting back to a datascience table
with_weights = Table().from_df(with_weights)
with_weights

In [None]:
# relabeling a column
with_weights.relabel('count', 'Weight')

In [None]:
#if you would like to save the file as a csv, uncomment the following line (remove the #)
#with_weights.to_csv('temple_nodes_edges.csv')

The above example is a good demonstration on how we would convert a bunch of nodes and edges to unique ids. Using these conversions, we can also organize and weight our info in a manner that will help us in creating network graphs. However, for this example, we already had a spreadsheet provided; this spreadsheet already identified important nodes and edges. For most of you, this identification of important nodes and edges is the truly hard part of the data analysis. This is where all the previous information that has been taught in the modules comes into play. By utilizing tools such as regular expressions and NLTK, you should try and make relavant and specific queries that return you specific connections and correlations in your data sets/texts; these connections and correlations can eventually be transformed into nodes and edges! To review how this process works, let us look at a powerful example from a previous notebook.

In [None]:
#importing relevant packages
import re
import codecs
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk import tokenize
from collections import Counter
import pprint
pp = pprint.PrettyPrinter()

%matplotlib inline

In [None]:
#We are going to revisit our texts regarding Phillip II of Spain
with codecs.open('Grand Strategy of Phillip II Text.txt', 'r',encoding = 'utf-8', errors='ignore') as f:
    read_text = f.read()

In [None]:
#Let's start off by getting a rough list of proper nouns with relative frequencies - this will help us narrow down people, places and their relations
# capital 'W' mean NOT word characters (not letters or numbers)
pp.pprint(Counter(re.findall('[A-Z]\w+',read_text)))

In [None]:
# making a table out of the above dictionary
names_table = Table(['Name', 'Count'])
names = Counter(re.findall('[A-Z]\w+',read_text))
for name in names.keys():
    row = [name, names[name]]
    names_table.append(row)
names_table.sort('Count', descending=True)

Even with this simple, query, we see that we are able to gain important information regarding the most commonly occuring names/proper nouns in the work. If we do more analysis regarding these parts of the texts, we may be able to develop clear nodes and edges between relevant pieces of information!

In [None]:
# while we're at it, let's get a list of all the years that appear in the text along with relative frequencies
pp.pprint(Counter(re.findall('\d{4}',read_text))) 

One of the most important aspects of data analysis is CREATIVITY! Analysis, especially of dense works, is never black-and-white. Rather, the person analyzing the data must think of meaningful queries and searches that will yield the most useful results. Take some time to brainstorm like a data scientist! Using a list of the most common years and proper nouns as a starting point, what are some more/resulting things that you feel you could analyze/query using our data analysis tools? THINK BIG! Often, people feel that their idea is too ambitious; however, most of the time, with enough work, we can get a computer to make that vision a reality. Share your ideas with the class!

We are going to revisit an example from a previous notebook.

In [None]:
#let's see the context surrouding the instances in the text where our most common name and most common country (besides Spain)
#are mentioned in close proximity
info_england = re.findall('[\S\s]{,45}England[\S\s]{,45}', read_text)
info_england_parsed = []
for elem in info_england:
    if 'Philip' in elem:
        info_england_parsed.append(elem)
info_england_parsed
#What are ways that you could improve this query?

Let's organize this!

In [None]:
date_words = [re.findall('[A-Z][a-z]+',elem) + re.findall('\d{4}', elem) for elem in info_england_parsed if re.search('[A-Z][a-z]+', elem)]
date_words
#Using this query, we can associate proper nouns with the year around which they were referenced

In [None]:
#Finally, let's make a dictionary that organizes this data! This dictionary is created in a similar manner to 
#the other dictionaries above. In this one, our "keys" are years. The keys are associated with words that are 
#mentioned in close proximity to that year.
word_date_dict = {}
for x in range(1500,1600):
    for elem in date_words:
        if str(x) in elem:
            word_date_dict[str(x)] = elem
word_date_dict

Wow! Look what we were able to do with only a few, simple, rough queries! Creating similar sets of organized info will help greatly in finding node/edge relationships. What are some relationships that you can pick out from the above information?

Keep in mind that you are not limited to only names and years! By analyzing the patterns in the text, you can pull out relationships between names, places, events, artifacts, etc. These relationships will help you greatly in creating your network structure.

Experiment with all that you've learned! Ask questions if you have them!