# Explainer Notebook

## 1. Motivation
What is your dataset?

Why did you choose this/these particular dataset(s)?

What was your goal for the end user’s experience?

Our dataset consists of animal pages on Wikipedia, where we used wescraping methods and the WikiData API to collect a dataset.
We choose this particular dataset because when looking at Wikipedia in general there is always a link to another page within a wikipedia page and then another link. This we thought would be interesting to look at in regards to the kingdom of animals. Because some groups of animals are more likely to mention each other in a wikipedia article. We first analyzed all of the approxomately 30k wikipedia pages and then narrowed down our network to only reptiles in order to narrow down the dataset.
Our goal with the analysis of our Wikipedia animal dataset was to show the user and viewer how contected these animals actually are on wikipedia.

## 2. Basic stats
Write about your choices in data cleaning and preprocessing

Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)


The data that will be used in the analysis, has been gathered by using the WikiData API using their query builder. Due to the Wikipedia pages not having a category called “Animals” that allows you to gather all the animal pages, we will use an alternative which is querying on the Wikidata pages that have a “Animal Diversity Website (ADW) taxon id”. The query to gather the data is:

In [None]:
query = '''
    SELECT DISTINCT ?item ?itemLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
      {
        SELECT DISTINCT ?item WHERE {
          ?item p:P4024 ?statement0.
          ?statement0 (ps:P4024) _:anyValueP4024.
        }
      }
    }
    '''
query_reptile = '''
    SELECT DISTINCT ?item ?itemLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
      {
        SELECT DISTINCT ?item WHERE {
          ?item p:P4024 ?statement0.
          ?statement0 (ps:P4024) _:anyValueP4024.
          ?item p:P5473 ?statement1.
          ?statement1 (ps:P5473) _:anyValueP5473.
        }
      }
    }
    '''

In [None]:
from tqdm import tqdm
import requests
import json
from bs4 import BeautifulSoup
import re
from get_links import links_on_page
import networkx as nx
import matplotlib.pyplot as plt
from netwulf import visualize
import pickle
import numpy as np

The first query just request all the pages where it has a Animal Diversity Website (ADW) taxon id, this query is made by their auto query builder.
The second query request all the pages where it has a Animal Diversity Website (ADW) taxon id aswell as a Reptile Database ID.

The get_wiki_links function takes n amount of wikidata ids, and in the query they are seperated with | as this allows us to query up to 50 ids at a time. We then loop over the response and check if it contains a link to an english wikipedia site.

In [None]:
def get_wiki_links(item_ids):
    url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&ids={'|'.join(item_ids)}&props=sitelinks/urls&format=json&sitefilter=enwiki"
    response = requests.get(url)
    data = json.loads(response.text)
    wiki_links = []
    for item_id in item_ids:
        wikipedia_url = data["entities"][item_id]["sitelinks"]
        if "enwiki" in wikipedia_url: # only save if the page has a reference to an english wikipage
            link = wikipedia_url["enwiki"]["url"]
            wiki_links.append(link)
    return wiki_links

For the actual quering, we start by getting all the wikidata pages that fit our query (query & query_reptile). We loop over 50 pages at a time, and get their wikipedia links with get_wiki_links, and write the results to a .txt file where each entry is seperated by a new line.

In [None]:
url = 'https://query.wikidata.org/sparql'
r = requests.get(url, params = {'format': 'json', 'query': query_reptile})
data = r.json()

data = data["results"]["bindings"]
temp = []
for entries in tqdm(range(0,len(data),50)):
    if entries+50 < len(data):
        sub_list = [ids["itemLabel"]["value"] for ids in data[entries:(entries+50)]]
        temp = temp + get_wiki_links(sub_list)
    else:
        sub_list = [ids["itemLabel"]["value"] for ids in data[entries:(len(data))]]
        temp = temp + get_wiki_links(sub_list)
file = open('animal_links_reptile.txt','w')
for item in temp:
    file.write(item+"\n")
file.close()

Now for creating our nodes and edges. We open the .txt file from before, and prepare for the web scraping:

In [None]:
edgelist_weights = {}
edgelist_weights_long = {}
#animal_name = "Elephant"
names = names_from_table() # Gets all the entries in the larger table on https://en.wikipedia.org/wiki/List_of_animal_names
names_long = {}
with open('data/animal_links_reptile.txt', 'r') as f:
    entries = f.read().splitlines()
for name in tqdm(entries):
    name_temp = name.split("/")[-1]
    names_long[name_temp] = 0 # Saving all the entries in the txt file in a dict, for a fast comparisons (ie. only make pairs
                                #  with animals and not wikipages for unrelated stuff)
attributes_dict = {}

For each wikipedia link in the .txt file, we get all the links on the page aswell as look for infoboxes using links_on_page. This function just uses the wikipedia api to get all the information on the page in json format, and then we find all href in the wikipedia page and save these.
Then we look for a infobox, since there are 2 ways wikipedia makes these, we have to check if infobox is none, as this indicates the infobox is the other type.
Then we go through the infobox (table) and extract predefined values we wanna use as attributes.
In case there is no infobox on the page, ie. infobox is never updated from having None as values, we discard this page as it is most likely a redirect page.

In [None]:
def links_on_page(animal_name="Elephant"):
    url = "https://en.wikipedia.org/w/api.php?action=parse&page="+animal_name+"&format=json" # we can either put the title page or the URL version
                                                                                             # of the title, ex. Malayan softshell turtle = Malayan_softshell_turtle
    response = requests.get(url)
    html = json.loads(response.content.decode('utf-8'))['parse']['text']['*'] # Default way the result comes in
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all('a', href=lambda href: href and href.startswith('/wiki/') and not href.endswith('.jpg') and not href.endswith('.png'))
    # ^ sometimes the links link to jpg and png file, sort them out
    result = []
    for link in links:
        href = link.get('href')
        title = link.get('title')
        text = link.text
        result.append(href)
    info = {"Class:": None, "Order:":None, "Superfamily:":None, "Family:":None,"Name:": None} # Default dict allows us to see if its redirect later
    infobox = soup.find('table', {'class': 'infobox biota biota-infobox'}) # One type of infoboxes wiki uses
    if infobox is None:
        infobox = soup.find('table', {'class': 'infobox biota'}) # Other type of infoboxes wiki uses
    if infobox:
        rows = infobox.find_all('tr') # Going through the rows in the infobox
        for row in rows:
            td = row.find_all('td')
            if td: # Checking if empty
                if td[0].text.strip() in ["Class:", "Order:", "Superfamily:", "Family:"]: # the info we want is stored in a row at a time
                                                                                          # it has two columns first being Class, Order...
                                                                                          # the other being the attributes we will save 
                    if td:
                        info[td[0].text.strip()] = td[1].text.strip() # Making the category the key, and the attribute value the value
        info["Name:"] = animal_name # Updating from default 

    return result, info

One of the graph we make is the one where we only make a edge a list of 224 animal names from https://en.wikipedia.org/wiki/List_of_animal_names, this wikipedia page contains two tables, one of them being one with the overall animal species (no subspecies).
Due to it being a wikipedia page, there is a lot of references and stuff we dont want in our scraping which is removed with re.sub, where we have given it some predefined things to remove.

In [None]:
def names_from_table():
    url = "https://en.wikipedia.org/wiki/List_of_animal_names"
    
    response = requests.get(url)
    html_content = response.content
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Find table on the page
    table = soup.find_all('table',{"class":"wikitable"})[1]
    
    # Find all the rows in the table
    t_rows = table.find_all('tr')
    
    ths = t_rows[0].find_all('th')
    
    header = [th.text.replace("\n","") for th in ths]
    
    rows = []
    names = {}
    for tr in t_rows[1:]:
        tds = tr.find_all('td')
        if tds:
            links = tds[0].find_all('a')
            if links:
                row = [td.text.replace("\n","") for td in tds]
                rows.append(row)
                cleaned_row = re.sub(r'\(.*\)|Also see.*|\[\d+\]|See.*', '', row[0]) # The table is not clean, many unwanted formating we remove here
                for link in links:
                    link_href = link.get('href')
                    if link_href.startswith('/wiki/'):
                        names[link_href] = cleaned_row
                        break
    import pandas as pd
    df_animals = pd.DataFrame(rows, columns=header)

    return names

Now for the actual scraping of websites, we loop over all the entries from our .txt file. We check if its redirect by looking at if infobox is None (its default scenario) if not, we add it to our edgelist. One of them to be used for directed and the other for undirected graph.

In [None]:
for name in tqdm(entries):
    temp_string = name.split("/")[-1]
    result, info = links_on_page(animal_name=temp_string)
    if info["Name:"] is not None: # A way to remove wiki redirects from the final result, as redirects dont have the infoboxes we want
        attributes_dict[info["Name:"]] = info # Making nested dicts to quickly get attributes later
        for entry in result:
            if entry in names: # making one graph where we only make edges to the table from https://en.wikipedia.org/wiki/List_of_animal_names
                pair = ("/wiki/"+temp_string,entry) # Making pairs to compare for the dict
                pair_inverted = (entry,"/wiki/"+temp_string)
                if pair in edgelist_weights:
                    edgelist_weights[pair] += 1 # If the pair is already in the dict, the weight is increased
                elif pair_inverted in edgelist_weights:
                    edgelist_weights[pair_inverted] += 1 # If the inverted pair is already in the dict, the weight is increased
                else:
                    edgelist_weights[pair] = 1 # If the pair is not in the dict, the weight is 1
            if entry.split("/")[-1] in names_long: # making other graph where we make edges between entries from the .txt file
                pair = (temp_string, entry.split("/")[-1]) # Only taking the last part of the URL (the URL title of the page)
                if pair in edgelist_weights_long:
                    edgelist_weights_long[pair] += 1 # If the pair is already in the dict, the weight is increased
                else:
                    edgelist_weights_long[pair] = 1 # If the pair is not in the dict, the weight is 1


In order to avoid doing this over and over we dump the results as pickle files

In [None]:
with open('data/data_plain_reptile_test.pickle', 'wb') as fp:
    pickle.dump(edgelist_weights, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('data/data_plain_long_reptile_test.pickle', 'wb') as fp:
    pickle.dump(edgelist_weights_long, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('data/Reptile_attributes.pickle', 'wb') as fp:
    pickle.dump(attributes_dict, fp, protocol=pickle.HIGHEST_PROTOCOL)


To visulize them, they are first loaded and then we loop over all the entries and add them to an edgelist, and construct the graph with this edgelist.
We then add all our attributes to the graph, and for sanity check we remove nodes than dont have attributes saved in our .pickle file.

In [None]:
with open('data/all_animal_to_all_animal.pickle', 'rb') as handle:
    b = pickle.load(handle)
with open('data/animal_attributes.pickle', 'rb') as handle:
    c = pickle.load(handle)

values = list(b.values())
plt.hist(values, bins=np.arange(max(values))-0.5, edgecolor='black')
plt.yscale("log")
plt.xticks(range(1, max(values) + 1))
plt.xlabel('Number of references')
plt.ylabel('Frequency')
plt.show()

edgelist = [None]*len(b)
for i,items in enumerate(b):
    edgelist[i] = (items[0].replace("/wiki/", ""),items[1].replace("/wiki/", ""),int(b[items]))
G = nx.DiGraph()

G.add_weighted_edges_from(edgelist)
print(G)

to_remove =[]
for Names in tqdm(G.nodes):
    if Names in c:
        G.nodes[Names]['Class'], G.nodes[Names]['Order'], G.nodes[Names]['Superfamily'], G.nodes[Names]['Family'], _ = c[Names].values()
    else:
        to_remove.append(Names) # Some nodes get added to graph even though they are redirects, the cause is known but no good way to handle it
for names in to_remove:
    G.remove_node(names)
print(G)
network, config = visualize(G)

degree = []
for node in G.nodes():
    degree.append(G.degree(node))
plt.hist(degree,bins=np.arange(max(values))-0.5, edgecolor='black')
plt.title("Degree distribution")
plt.xlabel('Number of degrees')
plt.ylabel('Frequency')
plt.show()

## 3. Tools, theory and analysis. Describe the process of theory to insight
Describe which network science tools and data analysis strategies you’ve used, how those network science measures work, and why the tools you’ve chosen are right for the problem you’re solving.

Talk about how you’ve worked with text, including regular expressions, unicode, etc.

How did you use the tools to understand your dataset?


### Network theory

Fraction of edges

Assortivity

Modularity

$M=\sum_{c=1}^{n_c}\left[\frac{L_c}{L}-(\frac{k_c}{2L})^2\right]$ 

##### Louvain
The Louvain algorithm is a community detection algorithm used to identify the communities or groups within a network.

The algorithm works by optimizing a modularity function that measures the quality of the community structure. The modularity function quantifies the extent to which the number of edges within communities is higher than the expected number in a random network with the same degree sequence.

### Text theory 

TF-IDF 

## 4. Discussion. Think critically about your creation
What went well?

What is still missing? 

What could be improved? Why?