# Initial data exploration
This is a simple initial data exploration created by Nick.

The objective of this notebook in particular is the following:
- Know how to read in the three different datasets that we have
    - Might keep it only to the plain text and navigation paths, HTML might be too much
- Find a way of linking the two datasets
- Enrich the graph with the shortest path, or find a way of adding that info as well
- Maybe see if there's a way of doing basic AF explorations already with networkX?

Might also be worth it to create the environment we'll use for this...

In [None]:
import numpy as np
import pandas as pd
import networkx
import matplotlib.pyplot as plt


# How to use data readers
Most of this notebook is just the template code that was used to make sure everything works.

The important part is the following code block, that shows how to read in each of the different datasets that we have.

In [None]:
import data_readers

# The links and edges
wikispeedia = data_readers.read_wikispeedia_graph()

# The finished paths
finished_paths = data_readers.read_finished_paths()

# The unfinished paths
unfinished_paths = data_readers.read_unfinished_paths()

# The shortest path matrix
# This one is the slowest to read by far, probably due to the weird parsing that has to be done!
shortest_path_df = data_readers.read_shortest_path_df()

# Searching for the string of a given article. It has to be formatted like the article name
# Which shouldn't be a problem, as we'll probably usually retrieve them internally
obi_wan_text = data_readers.plaintext_article_finder('Obi-Wan_Kenobi')

In [None]:
shortest_path_df[('Actor',)][('Japan',)]

In [None]:
shortest_path_df

In [None]:
shortest_path_df

There's four really important datasets in the wikispedia articles:
- links.tsv: Contains the actual edges
- paths_finished.tsv: Contains the winning games
- paths_unfinished.tsv: Contains the losing games
- shortest-path-distance-matrix.txt: Contains info on the shortest path

First part is checking that the reading of the data is correct. We know the number of edges and nodes in the dataset, so we'll just use that to compare

In [None]:
wikispeedia= networkx.read_edgelist('datasets/wikispeedia_paths-and-graph/links.tsv', 
                                    create_using=networkx.DiGraph)
print("Dataset has", len(wikispeedia.nodes), "nodes")
print("Dataset has", len(wikispeedia.edges), "edges")

In [None]:
wikispeedia= networkx.read_edgelist('datasets/wikispeedia_paths-and-graph/links.tsv',
                                    create_using=networkx.DiGraph)

These are less nodes than the reported number, it should be 4,604 nodes.

The 119,882 edges is correct though.

The difference is still small-ish, so for now I'll just ignore it and focus on reading in the other datasets.

In [None]:
paths_finished = pd.read_csv('datasets/wikispeedia_paths-and-graph/paths_finished.tsv', sep='\t', skiprows=15, 
                   names=['hashedIpAddress', 'timestamp', "durationInSec", 'path', "rating"])
paths_finished.head()

In [None]:
paths_unfinished = pd.read_csv('datasets/wikispeedia_paths-and-graph/paths_unfinished.tsv', sep='\t', skiprows=16,
                                names=['hashedIpAddress', 'timestamp', "durationInSec", 'path', "target", "type"])

paths_unfinished.head()

Last part left is the shortest distance matrix. For this part, I just need to link things up with the articles.tsv to find the names corresponding to everything.

Already reading the shortest distances is a bit of a pain though...



In [None]:
shortest_path = np.genfromtxt("datasets/wikispeedia_paths-and-graph/shortest-path-distance-matrix.txt",
                              delimiter=1, missing_values=-1, dtype=int)
articles = pd.read_csv('datasets/wikispeedia_paths-and-graph/articles.tsv', sep='\t', skiprows=12,
                       names=["article_name"])

shortest_path_df = pd.DataFrame(shortest_path, index=articles.values, columns=articles.values)

Common file reader has already been created, I'll just have to do wait to transform things into a python file.

## Plain text reader
Another basic AF reader, the objective of this is simply to find the relevant text file given the string name. Annoying because of the format the strings were given.

In [None]:
text_file = open("datasets/plaintext_articles/%C3%85land.txt", "r", encoding="utf8")

#read whole file to a string
data = text_file.read()

#close file
text_file.close()

In [None]:
def plaintext_article_finder(article_name: str) -> str:
    art_file_name = "datasets/plaintext_articles/" + article_name + ".txt"
    text_file = open(art_file_name, "r", encoding="utf8")
    res_string = text_file.read()
    text_file.close()
    
    return res_string

# Convert URL-encoded string to a regular Python string

In [None]:
from urllib.parse import unquote

# Decodes URL-encoded article names. 
# Changes to text with accents. For example, %C3%81ed%C3%A1n_mac_Gabr%C3%A1in becomes Áedán_mac_Gabráin.
def decode_article(article_name):
    encoded_string = (article_name)
    decoded_string = unquote(encoded_string)
    return decoded_string
    
decode_article(articles.iloc[0,0])