
<img src="https://cloud.archivesunleashed.org/assets/logo-8d2126e162dc682078284bb8f5585e4365fbad6dc04aa2afbae747626bd815ea.png" height="100px" width="100px">

# Welcome

Welcome to the Archives Unleashed Cloud (AUK) Visualization Demo in Jupyter Notebook for your collection { Collection Name } id { Collection Id }. 

This demonstration takes the main derivatives from AUK and uses Python data analysis approaches to produce information about the collection you analysed.

This product is in beta, so if you encounter any issues, please post an [issue in our Github repository](https://github.com/archivesunleashed/auk/issues) to let us know about any bugs you encountered or features you would like to see included.

If you have some basic Python coding experience, you can change the code we provided to suit your own needs. Unfortunately, we cannot support code that you produced yourself. We recommend that you use `File > Make a Copy` first before changing the code in the repository. That way, you can always return to the basic visualizations we have offered here. Of course, you can also just re-download the Jupyter Notebook service from your Archives Unleashed Cloud account.

### How Jupyter Notebooks Work:

If you have no previous experience of Jupyter Notebooks, the most important thing to understand is that that <Shift><Enter/Return> will run the python code inside a window and output it to the site.
    
The window titled `# RUN THIS FIRST` should be the first place you go. This will import all the libraries and set basic variables (eg. where your derivative files are located) for the notebook. After that, everything else should be able to run on its own.


In [91]:
# RUN THIS FIRST

# This Window will set up all the necessary libraries and dependencies
# for your Collection.
coll_id = "4656"
auk_fp = "data/"
auk_full_text = auk_fp + coll_id + "-fulltext.txt"
auk_gephi = auk_fp + "coll_id-gephi.gexf"
auk_graphml = auk_fp + "coll_id-gephi.grapml"
auk_domains = auk_fp + "coll_id-fullurls.txt"
auk_filtered_text = auk_fp + "coll_id-filtered_text.zip"

# The following script will attempt to install the necessary dependencies
# for the visualisations. You may prefer to install these on your
# own in the command line.
import sys
from collections import Counter

try:  # a library for manipulating column data.
    import pandas as pd
except ImportError:
    !{sys.executable} -m pip install pandas  

try:
    import matplotlib.pyplot as plt # a library for Plotting
except ImportError:
    !{sys.executable} -m pip install matplotlib

try:
    import numpy as np # a library for complex mathematics
except ImportError:
    !{sys.executable} -m pip install numpy
    
try:
    from nltk.tokenize import word_tokenize
    from nltk.draw.dispersion import dispersion_plot as dp
except ImportError:
    !{sys.executable} -m pip install nltk
    nltk.download('punkt')

# Text Analysis

The following set of functions use the nltk python library to search for the top most used words in the collection.

In [110]:
# You can change the value of `top` to get more results. 
top = 30

def clean_domain(s):
    stop_words = ["com", "org", "net", "edu"]
    ret = ""
    dom = s.split(".")
    if len(dom) <3:
        ret = dom[0]
    elif dom[-2] in stop_words:
        ret = dom[-3]
    else:
        ret = dom[1]
    return ret

def get_textfile () :
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            tokens += word_tokenize(str(line).split(",")[3])
    return tokens

def get_text_domains():
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            split_line = str(line).split(',')
            tokens.append((clean_domain(split_line[1]), split_line[3]))
    return tokens

def get_text_years():
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            split_line = str(line).split(',')
            tokens.append((split_line[0][1:5], split_line[3]))
    return tokens

def get_top_tokens(total=20):
    tokens = get_textfile()
    tokens = [(value, key) for key, value in Counter(tokens).items()]
    tokens = list(filter(lambda x : len(x[1]) > 3, tokens))
    tokens.sort(reverse=True)
    return(tokens[0:total])

def get_top_tokens_by_year(total=20):
    tokens = get_text_years()
    sep = {key: "" for key, value in tokens}
    for year, text in tokens:
        sep[str(year)] = sep[str(year)] + " " + text
    ret = [(key, Counter(word_tokenize(val)).most_common(total)) for key, val in sep.items()]
    return (ret)

def get_top_tokens_by_domain(total=20):
    tokens = get_text_domains()
    sep = {key: "" for key, value in tokens}
    for domain, text in tokens:
        sep[str(domain)] = sep[str(domain)] + " " + text
    ret = [(key, Counter(word_tokenize(val)).most_common(total)) for key, val in sep.items()]
    return (ret)   


Now that you have saved the above functions, you can now use them like so:



In [None]:
# Get the set of available years in the collection 
set([x[0] for x in get_text_years()])

In [None]:
# Get a list of the top words in the collection
# (regardless of year).
get_top_tokens(top)

In [None]:
# Get a list of the top tokens, separated by year.
get_top_tokens_by_year(top)

In [112]:
get_top_tokens_by_domain(top)

[('bcdailybuzz',
  [(')', 8),
   ('Not', 8),
   ('Found', 8),
   ('404', 4),
   ('The', 4),
   ('requested', 4),
   ('URL', 4),
   ('was', 4),
   ('not', 4),
   ('found', 4),
   ('on', 4),
   ('this', 4),
   ('server', 4),
   ('.', 4),
   ('/thumbs/c3e0677be2f6.jpg', 1),
   ('/thumbs/575280c2a866.jpg', 1),
   ('/thumbs/6a5e80c11f3d.jpg', 1),
   ('/thumbs/eb55c88d287c.png', 1)]),
 ('nanaimodailynews',
  [('BC', 58638),
   ('News', 32906),
   ('Sports', 26951),
   ('World', 17705),
   ('Our', 17554),
   ('Business', 17512),
   ('Entertainment', 17502),
   ('Us', 14803),
   ('Classifieds', 14711),
   ('-', 13487),
   ('Nanaimo', 13461),
   ('Daily', 13125),
   ("'s", 10501),
   ('on', 10120),
   ('Canada', 9244),
   ('Games', 8830),
   ('Vancouver', 8823),
   ('Team', 8794),
   ('Letters', 8779),
   ('Home', 8777),
   ('Contact', 8776),
   ('Jobs', 8771),
   ('Advertising', 8768),
   ('Info', 8764),
   ('Opinion', 8764),
   ('Browse', 8758),
   ('Opinions', 8755),
   ('Poll', 8753),
   ('

In [None]:
# Create a dispersion plot, showing where the list of words appear
# in the text.
text = get_textfile()
dp(text, ["he", "she"]) # uses the nltk dispersion plot library (dp).

# Bibliography

Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language       Processing with Python*. O’Reilly Media Inc.

Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.