# Interacting with the Manuscript object in Python

If you are a little savvy with Python, you can interact directly with the Manuscript in a Python interpreter.
Open up the Python interpreter, Jupyter Notebook, or iPython in the manuscript-object directory and enter the following:

In [21]:
from manuscript import *
m = Manuscript(utils.ms_xml_path)

Generating Manuscript object from /mnt/c/code/mkp/manuscript-object/m-k-manuscript-data/ms-xml...
Generating entries from files in folder /mnt/c/code/mkp/manuscript-object/m-k-manuscript-data/ms-xml/tc...
Generated 0 entries.
Generating entries from files in folder /mnt/c/code/mkp/manuscript-object/m-k-manuscript-data/ms-xml/tcn...
Generated 0 entries.
Generating entries from files in folder /mnt/c/code/mkp/manuscript-object/m-k-manuscript-data/ms-xml/tl...
Generated 0 entries.


Now the Manuscript is held in memory with the variable name `m`. You can look at a particular entry like this:

In [None]:
e = m.entries['tl']['p005r_2']

And you can inspect various aspects about it:

In [None]:
e.text
e.xml_string
e.properties

There are also several functions which are useful when interacting with entries:

In [None]:
find_terms(e.xml, "env")

With a bit of Python, you can make complex queries about the manuscript this way.

In [None]:
for id, entry in m.entries['tl'].items():
    if len(find_terms(entry.xml, "env")) > 0:
        print(id)

Just like that, you get a list of all the entries with environment tags in them!

If we store some data in a list, we can plot the number of `env` tag occurrences by entry:

In [None]:
> import matplotlib.pyplot as plt
> ids = []
> n_terms = []
> for id, entry in m.entries['tl'].items():
    terms = find_terms(entry.xml, "env")
    if len(terms) > 0:
        ids.append(id)
        n_terms.append(len(terms))
> plt.scatter(ids, n_terms)
> plt.show()

![scatter plot](https://raw.githubusercontent.com/cu-mkp/manuscript-object/master/projects/visualizations/scatter.png)

With a little extra formatting, you have a visualization of roughly where env tags appear in the manuscript!

We see that entry 17r_1 has a ton of environment tags. Why is this?

In [None]:
e = entries['tl']['p017r_1']
e.title
e.properties["environment"]

So this is an entry discussing how gunners interacted with various environments in order to defend or attack them!
It looks pretty long. How many characters is it?

In [None]:
len(e.text)

Looks like a big number, but how much is that in context?

In [None]:
lengths = [len(entry.text) for entry in m.entries["tl"].values()]
average = sum(lengths) / len(lengths)
average

Wow! Compared to the average, this entry is super long! But that doesn't tell us anything about the actual distribution.

In [None]:
import math
sd = math.sqrt(sum((x - average)**2 for x in lengths) / len(lengths))
sd

Unsurprisingly, we have a pretty significant standard deviation.

In [None]:
len(e.text) / sd

So the length of entry 17r_1 is 13 standard deviations from the average entry in the manuscript!

It's very easy to go from here to a simple histogram showing the length distribution:

In [None]:
plt.hist(lengths, bins=100)
plt.axvline(average, color="orange")
for x in range(1,14):
    plt.axvline(average + x*sd, color="purple", linewidth=0.5)

![histogram](https://raw.githubusercontent.com/cu-mkp/manuscript-object/master/projects/visualizations/hist.png)

The orange line is the mean; the purple lines are standard deviations. That tiny blue blip around 14000 must be entry 17r_1.

Admittedly, this sort of statistic is not terribly informative on this kind of dataset, but possibilities are abound. Interacting with the manuscript is made simple and powerful by holding the entries in memory as a Python object.