# Exploring NLP on project XML files 
Start with imports and installs.
You can start on your local computer with a pip install. 
* Where you've set up your python environment, run `pip install saxonche` or `pip3 install saxonche` as needed.

You should be able to run this notebook on your local computer: 
* Navigate to the Class Examples/Python directory in your Git Bash (Windows) or Terminal (Mac),
* Type in `jupyter lab` and press enter
* Then open the localhost address you're given in your web browser. 

In [2]:
!pip install saxonche
import os
import spacy
import re as regex
# re lets us work with regular expressions in Python
from saxonche import PySaxonProcessor
# You may need to pip install saxonche at the command line if the install doesn't work in the notebook here.
# This lets us use Saxon XPath parsers over XML files



Remember the spaCy language models? Let's try loading loading the large one to get the maximum amount of information from it! 
There's a lot we can experiment with from spaCy, so here's a link to the documentation for our ready reference:
<https://spacy.io/usage/spacy-101> 

We're going to start by just reviewing its POS (part of speech) and NER (named entity recognition) taggers to see what we can see in your project files.


In [6]:
# nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory.

Okay, let's explore some project files!
We've loaded the XML directory prepared by the Futurama team for our example here. 

* If you have some basic XML right now, like the Futurama team has prepared, we can easily scope in tagged sections of your collection. Swap out the Futurama collection with yours, and adjust the Python code below accordingly.
* If you don't have XML at this point, you can work around this over text files, or just explore the Futurama collection.

In [4]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'futurama-xml'
OutputPath = 'testOutput' 

Now, here are some functions to: 
* read input files
* pull from the XML elements with some simple XPath
* run stuff through spaCy's NLP

Read (and adapt) the functions in the following cell from the bottom up.

In [8]:
def readTextFiles(InputPath):
    # This function uses XPath to read the XML input
    for file in os.listdir(InputPath):
        if file.endswith('.xml'):
            filepath = f"{InputPath}/{file}"
            with PySaxonProcessor(license=False) as proc:
                xml = open(filepath, encoding='utf-8').read()
                # ebb: Here we apply the Saxon processor to read files with XPath.
                xp = proc.new_xpath_processor()
                node = proc.parse_xml(xml_text=xml)
                xp.set_context(xdm_item=node)

                # From here on, we select the string that Python will send to NLP. 
                # xpath = xp.evaluate('//your/xpath/here')
                xpath = xp.evaluate('(//speak[@who = "BENDER"]) => string-join()')
                #//speak[@who] finds every speaker and //speak/* finds all adverbs inside a speech
                string = str(xpath)
                print(string)
                
                
readTextFiles(InputPath)


Lousy stinking rip-off!  Well I didn't 
have anything else planned for today. 
Let's go get drunk!
I don't need to drink, I can quit anytime 
I want!  So they made you a delivery 
boy, huh? Man, that's as bad as my job.
I'm a bender. I bend girders, that's 
all I'm programmed to do.
You kidding? I was a star! I could bend 
a girder to any angle: 30 degrees, 32 
degrees, you name it! (unsure) 31. (normal) 
But I couldn't go on living once I found 
out what the girders were for.
Suicide booths!  Well, Fry, it was a 
pleasure meeting you, I'm gonna go kill 
myself.
You really want a robot for a friend?
Well, OK. But I don't want people thinking 
we're robo-sexuals, so if anyone asks, 
you're my debugger.
I'm not looking!
We can hide in here, it's free on Tuesdays.
Oh, we're trapped!
Dream on, skin tube. I'm only programmed 
to bend for constructive purposes. What 
do I look like, a de-bender?
I'll have to check my program...yep.
You're full of crap, Fry!  You make 
a persuasive argument,