# XQuery in Python: Process a whole collection at once on project XML files 

Source: <https://stackoverflow.com/questions/75142854/how-to-use-the-collection-function-with-saxonche> 

## New pip installs
* For this homework, you'll need to install the **pathlib** library so we can run XQuery across platforms on Mac *and* Windows. 
In your python environment for this series of notebooks, run `pip install pathlib`
* If you didn't install spacy before, be sure to `pip install spacy`
* Also, if you didn't download the spacy large language model, uncomment the line to download the spacy large model to your virtual environment.



In [1]:
# !pip install saxonche
# !pip install pathlib
import os
import pathlib
from pathlib import Path
import spacy
import re as regex
# re lets us work with regular expressions in Python
from saxonche import PySaxonProcessor
from os import getcwd
# this lets us retrieve the current working directory

Remember the spaCy language models? Let's try loading loading the large one to get the maximum amount of information from it! 
There's a lot we can experiment with from spaCy, so here's a link to the documentation for our ready reference:
<https://spacy.io/usage/spacy-101> 

We're going to start by just reviewing its POS (part of speech) and NER (named entity recognition) taggers to see what we can see in your project files.


In [13]:
# nlp = spacy.cli.download("en_core_web_lg")
# ONLY NEED ABOVE LINE ONCE. REMEMBER: COMMENT OUT THE ABOVE LINE THE NEXT TIME YOU RUN THIS.
nlp = spacy.load('en_core_web_lg')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Okay, let's explore some project files!
We've loaded the XML directory prepared by the Futurama team for our example here. 

* If you have some basic XML right now, like the Futurama team has prepared, we can easily scope in tagged sections of your collection. Swap out the Futurama collection with yours, and adjust the Python code below accordingly.
* If you don't have XML at this point, you can work around this over text files, or just explore the Futurama collection.

In [3]:
# DEFINE SOME FILE PATHS FOR INPUT, AND (ONCE WE'RE READY) OUTPUT
InputPath = 'futurama-xml'
OutputPath = 'testOutput' 

The next cell demonstrates the xpath() function, set up to run over individual files.
Let's look at how it returns information about distinct values of speakers. We're exploring distinct-values() and count() functions here. Try removing them and putting them back to see what the effect of distinct-values() is on the count. 


In [5]:
def readTextFiles(InputPath):
    # This function uses XPath to read the XML input
    for file in os.listdir(InputPath):
        if file.endswith('.xml'):
            filepath = f"{InputPath}/{file}"
            with PySaxonProcessor(license=False) as proc:
                xml = open(filepath, encoding='utf-8').read()
                # ebb: Here we apply the Saxon processor to read files with XPath.
                xp = proc.new_xpath_processor()
                node = proc.parse_xml(xml_text=xml)
                xp.set_context(xdm_item=node)

                # From here on, we select the string that Python will send to NLP. 
                # xpath = xp.evaluate('//your/xpath/here')
                xpath = xp.evaluate('//speak/@who => distinct-values() => sort()')
                count = xp.evaluate('//speak/@who => distinct-values() => count()')
                string = str(xpath)
                print(count)
                # xpath is going to go file by file.
             
readTextFiles(InputPath)

24
24
22
17
39
28
21
18
21
19
23
20
25
17
15
32
25
21
20
34
34
27
22
20
14
21
28
32
17
23
20
21
17
23
20
25
20
21
21
33
32
32
21
36
15
37
23
19
28
17
30
26
30
29
22
21
30
23
15
35
18
28
20
26
16
22
34
20
22
22
30
27


## Introducing XQuery! 
XQuery is what we want to use to help read data from across a whole directory, or "corpus" collection of files. 
XQuery can be written as a separate file (with .xql or .xquery extension) in oXygen over a directory. But we will find it more useful 
to apply it in Python if we're working on natural language processing applications. 

### Setting up XQuery in Python 
We use the same "boilerplate" PySaxonProcessor lines, but switch from xpath to the xquery processor.


Requirements: 
* We need all the xq lines to plug this processing into Python. You can use this code as a starter for your projects.
* The XQuery script is written inside a quoted block in the set_query_content() function that needs to take a quoted string, just like in the cell below.
* We need the run_query_to_value() function to execuit the script we're writing.
  

**Reading the collection**: We're setting this up to read a `collection()` function, which is a directory of XML files. 
**Writing XQuery**: This involves setting simple variables equal to xpath expressions. XQuery variables are defined with a `$`.
**XQuery Comments**: look like sideways smiley faces. `(: I'm an XQuery comment :)`

Let's take a look:



In [8]:
def xqueryOverFiles(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        # This only works on Mac / Linux: xq.set_query_base_uri('file://'+getcwd()+'/')
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $speakers := $futurama//speak/@who => distinct-values() => sort()
let $count := count($speakers)
return $speakers

   (: ebb: I'm writing in an XQuery comment, and pointing out you can define and return any variable you want in the XQuery zone. :)
   (: ebb: Try changing this one to return $speakers instead of the $count variable. :)
   (: ebb: We're writing an query-based syntax called FLWOR (pronounced "flower") and every FLWOR requires a return statement at the end. :)
    
''')
        r = xq.run_query_to_value()
        print(r)  
                               
xqueryOverFiles(InputPath)

SaxonC-HE 12.4.2 from Saxonica
"5-EYES"
"AARON JR"
"AARON SR"
"AD ROCK"
"ADLAI"
"AHAB"
"AKI"
"ALBERT"
"ALBRIGHTBOT"
"ALEX"
"ALIEN"
"ALIENS"
"ALKAZAR"
"ALL"
"ALPHABOT"
"AMAZONIAN"
"AMAZONIANS"
"AMBASSADOR MOIVIN"
"AMY"
"AMY 1"
"AMY 420"
"ANDERSON"
"ANDREW"
"ANDY"
"ANGLEYNE"
"ANNOUNCER"
"ANNOUNCER #1"
"ANNOUNCER #2"
"ANTONIO"
"ARACHNEON"
"ARMY ROBOT"
"ARTHUR"
"ATILLA THE HUN"
"AUCTIONEER"
"AUDIENCE"
"AUSTRALIAN GUY"
"AUTOPILOT"
"BABE"
"BAILIFF"
"BAILLIF"
"BARBADOS SLIM"
"BARKER"
"BARMAN"
"BARRIERBOT #1"
"BARRIERBOT #2"
"BARTENDER"
"BARTENDERBOT"
"BEASTIE BOYS"
"BECK"
"BEE"
"BEELER"
"BEES"
"BENDER"
"BENDER 1"
"BENDER 1729"
"BENDER AND BECK"
"BENDER AND BENDER 1"
"BENDER FIGURINE"
"BENDING UNIT"
"BESERK"
"BETABOT"
"BETAMAX PLAYER"
"BIDDER #1"
"BIDDER #2"
"BIG BRAIN"
"BIG EARED MUTANT"
"BIG MOUTHED MUTANT"
"BIKE THIEF"
"BILL"
"BILLIONAIREBOT"
"BILLY"
"BLUE ELDER"
"BOLT"
"BONT"
"BOOTH VOICE"
"BOY"
"BRAIN #1"
"BRAIN #2"
"BRAIN #3"
"BRAIN BALL #1"
"BRAIN BALL #2"
"BRET"
"BROKERBOT #1"
"BROKERB

## What's the difference? 

Notice that you return just one count for the entire collection, not 72 different counts for the speakers in each file.
How can we use this? 

We can get pull information from across the entire collection and find out literally who has the most speeches in the whole series. 
Take a look at this code. Don't worry if you don't know how to write it yet. I just want to show you for demonstration purposes! 
What's here is a nearly full "FLWOR statement" which stands for :

* For
* Let
* Where
* Order by
* Return

The only things required in a FLWOR statements are L and R. The others give you extra powers like we're seeing here.

**IMPORTANT: You're only allowed one return per FLWOR.**



In [9]:
def xqueryOverFiles(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $speakers := $futurama//speak/@who => distinct-values() => sort()
let $count := count($speakers)
for $sp in $speakers
    let $count := $futurama//speak[@who = $sp] => count()
    where $count > 100
    order by $count descending
    return ($sp || ':  ' || $count)
 
''')
        r = xq.run_query_to_value()
        print(r)  
                            

xqueryOverFiles(InputPath)


SaxonC-HE 12.4.2 from Saxonica
"FRY:  2684"
"BENDER:  2360"
"LEELA:  2117"
"FARNSWORTH:  990"
"ZOIDBERG:  581"
"AMY:  499"
"HERMES:  408"
"ZAPP:  336"
"KIF:  197"
"CALCULON:  114"


For this notice how we can move **deliberately** in XQuery from information on the whole collection, to information based on individuals in a series.
The counts we're seeing are NOT based on files, but on info about each speaker! 

## For statements, measurements, order by, return


* Let's choose a character from Futurama who gets a LOT of speeches and use XQuery to pull all their speeches from across the entire collection into one return.
The next code block is written to show you how to get all the speeches of one character in the WHOLE collection. You can change this to any other character you wish.
* Let's go through each speech individually and meaasure it: 
* Try out defining a new variable in the XQuery, to use the `string-length()` function, so you can measure the text of the speeches. Try it out!

* We're going to write a for statement so we can evaluate each speech, and order our results based on how long the speeches are,
     * We'll use the XPath `string-length()` function to measure each speech
     * We'll work through an XQuery `for` and `order by`
     * Let's see if we can return the speeches from shortest to longest, then try adding the word `descending` to order them longest to shortest!


* YOUR TURN: Edit the code below to return some alternative information! 

In [10]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $zoidbergSpeeches := $futurama//speak[@who="ZOIDBERG"]
let $zoidbergTextsOnly := $zoidbergSpeeches/text() 
for $zt in $zoidbergTextsOnly
   let $length := $zt ! string-length()
   where $length gt 10
   order by $length descending
   return $zt
 
''')
        r = xq.run_query_to_value()
        print(r)  
                            

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica

I've got just the thing, genuine miracle 
cream I bought from a travelling salesman. 
"Come one come all," he said, "Step 
right up!" This sounds too good to be 
true I thought. He said I looked like 
a smart young man. "So is it a deal?" 
I enquired. Two hours later he was gone, 
with 60 of my dollars. But I have the 
miracle cream...

And the nominees for Best Actor are: 
Sir Lawrence...... in The Merchant Of 
Venus, Hive Mind Gamma 7X in Bikini 
Party Summer, the Soda Machine Robot 
in Bikini Party Summer, Mark Jones in 
How Beige Was My Jacket and, instead 
of the fifth guy - Calculon, for his 
powerhouse performance in The Magnificent 
Three.

Good evening ladies and germs.  That 
wasn't a joke, I was talking to Dean 
Streptococcus.  Now I'm not saying Professor 
Farnsworth is old, but if you consider 
his age he's likely to die soon!  Hey 
Ringo, that was the joke. Oh, it's showtime 
at the Apollo all over again.

So, now Zoidberg is big, huh? That

## Experimenting with some functions


Remember how we were going to apply some NLP from spaCy? The next part of this exercise is to take the output of this XQuery function and pass it to the spaCY language model!

We need XQuery in our function above to return just one string if we want to deliver that to spaCY and NLP tools. The way I left this code, it's returning thousands of strings.

* Find out  **how many strings** is the XQuery in the codeblock below is returning? Modify the XQuery to show you this `count()`.
* To bundle them all together in one string **add a string-join()** function.


* Then we're going to pass them on to other tools.
* So here's your challenge: Write some code that returns all the text of one speaker. (Or adjust the code to use in your project.)
* Then apply `string-join()` to make the XQuery code return in a single string.
* **Write some new code** that delivers this single string to be processed with spaCy in some way.
    * You can build your code in a new cell block below this one.
    *  You'll probably want to review [this spaCy NLP assignment](https://github.com/newtfire/textAnalysis-Hub/blob/main/python-nlp-exercise1.md#write-some-python-code-to-do-the-following)
    * Write your code to return any information of interest from spaCy, following the spaCy documentation as we did in that earlier assignment! Try looking for one or two different kinds of language. What are the most popular verbs, adjectives, nouns? Try out the NER (named entity recognition)...Write your code in your copy of this Jupyter Notebook to complete this exercise. 

In [11]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $zoidbergSpeeches := $futurama//speak[@who="ZOIDBERG"]
let $zoidbergTextsOnly := $zoidbergSpeeches/text()
let $zcount := $zoidbergSpeeches/text() => count()
return $zcount
 
''')
        r = xq.run_query_to_value()
        print(r)  
                            

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica
646


In [12]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $zoidbergSpeeches := $futurama//speak[@who="ZOIDBERG"]
let $zoidbergTextsOnly := $zoidbergSpeeches/text()
let $zjoin := $zoidbergSpeeches/text() => string-join()
return $zjoin
 
''')
        r = xq.run_query_to_value()
        print(r)  
                            

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica

Excellent, excellent!
Now open your mouth and let's have a 
look at that brain.  No no no no no 
not that mouth!
Really?
Young lady, I'm an expert on humans. 
Now pick a mouth, open it and say... 
What? My mother was a saint! Get out!
Thank you, I made them myself.
Goodbye.
The female Leela's problem is purely 
medical. Soon she will drop her eggs 
and they will hatch and all will be 
well.
The rotting carcass of a whale.
I'd like a jumbo squid log, please.
Alright, alright. Let me have one of 
your young on a roll.
Fine! Just give me something crawling 
with parasites.
Yes, please, popcorn!
 I'm not on trial here.
Stop! Stop! I admit it! My people ate 
them all! We kept saying "One more can't 
hurt" and then they were gone. We're 
sorry!
That stench. That heavenly stench!  
More!
More! More! More! More!
Alright.
Uh-oh, I shouldn't have had seconds.
They are mild. In fact, you're soaking 
in one right now.
A fancy dress gala! I'll wear my formal 
shell.


In [30]:
def xqueryAndNLP(InputPath):
    # This time, let's try XQuery over a collection of files:
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        xq = proc.new_xquery_processor()
        xq.set_query_base_uri(Path('.').absolute().as_uri() + '/')
        xq.set_query_content('''
let $futurama := collection('futurama-xml/?select=*.xml')
let $zoidbergSpeeches := $futurama//speak[@who="ZOIDBERG"]
let $zoidbergTextsOnly := $zoidbergSpeeches/text()
let $zjoin := $zoidbergSpeeches/text() => string-join()
return $zjoin
 
''')
        r = xq.run_query_to_value()

        wordstrings = str(r)


# start playing with spaCy and nlp:
        futuramaWords = nlp(wordstrings)
        for token in futuramaWords:
                if token.pos_ == "ADJ":
                    print(token.text, " = ", token.pos_)
                            

xqueryAndNLP(InputPath)

SaxonC-HE 12.4.2 from Saxonica
Excellent  =  ADJ
excellent  =  ADJ
female  =  ADJ
medical  =  ADJ
jumbo  =  ADJ
young  =  ADJ
more  =  ADJ
sorry  =  ADJ
heavenly  =  ADJ
More  =  ADJ
More  =  ADJ
More  =  ADJ
More  =  ADJ
More  =  ADJ
mild  =  ADJ
fancy  =  ADJ
formal  =  ADJ
high  =  ADJ
denser  =  ADJ
hot  =  ADJ
pretty  =  ADJ
dead  =  ADJ
fine  =  ADJ
tight  =  ADJ
talented  =  ADJ
hideous  =  ADJ
third  =  ADJ
next  =  ADJ
cute  =  ADJ
least  =  ADJ
cold  =  ADJ
blooded  =  ADJ
humorous  =  ADJ
beautiful  =  ADJ
handy  =  ADJ
new  =  ADJ
Excellent  =  ADJ
excellent  =  ADJ
frisky  =  ADJ
More  =  ADJ
Enough  =  ADJ
More  =  ADJ
More  =  ADJ
More  =  ADJ
normal  =  ADJ
normal  =  ADJ
better  =  ADJ
Excellent  =  ADJ
excellent  =  ADJ
less  =  ADJ
nuts  =  ADJ
cloacal  =  ADJ
erotic  =  ADJ
unknown  =  ADJ
female  =  ADJ
genetic  =  ADJ
old  =  ADJ
scuttling  =  ADJ
bigger  =  ADJ
tough  =  ADJ
Same  =  ADJ
complete  =  ADJ
high  =  ADJ
few  =  ADJ
puny  =  ADJ
other  =  ADJ
giant  