Here's a quick little notebook where I chart my experiments with prepping texts for text mining. Primarily, this means stripping XML and headers. Let's get started!

In [103]:
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

In [104]:
from bs4 import BeautifulSoup

In [105]:
with open("C:/Users/ASUS/Documents/iPython/data/DREaM/withXML/traveltraining/1698 1698 - T C - The New atlas or Travels.xml", "r", encoding="utf8") as f:
    testFile = f.read()

In [106]:
print(testFile[:400])

<?xml version='1.0'?><EEBO><dreamheader>
  <tcpdata>
    <tcpid phase="2">A31298.hdr</tcpid>
    <fileDesc>
      <titleStmt>
        <title>The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever.</title>
        <author>T. C.</author>
     


Let's try remove the XML tags.

In [107]:
soup = BeautifulSoup(testFile, "xml") # testFile is the source doc

for text in soup.find_all("TEXT"):
    print(text.string)

None


In [108]:
testSoup = BeautifulSoup(testFile, "xml")
testSoupText = testSoup.text
print(testSoupText[:400]," &&&&&&& ", testSoupText[-400:]) # have a peek



A31298.hdr


The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever.
T. C.


1698.
London :
Printed for J. Cleave ... and A. Roper ...,

12251058
Wing C139.
Arber's Term cat. III 138.
57084
A31298






The New atlas, or, Travels and voyag  &&&&&&&   Questions, of where I had been, and the particulars of what I observed, at the Request of those that had done so well for me, and now so Lovingly received me, I took leisure time to Write what you have perused, hoping it will give my Country-Men as an entire satisfaction as I have had, in the undertaking and performing my Travels, and then no doubt but both of us will be well pleased.
FINIS.







Great! But the header is still in there. Let's try remove the it. First, it seems easiest to just select the text using tags that all the documents will have - so, for example, /dreamheader at the beginning and /EEBO at the end.

In [109]:
start = testFile.find("</dreamheader>")
end = testFile.find("></EEBO>")
testFileNoHead = testFile[start:end]

In [110]:
print(testFileNoHead[:400], " |||||||||||||||| ", testFileNoHead[-400:])

</dreamheader><TEXTS type='varded' set='cleaned' match='45' date='2015-06-10' version='2.0' metadata='dream'><TEXT LANG="eng">
<FRONT>
<DIV1 TYPE="title page">
<PB REF="1"/>
<PB REF="1" MS="y"/>
<P> THE NEW ATLAS: OR, Travels and Voyages IN Europe, Asia, Africa and America, <normalised orig="Thro'" auto="true">Through</normalised> the most Renowned Parts of the WORLD, VIZ.</P>
<P>From England to t  ||||||||||||||||  at the Request of those that had done so well for me, and now so Lovingly received me, I took leisure time to Write what you have perused, hoping it will give my Country-Men as an entire satisfaction as I have had, in the undertaking and performing my Travels, and then no doubt but both of us will be well pleased.</P>
<TRAILER>FINIS.</TRAILER>
<PB REF="126"/>
</DIV2>
</DIV1>
</BODY>
</TEXT></TEXTS


That seems to give the text that we want! hurrah. Now let's remove the XML from our core text.

In [111]:
soupNoHead = BeautifulSoup(testFileNoHead, "xml") # string is the source doc

for text in soupNoHead.find_all("TEXT"):
    print(text.string)

In [112]:
testSoupNoHead = BeautifulSoup(testFileNoHead, "xml")
testSoupTextNoHead = testSoupNoHead.text
print(testSoupTextNoHead[:40000]," |||||||||||| ", testSoupText2[-400:]) # have a peek

  ||||||||||||  


Hmm. I tried removing the header before removing the XML, but that...did not work. Maybe Beautiful Soup can't pick it up unless the starting tag is complete? Unsure.

In any case, does it matter if the headers are in there, if every single text will have nearly the same header? Hmm, yes, I think it will still be a problem, because they won't be exactly the same. Not sure if that will be enough to skew results.

What if we do a start/end with the plain text?

In [113]:
start = testSoupText2.find("tcpid.")
end = testSoupText2.find("FINIS") # how to get to the end
testNoHeader2 = testSoupText2[start:end]

In [114]:
print(testNoHeader2[:200]) # Yup, doesn't look like there is anything in there.




In [115]:
print(testSoupText[:5000]," &&&&&&& ", testSoupText[-400:]) # have a peek



A31298.hdr


The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever.
T. C.


1698.
London :
Printed for J. Cleave ... and A. Roper ...,

12251058
Wing C139.
Arber's Term cat. III 138.
57084
A31298






The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world
The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever
New atlas

T. C.
unknown
0
0
93425138



1698
1698
London


Roper, Abel, 1665-1726
unknown
1665-01-01
1726-02-05
77718116


Röper, A.
unknown
0
0
194410204


Roper, Abel.
unknown
0
0
250424859




j. cleave



12251058
832951456
861624617
891541373
A31298

12251058
832951456
861624617
891541373
934251

In [116]:
if "as an attribute in tcpid." in testSoupText:
    param, value = testSoupText.split("as an attribute in tcpid.",1)
    print(value[:600])






2015-11-16








 THE NEW ATLAS: OR, Travels and Voyages IN Europe, Asia, Africa and America, Through the most Renowned Parts of the WORLD, VIZ.
From England to the Dardanelles, thence to Constantinople, Egypt, Palestine, or the Holy Land, Syria, Mesopotamia, Child, Persia, East-India, China, Tartary, Muscovy, and by Poland; the German Empire, Flanders and Holland, to Spain and the West-Indies; with a brief Account of Aethiopia, and the Pilgrimages to Mecha and Medina in Arabia, containing what is Rare and Worthy of Remarks in those vast Countries; relating to Building, Antiquities, Rel


Well, that kind of worked...but there must be a cleaner way to do it, and perhaps to get that date out of there? I hesitate to use the date as the splitting point, since it seems likely that that will change depending on the file (maybe? a cursory look at 5 random files suggests otherwise, but this seems like a *good question for Stefan*. It seems difficult that the splitting/etc. can't be done until after the XML is stripped (although that makes sense considering how BS parses things and converts tags).

Hypothetically, I should be able to put the stripping and then the header cuts into a loop and run my texts through them?

So, as a wrap-up to this little experiment, what questions and thoughts do I have?

- how much resources is it going to take to prep all my data? Stripping XML, stripping headers, etc.?
- is it better to use plain text (VARDed), considering that otherwise I'm going to have to strip these files, but also the ~40k files that I'll be running through machine learning?