Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
636 lines (496 sloc) 72.5 KB
#Implementation
I have done my best, through my thought processes and the documentation below, to follow through on the principles set out for digital visualization through the [London Charter](http://www.londoncharter.org/fileadmin/templates/main/docs/london_charter_2_1_en_edits.pdf).
To be honest, this section sounds to me more like a statement that people either agree with or disagree with and which governs the whole of the project. I'm not sure what documentation belongs in this section otherwise.
For clarification and to aid the understanding of this document, please refer below to the questions which I began with and what I ended with.
##Beginning Questions
I am going to use the papers from 1897-1902 (6 years) to see if there is an increase of French/English conflict in Canada from before to after the Second Boer War begins in South Africa and there is a difference of opinion between the two groups about supporting the British troops. This will be especially interesting because without doing any research at this point, I suspect the Shawville Equity as an English paper within Quebec will have an interesting take on the matter.
##Ending Questions
How often are well-known people vs. regular people mentioned?
How might this speak to the function or readership of the paper?
How frequently are locations mentioned? Does this speak to the relative "world" they lived in? E.g. closer contact with locals - what they were interested in?
Are some items sold more than others? What and why?
#Aims and Methods
1. Wget
I needed to download potentially hundreds of newspapers within a fairly short amount of time. I could have gone to the hosting website and individually downloaded them, but that would have taken a long time and been unnecessary manual labour.
I chose to use wget, a command line tool that will pull information off the web based and neatly store it in a folder on your computer. This was useful because it would be semi-automatic, require little external monitoring, and based on the limits and restrictions built into the code, minimally invasive for the hosting website. Its use fits my purposes well and when the download restrictions are correctly applied, there are no limitations.
2. Regex with Python
I wanted to clean up the OCR text files I downloaded and identified regex as a simple way to do that. Regex allows you to type in expressions which the program then searches and uses to identify and make changes from. I also identified python as a programming language which would allow me to write/modify a program to run through the regex with minimal interference. Using these two programs together was appropriate and was suggested by a tutorial, however, I had difficulty understanding it and abandoned it.
Later on, I did use Regex to clean up simple symbols that the .xml file would not read, such as unicode, "*" and "&". I did not use a python program to do this however, because not all the OCR mistakes took place within the same context and it was important that I maintain the documents authenticity (but not accuracy) by replacing symbols with things that made sense, or at least did not harm the integrity of the paper.
I later wanted to use Python to isolate and count the tags generated from the TEI. This was an appropriate use of python because I would have been able to create many regular expressions and then order them in a script for execution. By running the program and searching for the tags, the inforamtion I wanted would have been easily displayed and I could have used it as a basis for the deep reading this project would have done had it followed through. Unfortunately, my brother and I were unable to make the python program do exactly what I wanted it to.
3. RStudio
I tried to use RStudio to begin some topic modelling and data analysis. RStudio is a statistical and programming package that lets you sort, group, and analysis data. It would have been appropriate in conducting the next stage of the research, but it was not appropriate at this point because it needed a .csv file which I did not have at the point. As well, it turned out that the data was too messy to be ethically used for data analysis at this point and its use would have been methodologically questionable because it would not have actually answered the questions I had at that point either.
4. TEI
I needed to do a significant amount of work cleaning up the data and making it presentable before I could do any deep work with it. There was the choice of manually going through 1800 lines of text and trying to figure it out, changing words to be spelled correctly, identifying misplaced characters, and removing symbols that the .xml file could not read. I chose to use TEI because it is a universally-accepted markup langauge, has easy-to-follow conventions, and would allow for further work. Specifically, the tagging would allow me to categorize people, places, things, ideas, and events. Crucially, it would do this without significantly altering the text itself, which I was reluctant to do because of the ethical and methodological complications of altering digitally-rendered text and presenting it to the reader/ researcher as unbiased text. TEI would also allow the text to be placed on the web and quickly searched/categorized by others. I found this method of cleanup to be a compromise between "fixing" poorly rendered OCR and maintaining my committment to refrain from modifing the materials as much as possible. This way, I could search the tags for the information I wanted without worrying about spelling mistakes. In addition, the tags allowed me to add additional information on top of the text to aid in comprehension and provide a starting point for future research about any of the major questions I had throughout this project.
5. Google's OpenRefine
OpenRefine is an online tool that allows for simplified data cleanup. Primarily, my intent was to use it to identify words that were spelled or OCRd incorrectly but were most likely the same word. OpenRefine allows for easy regrouping of these types of terms and provides a word count of instances, a feature that would have helped in determining what sales items were sold most frequently and some similar things. Unfortunately, it would have been appropriate to use this tool before any major data cleaning (I ended up using TEI), but I could not use it originally because as a newspaper source it was not well suited for being turned into a .csv file and used for a column-based analysis. When I did have information that was finally within the .csv format and which would have made sense to use OpenRefine on, I had already finished the TEI work and I didn't need any extensive data cleaning anymore.
6. Voyant
Within Voyant, I used Cirrus, Links, and Terms. These were all slightly different ways of analysing and presenting word count frequencies which I used to do preliminary research on my questions about the form and function of the paper. Voyant was a good choice because it was highly visual and would not only help me distill the paper into "top #" lists, but because it would act as visual aids for the blog posts I was going to write to present the assignment. I also found voyant to be intuitive and easy to use, with a slide rule to increase or decrease the number of terms that were analysed. Significantly, Voyant allowed me to confirm what I had already observed through the close reading of the paper I had done during the encoding phase of the TEI work. It confirmed the new main topics of the newspaper and affirmed that the questions I asked about who was represented most frequently in the paper, or what cities were most important to the people of Shawville, were valid. When I used the Links tool, which allowed me to see the connections between different words, I noticed more nuanced levels of analysis and alerted me to flaws in my initial suppositions and research methodology. For example, I was pleased to notice that John was a common name, but on closer analysis, I realised that there were multiple people named John and it was not possible to clearly use the Links tool to determine which Johns were truly linked to the connections that the tool suggested. Thus, I had to think critically about my approach and modify my assumptions and methodology, as well as present these thoughts in writing for those following along with my blog posts.
7. Catmandu
Catmandu was the program the professor suggested I use to search the tags that were in the .xml file rather than the newspaper text itself. I wanted to do this because the tags allowed me to categorize the people, places, and sale items in the paper. This was vital to my research questions because as much as I was interested in what the paper said, I was interested in how the format and content changed over time. I wanted to know about the relationships between things in the paper and between other aspects of 1897 life in Shawville and surrounding region. Thus, I wanted to know how many cities were local vs international, how many people could be considered wealthy political or business elites vs ordinary residents, and I wanted to know how many of the huge January sale items were seasonal. Ultimately, I wanted to do this comparison over time and see the changes a globalizing world was making on Shawville and the Ottawa region. From what I could see and because it was suggested by the professor for this purpose, I was confident that it was the right program. However, I had difficulty running it and decided to try using python to do something similar.
8. Blogging
I decided to present the findings of my digital history project to date through a series of blog posts. I did this because much of the class reading we did was blogs, and I found that the shorter and more personal nature of the posts was easier to read and more engaging. This is especially true for new digital history work, which I found extremely complicated and difficult to grasp. By breaking the blog posts up so that they usually only covered one day, idea, and new tool, I hoped to make them easier to digest and more engaging. I also found that blog posts allowed for easy insertion of links and media (I used pictures of my findings and screenshots). This increased the potential level of comprehension by providing visuals and reduced the amount of background knowledge I had to provide, because I could include useful links. Finally, the blog format makes the materially freely accessible on the internet and allows the reader to interact with the material and the author through the comment section. These are principles that are integral to the field of digital history and which set it apart from academic history.
#Research Sources
**The following is a list of the sources, tutorials, and support I received in completing this project (in order that they appear through the text)** *Please also note that to the best of my knowledge, this list is complete*
##Tutorial and Assistance
http://workbook.craftingdigitalhistory.ca/
https://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions
http://workbook.craftingdigitalhistory.ca/supporting%20materials/topicmodel-r-dhbox/
http://workbook.craftingdigitalhistory.ca/supporting%20materials/tei/
https://github.com/craftingdigitalhistory/module3-wranglingdata/blob/master/tei-hist3907/blanktemplate.txt
https://github.com/claremaier/Final_Project/blob/master/collections.banq.qc.ca:8008/jrn03/equity/src/1897/01/14/83471_1897-01-14.txt
http://regexr.com/
https://github.com/craftingdigitalhistory/module3-wranglingdata/blob/master/tei-hist3907/000style.xsl
https://hist3814o.slack.com/files/dr.graham/F6P7GSXM1/Untitled.r
http://historyinthecity.blogspot.ca/2013/12/corpus-linguistics-for-historians.html
http://workbook.craftingdigitalhistory.ca/supporting%20materials/open-refine/
https://hist3814o.slack.com/archives/C0GDSE1B8/p1503065858000029
https://librecatproject.wordpress.com/2014/12/04/day-4-grep-less-and-wc/amp/
http://www.londoncharter.org/fileadmin/templates/main/docs/london_charter_2_1_en_edits.pdf
Isaac Maier
##People
http://ottwatch.ca/meetings/file/264178
http://www.biographi.ca/en/bio/hay_george_13E.html
http://www.biographi.ca/en/bio/bryson_george_12E.html
http://hwtproject.ca/directory/fraser/
http://www.biographi.ca/en/bio/mather_john_13E.html
https://www.geni.com/people/David-MacLaren/6000000000407790682
https://books.google.ca/books?id=IZFXAAAAMAAJ&pg=PA248&lpg=PA248&dq=sj+mcnally+ottawa&source=bl&ots=jUoOqXoBYi&sig=55PDw9IKjON8xpqFN8qJmbzDljA&hl=en&sa=X&ved=0ahUKEwi0tMWovdLVAhUL7YMKHbYgBWwQ6AEIPzAF#v=onepage&q=sj%20mcnally%20ottawa&f=false
http://www.glengarrycountyarchives.ca/Glengarry_pdf/The-Glengarry-News/1892-1900/1897/Jul/07-30-1897.pdf
https://gist.github.com/shawngraham/8323899898cf016d5829f68394e63699
https://gist.github.com/shawngraham/8323899898cf016d5829f68394e63699
https://gist.github.com/shawngraham/8323899898cf016d5829f68394e63699
https://gist.github.com/shawngraham/8323899898cf016d5829f68394e63699
https://gist.github.com/shawngraham/8323899898cf016d5829f68394e63699
https://www.newspapers.com/newspage/48094345/
https://www.royal.uk/victoria-r-1837-1901
https://www.whitehouse.gov/1600/presidents/grovercleveland22
https://www.newspapers.com/newspage/43432598/
http://www.biographi.ca/en/bio/hays_charles_melville_14E.html
##Places
http://town.shawville.qc.ca/web/
http://www.municipalitepontiac.com/en/
http://www.pembroke.ca/
http://arnprior.ca/town/
http://campbellsbay.ca/
http://bristolmunicipality.qc.ca/
http://www.thecanadianencyclopedia.ca/en/article/hull/
https://www.cityofnorthbay.ca/
http://www.heritagepontiac.ca/hhf.htm
http://www.dixville.ca/
http://www.renfrew.ca/
https://www.limerick.ie/
https://en.wikipedia.org/wiki/Nepean,_Ontario
https://en.wikipedia.org/wiki/Brooklyn
http://www.ny.gov/
https://en.wikipedia.org/wiki/Fort-Coulonge
https://en.wikipedia.org/wiki/L%amp;27%C3%8Ele-du-Grand-Calumet,_Quebec
https://www.mindat.org/loc-256565.html
https://www.gov.mb.ca/
http://www.winnipeg.ca/interhom/
https://www.newarknj.gov/
https://www.ci.buffalo.ny.us/
##Medicines
http://www.hairquackery.com/historical-quackery/halls-hair-renewer.shtml
http://digital.lib.ecu.edu/20944
http://digital.lib.ecu.edu/20944
http://www.centerforinquiry.net/blogs/entry/warners_safe_cures/
http://oldnews.aadl.org/node/151451
http://oldnews.aadl.org/node/151120
http://www.centerforinquiry.net/blogs/entry/shilohs_consumption_cure/
http://www.asylumeclectica.com/garretdom/quackery/dreadfully.htm
https://en.wikipedia.org/wiki/Dr._Williams%27_Pink_Pills_for_Pale_People
https://news.google.com/newspapers?nid=37&dat=19000709&id=SIAdAAAAIBAJ&sjid=QykDAAAAIBAJ&pg=3546,2480365&hl=en
http://digging-history.com/2013/11/16/home-remedies-and-quack-cures-halls-catarrh-cure/
https://news.google.com/newspapers?nid=1633&dat=18970311&id=C586AAAAIBAJ&sjid=OyoMAAAAIBAJ&pg=373,15290730&hl=en
https://www.drugs.com/mtm/doans-pills.html
#Documentation
**The following documentation is my original research and proccess notes. I consciously structured them in a way that made sense to me, but incorporated the methodological, ethical, and technological challenges I was having with the project as it progressed. For clarity, I will restate the research questions I ultimately ended with at the beginning of this section.**
##Research Questions
**How often are well-known people vs. regular people mentioned?**
**How might this speak to the function or readership of the paper?**
**How frequently are locations mentioned? Does this speak to the relative "world" they lived in? E.g. closer contact with locals - what they were interested in?**
**Are some items sold more than others? What and why?**
###Augst 5, 2017
STEP 1: Downloading the Equity files 1887-1902
-in DH Box, created directory called Final
-in Final, created directory for Equity Papers, called Equity_Papers_1899
-I will put all my Equity papers in this folder
*I am going to use the papers from 1897-1902 (6 years) to see if there is an increase of French/English conflict in Canada from before to after the Second Boer War begins in South Africa and there is a difference of opinion between the two groups about supporting the British troops. This will be especially interesting because without doing any research at this point, I suspect the Shawville Equity as an English paper within Quebec will have an interesting take on the matter.*
*Perhaps this will not even be in the papers, in which case I will focus on whatever the paper focuses on.*
*I recognize that six years worth of data is not a large sample, but I hope that the change will be great enough to notice over the six years. I also made this decision practically because I do not have unlimited internet and did not want to download a decade's worth of materials*
####Fail
-I used the weget command from Module 2, modifying it so it would take the 1897 files. I received the following error:
claremaier@3bcf141bb995:~/Final/Equity_Papers_1899$ wget http://collections.banq.qc.ca:8008
/jrn03/equity/src/1897/ -A .txt -r --no-parent -nd âw 2 --limit-rate=20k
--2017-08-05 21:57:09-- http://collections.banq.qc.ca:8008/jrn03/equity/src/188397-A
Resolving collections.banq.qc.ca (collections.banq.qc.ca)... 198.168.27.56
Connecting to collections.banq.qc.ca (collections.banq.qc.ca)|198.168.27.56|:8008... connec
ted.
HTTP request sent, awaiting response... 404 Not Found
2017-08-05 21:57:09 ERROR 404: Not Found.
--2017-08-05 21:57:09-- http://.txt/
Resolving .txt (.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘.txt’
idn_encode failed (1): ‘String preparation failed’
idn_encode failed (1): ‘String preparation failed’
--2017-08-05 21:57:09-- http://%C3%A2%C2%80%C2%93w/
Resolving â\302\200\302\223w (â\302\200\302\223w)... failed: Name or service not known.
wget: unable to resolve host address ‘â\302\200\302\223w’
--2017-08-05 21:57:09-- http://2/
Resolving 2 (2)... 0.0.0.2
Connecting to 2 (2)|0.0.0.2|:80... failed: Invalid argument.
*I returned to the tutorial to see what I did wrong. It appears that I had to remove to "-A" and somehow I accidently added "âw" instead of just "w".* I replaced that command with:
wget -r --no-parent -w 2 --limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/
-it appeared to be working well, but the files were downloading very very slowly and then they opened to the original website location. Not what I wanted, so I cancelled the download and tried again, this time without DH Box, and on my local computer.
I tried: wget -r --no-parent -w 2 --limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/ -A .txt
*this finally worked - I am downloading each of the 6 years independently and verifying the files are complete and planted in the correct folders *
###August 6, 2017
Step 2: Regex with Python **Attempt 1: Fail**
-I wanted to clean up the OCR data before I analysed it, so I looked
at the old exercises and identified TEI. That exercise looked like it
was for one document.
I found Regex and saw the link for the [Cleaning OCR'd Text with Regular Expressions](https://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions)
I opened a text file and started to work off the python script provided in the tutorial. I changed the names of files, copied and pasted portions and tried to understand it.
When I pasted it in to the terminal, I received error message after error message.
They all basically looked like this:
clare@clare-fun1:~/School/Final_EquityProject$ python PthyonScript.py
File "PthyonScript.py", line 19
nodash = re.sub('.(-+)', ',', line)
^
IndentationError: expected an indented block
My brother explained that indents need to be very precise, so I set about making sure they all lined up properly.
It still didn't work, so I just copied and pasted the whole script in and changed the file names.
Somehow terminal was working off a different version of my file PthyonScript.py (I also spelled it wrong) and wouldn't make the changes.
When I made sure all the files were in the right place, I ran it again. It correctly filled the csv, but I realised I wanted a txt file. I ran it again and created the changed text in the text file.
When I opened the file, I found that the script had deleted 1700 lines of text, leaving me with 14.
**What I learned**
I learned that python scripts are very exact and that I need to be very careful when I build them
I learned that carefully naming and locating files is key to getting work done quickly and without unnecessary frustration.
I learned that it is a bad idea to use other people's stuff without understanding it and making sure it works for my project.
I was reminded of how important it is to make backups (glad that I did).
**Next steps**
Carefully examine several sample txt files, noting where I think the most common OCR mistakes are and then use python to carefully write useful regex expressions
My goal is to clean up the library (not sure if this is possible or if each one needs to be done automaticaly)
Then I want to do some topic modeling and see whatelse seems useful
**python script is in 1897/01/07**
###August 7, 2017
Step 3: RStudio
I attempt to try out the readability with some of the Equity papers.
Because it was a new DH Box, I had to install rJava and Mallet again.
I ran the following commands, modified from their original [tutorial](http://workbook.craftingdigitalhistory.ca/supporting%20materials/topicmodel-r-dhbox/)
documents <- mallet.read.dir("/home/claremaier/Equity_Papers1897") >
I then tried to run
topic.model$loadDocuments(mallet.instances),
but kept receiving this error:
Error in lapply(list(...), ._java_valid_object) :
object 'mallet.instances' not found
I continued having similar errors:
documents <- read.csv(text = x, col.names=c("Article_ID", "Newspaper Title", "Newspaper City", "Newspaper Province", "Newspaper Country", "Year", "Month", "Day", "Article Type", "Text", "Keywords"), colClasses=rep("character", 3), sep=",", quote="")
Error in textConnection(text, encoding = "UTF-8") : object 'x' not found
> topic.model$loadDocuments(mallet.instances)
Loading required package: rJava
Error in topic.model$loadDocuments :
$ operator not defined for this S4 class >
-I got very confused and didn't know exactly what I was doing. They I realized that I wasn't working with a csv file, which it appears that you have to in order for the program to work. I started thinking about making a csv and realized that it needed clear headings. My documents did not have clear headings. I tried to think of clear headings, but realized the data was too messy.
I went back to my original plan of cleaning the data, but this time decided to use regex. I could not find what felt like enough consistant errors to justify writing a regex expression.
I looked for other things to work on, because right then I felt pretty overwhelmed and worried. I found the TEI tutorial (which I wasn't able to work through the first time) and noted the Prof's note that doing this with an equity file would be an appropriate file.
**the current plan is to use the TEI tutorial to markup an Equity file for searching. I would also like to take it a few steps further, but I'm not sure how right now.**
*I'm still having trouble understanding the project requirements - re: research and how it should look. The whole thing scares me because I'm having trouble conceptualizing it, so I'm going to just focus on the TEI and documenting it well at this point.* **perhaps open refine could help remove some of the incorrectly spelled, etc text**
Step 4: TEI
I am following the tutorial from class, making changes as necessary.
[Tutorial](http://workbook.craftingdigitalhistory.ca/supporting%20materials/tei/)
I used the [blanktemplate.txt](https://github.com/craftingdigitalhistory/module3-wranglingdata/blob/master/tei-hist3907/blanktemplate.txt) before adding and modifying it.
Between <body> and </body>, I pasted in the 1800 lines of Equity file [83471_1897-01-14.txt](https://github.com/claremaier/Final_Project/blob/master/collections.banq.qc.ca:8008/jrn03/equity/src/1897/01/14/83471_1897-01-14.txt)
I then marked the beginning of every heading and paragraph with <p> and ended each heading or paragraph with </p>.
Decisions
I had to make decisions about where to make the paragraph breaks throughout the text. In some cases, it wasn't clear, so I chose to enclose the unreadable text within its own section, so that clear portions continued to be clear. Note: I did not take this approach if the unclear sections were within clear sections.
Sometimes it is difficult to tell if two topics have been merged by the OCR, so I do my best to group it all together - can manually go through it later
<p>1 hey are <>lltif remarks that the continued absence I Imi*ohtant to Farmer*.—L. D. Davis, being held in too school house and ®r,‘ of snow there is likely to prove a a**ri I ^ Shaw ville, has been operating a de* to till a 1«mg fuit want in that di«t our matter to the operations of lumber j horning machine in tins section for some trict. A church will likely be built there men jn that locality. time P‘#t with great success. All who
in the spring. | I have had their cattle dehorned are per-
yean, and 10 month*. Mis remains were ,,f th„ 0ttawa Hou.ti, win, will conduct
interred in Norway I$ay cemetery the the Young Hou.eat8.ncl Point n future.</p>
Observations
The OCR has mashed different advertisements and ads together, most likely as a result of too many small letters being placed together and getting mistaken as one column of text.
Sometimes it is hard to figure out if the abbreviations are a result of the OCR, or if the original newspaper used them to save space.
Most of the entries are very small - have not yet come across the "large" feature articles we are currently used to
-there are some, but they are written as town meetings/ elections
*perhaps this was before "feature articles as we know them*
This is painstaking work - increases admiration for librarians - so many decisions about where to put breaks - same if I tried to fix up the texts later - each decision has potentially huge consequences
**NO way that just this text can be relied on for any major data mining - it doesn't make much sense**
**It all seems rather subjective actually, and I don't see how I could properly explain every paragraph break I make**
Example: I thought this looked like a list of items on sale, so I separated it from what appears to be prizes for a contest, even though both sections contained indistinguishable characters.
<p>A large, fln«dy-enulpp*d. old established I lut Ion- NON! BETTER IN CANADA.
Bae-lnf##* Kdueatioo at LowofI PoudMe Graduates always eur#p -ful. Write Š oatnlo.ua W. J. Hl.LlOTT, Prlnolphi
Wrappers
Soap
Samuel Rogers, Pres
ïxjîTxwiwrï-six ilau».
DUNNS
BAKING
POWDER</p>
<p>me
as follows:
10 First Prizes, $100 Stsarns' Bicycle,! 1,000 26 Sececd " $25 Odd Watch Bleyolee and Watches given each month 1,625
Total given dur'gyear '97, $19,500
HOW TO For rules and full particulars, iiv tt Š v eee |hfl Toronto Ôlobb
ŠŠŠŠŠ</p>
Also base decisions on careful skimming - sometimes have to take context into consideration.
*A lot of in-depth and long medical potion adverts seem to be mixed up - perhaps they were side-by-side with condensed text
-they are different styles of writing (list, testimony, hyperbole, etc) and refer to different names (Kootenay, KarC Clover Root Tea, Dr. Williams' Medicine, etc)
###August 8, 2017
Continued TEI work
I'm not sure the ethical restrictions on using this data yet because all the stories about land valuation, medical breakthroughs, huge store sales are all intertwingined - does this impact topic modeling?
This is why people have to be critical and include close reading of their texts - can't just begin with topic modeling
Beginning and end of sentences are good estimates of where page breaks should go, although not everytime. Some lines are started mid-sentence. **Perhaps I can use OpenRefine or something to sort through**
The end section 500 or so lines is written as a fantastical story (I think), so even when it doesn't make much sense, I skim a few lines and if it seems to be the same tone/story, I left it - so it's a big chunk of text
<p>Teacher—What is that letter? Pupil—I don't know.
Teacher—What is it that maker bon
Small boy )son of a manufacturer)-G lucoee.</p>
**I think this is a joke**
It's things like this that make it hard to know how to group them.
I finished what I hope was a good job with the <p> and </p>
####Fail-log
It was suggested that I not worry about trying to ensure everything was indented properly when adding the <p> and </p> features. There is a plugin that will do that automatically. I installed and tried to run the plugin when I was done the initial markup.
The error indicated that the script couldn't be run because of the extra "<"'s spaced throughout the text because of poor OCR.
I realized I could use regex to find and replace these extra, useless aspects. With my brother's help, I was able to contruct a regex that found the "less than" symbol and whenever it was connected to a character that wasn't "p" or "/", replace it with the html version. This would allow it to be easily read and solve the problem.
*the regex looked like this: <([^p\/]), and was replaced by &gt;\1*
To ensure that it didn't accidently replace the wrong thing, I did it one at a time.
Ran the plugin again, but this time the error identified the unicode and other "irregular" markings, such as "&" as problematic. Should have corrected for the "&" sign before changing the "<". Oh well.
The plan is to create regex expressions tomorrow to fix these new problems so that it can all be indented and move onto the next step.
###August 9, 2017
Using [RegExr](http://regexr.com/) to try out different regex expressions in order to take out "*" and other similar expressions that seem to cause errors when intenting correctly.
=======
>>>>>>> 7e2636109e7b0e3da0d34b1d6df9c20975be04c3
I first tried "*" to locate the *. It didn't pick it up, so I put in brackets ("*"), but it still didn't work. I recalled the [], so I changed the regex to (["*"]). This worked, although it also picked up the " (quotes) line for some reason.
I created an expression to ignore the " ["*"])(\"). It didn't work and then I realized I could remove the markings in the gedit program using the find and replace function.
Using the find and replace function I looked up the unicode meaning of the following symbols and replaced "*", "&" with nothing. ’ With "'", “ with """, „ with """, ‘ with "'" , ” with """, — with "-", ™ with nothing because the suggested symbol was a TM and made no sense based on the context. • with nothing because the suggested symbol was a picture of a sword and made absolutely no sense. I made sure to examine each case before replacing it and I found that there was no real option.
I re-ran the command to automatically align the tags, but it presented me with a lot of errors. I will copy the errors into a separate document and clean them up manually. There are over 300 lines of errors similar to the ones below:
/dev/stdin:97: parser error : StartTag: invalid element name
established at Haley a Station.</p> <p>1 hey are <>lltif remarks that the contin
^
/dev/stdin:137: parser error : StartTag: invalid element name
. <4l,e, wjH, at request,attend Bllcouri. lent work. “The best annual meeting
^
/dev/stdin:147: parser error : Specification mandate value for attribute Physician
Physician, Surgeon and
^
/dev/stdin:147: parser error : attributes construct error
Physician, Surgeon and
^
/dev/stdin:147: parser error : Couldn't find end of Start Tag J line 146
Physician, Surgeon and
*The majority of the errors are caused by stray "<" inputted through the OCR process, so I remove those becuase they are not symbols anyway.In the few instances where the word is clearly visible, I replaced the letter*
Happily, it now comes back with far fewer errors than previously. Now, most of the errors read:
/dev/stdin:1867: parser error : Premature end of data in tag p line 1541
/dev/stdin:1867: parser error : Premature end of data in tag p line 1171
/dev/stdin:1867: parser error : Premature end of data in tag p line 1171
I forgot to type the "/" in the end of the tag </p>, which makes sense because I copied the opening tag and then just manually added the closing tags.
When I ran it again, it attached them all together. <p></p>.
My brother thought it was probably an error with the text editor (gedit) I was using, so he sent it through his own more advanced text editor (Atom) and it intented properly.
**-ethical ramifications**
As discussed in class, my decisions to change the original data in any way is problematic because it alters the original text and may have a strong impact on the digital analysis I run on it. For example, when I decided to remove some incorrectly formatted unicode characters that were disrupting the TEI coding, I was doing something fairly straight-forward on the surface. However, those unicode symbols stood for something and I was making a conscious decision to remove some information from the text that I will theoretically be analysing later on. Since it was in unicode, I have no idea what it actually said, and it could have been important or not. I tried to mitigate the damage I did my making clear notes on what I removed/replaced and why. The hope is that this paradata can be used by future students and historians not only to help them understand the steps I took and my thought-process, but to also gain an understanding of the manipulatings and analysis I ran, so that they can decide how it impacts their planned work and adjust accordingly. When it was obvious (eg. i& was a nice day), I actually replaced the character to read correctly (it was a nice day). However, I did this only when I noticed it and only when it made sense. I have not gone through the whole text as this would have added a new layer of researcher subjectivity and bias to the process. In fact, I wonder if I haven't already done this. I hope whatever changes I have made, when taken into consideration the purpose of this project (to clean and make usable data) is a reasonable level of "risk". Ultimately, the goal would be to run this through OpenRefine too and work through it to clean up some of the data.
Open Refine
####Fail-log
I am going to try to use Open Refine to fix some of the spelling and easy to spot errors before I continue with TEI markup.
Not sure if it will work, but my thought is that I can improve the quality of the OCR somewhat and then it will be easier to search or to add to a collection of papers for visualisation later.
I converted the text file into a csv and put it into OpenRefine. It did not make any sense what the results were and looking at it, I realized that there were no headings or anything in the text, so it was using the initial <></> as columns. This is not what I wanted and it was not useful information. Now I remember why I didn't use it in the first place (I had this discussion above).
Still want to use OpenRefine at some point if possible. Might have to run it through the rest of the TEI first.
###August 11, 2017
TEI continued - Encoding/Markup
Now that the file is indented properly, I will encode it and then people will be able to quickly search it for people's names and such. If combined with many different but similar Equity files that have also been similarly coded, this can be visualized and used for much more useful research.
I used some of the Prof's examples for what to code and I created some that I figured would best suit the text, based on the skimming I did when adding the <p> </p> to separate into paragraphs.
####Fail-log
I used the formula below to encode the first name (Cation Thornloe), which appears to be a poorly rendered Captain Thornloe).
<p> <CationThornloe <key="Thornloe, Cation" from="?" to="?" role="Bishop" ref="none"> </persName>
I received a parse error saying it was improperly formed.
I examined the format and noted that there was an extra ">" before the end tag </persName>. Even though it appeared to follow the format laid out in the template, I modified it to close out the tag:
<p> <CationThornloe/> <key="Thornloe, Cation" from="?" to="?" role="Bishop" ref="none"> </persName>
It then returned a parse error again, saying it was poorly formed.
I asked for help on our class' slack channel and Dr. Graham responded with the following:
Markup wraps information around text. So <markup>text</markup>
8:45
You are missing the text bit. It's telling you exactly where.
8:46
Move cation Thornloe to the space between ><
8:51
<p><persName blah blah>Cation Thornloe</persName> was the bishop
I reformatted the way he suggested and the file opened in Firefox without an error, but it did not include any highlighting or any visible "markup".
Prof. Graham asked if I had a stylesheet for the xml file. I did not, nor did I know what it was. He explained and provided a sample. To me, it seems like a legend that formats the tags we put in the xml file.
I'm still confused because the example doesn't seem to match his instructions, so I ask for clarification
"<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="000style.xsl"?>
<teiCorpus>
right? line 2, that tells the browser to use the stylesheet 000style.xsl to interpret your markup.
dr.graham
9:21 PM
so just put that stylesheet file in the same folder as your xml file, then try reloading
claremaier
9:22 PM
okay, thanks
9:26
there doesn't appear to be "href="000style.xsl"?>
9:26
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">"
The professor used my file in his browser and it worked, so I was super confused. Then my brother pointed out that the lines should be in the xml file and I mistakenly thought the professor meant the xsl file. Sometimes it is really annoying to get stuck on such a simple mistake for hours. Tomorrow I will try to encode everything I need, or at least get a significant start on it. I'm going to go through each type of code one at a time, because otherwise I'm quite sure I will get confused between what needs to be coded.
I see this section also includes some research, so that will take up time. Not sure some of the information, such as Grocery sales or specific medicines will be available online, but we'll see.
I used this style-package
https://github.com/craftingdigitalhistory/module3-wranglingdata/blob/master/tei-hist3907/000style.xsl
**Encoding Legend**
Persons <persName key="Last, First" from="YYYY" to="YYYY" role="Occupation" ref="http://www.website.com/webpage.html"> </persName>
Places <placeName key="Sheffield, United Kingdom" ref="http://tools.wmflabs.org/geohack/geohack.php?pagename=Sheffield&params=53_23_01_N_1_28_01_W_type:city_region:GB""> </placeName>
Medicine <medicineName key="Name" from="Business" claim="medProperties" ref="website"> </medicineName>
Sale <saleType key="type" from="Business" to="amount"> </saleType>
###August 12, 2017
Continuing Encoding the TEI - Focus on Names in First 201 lines
I started by looking up and encoding for the names of people. This required some research using the internet. Sometimes it was easier than other times, based on me having to find alternative ways to spell people's names and try to find historical documents that contain the information I wanted about them.
For example: Charles M aukk, Esq. on the Board of Directors
I searched Charles Maukk, didn't find any results that were from the right timeframe and/or referenced him being on the board for anything. I added different search phrases to his name and tried different spellings. I finally found several entries for Charles Magee, President of the Bank of Ottawa during the time the paper was written. I noted the correct spelling and began searching for information on him as a person. I learned that he was a dry goods merchant, rich, president of other prestigious groups. I kept looking for his birthdate and death because as such a prominent figure, I was sure it was out there. *Here is where if I was doing this seriously as part of a larger research project, I would have noted the name for further, primary source research within a library or archives setting*. I did find his information in an online meeting document from the City of Ottawa in 2014 regarding Heritage Buildings. It was about some houses Magee built. *In the process, I learned about the history of the Bank of Ottawa and its merger with the Bank of Nova Scotia, as well as other interesting historical facts. I also can see just how time-consuming this project could be if it was undertaken most professionally*
The information I include in the brief description of the person also needs to be problematized because each person could have at least a full essay written about them, their politics, significance to early Ottawa history, influence at the Bank, trade influence, religious and social views, etc.
Each decision about what to include should be explicitly given, although I am not sure where it would be most appropriate to do so or how that type of paradata should be shared. For example, in looking up George Hay, I found one biography for him that seemed to give a rather complete history of his interactions with several different aspects of life. I chose not to look at any addition resources after briefly searching to make sure George Hay was the person I was looking for (context included connection to the Bank of Ottawa, that he was wealthy and influential). I also chose to exclude some of the information about his complex political ties more detailed religious leanings. I did include that he was the leader of the Ottawa Bible Society because it seemed to include his broader position within the religious circles about Ottawa. I also included it because as a person of faith myself, I found myself identifying with him on this level. *While this admission is important for transparency, I do see how my bias could be problematic because it influences how those who use my source learn, interact, further research, and interact with George Hay. However, I do not see a solution, other than to be honest about the processes I took and the decisions I made, because to not do so would be problematic, and as a subjective human, there is no way that the information could be presented in a non-subjective manner. Perhaps this is the danger of digital sources and the type of search function I am writing: people are so used to searching things on Google that they incorrectly assume Google (and every other type of search) is objective. It is not, as even Google recently got fined for preferencing its own products in its search function. Just as we are critical about the value of primary sources, we need to be critical about what we find and how we find online sources. I just don't know the exact most appropriate format or method, but I will try to highlight within this document any decisions I made that seem most significant in creating a bias within the text.*
When I could not be sure that the person whose information I found matched the person in the Equity article, I chose to not include the information. I did this to maintain a level of good scholarship and not present potentially misleading information that future researchers might take at face value. This decision applied to people who were in the paper, but were not necessarily "big players", such as local shopkeepers. If I were to focus on this project as a significant research project, I would have taken the time to go to local registry offices and done cross-referencing between numerous newspaper articles, school registries, church registries, property titles, etc, in order to determine who people are. This would be the next step if someone were going to focus on this newspaper entry as part of a larger effort to create an Equity-wide searchable database.
**I have spent most of the day looking up information on the people mentioned in the first 200 lines of the Equity article. At this point, it may be a better idea to only look at the first 400 lines and fully encode that, using it as an example of what the completed text would look like.**
*Note: While I did my best to highlight all the names that appeared in the first 201 lines, I may have missed a few - this oversight is entirely mine. I also did my best (by copying) the same tags for the same people, but again, some oversight may have slipped through. This is something people should be aware of when using these sources - it does not absolve people from doing their own critical thinking and fact checking*
####Fail-log
I just double-checked to make sure my file worked and I received the parsing error again. Turns out the "&" symbol in the URL was throwing it off. Couldn't figure out how to fix it, so I removed it and [replaced](http://www.res.parl.gc.ca/parlinfo/Files/Parliamentarian.aspx?Item=e04a4ab4-fa00-4451-904e-6c261abd68c0&Language=E&Section=ALL) with ParliamentofCanadawebsite, Murray, Thomas. I'm happy to fix it once I figure out how.
I had the same problem with a GoogleBooks [search](https://books.google.ca/books?id=IZFXAAAAMAAJ&pg=PA248&lpg=PA248&dq=sj+mcnally+ottawa&source=bl&ots=jUoOqXoBYi&sig=55PDw9IKjON8xpqFN8qJmbzDljA&hl=en&sa=X&ved=0ahUKEwi0tMWovdLVAhUL7YMKHbYgBWwQ6AEIPzAF#v=onepage&q=sj%20mcnally%20ottawa&f=false)
Canada Medical Record, Volume XXIV, Oct., 1895, to Sept., 1896
When I copied one of the tags near the beginning (to use by quickly modifying), I also grabbed the sentence several times: before it and this was detected as incorrect parsing by my browser. I had to go through and remove this sentence </persName> ex M. P., will grace the Mayor's chair, in the town of Pembroke for the current year.
**XML Parsing Error:** mismatched tag. Expected: </p>.
Location: file:///home/clare/School/Final_EquityProject/TEI_OCR_Tagging.xml
Line Number 829, Column 15: </body>
--------------^
**Not sure why this parsing error. Could be that it doesn't like the new alignment with all the added tags**
###August 13, 2017
Continuing Encoding the TEI
####Fail-log
I was able to fix the two URLs listed above. The xml wouldn't read the "&" symbol, so my brother suggested I replace it with "&amp;" which is essentially the same thing, but is recognized by xml.
Similarly, the last parsing error with the mismatched tag was fixed by looking through the file and carefully fixing the few tags that I hadn't closed properly.
###August 14, 2017
**Places**
<placeName key="Sheffield, United Kingdom" ref="http://tools.wmflabs.org/geohack/geohack.php?pagename=Sheffield&params=53_23_01_N_1_28_01_W_type:city_region:GB""> </placeName>
This went well. I decided that by places I meant geographical cities and regions and not "Shawville Skating Rink" or businesses. I did this because I did not see the value in encoding such obvious places and others, such as McGuire's Grocery, were not searchable using Google. I may revisit this decision later.
**Medicine**
Medicine <medicineName key="Name" from="maker" claim="medProperties" ref="website"> </medicineName>
*I have decided to highlight medicines within the text because it seemed to me that I was seeing several different medicine advertisements throughout the paper and I wanted to know how many, and exactly what they promised. This is a personal interest because as a person who watched Little House on the Prairie, I often learned about "phony" medicines that were mostly alcohol. I wanted to explore this further and see if it was widespread/common practice.*
I used the pdf copy of the paper to find all the mentions of newspapers
*I will do the actual research tomorrow at work*
**Sales/Items**
Sale <saleType key="type" from="Business"> </saleType>
I am not sure exactly what/how I want to approach this category - I can either do it by store/owner or by type of item. Each has advantages and disadvantages. The advantage of itemized list lets the researcher notice what items are commonly for sale and if crossed with data from several months/years/seasons - did it change? Learn about seasonal fruits and vegetables and if it changed with the advent of further distance transportation or new imports. Is there a difference in the clothes or medicines being sold? I can't do that with this project because I'm focused on one newspaper, but if would be a good future use of this material.
**By Salesman**
**Item**
###August 15
Contiue TEI tagging
I hope to finish TEI tagging by the end of the day
*It is possible that I didn't include some medicines if they were not recognisable as medicines or weren't picked up through a manual skimming through the paper.*
Problem of finding these items by keyword (eg. Hair for Hall's Hair Renewer) is that because the advertisements are often broken up, other references such as scalp, bald, re-growth, etc are not caught. This is the type of thing that the topic modeling would catch and group together to give a more accurate representation of how often Hall's Hair Renewer was mentioned and in what contexts (especially if searching for the topic modeling that uses positive and negative sentiments).
This is interestingly the case for "Warners safe kidney and liver cure". While "Kidney" is sprinkled throughout the document, it is done without an explicit connection to Warner, so it is unethical and unfair to assign it to his medicine. This is especially true because Doan also has a kidney cure advertised within this edition and without being connected to a specific name, it would be poor methodology to include it. Now, I could pull up the searchable PDF version of this paper that I have on file, but again, this will not necessarily tell me what reference belongs to what medicine.
It doesn't look like there is anything really to find about the grocers or merchants listed in the paper. This information is not really necessary in for my questions anyway, so I will not do additional research on it. However, because I want to know more about the number of advertisements and who published them, I still will encode for them, but rather than include additional information about them, I will just use <merchant key="name"> </merchant>
I'm also going to continue to restrict my markup to the first 200 lines. This will ensure consistency and provide a proof of concept.
*Using the keyword to search through the document is again problematic because it excludes clearly related text such as "insisting of Dry G-.ods, Furs, Millinery, Carpets, Clothing, Gent's Furnishings, Boots, Shoes, and Groceries,", because it is not connected to a proper name, so it cannot be included. It would be picked up by a topic model however.*
For the items for sale, I will encode the type of product (clothing, meat, dairy, staples, luxuries, hides*) and business selling the item. Hopefully, this will allow for searches and later data mining to determine what type of product was most commonly sold (or rather, advertised) and by whom. This would be interesting to track over a longer period of time - years, decades.
Sale <saleType key="type" from="Business"> </saleType>
*rationale for chosing these categories: these categories appear to be commonly used to categorize commodities and seem to fit in with the items located within this document.
Of course, there are ethical and practical considerations when coding for these categories, such as what if the researcher groups things differently? What extent does my worldview, biases, research methods effect the way I grouped things? These categories can't be used universally because of the ways different cultures group food (ex. hot vs cold, hard vs. soft, taboo and clean, etc). These categories would also change depending on the research question being asked. This should be stated up-front. My research question theoretically explores what general types of commodities are being sold at different times of the year and asks if the types of items sold changes over time. In addition, it is important to note that these are not necessarily a fair representation of items sold during Winter of 1897, but a representation of items that stores wished to put on sale for a variety of mostly unknown reasons. Their use would have to be coupled with a more detailed reading and research project.
**Note: Once again, I only coded to the first 200 lines, although the terms compiled in the ResearchFinalProject file accurately applies to the whole document**
There is also the issue of what qualifies as "luxury" vs. "staples", especially regarding sugar. I chose to classify it as a luxury because during that time period, it would have been more of a luxery, rather than today, where it is very much considered a staple. Similarly, coal oil was classified as a staple because it was needed to provide light, especially during the winter, when this paper was published.
*Line 164 includes a list of food items, but it also appears to be a list of prices, so I made sure not to include it in the sale count. I ensured it was actually a list of prices by checking it against the PDF version of the paper.*
*Also does not include a list of medicines, although they were also for sale, because medicines are based on their own research question and classified differently*
####Fail-log
I did all the encoding for Medicine and opened the web browser to make sure it all worked. I did not see anything highlighted. I realized that there must be something in the stylesheet that tells the xml file what colours to make things and to turn it into a link. I opened up the stylesheet and there are two parts that look like they are necessary for this.
<li style="color:blue;text-decoration:none;">Individual</li>
and then later on:
<xsl:template match="persName">
<a style="color:blue;text-decoration:none;" href="{@ref}" title="{@key}&#013;({@from}-{@to})&#013;{@role}"><xsl:value-of select="."/></a>
I'm going to try experimenting with them and ask my brother for help to substitute what looks like a red tag that I'm not using that is already in the stylesheet I'm using.
When I figure this out, I'll also have to add two new colours - Merchant and Sale.
I tried copying the red "Claim" group of both of these, changing the keywords in them and making them orange. Didn't work.
<li style="color:orange;text-decoration:none;">Medicine</li>
<xsl:template match="medicineName">
<a style="color:orange;text-decoration:none;" href="{@ref}" title="{@key}&#013;({@from}-{@claim})&#013;{@ingredients}"><xsl:value-of select="."/></a>
Medicine <medicineName key="Name" from="Business" claim="medProperties" ref="website"> </medicineName>
Sale <saleType key="type" from="Business" to="amount"> </saleType>
###August 16, 2017
####Fail-log
I received help from the professor [here](https://hist3814o.slack.com/files/dr.graham/F6P7GSXM1/Untitled.r) and was able to get the "medicine code to turn red throughout the text.
However, my brother pointed out that I was missing the "href="{@ref}" to actually link the item to a web page that would open when it was clicked on. I also wasn't able to get the medicine categories all to show up in the "hover" function. I realized that it was because I only included the default variables from the professor, which were "key" and "ref". Then I forgot to add the "@" sign to the beginning of item, further frustrating me until I re-read the document again and noticed the discrepency with the original examples. My brother also talked with my on the phone, guiding me towards areas the mistakes might be in.
*This was a good reminder (as I had previously) of the importance in digital reseach of making sure every note, command, and word was spelled correctly. Computers are not like humans and they cannot yet anticipate, infer, or change what we add with accuracy. Of course, the legal and ethical ramifications that would arise if they were able to change what we wrote can speak to issues such as programming bias, transparency, accuracy in making complex, multifaceted decisions, and so on**
I then had to do something similar for the saleType tags, and I included the href and title. I did not include a reference because I really just wanted to create an easy way to track different categories of food, with the future plan of doing me historial analysis and data counting based on changes in the summar and winter, as well as changing purchasing habits and similar research.
**I do not know what the proper way to make this type of data accessible is. I haven't decided if the method I used was correct or not, and tomorrow I will explore a few ways of using the tags to search for ways to answer the research questions below.**
I think the biggest thing that was difficult about this last step was that because I had received the stylesheet from the professor, I was using it without truly understading what each component meant or how to modify it. I had no frame of reference and so I was totally confused when it didn't work the first time. I didn't panic and reached out to my professor and brother. Once I understood the different components, I was able to be directed towards a close comparison between the various mistakes I made over different layers of the project and the original versions. This led to my identification of each little error, such as missing the "@" in the variables (eg. {@key}
**Post as blog posts by day, broken up into topics on Reclaim - make sure to mention that I created an ititial document and then decided to post it all on the blogs - and the reasons why**
###August 17, 2017
Next steps - getting it ready for presentation
I have finished the encoding, so I am going to see what I can do with the data and if I can find the answers to the questions I have posed below. I will also start formatting these notes and put them out there in my blog.
Ideally, I would like to map people's names and compare the frequency of "important" vs "regular" people. I also want to see which towns are mentioned most frequently and what type of food is mentioned most.
#####Voyant
I started with Cirrus, the word cloud display. This will provide a visual representation of what words the program counts as the most frequently included within the text. Unfortunately, the downside is that because of the poorly-done OCR, this word cloud is not entirely accurate and this needs to be mentioned. Failing to mention this could influence the way researchers approach this tool and lead them to draw wrong conclusions. For example, the auto-generated word cloud identified "Mr" as the largest cluster of words. This may indicate that many people are referred to throughout the document in the formal sense, as in "Mr. Ayer". However, it does not provide context in this type of general distant reading. This becomes especially problematic when considering the word "time". Time in what context? By itself, the term is not useful.
Either way, the top key words include: Mr, new, old, time, years, house, men, coun, shawville, John, law, business, and January.
Taken critically and within the context of a paper written shortly after the start of the new year and largely focusing on recent municipal elections, the results make sense. Elected officials are referred to formally and respectfully. References to "new" include "New" York, "New" Years, new councillors and new buildings/work. Interestingly enough, the name "John" is 7th among the top 25 most frequent words. This means that either the same person is being referred to multiple times, or there are many different people named John. In examining the phrases in which the term occurs, it appears that both are the case. Without doing outside research, it is difficult to know if the name John was popular in the Shawville region during this time, if it was popular among the English-speaking elites, or if some other factor is at play.
The term "January" is also in the list of frequent words and speaks heavily to the advertising tactics of G. H. Hodgins.
In terms of the words associated with place names, Shawville is the most frequently cited. This makes sense because it is where the paper is written in. I would like to see the frequency of other towns and see if or what their relationship can be deduced as a result. *Not sure how to do this, but I will continue to look into it*
When I broadened the key word frequency search to 105 terms, it included variations on the words "council/councillor/municipal/elected/mayor". This points to the strong emphasis in the paper on the recent elections in the surrounding townships and also correlates with the frequency of "Mr.". Other strong words inclded words associated with the January sale "prices/goods/company/cent/sale/January". Again, this correlates to the top 25 list and its emphasis.
Other terms that came up frequently included general public notices associated with common news "death/coming/home/court".
In terms of the tags I encoded for, I was happy to see that names of towns featured in the top 105 list. Bristol was mentioned 14 times, Arnprior 9 times, London 8 times, York (most likely New York) 9 times, and Ottawa 14 times. Closer reading would need to explore in what context these terms where mentioned. It surprised me to see London and (New) York mentioned so highly. On closer examination, it appears many of these references were concentrated in two or three distinct stories about criminal proceedings or internationally significant events.
I clicked on the Links tab (a different tool). It appears to map connections between words based on their frequency. I started playing with the slider in terms of how many words are included in the context, and it took what made sense (a collection of connected but easy to read words) and replaced it with hundreds of words and their connections.
####Fail-log
When I moved the slider in Links to 15 from 30, it looked to actually increase the number of words that were being connected. I pressed clear to hopefully reset it to the way it originally was, and it reduced the word to one. I had to re-upload the file and then work on the Links tools. I set it for 15 links, which was in the middle of the options and seemed like a good number. This generated a web of connected words.
#####Links
This was a different view from the previous visualization because it did not just present the most frequently used words, but searched out how they were connected. This type of visualization is useful in helping the reader or researcher see context and begin to look at alternative ways of broadening a search query. For example, "John" is now connected to "mother". If we continued to trace either John or mother, we could potentially see what John does or why he is mentioned so many times and take a deeper reading to figure out how he connects to mother. Similarly, if our focus was on mother, we could explore how she is related to John and start to ask broader questions. In this case, we could ask why the mother is referred to as John's mother as her identifier, asking questions about identity, gender, power, representation, and the role of newspapers/the printed word in perpetuating or addressing these issues. If we were looking at a huge collection of texts instead of one, we could do like [Michelle Moravec](http://historyinthecity.blogspot.ca/2013/12/corpus-linguistics-for-historians.html) does on her analysis of feminism and its representation through topic modeling, and do a deeper analysis.
Cautions to consider with this type of analysis include the possibility of program errors in connecting of words, and in the case of this data in particular, the concern about poorly OCRd text and missed or unrecognizable words. In order for this chart to be more than merely illustrative, it would have to be used critically and the researcher would have to have taken high levels of care and effort in cleaning up the data for more rigorous scholarly use.
#####Terms and CSV file
I then explored the Terms tool. This was a list of all the words and their frequencies. I selected all the terms that ocurred eight or more times. I chose this number because it was a number that reflected the lower end of the "frequent" word list and represented 136 individual terms. This number seemed to me to be enough to experiment with, but not too much to get bogged down with.
When I exported it as a text separated file (so I could turn it into a csv and use it later), it actually exported all 600 terms. *I may use them all, or I may edit the file down and use only the selection I original decided on.* I opened the file in my LibreOffice Calc program and saved it as a .csv file. Happily, it worked well and provided headings: Term, Count, Trend. The trend category will not be useful for my anticipated project, so in the working copy I save, I will not be including that column. I have also decided to only include the original 136 most frequent words.
I would like to take this csv file and open it in OpenRefine.
I loosely followed the [tutorial](http://workbook.craftingdigitalhistory.ca/supporting%20materials/open-refine/) and modified it for what I was looking for.
####Fail-log
I noticed right away that unlike the Texas correspondance files, they are already sorted into fairly simple and clean categories (key words). There were not really any "Johns" "Johnston" "Johston" to merge and rename. Once again, it doesn't really look like OpenRefine is the right tool for this project, no matter how much I seem to want to use it.
I tried the "text facet sort" on the file anyway and it found "friend" and "friends". I clicked that it should merge, but on second thought, realized that in the context of this newspaper, it might actually be unethical to do this. Unlike a series of correspondances where the names were almost certainly the same, the key words were not necessarily within the same context. "Friend" might be part of a sentence that said "John went to visit his dear friend last week", whereas "Friends" might refer to people in the more formal and impersonal voice: "at this grocery store, we treat all our customers as friends". I undid the merge and closed the file. Not sure if/how I will use it, but at least I learned how to get it into the format I wanted (finally) and got am learning more about what the program can/can't do.
*It is also worth noting that I successfully opened Open Refine and accesed the sort function with very little reference to the tutorial. This means I am making some progress, so I'm happy about that.*
Upon closer examination of the class coursebook, it looks like I was confusing to some extent OpenRefine with Gephi. Gephi is the net-work analysis tool and I would like to use it to continue my exploration of words and how they are connected re: the questions I am asking. I may also look briefly at topic modelling to see if the computer program (don't remember which one) makes the same connections between words as the "Links" and my own analysis does.
###August 18, 2017
Catmandu, regex, python
Following advice from the professor, I took a look at his advice from [link](https://hist3814o.slack.com/archives/C0GDSE1B8/p1503065858000029) and worked to install Catmandu. This took some time because the file was large and everything downloaded individually. I asked my brother to look at the tutorial with me because I was unfamiliar with the new program. The tutorial is found [here](https://librecatproject.wordpress.com/2014/12/04/day-4-grep-less-and-wc/amp/)
We decided after some exploration that we would prefer to not only count the number of tags each word totaled to, but to sort the place names and determine their relationship to each other. We abandoned Catmandu and began to work with regex and python.
The plan was the create a regex searching for the <placeName> tag, including up to the end of the "ref" line and then use a python expression to tabulate all instances of the same place name.
I wrote a regex that I thought would do this
("<placeName" + ">)
####Fail-log
I was right about the beginning section of the regex and the later need for the "+" to grab additional information. I was not precise enough and while I remembered some of the conventions about writing the expression, I forgot others, such as the "\" and the strategic use of brackets.
My brother modified my expression so that it continued until the "ref" at the end. <placeName key=\"(.+)\" ref "> - inserted to close the tag and prevent the rest of the text from changing colour" This correctly identified the place names, but the "ref" it was picking up was not always the one it started with, especially when the place names occured on the same line **insert screenshot**
We then turned to the python part, looking for ways to isolate each line independently and then add re-occuring place names into a counter. This required the creation of the python file, using "for" to identify the regex phrase to be searched for, "if" means that if it finds more than one for the phrase, it adds it up and creates frequencies. If the place name is mentioned for the first time, it is simply noted. This is referred to as the dictionary.
He kindly created and sent me this python script: [insert screenshot]
It works in theory, but the "key=" causes problems with the script and it will not work. However, this process demonstrated the initial steps of this analysis. When the python does work, the goal would be to learn more about why the **placenames[m.group(1)] = placenames[m.group(1)] + 1** section isn't working properly.
Theoretically, if it did, it would search for all place names that were in tags, and tabulate how many times each was mentioned. I would then take the information and place it into context, asking what, if any, the relationship was between a locations's location and Shawville. Other data analysis could also be conducted, similar to what I have discussed previously. The same type of process would also be done for well-known vs. regular people, and items sold at the groceries what, if anything, it woulld have to do with seasonal products.
**Didn't get to this, discuss as next steps**
#Goals for tomorrow (Friday, August 18, 2017)
1. Run several visualizations through Gephi in an effort to answer my questions.
2. Briefly try some topic modeling
3. Write up some preliminary conclusions answering the primary questions
#Goals for Saturday, August 19, 2017
1. Clean up this "Notes" document
2. Put the blog postings on a website
3. Finalize presentation of the data
4. Ensure it meets final project guidelines
###August 19, 2017
Writing blog posts, thinking everything through
I spent much of the day taking these notes and using them to form blog posts going back to the start of the project. They can be found [here](http://claremaier.ca/final-equity/august-18-a-series-of-errors-and-a-helpful-friend/)
I did my best to make it user friendly and interesting for the audience, who I envisioned to be students like me (new to digital history) or people from the Shawville area who stumbled across the site somehow through Google. I wrote in a narrative, first person style, describing my motivations, steps, failures, and successes. I also included links that I found helpful and visualisations of the work as appropriate.
**In the blog, include appropriate visualizations and code snippets to illustrate the chronological blog posts**
#Research Questions
**How often are well-known people vs. regular people mentioned?**
**How might this speak to the function or readership of the paper?**
**How frequently are locations mentioned? Does this speak to the relative "world" they lived in? E.g. closer contact with locals - what they were interested in?**
**Are some items sold more than others? What and why?**
#Sustainability and Access
This project and its results will remain sustainable as they will continue to be available on free, reputable, and open source sites hosted on the internet. I have written all my notes associated with this project in accessible and sustainable formats (.txt, .md, .xsl, .xml) that can be opened and used by all computers and internet enabled machines.
I have made careful notes and provided internal thoughts and commentary alongside the process, failures, and project results in the hopes that my project is clear and replicable by both professional digitl historians and beginners like myself.
The project results will also be sustainable because they are being archived by the professor as part of his marking process and because I found the class-provided DH Box so complicated and not at all user-friendly, the materials are also on my personal computer and safely stored on my publically-accessible github account. I will provide a link to my domain and github account below.
My work also contributes to long-term accessibility of the *Equity* papers because the file itself is saved in many formats within my computer and github account. In addition, My documentation and blog posts should sufficiently describe my process with reference to the files so that the conclusions I have drawn about the Shawville region are applicable and understandable should the original edition of the January 14, 1897 *Shawville Equity* be destroyed.
Reclaim hosting (domain name Clare Maier): https://cpanel.outofstep.reclaimhosting.com/cpsess2656399109/3rdparty/installatron/index.cgi?login=1&post_login=28410085606181#/installs
Github: https://github.com/claremaier/Final_Project
Blog: http://claremaier.ca/