<h2>Digital Humanities and Data Science</h2>
<h4>Producing and Organizing Content in the Interest of Streamlining Preprocessing</h4>

<p>This notebook was written to address the specific issues of one digital humanities project. Messy input data and the tech debt created by variations in the input data were resolved by this notebook in the interest of creating a homogenous data set. This notebook exemplifies how we can work with homogeneous data to generate relevent statistics about the data and to group the data by time, place, or characteristics specific to a data point--in our case, one poem. So while there is much in this notebook that those working in digital humanities may find useful, the specific inconsistencies of our project's input data forced this writer to formulate the code below to address inconsistencies that may have existed in only a few data points. The comments in this notebook should help clear up some of the confusion.</p>

<p>Some of the specific issues that I've attempted to address in this notebook are:</p>

<ul>
<li>Formatting XML in the interest of creating good "input data" from the standpoint of data analysis--see notes below</li>
<li>Creating equalivalencies between data points--e.g., cannot==can not==can't</li>
<li>Generating Relevant Statistics about the data--as a whole, grouped by year, etc.</li>
</ul>

<p>Readers also may be interested in researching the lxml package. lxml extends elementTree (used in this notebook) and supports full XPath capabilities. This writer chose to work with elementTree because it is a standard library and can accomplish the necessary preprocessing.</p>

<h4>Notes on examining XML with Python in the Digital Humanities</h4>

<ul>
<li>Do not use multiple, identical tags to convey a multiplicative meaning of the tag.
e.g. &lt;l rend='indent'&gt;&lt;l rend='indent'&gt;twice indented text&lt;/l&gt;&lt;/l&gt;
Instead, use a unique tag, e.g. &lt;l rend='indentx2'&gt;twice indented text&lt;/l&gt;</li>

<li>If using a boilerplate XML file to create new content, do not leave empty example tags.</li>

<li>Determine at the outset how lingustic equivalencies should be handled, e.g. are Arabic numerals equivalent to spelled numbers</li>

<li>Determine at the outset what "counts" toward certain statistics, e.g. does the occurrence of 4 count toward a word count</li>

<li>A "flatter" XML structure is preferable to a deeper one, i.e. an effort to use semantically descriptive XML tags shouldn't rely on a series of nested XML tags. Instead, use XML attributes to reduce the number of nested tags. This simplifies XPath expressions. This can go a long way in practice because it can ensure that similar XML content is nested on the same level. For example, in choosing a set of descriptive XML tags, the decision to include a descriptive XML tag would have to be accompanied by another XML tag that represents the negation or absence of the descriptive characteristic of the first tag in order for the content nested within those tags to be on the same level.</li>

<li>Artistic license, dialects, slang, and even simple, grammatical expressions represent an intractible problem when attempting to analyze digital texts. While variety is the spice of life, it is anathema to programmatic problem solving. Apostrophes are an example of how common grammatical forms can confuse a programmatic analysis. Determining whether or not a word is a contraction or a possessive is difficult without additional information. 'the doctor's out' = the doctor is out vs. 'the doctor's pen'. A similar example in German would arise in determining whether or not the phrase 'den deutschen' is singular accusative or dative plural. These considerations lead me to advocate for the use of 'regularizing' tags. The meaning of a specific word or phrase can be clarified by using some variant of the following in the XML: '&lt;reg regular='von den Deutschen'&gt;den Deutschen&lt;/reg&gt;'. Then in certain instances, the contents of the regular attribute can be substituted for the contents of the reg tag.</li>
</ul>

In [8]:
import xml.etree.ElementTree as et
import os
import sys
import re
from collections import Counter
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import xlsxwriter
import warnings
import pandas as pd
import statsmodels as sm

In [9]:
for file in os.listdir(r'/path/to/XML/folder/'):
    tree = et.parse(r'/path/to/XML/folder/' + file)
    dataDict[file] = tree

In [10]:
# The below pattern includes the alphanumeric set specific to the German language inlcuding specific punctuation. 
allCharacters = r'(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ0-9])'

In [24]:
import xlsxwriter

def buildSummary(dictionaryOfDictionaries, allowedRegexExpressions, writePath):
    wordList = []
    # The below variables will count occurrences of the different kinds
    poemCount = 0
    wordCount = 0
    lineCount = 0
    wordCounts = {}
    wordsInAPoem = []
    poemsDict = {}
    for k, v in dataDict.items():
        poemDict = {}
        poemCount += 1
        # Obtain the root element of the XML doc.
        root = v.getroot()
        wordsInThisPoem = 0
        '''
        Determine the language of the poem. Note, there may be more than one language present in a poem;
        therefore, create a list of languages if there are multiple. Store languages in variable: 'flag'.
        This step allows handling of multiple languages.
        '''
        flag =[]
        # When using ".findall" it's necessary to define the namespace of the document within brackets.
        languages = root.findall(".//{http://www.tei-c.org/ns/1.0}language")
        for language in languages:
            if language.text is not None:
                flag.append(language.text.lower())
                
        # Remove annotations. Note, to delete a node, the parent node must be referenced. 
        note_parents = root.findall(".//{http://www.tei-c.org/ns/1.0}note/..")
        for parent in note_parents:
            # Because an l tag may have more than one note . . .
            for note in parent.findall(".//{http://www.tei-c.org/ns/1.0}note"):
                parent.remove(note)
            
        # Edit the lines below to locate the particular xml Tag contents in which you are interested
        # Here, I want to find "l" tags that are children of "lg" tags
        for lg in root.findall(".//{http://www.tei-c.org/ns/1.0}lg/{http://www.tei-c.org/ns/1.0}lg"):
            for line in lg.findall(".//{http://www.tei-c.org/ns/1.0}l"):
                # The below line joins all the text in the sequence they occur, i.e. exhausting all children for a parent
                # before proceeding to the next top level parent for all nodes 
                # returned by .findall. This usage can avoid complicated XPath expressions.                
                trueLine = "".join([x for x in line.itertext()])
                if ('english' in flag) and (trueLine is not None):
                    trueLine = re.sub(r"won't", "will not", trueLine)
                    trueLine = re.sub(r"can\'t", "can not", trueLine)
                    trueLine = re.sub(r"n\'t", " not", trueLine)
                    trueLine = re.sub(r"\'re", " are", trueLine)
                    trueLine = re.sub(r"\b([a-z]\w+)'s\b", r"\1 is", trueLine)
                    trueLine = re.sub(r"\'d", " would", trueLine)
                    trueLine = re.sub(r"\'ll", " will", trueLine)
                    trueLine = re.sub(r"\'t", " not", trueLine)
                    trueLine = re.sub(r"\'ve", " have", trueLine)
                    trueLine = re.sub(r"\'m", " am", trueLine)
                    trueLine = re.sub(r"won’t", "will not", trueLine)
                    trueLine = re.sub(r"can\’t", "can not", trueLine)
                    trueLine = re.sub(r"n\’t", " not", trueLine)
                    trueLine = re.sub(r"\’re", " are", trueLine)
                    trueLine = re.sub(r"\b([a-z]\w+)’s\b", r"\1 is",trueLine)
                    trueLine = re.sub(r"\’d", " would", trueLine)
                    trueLine = re.sub(r"\’ll", " will", trueLine)
                    trueLine = re.sub(r"\’t", " not", trueLine)
                    trueLine = re.sub(r"\’ve", " have", trueLine)
                    trueLine = re.sub(r"\’m", " am", trueLine)
                    wordsInLine = trueLine.split()
                    lineCount += 1
                    # Regularize each letter in a word and discard characters not in allowedRegexExpressions.
                    for word in wordsInLine:
                        no_punct = ''
                        for char in word:
                            char = char.lower()
                            if re.search(allowedRegexExpressions, char):
                                no_punct = no_punct + char
                        if no_punct != '':
                            wordList.append(no_punct)
                            wordsInThisPoem += 1
                            wordCount += 1
                            poemDict[wordsInThisPoem] = no_punct
                elif trueLine is not None:
                    wordsInLine = trueLine.split()
                    lineCount += 1
                    for word in wordsInLine:          
                        no_punct = ''
                        for char in word:
                            char = char.lower()
                            if re.search(allCharacters, char):
                                no_punct = no_punct + char
                        # Check if no_punct is empty
                        if no_punct != '':
                            wordList.append(no_punct)
                            wordsInThisPoem += 1
                            wordCount += 1
                            poemDict[wordsInThisPoem] = no_punct
        # This block of text deals with XML files for which lg tags (stanzas) are not contained in a parent lg tag.
        if wordsInThisPoem == 0:
            for lg in root.findall(".//{http://www.tei-c.org/ns/1.0}lg"):
                for line in lg.findall(".//{http://www.tei-c.org/ns/1.0}l"):
                    trueLine = "".join([x for x in line.itertext()])
                    if 'english' in flag:
                        trueLine = re.sub(r"won't", "will not", trueLine)
                        trueLine = re.sub(r"can\'t", "can not", trueLine)
                        trueLine = re.sub(r"n\'t", " not", trueLine)
                        trueLine = re.sub(r"\'re", " are", trueLine)
                        trueLine = re.sub(r"\b([a-z]\w+)'s\b", r"\1 is", trueLine)
                        trueLine = re.sub(r"\'d", " would", trueLine)
                        trueLine = re.sub(r"\'ll", " will", trueLine)
                        trueLine = re.sub(r"\'t", " not", trueLine)
                        trueLine = re.sub(r"\'ve", " have", trueLine)
                        trueLine = re.sub(r"\'m", " am", trueLine)
                        trueLine = re.sub(r"won’t", "will not", trueLine)
                        trueLine = re.sub(r"can\’t", "can not", trueLine)
                        trueLine = re.sub(r"n\’t", " not", trueLine)
                        trueLine = re.sub(r"\’re", " are", trueLine)
                        trueLine = re.sub(r"\b([a-z]\w+)’s\b", r"\1 is",trueLine)
                        trueLine = re.sub(r"\’d", " would", trueLine)
                        trueLine = re.sub(r"\’ll", " will", trueLine)
                        trueLine = re.sub(r"\’t", " not", trueLine)
                        trueLine = re.sub(r"\’ve", " have", trueLine)
                        trueLine = re.sub(r"\’m", " am", trueLine)
                        wordsInLine = trueLine.split()
                        lineCount += 1
                        for word in wordsInLine:
                            no_punct = ''
                            for char in word:
                                char = char.lower()
                                if re.search(allowedRegexExpressions, char):
                                    no_punct = no_punct + char
                            if no_punct != '':
                                wordList.append(no_punct)
                                wordsInThisPoem += 1
                                wordCount += 1
                                poemDict[wordsInThisPoem] = no_punct
                    elif trueLine is not None:
                        wordsInLine = trueLine.split()
                        lineCount += 1
                        for word in wordsInLine:
                            wordCount += 1                        
                            no_punct = ''
                            for char in word:
                                char = char.lower()
                                if re.search(allCharacters, char):
                                    no_punct = no_punct + char
                            # Check if no_punct is empty
                            if no_punct != '':
                                wordList.append(no_punct)
                                wordsInThisPoem += 1
                                wordCount += 1
                                poemDict[wordsInThisPoem] = no_punct
        wordsInAPoem.append(wordsInThisPoem)
        poemsDict[k] = poemDict                    
       
    counts = Counter(wordList)
    
    workbook = xlsxwriter.Workbook(r'{}'.format(writePath))
    worksheet1 = workbook.add_worksheet('Summary Statistics')
    row = 0
    col = 0
    worksheet1.write(0, 0, 'Poem Count')
    worksheet1.write(0, 1, poemCount)
    worksheet1.write(1, 0, 'Word Count')
    worksheet1.write(1, 1, wordCount)    
    worksheet1.write(2, 0, 'Line Count')
    worksheet1.write(2, 1, lineCount)
    worksheet1.write(3, 0, 'Unique Words')
    worksheet1.write(3, 1, len(set(wordList)))
    worksheet1.write(4, 0, 'Average Words Per Line')
    worksheet1.write(4, 1, wordCount/lineCount)
    worksheet1.write(5, 0, 'Average Words Per Poem')    
    worksheet1.write(5, 1, wordCount/poemCount) 
    
    worksheet2 = workbook.add_worksheet('Unique Word Counts')
    worksheet2.write(0, 0, 'Word')
    worksheet2.write(0, 1, 'Count')
    row = 0
    col = 0
    for key in counts.keys():
        row += 1
        worksheet2.write(row, col, key)
        item = counts[key]
        worksheet2.write(row, col + 1, item)
    plt.rcParams['figure.figsize'] = (16.0, 12.0)
    plt.style.use('ggplot')
    
    data = pd.Series(wordsInAPoem)
    # Plot for comparison
    plt.figure(figsize=(12,8))
    ax = data.plot(kind='hist', bins=200, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
    # Save plot limits
    dataYLim = ax.get_ylim()

    # Find best fit distribution
    best_fit_name, best_fir_paramms = best_fit_distribution(data, 200, ax)
    best_dist = getattr(st, best_fit_name)

    print(best_dist)
    # Update plots
    ax.set_ylim(dataYLim)
    ax.set_title('Poem Word Count. Best Distribution: ')
    ax.set_xlabel('Word Count')
    ax.set_ylabel('Frequency')
    plt.savefig('allDistributions.png', dpi=150)
    # Make PDF
    pdf = make_pdf(best_dist, best_fir_paramms)

    # Display
    plt.figure(figsize=(12,8))
    ax = pdf.plot(lw=2, label='PDF', legend=True)
    data.plot(kind='hist', bins=200, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

    param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
    param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fir_paramms)])
    dist_str = '{}({})'.format(best_fit_name, param_str)

    ax.set_title('Poem Word Count. Best Distribution:  \n' + dist_str)
    ax.set_xlabel('Word Count')
    ax.set_ylabel('Frequency')
    
    worksheet3 = workbook.add_worksheet('Fitted Distributions')
    worksheet3.insert_image('A1', 'bestDistribution.png')
    #worksheet3.insert_image('A36', 'allDistributions.png')
    
    workbook.close()
    
    return  wordsInAPoem, poemsDict, wordList
    
wordsInAPoem, poemsDict, wordList = buildSummary(dataDict, allCharacters, r'test.xlsx')



<scipy.stats._continuous_distns.frechet_l_gen object at 0x7f39669c20f0>


In [20]:
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Propbability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf