# Structuring OCR'ed Text as Data

**<< Previous module: [How to OCR](04-HowToOCR.ipynb) <<**

*60-90 minutes*

<div class="alert alert-block alert-info">
    <strong>Learning Objectives:</strong>
    <p>By the end of this module, you should be able to</p>
    <ul>
        <li>differentiate between different types of OCR errors;</li>
        <li>decide which method(s) are best applied to correct OCR errors;</li>
        <li>explain reasons for creating structured data from OCR'ed text and describe possible use cases;</li>
        <li>create structured data from OCR'ed text.</li>
    </ul>
</div>

## Table of Contents

- [Correcting OCR Errors](#ocr-errors)
- [Structuring OCR'ed Text as Data](#text-as-data)
- [Resources](#resources)

In the [last module](04-HowToOCR.ipynb), we learned how to use Python and Tesseract to perform OCR on a sample selection from North Carolina's 1955 session laws. At the end of the tutorial, we evaluated the OCR'ed text for readability and produced a list of spelling "errors" that Python helped us identify. 

In this module, we'll look at a few methods for correcting errors in OCR'ed text. We'll also look at ways to structure OCR'ed text as data and consider why we might want to create structured data. We'll end by looking at a sample of the structured data from the *On The Books* corpus of Jim Crow laws. This will prepare us for the [final module](06-ExploratoryAnalysis.ipynb) in which we'll be using Python and other tools to perform exploratory analyses with the *On The Books* corpus dataset.

## Correcting OCR Errors <a class="anchor" id="ocr-errors"></a>

Let's begin by reviewing the errors that we found using spellcheck in the [last module](04-HowToOCR.ipynb). We saved the spellcheck output--including information about each page's readability--in a spreadsheet (.csv) file. We can use <a href="https://pandas.pydata.org/" target="blank">pandas</a>, a Python data analysis library, to review that output here:

In [None]:
# Import the pandas library.
import pandas

# We'll create a variable called "ocrErrors" that will hold the data from 
# sample_output_spellchecked.csv, the file where we stored our spellcheck
# and readability data. The "pandas.read_csv" function opens and reads the 
# file. "sep=","" specifies that each cell in the file is separated by a comma
# so that data are kept in their correct fields.
ocrErrors = pandas.read_csv("sample_output/sample_output_spellchecked.csv", sep=",")

# We'll ask Python to show us the data here:
ocrErrors

**Review each field again.** As a reminder, here is a description of each column:

- **file_name**: The name for the corresponding image file. For now, this is the only information in the table that identifies where the rest of the information in each row comes from (which page).
- **token_count**: The total number of tokens (words) found in each page.
- **unknown_count**: The number of unknown ("misspelled") words found in each page.
- **readability**: Think of this as the percentage of the page that was readable.
- **unknown_words**: A list of tokens (words or in some cases characters) that were not listed in the spellchecker. *Note that `b` appears in all of these because it's included at the beginning of each text field. We'll ignore `b` for now because it's part of how Python has read each page's text and isn't part of the OCR'ed text.*
- **text**: The OCR'ed text output from each page. The output here includes all <a href="https://en.wikipedia.org/wiki/Escape_character#JavaScript" target="blank">escape characters</a>, so it may look as if a lot of erronenous characters have been added. In the [next module](06-ExploratoryAnalysis.ipynb), we'll see how including these in our OCR'ed text can be useful.

We also have an **"Unnamed"** column, which holds an index, a unique number, assigned to each row. For the purposes of this module, we can ignore this column.

<div class="alert alert-block alert-warning">
<p><strong>Let's spend some time specifically with the "unknown_words" column.</strong> In this column, each row contains a list of words or characters that we identified in the last module as not matching words in our spellcheck list. Broadly speaking, each list contains <strong>two types of errors that we'll call <u>unique</u> and <u>recurring</u>.</strong></p>

<p><strong>Unique errors</strong> are those that occur infrequently or only on one page (in one row). <code>onehalf</code> and <code>nontax</code> are examples of unique errors.</p>

<p><em>What about <code>distriet</code>, <code>publie</code>, and <code>foree</code>? Do you think they are unique or recurring errors?</em></p>
    
<p><strong>Recurring errors</strong> are those that show up in many or all of the unknown_words lists in our table. <code>ch</code> is an example of a recurring error.</p>
    
<p><em>But is <code>ch</code> actually an error? What do you think?</em></p>

</div>

### Unique Errors & Changes

#### Computer-identified errors.

`distriet`, `publie`, and `foree` are unknown, or misspelled, words that all have one thing in common: Tesseract recognized the letter `c` in each word as an `e`. While this error does repeat across our sample files, there are also many instances of *correctly used* `c` and `e` characters. So **we need to treat these as unique errors.**

There are at least **two ways to address unique computer-identified errors:**

1. Since we produced a list of unknown words in our readability test, we could simply open each file in a text editor and use find-and-replace functionalities (Command + F or Control + F) to locate and replace instances of unique errors.



2. We could use a little Python to find and replace these errors. This would be especially useful if we thought that individual errors such as `publie`, `foree`, or `distriet` might occur across the corpus. 

The following script runs through the entire sample output (and could be applied to an entire corpus) and checks for and replaces instances of `publie`: <a class="anchor" id="correction"></a>

In [None]:
# Import glob, a module that helps with file management.
import glob

# Identify the sample_output file path.
# Remember that our readability output is also stored 
# in this file as a .csv. We don't want to change it, 
# so we'll use glob to look for only .txt files.
filePath = glob.glob("sample_output/*.txt")

# Apply the following loop to one file at a time in filePath.
for file in filePath:
    
    # Open a file in "read" (r) mode.
    text = open(file, "r")
    
    # Read in the contents of that file.
    text = text.read()
    
    # Find instances of "publie" and change the word 
    # to "public".
    word = text.replace("publie", "public")
    
    # Close the file.
    # file.close()
    
    # Reopen the file in "write" (w) mode.
    file = open(file, "w")
    
    # Add the changed word into the reopened file.
    file.write(word)
    
    # Close the file.
    file.close()

print("All instances of publie replaced with public.")

Check the [sample_output files](sample_output) to see if any instances of `publie` still appear. Note that you can replace `publie` and `public` in the script to locate and change other words. Consider how you would change `distriet`--remember that it was hyphenated, capitalized, and appeared at the end of a line.

#### "Errors" that computers don't notice.

Let's compare what we found in the readability dataset with one of the original OCR'ed texts. Run this code to preview the text here. *What do you notice about the text? Do you see the same errors as those identified above? Do you see different errors?*

In [None]:
# We'll create a variable called "sampleText" to hold 
# the text from one of our OCR'ed text files. 
# When we open the file, we'll use "r" to specify that 
# we're only going to "read in" the text -- we're not going
# to be making any modifications to the file.

# If you'd like to view a different file, replace the file name 
# below ("sessionlaws...txt") with another file name 
# and rerun this script.
sampleText = open("sample_output/sessionlawsresol1955nort_0059.txt", "r")

# We'll use the "read" function to tell Python to 
# convert all of the content in the file into a string 
# (collection of characters) so that we can view it here.
sampleText = sampleText.read()

# This statement will display text from the file for us here.
print(sampleText)

As humans trained in reading, writing, and speaking the English language, we may notice at least one "error" that a computer would not: the word "session" appears as `SeSSION` -- our readability test has not recognized this as an error because the word is spelled correctly, albeit with unusual use of upper- and lowercase letters. **If our goal is to produce human readable and searchable digitized text,** then correcting errors such as changing the lowercase `e` to uppercase `E` in `SeSSION` might be a priority. **If our goal is to produce digitized text for computational analysis,** then fixing this "error" may not be a priority.

If we wanted to fix `SeSSION`, we could use the following functions to locate instances of irregularly capitalized words like session and make them fully capitalized (following the original text formatting). We won't do it here because it's not necessary for our purposes, but consider how you would use this script to fix case issues:

    # Open a file in "read" (r) mode.
    file = open("sample_output/sessionlawsresol1955nort_0059.txt", "r")
    text = file.read()
    
    # Find instances of "SeSSION" and make them all uppercase.
        # This function works with one known phrase or word at a time
        # in a single file. How might we do this differently if we 
        # needed to run this through many text files? How might we find
        # unidentified errors or variants such as "SeSSiON"?
    word = text.replace("SeSSION", "SESSION")
    
    # Close the file.
    file.close()
    
    # Reopen the file in "write" (w) mode.
    file = open("sample_output/sessionlawsresol1955nort_0059.txt", "w")
    
    # Add the changed word into the reopened file.
    file.write(word)
    
    # Close the file.
    file.close()

### Recurring Errors & Changes

#### Specific Words & Phrases

Another issue you may have noticed is that our readability test missed `Cu. 4`, which should read as `Ch. 4`. We know this because a look at other pages tells us that sometimes `Cu` is read as `Ch` *and* our readability text has, interestingly, identified `ch` as an unknown word but not `cu`. (A digression into <a href="https://www.merriam-webster.com/dictionary/cu" target="blank"><em>Merriam Webster</em></a> may offer some explanation as to why `cu` is recognized at `ch` is not.)

Since we know that this error recurs, we can use the *same* script as we used to change `publie` to locate and change `Cu` to `Ch`. **[Scroll back up to the script we applied to unique errors](#correction) such as `publie` and change the word variable to `Cu` and `Ch` to see what happens.**

#### Word, Phrase, or Character Patterns

The above script only works when we know the specific word, characters, or phrase that we need to change. What if we need to fix an error that occurs in many different instances? What if this isn't an error but a change we want to make to the digitized text to help with both human reading and computational analysis? 

An example we can use here are the **hyphens that the printers used to break words at the end of lines.** Remember `Dis-trict` and `non-tax`? <em>We've already demonstrated how to do this in the [previous module](04-HowToOCR.ipynb).</em> Here is the specific code. **Can you locate where to insert this in the same script we used to replace `Cu` above?**

        word = text.replace("-\n","")

Remember that `\n` is an "escape character," specifically the "newline character," which is usually invisible to human readers but that computers use to mark the end/beginning of a line. In a word processor such as Microsoft Word, each time you press the Enter/Return key on your keyboard, an invisible "\n" is created to mark the beginning of a new line. Since computers "see" `\n`, we can use this character to find only hyphenated words at the end of lines and replace them.

**Another change we might wish to make involves paratextual elements.** In this case, we are referring to the headers that appear at the top and bottom of each page showing chapter number, volume year, perhaps volume section name ("Session Laws"), and page number. If we don't want this information to be included in computational analysis, and/or if we plan to combine individual text files for each page into a single text file for an entire volume or corpus, then we might need to remove it:

<img src="images/09-data-01.jpeg" width="70%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="An excerpt from page 66 from the 1955 session laws showing the chapter, year, and volume section name." title="An excerpt from page 66 from the 1955 session laws showing the chapter, volume year, and volume section name." />

In the output .txt file (pg. 66), the header has been digitized as 

    CH. 17 1955—SEssIoN LAWS

A quick look through the other sample files, and we notice that the chapter numbers change on each page. If we were working with a larger number of pages, the volume year and volume section might also change. **How do we find and replace these elements if they change across the pages?**

Let's take a closer look. Here's the header with newline characters and some of the text from the page above shown, too:

    CH. 17 1955—SEssIoN LAWS\n
    \n
    main office, or at any branch office which he may establish. The return of
    every person reporting on a calendar year basis shall be filed on or before

*What do you notice?* 

Compare the above to this selection from another page in our sample (pg. 62):

    Cu. 9-10-11 1955—SEssIoN LAWS

    four per cent (4%) of the sales price as confirmed by the board of county commissioners.
  
These two samples may look a bit different, but *they do share several things in common:*

1. They share the same basic structure.
2. They are followed by an empty line before the main text on the page begins. The computer represents this as `\n\n`, or:

    `Cu. 9-10-11 1955—SEssIoN LAWS\n
    \n
    four per cent (4%) of the sales price as confirmed by the board of county commissioners.`

Although these two examples differ in the specific information they provide, each header has a similar construction: `Chapter Year—Session Laws\n\n`. Looking through the pages, we might notice that this pattern does not always occur at the beginning of each page. Let's run this script to see the first characters of each page:

In [None]:
# Import the regular expressions module (re), 
# which helps us use regex in Python.
import re

# Import glob, a module that helps with file management.
import glob

# Open the file folder where our sample output pages are 
# stored. Look for only files ending with the ".txt"
# file extension.
filePath = glob.glob("sample_output/*.txt")

# For each file in the sample_output folder:
for file in filePath:
    
    # Open a file.
    with open(file, "r") as textFile:
        
        # Get a file name to help us identify which 
        # header comes from which file.
        fileName = file.strip(".txt")
        
        # Read the file.
        textFile = textFile.readline()
        
        # Print out the file name and first line in the file.
        print(fileName,":", textFile)
        

**What do you notice about each header?** Is a chapter abbreviation always represented in the same way? Does the header always begin with a chapter abbreviation?

It so happens that those beginning with a chapter are even pages, and those beginning with a year are odd pages. 

Because of the different ways that Python has "read" each page, we'll need to use *patterns* to search for and replace these headers rather than using the .replace( ) function with a specific word as we did above. We can search for patterns using something called **regular expressions**.

<div class="alert alert-block alert-warning">
    <p><strong><a href="https://en.wikipedia.org/wiki/Regular_expression" target="blank">Regular Expressions</a></strong>, sometimes written or said as "regex," are sequences of characters used to create a search pattern. They are useful for finding not just one specific word or phrase but any word or phrase that broadly fits a pattern, meaning that different word forms (think about verb conjugation or tense) can be included in search results. Search engines use them to deliver search results, analysts can use them to find and change data, and researchers can use them in performing text analysis.</p>
    <p>Regular expressions work by using a <em>syntax</em> in which various characters are used to signify a specific pattern. One of these characters, a full stop or period, <code>.</code>, can be used to match any single character except for newline characters (<code>\n</code>) or other "terminators" (e.g., escape characters marking the end of a document).</p>
    <p>For example, if we were searching the corpus for any mention of the words <code>vote, votes, voted, voting, voter, voters</code>, we could search for each word one at a time, or we could use regular expressions to return all of these forms with one search. The phrase <code>vot.</code> would return any string of characters that include both <code>vot</code> and a single character following them--presumably an <code>e</code> but also any other character, which could be useful if we knew that OCR had in some cases mistaken <code>e</code> for <code>c</code>.</p>
    <p>But how do we get other forms of the word, which have more than one additional character after <code>vot</code>? We can use the characters <code>*</code> (asterisk or star) and <code>?</code> (question mark) to search for multiple characters following the result returned by <code>.</code> So, <code>.*?</code> will search for the following terms:<code>vote, votes, voted, voting, voter, voters</code>.</p>
</div>

Following what we've learned so far about regular expressions, how might we locate and remove headers from each page in our corpus? We can create regular expressions to match the pattern `Chapter Year—Session Laws\n\n` or the pattern `Year-Session Chapter\n\n`.

To match a header such as `CH. 17 1955—SEssIoN LAWS\n\n`, we could try `C.*?\n\n`. This pattern breaks down in the following way:

- `C` = "C" is the first letter in the abbreviation for "chapter." This appears to be consistent for headers that begin with the chapter abbreviation.

- `.*?` = The regular expression syntax that will help us find any characters after "C".

- `\n\n` = The two newline characters that separate the paratextual header from the beginning of the main content on the page.

To match one of our year headers, such as `1955—SEsSION LAWS Cu. 4`, we could use `\d\d\d\d` to search for 4 digits (to match a year) followed by `.*?\n\n` from above.

Note that the first page in our sample, which begins `SESSION LAWS` does not have a header but a title that is part of the main text.

**Let's use the following script to search for each possible header and create new text files without headers:**

In [None]:
# Import the regular expressions module (re), 
# which helps us use regex in Python.
import re

# Import glob, a module that helps with file management.
import glob

# Open the file folder where our sample output 
# pages are stored. Look for only files ending with ".txt".
filePath = glob.glob("sample_output/*.txt")

# Save the pattern for a chapter header (even pages) that we 
# want to search each page for. We've added "^" to our regular 
# expressions to be extra sure that Python searches only at the 
# beginning of each file.
headerCH = re.compile("^C.*?\n\n")

# Save the pattern for a year header (odd pages) that we want 
# to search each page for. "\d" in regular expressions represents 
# one digit (0-9).
headerYear = re.compile("^\d\d\d\d.*?\n\n")

# Create an empty string to replace the header text with.
# This will delete the header.
replacement = ""

# For each text file in the sample_output folder:
for file in filePath:
    
    # Create a file name for the new output file.
    
    # First, get the existing file name 
    # (e.g. "sessionlawsresol1955nort_0066.txt")
    # & remove the file extension.
    outFileName = file.strip(".txt")
    
    # Then, concatenate (add) the existing file name with additional
    # information ("_noheader") and the file extension to create a
    # new name (e.g. "sessionlawsresol1955nort_0066_noheader.txt").
    outFileName = outFileName + "_noheader.txt"
    
    # Create and open a new "outFile" to save our results to.
    # "w" tells Python that we plan to write to this file.
    outFile = open(outFileName, "w")
    
    # Open as readable ("r") the existing text file as "inFile".
    inFile = open(file, "r")
    
    # Read the input file.
    inFile = inFile.read()
        
    # Search inFile for the header beginning with a 
    # chapter abbreviation (such as CH).
    if re.search(headerCH, inFile):
        
        # If a chapter abbreviation header is found,
        # print a statement to let us know that there is a match.
        print(outFileName, "chapter header match")
        
        # And write the contents of the inFile WITHOUT the
        # header to the new outFile.
        outFile.write(re.sub(headerCH, replacement, inFile))
    
    # If a chapter abbreviation is not found, search the inFile
    # for the header beginning with a year.
    elif re.search(headerYear, inFile):
        
        # If a chapter abbreviation header is found,
        # print a statement to let us know that there is a match.
        print(outFileName, "year header match")
        
        # And write the contents of the inFile WITHOUT the
        # header to the new outFile.
        outFile.write(re.sub(headerYear, replacement, inFile))
    
    # If neither the chapter or year headers are found,
    else:
        
        # print a statement to let us know that no matches were found.
        print(outFileName, "no matches found")
        
        # And write all of the contents from the inFile to the outFile.
        outFile.write(inFile)
    
    # Close the current outFile and move to the next file.
    outFile.close()

# The loop will finish when Python has gone through all files in 
# the sample_output folder.

Open your [sample_output folder](sample_output) to view the new files without headers that you've created. Take a look at the entire contents of the files. What do you notice? Have we successfully removed *all* paratextual content (that is, information that structures the text but is not part of the main content)? 

**HINT:** What about those numbers at the bottom of each page? How might you change the above script to remove them? 

**See if you can modify the script above to remove page numbers.**

<div class="alert alert-block alert-success">
    <strong>Learn more about Regular Expressions:</strong> 
    <ul>
        <li><a href="https://www.dataquest.io/blog/regex-cheatsheet/" target="blank">Python Regex Cheat Sheet</a></li>
        <li><a href="https://programminghistorian.org/en/lessons/understanding-regular-expressions" target="blank">Understanding Regular Expressions</a></li>
        <li><a href="https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions" target="blank">Cleaning OCR'ed Text with Regular Expressions</a></li>
        <li><a href="https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf" target="blank">Regular Expressions: The Complete Tutorial</a></li>
    </ul>
</div>

## Structuring OCR'ed Text as Data <a class="anchor" id="text-as-data"></a>

We've spent some time practicing ways to correct OCR'ed text and begin to structure it for data analysis. In the [next module](06-ExploratoryAnalysis.ipynb) we'll begin exploring OCR'ed text as data. We'll look at ways to analyze text content and view laws temporally and spatially. We'll be working with the <strong><a href="https://pandas.pydata.org/" target="blank">Pandas</a></strong> library to prepare data for analysis. Pandas is an extremely useful library for working with data, and we'll just be skimming the surface of what it can do here.

Instead of continuing to structure our own sample from the *On The Books* corpus, we're going to begin working with the *entire* corpus of Jim Crow laws. We'll begin by looking at the laws, which have been stored together in one file, `on_the_books_text_jc_all_v1.txt`, which you'll find <a href="https://cdr.lib.unc.edu/concern/data_sets/nc580t06n?locale=en" target="blank">here</a>. (Click Download on the webpage that loads, or click [here](https://cdr.lib.unc.edu/downloads/6q182r84s?locale=en).)

*What do you notice about how the laws have been structured in this one file? How are they separated from one another? What other kinds of data accompany them? What kinds of data do NOT accompany them?*

<img src="images/09-data-02.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="An excerpt from the corpus of North Carolina Jim Crow laws stored in a text file." title="An excerpt from the corpus of North Carolina Jim Crow laws stored in a text file." />

The above image shows an excerpt of the corpus file including 3 laws. Each law and its accompanying metadata (data about the law) are separated from the previous and next laws by 3 empty lines, which Python reads as 4 newline characters, `\n\n\n\n`.

Remember that in *On The Books*, Jim Crow laws were identified at the level of a chapter *section*, so each of the laws we are viewing here come from a different chapter and/or volume. The first line includes 
- the volume, listed by year(s); 
- whether the law is considered "public" (public laws focus on public institutions and public property) or "private" law (private laws regulate individual private property, private companies, etc.);
- the chapter number;
- the section number.

In the screen capture above, each law has been identified as (`Identified by:`) a Jim Crow law by a human expert or by the algorithms used by the *On The Books* team ("model"). 

<div class="alert alert-block alert-warning">
    <p>The model was developed by feeding a dataset of laws identified by human experts as either being Jim Crow or not being Jim Crow. Two human experts went through a selection of laws to make these identifications, and they did not always agree. However, computers require clear forms of classification (e.g. yes/no). The team's programmers then created a model that could "read" the selected laws ("training set") and their classifications. As the computer "read" it "learned" how laws labeled "yes" (Jim Crow) were structured and the kinds of words appear in those laws--and it did the same with "no" (not Jim Crow) laws. It used this information to go through the entire corpus and attempt to correctly identify all of the laws that human experts had not reviewed. There is more to this process, and if you wish to learn more, you'll find the documentation and code in <a href="https://github.com/UNC-Libraries-data/OnTheBooks" target="blank">the <em>On The Books</em> Github repository</a>.</p>
    </div>

After the `Identified by:` field, the metadata accompanying the law then includes the full name of the chapter followed by the full text of that section.

Note that the first line, Identified by, chapter title, and section text are all separated from one another by 2 `\n` characters.

**Currently, all of this information is recognized by our computers as a single file, or collection, of text.** If we want to be able to view laws, for example, by volume or by location, then we need to structure this text in a way that provides explicit information, such as dates, chapter names, etc., to our computers. *As it turns out, we've already done this once: remember the readability test? We used Pandas in* [that module](04-HowToOCR.ipynb) *to create the table that showed us the unknown words, readability score, etc.* We'll do that again here, but this time **we'll create a table that includes things like year, chapter, identified by, etc.--all information we can use in our analyses.**

### Step 1: Getting the Text File

The text file for the corpus is located in the Carolina Digital Repository (CDR). In order to work with it, let's download it and add it to our Binder or local workspace in Jupyter Notebooks. You may remember that we used similar code when we first [gathered a selection of the corpus from the Internet Archive](02-GatheringACorpus.ipynb).

**To get the data's URL**, navigate to the [Jim Crow Laws](https://onthebooks.lib.unc.edu/laws/jim-crow-laws/) page on the *On The Books* website. Click on the link in the top paragraph, "all laws identified as likely to be Jim Crow in plain text format." This will take you to the CDR's page for the dataset. On the CDR page, *find the "Download the file" button. Right click on it, and select "Copy Link" (sometimes "Copy Link Location").*

Paste the URL in the script below over the text `'INSERT LINK HERE'`. That line should then look something like this:

`url = 'https://cdr.lib.unc.edu/downloads/6q182r84s?locale=en'`

**The following script will only work if you have completed the instructions above and added the link to the data file stored in the CDR.**

In [None]:
# Requests helps us call up a webpage, or link to a file stored online,
# and access the content.
import requests

# Create a variable to hold the direct link to the text file.
# ADD THE LINK BETWEEN THE QUOTES BELOW.
url = 'INSERT LINK HERE'

# Here's where we use the requests module to call up
# the content at the url we specified above.
r = requests.get(url)

# Create and open a new empty text file.
with open('on_the_books_text_jc_all.txt', 'wb') as f:
    
    # Write the contents of the online file into the new file.
    f.write(r.content)

# When finished, print the following:
print('Corpus downloaded.')

Now, look in the [same folder where these modules are stored](oer) to find the file.

### Step 2: From Text File to Dataframe

We'll begin by creating a **dataframe** (think of a dataframe as a table) using Pandas that can store the 4 main sections of data that we've identified. To do this, we'll split the text file into a list of laws and their accompanying data. For example, one item in this list will contain:

<img src="images/09-data-03.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="One law and its accompanying data from the corpus of North Carolina Jim Crow laws." title="One law and its accompanying data from the corpus of North Carolina Jim Crow laws." />

We'll then break this one law down into its own list so that each of the sections above becomes an item in that list:

<img src="images/09-data-04.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Year/Volume, Law Type, Chapter, and Section of a Jim Crow law." title="Year/Volume, Law Type, Chapter, and Section of a Jim Crow law." />


<img src="images/09-data-05.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Identified by field of a Jim Crow law." title="Identified by field of a Jim Crow law." />


<img src="images/09-data-06.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Chapter title of a Jim Crow law." title="Chapter title of a Jim Crow law." />


<img src="images/09-data-07.jpeg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Section text of a Jim Crow law." title="Section text of a Jim Crow law." />

We'll add each individual law's list to a compiled list of lists. Finally, we'll convert that list of lists into a spreadsheet (csv) file that we'll read back into a Pandas dataframe (table). Each row in the table will represent 1 law, and each column will hold one of the 4 parts above (Volume Law Type Chapter Section; Identified by; Chapter Title, and Section Text). It's OK if this doesn't fully make sense yet. Run the scripts below to create the dataframe, then go back and read through each step:

In [2]:
# Import the regular expressions library.
# We'll need regex again to help us find the beginning and end 
# of each piece of data.
import re

# Open and read the plain text file of all identified Jim Crow laws.
with open("on_the_books_text_jc_all.txt", "r") as file:
    jclaws = file.read()
    
    # First, we'll split up the laws into a list with each law and 
    # its accompanying data as a separate list item. 
    # We'll use the 4 newline characters, "\n\n\n\n," to find
    # the beginning and end of each law.
    laws_split = jclaws.split("\n\n\n\n")
    
    # If you'd like to see what this list looks like, delete the "#"
    # next to "print" in the line below before (re)running this script:
    # print("List of laws:", laws_split)
    
    # Next, we'll create an empty list that we'll fill below 
    # with a list for each law of its for main parts including 
    # Volume Law Type Chapter Section, Identified by, Chapter Title, 
    # and Section Text.
    laws_list = []
    
    # For each law (list item) in laws_split, get the law's Section Text.
    for law in laws_split:
        
        # Convert the law from a list item to a string.
        law = str(law)
        
        # Split the law into its 4 main parts using the \n\n character 
        # pattern, and store these parts as a list.
        law_list = re.split("\n\n", law)
        
        # If you'd like to see each law in its new list form, delete
        # the "#" next to "print" in the line below before (re)running 
        # this script:
        # print(law_list)

        # Add the list of law parts above to the larger list that 
        # will contain all law lists, creating a list of lists.
        laws_list.append(law_list)

# If you'd like to see the list of lists, delete the "#" next to 
# "print" in the line below before (re)running this script:
# print(laws_list)    

Now that we have our list of laws (or list of lists), we'll write these into a spreadsheet:

In [3]:
# Import the csv module, which helps us create a csv (comma separated value) file.
import csv 

# Create the column headers for the file.
column_headers = ["VolumeLawTypeChapterSection","IdentifiedBy","ChapterTitle","SectionText"]  

# Create a new csv file and open it.
with open('jc_laws_list.csv', 'w') as f: 

    # This variable creates a writer that will add our laws to the csv file.
    write = csv.writer(f) 
    
    # Add the column headers first so that they appear in the first row.
    write.writerow(column_headers) 
    
    # Add each of the laws from laws_list.
    write.writerows(laws_list) 

You can [view the new sheet here](jc_laws_list.csv).

We've added this extra step to ensure that the data gets structured correctly. Some laws contain formatting that interferes with our four column structure--creating more columns than we need.

Now we'll read this spreadsheet back in so that we can work with the data. 

In [4]:
# Import the pandas library, which we'll use to structure our data.
import pandas as pd

df = pd.read_csv("jc_laws_list.csv", sep=",")

# Print a preview of the dataframe for us to view here.
df.head()

Unnamed: 0,VolumeLawTypeChapterSection,IdentifiedBy,ChapterTitle,SectionText
0,1874/75 Private Laws Ch. 80 Sec. 4,Identified by: model and expert,CHAPTER LXXx. AN ACT CONCERNING THE CITY OF RA...,Sec. 4 Said registrars shall be furnished by s...
1,1874/75 Private Laws Ch. 138 Sec. 9,Identified by: model and expert,CHAPTER CXXXVIII. AN ACT TO AUTHORIZE THE ESTA...,Sec. 9 The board of school commissioners shall...
2,1874/75 Public Laws Ch. 89 Sec. 1,Identified by: model and expert,"CHAPTER LXXXIX. AN AOT TO AMEND CHAPTER FIVE, ...",Sec. 1 Zhe General Assembly of North Carolina ...
3,1876/77 Public Laws Ch. 139 Sec. 4,Identified by: model and expert,CHAPTER CXXXIX. AN ACT CONCERNING THE TOWN OF ...,Sec. 4 Said registrar shall be furnished by sa...
4,1876/77 Public Laws Ch. 162 Sec. 22,Identified by: model and expert,CHAPTER CLXII. AN ACT TO REVISE AND CONSOLIDAT...,Sec. 22 The county board of education shall co...


**Take a few moments to study the dataframe we've just created:** can you now see how the laws have been restructured from a text file into a table that separates different information into columns?

Note the numbers labeling each row of the dataframe on the left side. These are unique identifiers for each row. They don't relate to the chapter or section numbers, but they can help us differentiate between the laws.



### Step 3: Refining the Dataframe

We now have the laws in a table, but if we know we'll want to be able to analyze the laws by volume, by law type, or by some other feature, we'll want to add some more columns to our dataframe. Pandas lets us modify an existing dataframe *without* having to create a new one. 

We'll begin by splitting the "Volume Law Type Chapter Section" column into 4 separate columns. We'll use regular expressions again to help us get each piece of data. In Pandas, square brackets, `[ ]`, can be used to identify an existing or new column name. We'll create a new column for a piece of data using 1 line of code that looks like this:

`df["Volumes"] = df["VolumeLawTypeChapterSection"].str.extract("(\d\d\d\d\/\d\d|\d\d\d\d)", expand=True)`

There's a lot happening in that 1 line, so let's break it down, from left to right, into the following steps:

1. `df["Volumes"] =` - Create a variable with a new column name.


2. `df["VolumeLawTypeChapterSection"]` - Identify which *existing* column the new column should pull it data from. 


3. `.str.` - Read the characters in the *existing* column.


4. `.extract("(\d\d\d\d\/\d\d|\d\d\d\d)",` - Extract only the characters that match a given regular expression.


5. `expand=True` - Finally, specify that the data extracted needs to be added to an entirely new column.

In the following script, we repeat the above steps 4 times to create new columns for Volumes, Law Type, Chapter Number, and Section Number:

In [5]:
# First, let's get the volume. 
# The expand argument ("expand=True") adds a new column to the 
# dataframe. Our regular expression ("(\d\d\d\d\/\d\d|\d\d\d\d)") 
# helps us get volumes that are written as either "1873/4" or "1873".
# The pipe ("|") in the regular expression represents "or".
df["Volume"] = df["VolumeLawTypeChapterSection"].str.extract("(\d\d\d\d\/\d\d|\d\d\d\d)", expand=True)

# Let's repeat the above process for law types.
# In this case, we can search for specific labels, "Private Laws,"
# "Public Laws," or "Session Laws." Again, the pipe ("|") symbolizes "OR": 
# We're telling Pandas to search for and extract 
# Private Laws OR Public Laws OR Session Laws.
df["LawType"] = df["VolumeLawTypeChapterSection"].str.extract("(Private Laws|Public Laws|Session Laws)", expand=True)

# Now let's get the chapter number. This regular expression begins 
# with the characters "Ch". "\." tells Python to look for a period 
# (".") rather than using the period to symbolize any character. 
# And because chapter numbers vary in the number of digits 
# (e.g. 6 or 24 or 139), we use "\d" to look for 1 digit and "+" 
# to tell Python to look for additional digits until there are no more.
df["ChapterNum."] = df["VolumeLawTypeChapterSection"].str.extract("Ch\. (\d+)", expand=True)

# Next, the section number. We use the same basic construction as we 
# did for chapter, except we replace "Ch" with "Sec".
df["SectionNum."] = df["VolumeLawTypeChapterSection"].str.extract("Sec\. (\d+)", expand=True)

# Let's preview the dataframe again to see how it's changed.
df.head()

Unnamed: 0,VolumeLawTypeChapterSection,IdentifiedBy,ChapterTitle,SectionText,Volume,LawType,ChapterNum.,SectionNum.
0,1874/75 Private Laws Ch. 80 Sec. 4,Identified by: model and expert,CHAPTER LXXx. AN ACT CONCERNING THE CITY OF RA...,Sec. 4 Said registrars shall be furnished by s...,1874/75,Private Laws,80,4
1,1874/75 Private Laws Ch. 138 Sec. 9,Identified by: model and expert,CHAPTER CXXXVIII. AN ACT TO AUTHORIZE THE ESTA...,Sec. 9 The board of school commissioners shall...,1874/75,Private Laws,138,9
2,1874/75 Public Laws Ch. 89 Sec. 1,Identified by: model and expert,"CHAPTER LXXXIX. AN AOT TO AMEND CHAPTER FIVE, ...",Sec. 1 Zhe General Assembly of North Carolina ...,1874/75,Public Laws,89,1
3,1876/77 Public Laws Ch. 139 Sec. 4,Identified by: model and expert,CHAPTER CXXXIX. AN ACT CONCERNING THE TOWN OF ...,Sec. 4 Said registrar shall be furnished by sa...,1876/77,Public Laws,139,4
4,1876/77 Public Laws Ch. 162 Sec. 22,Identified by: model and expert,CHAPTER CLXII. AN ACT TO REVISE AND CONSOLIDAT...,Sec. 22 The county board of education shall co...,1876/77,Public Laws,162,22


**How has the dataframe changed?** Where are the new columns? What do they contain? 

Note that the original column, "Volume Law Type Chapter Section" has not been changed. We could remove it, but we might decide to come back to it later--we'll leave it as is for now.

**We can also modify the contents of a column if we need to.** In the `Identified By` column, each cell contains "Identified by:" in addition to its value, e.g. "model" or "expert." Because we have a column label, we don't need the repeating "Identified by:" in each cell. Let's remove it:

In [6]:
# In this line, we'll use "lstrip" to remove "Identified by: " from
# the IdentifiedBy column. Essentially, Python is overwriting the 
# existing column with the "model" or "expert" values without including
# "Identified by: ".
df["IdentifiedBy"] = df["IdentifiedBy"].str.lstrip("Identified by: ")

# Show us a preview of the updated dataframe.
df.head()

Unnamed: 0,VolumeLawTypeChapterSection,IdentifiedBy,ChapterTitle,SectionText,Volume,LawType,ChapterNum.,SectionNum.
0,1874/75 Private Laws Ch. 80 Sec. 4,model and expert,CHAPTER LXXx. AN ACT CONCERNING THE CITY OF RA...,Sec. 4 Said registrars shall be furnished by s...,1874/75,Private Laws,80,4
1,1874/75 Private Laws Ch. 138 Sec. 9,model and expert,CHAPTER CXXXVIII. AN ACT TO AUTHORIZE THE ESTA...,Sec. 9 The board of school commissioners shall...,1874/75,Private Laws,138,9
2,1874/75 Public Laws Ch. 89 Sec. 1,model and expert,"CHAPTER LXXXIX. AN AOT TO AMEND CHAPTER FIVE, ...",Sec. 1 Zhe General Assembly of North Carolina ...,1874/75,Public Laws,89,1
3,1876/77 Public Laws Ch. 139 Sec. 4,model and expert,CHAPTER CXXXIX. AN ACT CONCERNING THE TOWN OF ...,Sec. 4 Said registrar shall be furnished by sa...,1876/77,Public Laws,139,4
4,1876/77 Public Laws Ch. 162 Sec. 22,model and expert,CHAPTER CLXII. AN ACT TO REVISE AND CONSOLIDAT...,Sec. 22 The county board of education shall co...,1876/77,Public Laws,162,22


We can break down the line of code we just ran, from left to right, as follows:

`df["Identified By"] =` asks Python to look at the "Identified By" column and make the following changes to it.

`df["Identified By"].str` reads the contents of the "Identified By" column.

`.lstrip("Identified by: ")` looks at the beginning of each cell for "Identified by: " and removes, or "strips," it.

**Finally, let's add one more column of data: the number of words in each law.**

In [7]:
df["WordCount"] = df["SectionText"].apply(lambda x: len(str(x).split(" ")))

# Show us a preview of the updated dataframe.
df.head()

Unnamed: 0,VolumeLawTypeChapterSection,IdentifiedBy,ChapterTitle,SectionText,Volume,LawType,ChapterNum.,SectionNum.,WordCount
0,1874/75 Private Laws Ch. 80 Sec. 4,model and expert,CHAPTER LXXx. AN ACT CONCERNING THE CITY OF RA...,Sec. 4 Said registrars shall be furnished by s...,1874/75,Private Laws,80,4,332
1,1874/75 Private Laws Ch. 138 Sec. 9,model and expert,CHAPTER CXXXVIII. AN ACT TO AUTHORIZE THE ESTA...,Sec. 9 The board of school commissioners shall...,1874/75,Private Laws,138,9,47
2,1874/75 Public Laws Ch. 89 Sec. 1,model and expert,"CHAPTER LXXXIX. AN AOT TO AMEND CHAPTER FIVE, ...",Sec. 1 Zhe General Assembly of North Carolina ...,1874/75,Public Laws,89,1,46
3,1876/77 Public Laws Ch. 139 Sec. 4,model and expert,CHAPTER CXXXIX. AN ACT CONCERNING THE TOWN OF ...,Sec. 4 Said registrar shall be furnished by sa...,1876/77,Public Laws,139,4,266
4,1876/77 Public Laws Ch. 162 Sec. 22,model and expert,CHAPTER CLXII. AN ACT TO REVISE AND CONSOLIDAT...,Sec. 22 The county board of education shall co...,1876/77,Public Laws,162,22,105


**How did we do that? Let's break it down:**

`df["Word Count"] =` creates a new column name.

`df["Section Text"].apply` tells Python to *apply* whatever instructions come after this part of the line to the entire Section Text column.

`(lambda x: len(str(x).split(" ")))` in a nutshell counts the number of words in each Section Text cell. It's using a number of functions to do that, though, so let's break things down a little further:

- `lambda x:` -- `lambda` in Python is what's known as an "anonymous function" -- basically, it stores a given set of instructions and uses them on the value `x`. (Learn more about <a href="https://www.w3schools.com/python/python_lambda.asp" target="blank">lambda</a>.)


- `len()` gets the length of a string of characters. For example, `len("Hello World!")` would return the value `12`. Here, though, we want to use it to count *words*, not just characters.


- `str(x).` tells Python to read the contents of a cell, `x`, as a string (plain text).


- `split(" ")` tells Python to split that string by spaces, which creates a list of words. Punctuation is likely attached to some words, but because we're not counting individual word *length*, this doesn't matter. So `labor` and `labor,` both count as 1 word.

Working backwards, then, the `len()` function counts the number of *words* in the list created by `str(x).split(" ")`. `lambda x:` stores the instructions `len(str(x).split(" "))` to be used as many times as needed, and `apply` applies the instructions stored in `lambda x:` to every cell in the "Section Text" column. The word count for each Section is then stored in a cell on the *same row* in a new column called "Word Count".

Before we move on to the next and final module, let's save our dataframe as a .csv file so that we can access it:

In [None]:
# Let's write this dataframe in a csv file. 
# We'll use a pipe ("|") to separate cells for now to avoid interfering with 
# comma usage in the Section Text and Chapter Title columns.
# We'll also set index to False, which will exclude the row numbers above
# from the csv file.
df.to_csv('jclaws_dataset.csv', sep="|", index=False)

# When the file has been created, print
print("jclaws_dataset.csv created.")

Locate the [jclaws_dataset.csv file](jclaws_dataset.csv) in the same folder as this tutorial and open it in a text editor (such as Notepad or TextEdit). Note how the pipes "|" separate each cell, or each piece of data.

<div class="alert alert-block alert-success">
    <strong>Review &amp; Next Steps:</strong> 
    <p>In this module, we learned some basic methods to correct OCR errors and change OCR'ed text using regular expressions. We also began structuring OCR'ed text as data using Pandas. In the next module, we'll see how Pandas can be used to not only structure data but also to perform exploratory analyses.</p>
</div>

## Resources <a class="anchor" id="resources"></a>

### Regular Expressions

- Knox, Doug. ["Understanding Regular Expressions"](https://programminghistorian.org/en/lessons/understanding-regular-expressions). *Programming Historian*.

- Lovett, M. ["Regular Expressions: The Complete Tutorial"](https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf). Princeton University.

- O'Hara, Laura Turner. ["Cleaning OCR'ed Text with Regular Expressions"](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions). *Programming Historian*.

- Yang, Alex. ["Python Regex Cheat Sheet"](https://www.dataquest.io/blog/regex-cheatsheet/). *Dataquest.*

### Pandas

- Burns, Halle. ["Crowdsourced-Data Normalization with Python and Pandas"](https://programminghistorian.org/en/lessons/crowdsourced-data-normalization-with-pandas#exploring-the-nypl-historical-menu-dataset). *Programming Historian*.

- ["Getting Started Tutorials"](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html). Pandas.pydata.org.

- ["Pandas](https://unc-libraries-data.github.io/Python/Jupyter/Pandas.html). UNC Libraries Data.

- ["Pandas Tutorial"](https://www.w3schools.com/python/pandas/default.asp). W3Schools.

**>> Next module: [Exploring OCR'ed Text as Data](06-ExploratoryAnalysis.ipynb) >>**

*This module is licensed under the [GNU General Public License v3.0](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/LICENSE). Individual images and data files associated with this module may be subject to a different license. If so, we indicate this in the module text.*