# Sentence Splitting
This notebook uses OCRed text for a volume year and splits it into sentences using regular expression pattern matching.<br>
For this notebook to run, there should be an OCRed folder that should contain a .txt file, a .tsv file, and an images sub-folder (more details in the notebook).

<b>Note:</b>
- If the Acts and Joints were mixed for the chosen year, the OCRed output will contain `{year}_Both.txt` and `{year}_Both_data.tsv`
- If the Acts and Joints were seperate for the chosen year, the OCRed output will contain `{year}_Acts.txt` and `{year}_Acts_data.tsv`

In [None]:
from nltk.tokenize import PunktSentenceTokenizer
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
import re

pd.set_option('display.max_colwidth', None)

<br>
Either get the year variable from elsewhere (such as when this notebook is accessed from another file) or specify the year.

In [None]:
# Get the year variable from somewhere else
%store -r year

In [None]:
# # If running this notebook independently,
# # Uncoment the following line of code
# year = 1892

In [None]:
# This is the directory that will contain the OCRed output:
dir_OCR = "/Users/nitingupta/Desktop/OTB/OCRed/" + str(year)

print(f"Working on {year} under {dir_OCR}")

In [None]:
# Try reading in "{year}_text.txt" if the Acts and Joints were seperate for the year
try:
    acts_path = dir_OCR + "/" + str(year) + "_Acts.txt"
    with open(acts_path, 'r') as f:
        data = f.read()

    # If the read is successful, set a flag that identifies that the Acts and Joints are seperate
    actsSep = True

# However, if the directory contains {year}_Both.txt instead, a FileNotFoundError will be returned for the above code.
# So, catch that error and read in "{year}_Both.txt"
except FileNotFoundError:
    acts_path = dir_OCR + "/" + str(year) + "_Both.txt"
    with open(acts_path, 'r') as f:
        data = f.read()
    
    actsSep = False  # The flag being False means that the Acts and Joints are not seperate

# This variable holds all the OCRed text as a String
# data

In [None]:
print("The number of pages OCRed for {year} is: {count}".format(year = year, count = (data.count("\n\n")+1)))

<br>

## A. Training the tokenizer
Based on this [article](https://subscription.packtpub.com/book/application-development/9781782167853/1/ch01lvl1sec12/training-a-sentence-tokenizer),
- NLTK's default sentence tokenizer is general purpose and usually works quite well. But sometimes it might not be the best choice for our text if it uses nonstandard punctuation or is formatted in a unique way. In such cases, training your own sentence tokenizer can result in much more accurate sentence tokenization.
- The `PunktSentenceTokenizer` class uses an unsupervised learning algorithm to learn what constitutes a sentence break.
    - The specific technique used in this case is called sentence boundary detection. It works by counting punctuation and tokens that commonly end a sentence, such as a period or a newline, then using the resulting frequencies to decide the sentence boundaries.

In [None]:
sent_tokenizer = PunktSentenceTokenizer(data)
sentences = sent_tokenizer.tokenize(data)

# A List of tokens/sentences as seperated by nltk's PunktSentenceTokenizer
# sentences

<br>

## B. Creating the dataframe
Make a new dataframe with the sentences and character lengths as attributes

In [None]:
# Add to a new DataFrame
df = pd.DataFrame()
df["sentence"] = sentences

In [None]:
df["length"] = [len(sentence) for sentence in sentences]
print("Length of the initial dataframe:", df.shape[0], "\nThis is the number of tokenized sentences.")

<br>

## C. Adding page file names
- Add an feature that specifies which page number that sentence starts and ends on.
- Reading only Acts. <b> Not reading Joints </b>
- The reason to read the files from the directory is to ensure that missing file pages are not missed in the dataframe.

In [None]:
# This is the path to the directory that contains the images.
# NOTE: This directory is inside the OCRed output for the chosen year
dir_imgs = dir_OCR + "/images"
print(f"The images directory is {dir_imgs}")

In [None]:
imgs = os.listdir(dir_imgs)
imgs = [img for img in imgs if "jpg" in img or "tiff" in img or "JPG" in img or "TIFF" in img]
imgs.sort()
print("The number of image files for this year is:", len(imgs))

In [None]:
fileType = imgs[0].split(".")[1]
print(f"The files are of type: {fileType}")

<b>Note:</b>
- The OCR attempts to seperates new pages by adding "\n\n". However, the total number of pages does not equal the total count of "\n\n" in the text as the OCR does not add "\n\n" after every page.
- One way to eliminate this issue is by utilizing the `{year}_Both_data.tsv` (if acts and joints mixed) or `{year}_Acts_data.tsv` (if acts and joints seperated) file from the OCR output.
- This file contains each word (in the 2nd last column) and the filename for that word (last column).
- Also, since we are only working with Acts, if the Acts and Joints are seperate, the last word in the df_word dataframe will not end on the actual last page in the images sub-folder.

In [None]:
# Based on whether the Acts and Joints are mixed, read the appropriate tsv file
if actsSep:
    df_words = pd.read_table(f"{dir_OCR}/{year}_Acts_data.tsv")
else:
    df_words = pd.read_table(f"{dir_OCR}/{year}_Both_data.tsv")

df_words

So, to label the page numbers in the dataframe, we can go through the original dataframe and find the start and end words in each sentence.
<br>We, can then find the page numbers for those words, from `df_words` and add them to the original dataframe, `df`.
<br>To start, we need to clean the two dataframes.

In [None]:
df['page'] = np.nan

# Drop the columns which are unessecary for our analysis
df_words.drop(columns=["left", "top", "width", "height", "conf"], inplace=True)

# Drop the rows which don't contain a word in the "text" column
df_words.dropna(inplace=True)

# Relabel the "name" column to "page" column
df_words.rename(columns={"name": "page"}, inplace=True)

# Reassign index after dropping nas
df_words = df_words.assign(row_number=range(len(df_words)))
df_words.set_index('row_number', inplace=True)

# Drop the 'page' column from the org dataframe
df.drop(columns=['page'], inplace=True)

# Add an empty 'start_page' and 'end_page' column
df['start_page'] = np.nan
df['end_page'] = np.nan

Since, a word can only exist on a single page, we have unique identifiers for the start and end page for each sentence 

In [None]:
# Tracker for df_words:
words_trkr = 0

# Loop over the original dataframe
for i in range(0, df.shape[0]):
    
    # Remove "\n\n" from the original dataframe as they will interfere with the analysis
    df.at[i, 'sentence'] = df.iloc[i]['sentence'].replace("\n\n", "")

    # For each sentence, extract the first and last word
    tmp_sentence = df.iloc[i]['sentence'].split(" ")
    start, last = tmp_sentence[0], tmp_sentence[-1]

    # Get the page number for the start and end word
    start_page = df_words.iloc[words_trkr]['page']

    try:
        end_page = df_words.iloc[words_trkr + len(tmp_sentence)]['page']
    except:
        end_page = df_words.iloc[words_trkr]['page']
        

    # Remove the filename from the pages:
    start_page = start_page.split(".")[0]
    end_page = end_page.split(".")[0]

    
    # Assign the page number to their respective columns in the dataframe
    df.at[i, 'start_page'] = start_page
    df.at[i, 'end_page'] = end_page
    
    # Update tracker
    words_trkr += len(tmp_sentence)

In [None]:
df.tail(10)

<br>

## D. Cleaning on Char. lenght
Get rid of sentences with a low number of characters as they might not form meaningful sentences.

However, first, get the statistics on the length column to avoid removing meaningful sentences.

In [None]:
# Get the statistics for the length column
df["length"].describe()

In [None]:
# Plot a histogram for that column
fig, axes = plt.subplots(1, 2)
df.hist(column="length", ax=axes[0])
axes[0].set_title('Character lengths')
axes[0].set(xlabel="Character length of the sentences", ylabel="Frequency")

df.hist(column="length", ax=axes[1])

axes[1].set_title('Character lengths (zoomed)')
axes[1].set(xlabel="Character length of the sentences", ylabel="Frequency", xlim = [0, 1000])
fig.tight_layout()

Define a cutoff for the sentences. All sentences belows this length will be removed

In [None]:
cut_len = 50

Create a smaller dataframe, and export it to csv, that only contains the short length sentences.
Check the csv and change the length condition accordingly.

In [None]:
# # Uncomment the following line of code to create the csv which contains the short length sentences.

# testing_df = df[df['length'] < cut_len]
# testing_df.to_csv(f"{year}_len_testing.csv", index=False)

<br>
Once, the length is decided, create a new dataframe with sentences greater that the length

In [None]:
df_reduced = df[ df["length"] > cut_len]
print("Length of the cleaned dataframe: ", df_reduced.shape[0])
print("Reduction of about {:.2f}%".format( (1 - df_reduced.shape[0]/df.shape[0]) * 100))

In [None]:
df_reduced.reset_index(drop=True, inplace=True)
df_reduced.index.name = "index"

<br>

## E. Regex Matching
Remove unecessary words, which do not contribute to the overall meaning, in the sentences.

In [None]:
# New dataframe so that the results of the matching can be compared
df_cleaned = df_reduced.copy()

# A new dictionary to keep track of the number of errors
errorsDict = {}

In [None]:
# Create a new column that will contain the removed words that match the section patter
df_cleaned['removed'] = np.nan

# Rename 'sentence' column to 'org_sent' to avoid confusion
df_cleaned.rename(columns={'sentence': 'org_sent'}, inplace=True)

In [None]:
df_cleaned.head()

In [None]:
def replaceInDF(rgx_match: re.Pattern, df: pd.DataFrame, prevAppend: bool):
    '''
    Find the provided regex pattern in the provided dataframe.
    
    Parameters
    ----------
    rgx_match : re.Pattern
        A regular expression pattern that will be search for and replaced in the df
    df: pandas.Dataframe
        A Pandas dataframe to search and replace for
        Should contain an:
            'org_sent' column, in which the matches which will be replaced
             'removed' column, in which the matched string will be stored
     prevAppend: bool
         A flag for whether the match should be append to the end of the previous sentence
        
    Returns
    -------
    A tuple consisting of:
    
    df: pandas.Dataframe
        The modified Dataframe with the matches performed
    errorCount: int
        A count of how many times this error was found.
    '''
    
    errorCount = 0
    
    for i in range(0, df.shape[0]):
    
        # The value at this row's "removed" column
        removed_val = df.iloc[i]['removed']
        
        # The found matches
        matches = "; ".join([x.group() for x in re.finditer(rgx_match, df.iloc[i]['org_sent'])])
        
        # if no match found...
        if not matches:
            continue

        # Else if match is found...
        
        # Update the counter for the error with the number of matches found
        errorCount += len(matches.split(";"))
        
        # Check if there is already a value in the 'removed' column for that row
        if removed_val != "" and not pd.isnull(removed_val):
            # Append the matches to the existing value seperated by ";"
            df.at[i, 'removed'] = str(removed_val) + "; " + matches
        else:
            # Add the matched patterns to the "removed" section seperated by ";"
            df.at[i, 'removed'] = matches
        
        if prevAppend and i != 0:
            
            m = re.search(rgx_match, df.iloc[i]['org_sent'])
            if m:
                # Add to the end of the previous sentence
                df.at[i-1, 'org_sent'] = df.iloc[i-1]['org_sent'] + " " + str(m.group())
            
        # Remove the matched patterns from sentences
        df.at[i, 'org_sent'] = re.sub(rgx_match, '', df.iloc[i]['org_sent'])    
        
    return df, errorCount

<br>

### 1. Removing section identifiers
The following code implements regex patterns to identify sections, such as "Section 1.", "Sec. 4.", etc. 
<br>Since most sections, which need to be removed, appear either at the start or the start of the ORCed sentence, the pattern finds matches either at the start or the end of the sentence.
<br>Do note that the same pattern is repeated for the start and end of the sentence, and is seperated by '|'.

Some notes about the pattern:
- `r'(S|s|E|e|C|c|T|t|I|i|O|o|N|n){2,}'` matches "Section"
- `r'(\.|,|:|;| )'{0,2}` matches mistaken delimiters or spaces following "Section"
- `r'[0Oo1Iil!2Z5S6G\d]{1,2}'` matches the section number. Letters are required in this pattern to account for OCR mistakes
- `r'(. |.| |)'` matches the end of phrase spaces and periods

In [None]:
rgx_match = re.compile(r"^(S|s|E|e|C|c|T|t|I|i|O|o|N|n){2,}(\.|,|:|;| ){0,2}[0Oo1Iil!2Z5S6G\d]{1,2}(. |.| |)|(S|s|E|e|C|c|T|t|I|i|O|o|N|n){2,}(\.|,|:|;| ){0,2}[0Oo1Iil!2Z5S6G\d]{1,2}(. |.| |)$")

df_cleaned, errorsDict['section identifiers'] = replaceInDF(rgx_match, df_cleaned, False)

In [None]:
df_cleaned.head(10)

<br>

### 2. Removing end of line hyphenation
Whenever a word in the sentence continues from the end of a line to the beginning of the next line and is joined by a hyphen, the OCRed sentence also contains that hyphen and a space.
<br>For example, 'Commander-in-Chief' is OCRed as 'Com- mander-in-Chief'
<br>The following code implements regex patterns to remove "- " in the text since each hyphenated word is split with "- ".

In [None]:
rgx_match = re.compile('[-][ ]')
df_cleaned, errorsDict['EOL hyphenation'] = replaceInDF(rgx_match, df_cleaned, False)

In [None]:
df_cleaned.head(10)

<br>

### 3. Relocating incorrect "Approved ..." phrases
The “Approved…” phrases are incorrectly appended to the start of the next law. They should by appended to the end of the previous law.
<br>Phrases might be of the format: 
- "Approved the 2oth day of February, A. D. 1901",
- "Approved December 15th, A. D. 1892.",
- "Approved December O5th, A. D. 1892.",
- "Approved December !2th, A. D. 1892.",
- "Approved December 6Gth, A. D. 1892.",
- "Approved December 05th, A. D. 1892.",

Since phrases might either have the month or the date after the "Approved" sub-string, the code below utilized two patterns to account for either case, seperated by '|'.
<br>
Some notes about the pattern:
- `r'[0Oo1Iil!2Z5S6G\d]{1,2}'` matches the date. Letters are required in this pattern to account for OCR mistakes
- `r'(?:th|st|nd|rd)'` matches the prefixes for the dates
- `r'[A-Z][a-z]+'` matches the month
- `r'(?:18\d{2}|19\d{2}|2\d{3})'` matches years ranging from 18** to 2***
- `r'(. |.| |)'` matches the end of phrase spaces and periods

In [None]:
rgx_match = re.compile(
    r'^Approved the [0Oo1Iil!2Z5S6G\d]{1,2}(?:th|st|nd|rd) day of [A-Z][a-z]+, A\. D\. (?:18\d{2}|19\d{2}|2\d{3})(. |.| |)\b|Approved [A-Z][a-z]+ [0Oo1Iil!2Z5S6G\d]{1,2}(?:th|st|nd|rd), A\. D\. (?:18\d{2}|19\d{2}|2\d{3})(. |.| |)\b')
df_cleaned, errorsDict['Approved phrases'] = replaceInDF(rgx_match, df_cleaned, True)

In [None]:
df_cleaned.head(10)

<br>

### 4. Removing Act seperators
The horizontal lines differentiating one Act from another show up as U+2014 : EM DASH characters (one or multiple) in the OCR.
<br>For example, '——- —— AN ACT...' or '—— AN ACT...'

Some notes about the pattern:
- `r'^—+'` matches one or more consecutive occurrences of the "—" character at the start of a line.
- `r'(?=\s*[A-Za-z])'` is a positive lookahead (this part isn't captured) that checks if there is zero or more whitespace characters (\s*) followed by a letter ([A-Za-z]) after the "—" characters.

In [None]:
rgx_match = re.compile(r'^—+(?=\s*[A-Za-z])')
df_cleaned, errorsDict['Act seperators'] = replaceInDF(rgx_match, df_cleaned, False)

In [None]:
df_cleaned.head()

<br>

### 5. Removing incorrect numbers at the start
Some numbers are incorrectly left at the start of the sentence from the OCR process. They are rather OCRed, for example, as 2, or 2.

Some notes about the pattern:
- `r'^[0Oo1Iil!2Z5S6G\d]{1,3}'` matches upto 3 numbers at the start of the string
- `r'(. |.| |)'` matches the end of phrase spaces and periods

In [None]:
rgx_match = re.compile(r'^[0Oo1Iil!2Z5S6G\d]{1,3}(. |.| |)')
df_cleaned, errorsDict['Incorrect starting nums'] = replaceInDF(rgx_match, df_cleaned, False)

In [None]:
df_cleaned.head()

In [None]:
errorsDict

<br>

## F. Adding features
The features added below are:
- an id: a concatenation of the year and index number
- whether the sentence is an Act or a Joint
- the state that the law originates from

### 1. Adding ID

In [None]:
def addPrefix(fileName: str, nameLen: int, fileType: bool) -> str:
    '''
    Since the fileNames from the excel parsing could be any of any length
    (ranging from 1-3), this function appends a string of 0's to the 
    start of the input so that it is the specified nameLen lengths long.
    
    Parameters
    ----------
    fileName : str
        The file name that needs to be prefixed
    nameLen : int
        The length of the expected name of the file
        Ex. '00034.jpg' would have length of 5
        so nameLen should be 5
    fileType: bool
        True if the fileName contains a fileType prefix such as '.tiff'

    Returns
    -------
    str
        A length 5 file name (prefixed with 0's)
    '''
    
    # Remove the file type
    if fileType:
        name = fileName.split(".")[0]
    else:
        name = fileName

    prefix_length = nameLen - len(name)
    prefix = "0" * prefix_length
    
    final_string = prefix + fileName
    return final_string

In [None]:
df_cleaned.reset_index(inplace=True)
df_cleaned.rename(columns={"index" : "id"}, inplace=True)
# df_cleaned.set_index('id', inplace=True)

In [None]:
# The length of the id of the last row in the dataframe, which is used to assess how many 0's will be prefixed to the other ids
maxNumLength = len(str(df_cleaned.last_valid_index()))

for i in range(0, df_cleaned.shape[0]):
    df_cleaned.at[i, 'id'] = str(year) + "_" + addPrefix(str(df_cleaned.iloc[i]['id']), maxNumLength, fileType=False)

In [None]:
df_cleaned

<br>

### 2. Adding the remaining identifiers

In [None]:
df_cleaned.insert(1, 'law_type', 'Acts')
df_cleaned.insert(2, 'state', 'SOUTH CAROLINA')

In [None]:
df_cleaned

<br>

## Exporting

In [None]:
# Drop the 'removed' column
df_cleaned.drop(['removed'], axis = 1, inplace=True)

# Rename the 'org_sent' column
df_cleaned.rename(columns={"org_sent": "sentence"}, inplace=True)

df_cleaned

In [None]:
# Export the final dataframe to csv for viewing
# df_cleaned.to_csv(f"{year}.csv", index=False)