# Aggregrate Sentence Splitting

This notebook runs `SentenceSplitting.ipynb` for the years initiated below to concat and make a final, all encompassing, Pandas dataframe.
<br>
NOTE: The `%%cpature` command at the top of some cell is to avoid displaying the output messages.

In [1]:
import pandas as pd
import sys
from tqdm import tqdm  # For printing out progress bar
import os

In [2]:
# Manualy define the list of years to aggregate
# years = [1892, 1901]

# Or, read all folder names in the OCR (or a specified) directory
years = [int(name) for name in os.listdir("/Users/nitingupta/Desktop/OTB/OCRed") if not name.startswith('.')]

years

[1956, 1894, 1893, 1892, 1948, 1901]

In [3]:
# A dictionary to count the number of errors for all years
errorCountsAgg = {}

In [4]:
%%capture cap --no-stderr

# Create an empty dataframe
df_final = pd.DataFrame()

# Set up the progress bar
progress_bar = tqdm(total=len(years), file=sys.stderr)

# Iterate over the list
for year in years:
    
    # Update the progress bar
    progress_bar.set_description(f"Processing year {year}")

    # The %store command lets you pass variables between two different notebooks.
    # Store the year so that it can be picked up by the other notebook
    %store year

    # Run the notebook
    %run SentenceSplitting.ipynb

    # All variables, including the final dataframe,
    # should now be available in this notebook's scope.

    # Append this year's dataframe to the final dataframe
    df_final = pd.concat([df_final, df_cleaned])
    
    # Loop over this year's error counting dictionary and
    # update the overall error counting dictionary
    for key, value in errorsDict.items():
        try:  # If some value exists, append the new value to it
            errorCountsAgg[key] += value
        except KeyError:  # Else, use this value as the initialization value
            errorCountsAgg[key] = value

    # Update the progress bar
    progress_bar.update(1)

# Close the progress bar
progress_bar.set_description(f"Processed the list")
progress_bar.close()

Processed the list: 100%|█████████████████████████████████████████████████████████████████████████| 6/6 [00:10<00:00,  1.82s/it]


In [5]:
# Get a total count of all the errors
errorCountsAgg

{'section identifiers': 2953,
 'EOL hyphenation': 14258,
 'Approved phrases': 531,
 'Act seperators': 69,
 'Incorrect starting nums': 3911,
 'Session headers': 9,
 'Uppercased': 619}

In [6]:
df_final

Unnamed: 0,id,law_type,state,sentence,length,start_page,end_page
0,1956_0000,Acts,SOUTH CAROLINA,"AND JOINT RESOLUTIONS OF THE General Assembly OF THE State of South Carolina Grorce Bett TIMMERMAN, JR., Governor; Ernest F. Ho.ines, Lieutenant Governor and ex officio President of Senate; EDGAR A. Brown, President pro tempore of Senate; SoLomon Briar, Speaker of House of Representatives ; Tracy J. GaINnEs, Speaker pro tempore of House of Representatives; L. O. THomas, Clerk of the Senate; Inkz Watson, Clerk of House of Representatives.",443,00055,00055
1,1956_0001,Acts,SOUTH CAROLINA,"Passed at the regular session, which was begun and held at the city of Columbia on the tenth day of January, A.D. 1956 and was adjourned sine die on the 10th day of April, A.D., 1956 Part I GENERAL AND PERMANENT LAWS (R614, S467) No.",233,00055,00055
2,1956_0002,Acts,SOUTH CAROLINA,"An Act To Provide For The Regulation Of Traffic Upon Roads Of The United States Government Within The Confines Of Land Acquired For Use Of The Atomic Energy Commission, And To Permit Special State Constables, Appointed Under Chapter 5.1, Title 53, Code Of Laws Of South Oarolina, 1952, To Issue Official Summons Without Bond For Appearance For Trial, As Set Forth Therein.",380,00055,00056
3,1956_0003,Acts,SOUTH CAROLINA,"Whereas, it is desirable to establish permanent regulations for the control of traffic, and to provide for the enforcement of such regulations by appropriate Special State Constables within the area of the Savannah River Plant; and Whereas, it is necessary for Special State Constables, authorized to act within the area, to have the power to issue Summons to apprehended persons for appearance for trial at a future date, so that such Constables will not be required to leave their posts.",493,00056,00056
4,1956_0004,Acts,SOUTH CAROLINA,"Now, therefore, Be it enacted by the General Assembly of the State of South Carolina: Regulation of traffic at Atomic Energy Plant: SECTION 1. All the provisions of Chapter 3, Title 46, of the Code of Laws of South Carolina, 1952, except Articles 3, 11, 15, 16 and 17 of said chapter, being the Uniform Act Regulating Traffic on Highways, shall apply to all roads within the confines of lands in Aiken, Allendale and Barnwell Counties, acquired or to be acquired by the United States Government for use of the Atomic Energy Commission.",539,00056,00056
...,...,...,...,...,...,...,...
1326,1901_1326,Acts,SOUTH CAROLINA,"nds adjacent to said River; And whereas by the construction of said dam or dams the navigation of said River may be increased and the public interest promoted by the construction thereof for the purpose and for the sake of such improvement in the navigability of said River and for the public purposes to be ,fulfilled and encouraged by the construction of said dam or dams and for the purpose of removing any doubt which may arise as to the power and authority of the Secretary of State in granting the charter to the said Twin City Power Company for the erection of said dam or dams to be built across the said River: Now, Section 1. Be it enacted by the General Assembly of the State of South Carolina: That the right, power and privilege to construct and maintain a dam or dams across the Savannah River, as hereinbefore mentioned, to Twin City Power Company, its successors or assigns, shall be and is hereby fully authorized, ratified and confirmed; and that the said Twin City Power Company shall have all rights, powers and privileges conferred for the purpose of the acquisition and condemnation of land which may be overflowed by the erection or construction of said dam or dams as are conferred by Sections 1743-1755, inclusive, of the Revised Statutes of South Carolina, 1893, upon railway, canal and turnpike companies in the State and all of the Acts amendatory thereof; it being the intention of this Act for the sake of the public purposes intended to be carried out by said company to confer upon it all the rights, privileges and authorities conferred by the laws of this State upon railway, canal and turnpike companies in the acquisition and condemnation of property for rights of way or other interests in lands. Approved the 2oth day of February, A. D. 1901",1751,00291,00291
1327,1901_1327,Acts,SOUTH CAROLINA,AN ACT TO EMPOWER AND AUTHORIZE THE COUNNTY BOARD OF COMMISSIONERS OF CHEROKEE COUNTY TO BUILD A BRIDGE ACROSS BROAAD RIVER AND BORROW MONEY THEREFOR FROM THE COMMISSIONERS OF THE SINKING FUND.,243,00291,00292
1328,1901_1328,Acts,SOUTH CAROLINA,"Be it enacted by the General Assembly of the State of South Carolina: That the County Board of Commissioners of Cherokee County be, and they are hereby, authorized, if in their discretion they deem that it is for the best interest of said County, to borrow a sum of money from the Sinking Fund of the State of South Carolina, not to exceed ten thousand dollars, at a rate of interest not to exceed five per centum per annum, for the purpose of building a bridge across Broad River, in said County, at such point on said river as they may deem most practicable, and a special tax of one-half mill on the dollar may be levied on all taxable property in the County of Cherokee, provided the Board of Commissioners so decide to build said bridge, for the said period of seven years, for the purpose of repaying said loan.",836,00292,00292
1329,1901_1329,Acts,SOUTH CAROLINA,"That the proceeds of said levy of onehalf mill shall be paid each year on said loan until the seventh year, in which year the balance remaining due on said loan shall be paid from said special levy, if any remain it shall-be turned into the County Treasury for ordinary County purposes, and if a’sufficient sum has not been realized by said special levy at the expiration of said seven years the deficiency shall be paid by the County Board of Commissioners out of the ordinary County funds.",493,00292,00292


<br>

## Checking and dropping for duplicates
There is a high possibility for duplicates to exist in the sentence column. This is removed here, intead of in `SentenceSplitting.ipynb`, because there might be duplicates across different volumes.

In [7]:
print(f"The number of dropped sentences is: {df_final[df_final.duplicated(subset=['sentence'])].shape[0]}")

The number of dropped sentences is: 831


In [8]:
df_final.drop_duplicates(subset=['sentence'], ignore_index=True, inplace=True)