# Aggregrate Sentence Splitting

This notebook runs `sentence_splitting.ipynb` for the years initiated below to concat and make a final, all encompassing, Pandas dataframe.
<br>
NOTE: The `%%capture` command at the top of some cell is to avoid displaying the output messages.

In [1]:
import pandas as pd
import sys
from tqdm import tqdm  # For printing out progress bar
import os

In [2]:
print(pd.__version__)

1.3.5


In [3]:
# !pip install -U pandas --user

In [4]:
# Manualy define the list of years to aggregate
# years = [1892, 1901]

# Or, read all folder names in the OCR (or a specified) directory
years = [name for name in os.listdir("/work/otb-lab/OCRed") if not name.startswith('.')]

print(years)

['1906', '1953', '1913', '1968', '1880', '1928', '1954', '1914', '1929', '1940', '1900', '1886-1887', '1896', '1915', '1962', '1901', '1897', '1916', '1963', '1890', '1938', '1902', '1964', '1924', '1873-1874', '1891', '1939', '1950', '1910', '1965', '1925', '1892', '1911', '1926', '1893', '1912', '1869-1870', '1927', '1934', '1949', '1960', '1920', '1935', '1961', '1921', '1936', '1922', '1875-76', '1878', '1937', '1959', '1923', '1930', '1879', '1945', '1931', '1946', '1958b', '1873', '1932', '1881-82', '1888', '1947', '1877-78', '1907', '1874', '1933', '1889', '1948', '1908', '1955', '1872-1873', '1868-69', '1941', '1909', '1956', '1883', '1942', '1898', '1957', '1917', '1870-1871', '1884', '1943', '1903', '1899', '1958', '1918', '1871-1872', '1885', '1944', '1904', '1951', '1871', '1919', '1966', '1905', '1952', '1967', '1894']


In [5]:
# A dictionary to count the number of errors for all years
errorCountsAgg = {}

In [6]:
%%capture cap --no-stderr

# Create an empty list for the final dataframe
df_final = []

# Set up the progress bar
progress_bar = tqdm(total=len(years), file=sys.stderr)

# Iterate over the list
for year in years:
    
    # Update the progress bar
    progress_bar.set_description(f"Processing year {year}")

    # The %store command lets you pass variables between two different notebooks.
    # Store the year so that it can be picked up by the other notebook
    %store year

    # Run the faster notebook, since the outputs are not shown here anyways
    # %run SentenceSplitting.ipynb
    %run sentence_splitting.ipynb

    # All variables, including the final dataframe,
    # should now be available in this notebook's scope.

    # Append this year's dataframe to the final dataframe
    df_final.append(df_cleaned)       
    
    # Loop over this year's error counting dictionary and
    # update the overall error counting dictionary
    for key, value in errorsDict.items():
        try:  # If some value exists, append the new value to it
            errorCountsAgg[key] += value
        except KeyError:  # Else, use this value as the initialization value
            errorCountsAgg[key] = value

    # Update the progress bar
    progress_bar.update(1)

# Close the progress bar
progress_bar.set_description(f"Processed the list")
progress_bar.close()

# Convert the list to a dataframe
df_final = pd.concat(df_final, ignore_index=True)

Processed the list: 100%|██████████| 100/100 [07:21<00:00,  4.41s/it]      


In [7]:
# Get a total count of all the errors
errorCountsAgg

{'section identifiers': 104173,
 'EOL hyphenation': 338137,
 'Approved phrases': 8540,
 'Act seperators': 3753,
 'Incorrect starting nums': 172762,
 'Session headers': 135}

In [8]:
df_final

Unnamed: 0,id,law_type,state,sentence,length,start_page,end_page
0,1906_0000,Acts,SOUTH CAROLINA,Acts and]oint Resolutions OF THE General Assem...,279,00395,00395
1,1906_0001,Acts,SOUTH CAROLINA,"Joun T. Stoan, LieutenantGovernor and ex offic...",73,00395,00395
2,1906_0002,Acts,SOUTH CAROLINA,"M. L. SmirH, Speaker of the House of Represent...",53,00395,00395
3,1906_0003,Acts,SOUTH CAROLINA,"RosERT R. HEMPHILL, Clerk of the Senate.",40,00395,00395
4,1906_0004,Acts,SOUTH CAROLINA,"T. C. Hamer, Clerk of the House of Representat...",51,00395,00395
...,...,...,...,...,...,...,...
467404,1894_2072,Acts,SOUTH CAROLINA,But no such grant shall be made for a longer p...,70,479,479
467405,1894_2073,Acts,SOUTH CAROLINA,That this Act shall take effect from and after...,706,479,479
467406,1894_2074,Acts,SOUTH CAROLINA,"That this Act is a public Act, and shall conti...",202,479,479
467407,1894_2075,Acts,SOUTH CAROLINA,A JOINT RESOLUTION TO PRROVIPE FOR LOCATING TH...,133,479,479


<br>

## Checking and dropping for duplicates
There is a high possibility for duplicates to exist in the sentence column. This is removed here, intead of in `SentenceSplitting.ipynb`, because there might be duplicates across different volumes.

In [9]:
print(f"The number of dropped sentences is {df_final[df_final.duplicated(subset=['sentence'])].shape[0]}")

The number of dropped sentences is 91415


In [10]:
df_dropped = df_final.drop_duplicates(subset=['sentence'], ignore_index=True)

In [11]:
df_dropped

Unnamed: 0,id,law_type,state,sentence,length,start_page,end_page
0,1906_0000,Acts,SOUTH CAROLINA,Acts and]oint Resolutions OF THE General Assem...,279,00395,00395
1,1906_0001,Acts,SOUTH CAROLINA,"Joun T. Stoan, LieutenantGovernor and ex offic...",73,00395,00395
2,1906_0002,Acts,SOUTH CAROLINA,"M. L. SmirH, Speaker of the House of Represent...",53,00395,00395
3,1906_0003,Acts,SOUTH CAROLINA,"RosERT R. HEMPHILL, Clerk of the Senate.",40,00395,00395
4,1906_0004,Acts,SOUTH CAROLINA,"T. C. Hamer, Clerk of the House of Representat...",51,00395,00395
...,...,...,...,...,...,...,...
375989,1894_2072,Acts,SOUTH CAROLINA,But no such grant shall be made for a longer p...,70,479,479
375990,1894_2073,Acts,SOUTH CAROLINA,That this Act shall take effect from and after...,706,479,479
375991,1894_2074,Acts,SOUTH CAROLINA,"That this Act is a public Act, and shall conti...",202,479,479
375992,1894_2075,Acts,SOUTH CAROLINA,A JOINT RESOLUTION TO PRROVIPE FOR LOCATING TH...,133,479,479


<br>

## Exporting

In [15]:
df_dropped.to_csv("SC_acts.csv", index=False)

<br>

## Selective random sampling
Select 100 random sentences for each year