# JES ADD and DEL Tag Search with BeautifulSoup

This Notebook contains code that runs through .sec files to find unprocessed \<ADD\> and \<DEL\> tags and exports them to Excel via Pandas DataFrames.

Jupyter Notebook written by Ben Fisher on 26 November 2024 <br>
**benjamin.s.fisher@usace.army.mil**

### Imports
The following imports are assumed to have been previously installed (for Notebook installs, use *! pip install ~*)

In [1]:
import os, warnings
import bs4 as bs
import lxml

import numpy as np
import pandas as pd
from pathlib import Path

##### Warning Suppression (Jupyter Notebooks only)

The following code will suppress the user warning generated by Beautiful Soup when parsing XML files with lxml: *XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?)...*

This is necessary **only** when running the script in a Notebook context. You can skip this line if script ran elsewhere.

In [2]:
warnings.filterwarnings('ignore')

##### Directories
Working directories are made relative to the 'current working directory,' which is where the Notebook (.ipynb) file is located.

In [3]:
parent_folder = os.getcwd()
file = parent_folder + '\\SI Processing\\23 64 26.SEC'
before = parent_folder + '\\SI Processing\\23 64 26_unprocessed.SEC'

In [4]:
def clean_file(file):
    find_text = b'\x81'
    replace_text = b''
    
    with open(file, 'rb') as f:
        content = f.read()
        
    content = content.replace(find_text, replace_text) 

    with open(file, 'wb') as f:
        f.write(content)

In [5]:
clean_file(file)

### Read Files for Unprocessed Changes
Read through files to see if there are any \<ADD\> or \<DEL\> tags. We have to clean out the \<DEL\> tags that poison everything, especially because examples of their inconsistent application have been found in several sections.

In [6]:
with open(before, 'r') as doc:
    soup = bs.BeautifulSoup(doc.read(), 'lxml')
    add_tags = soup.find_all('add')
    del_tags = soup.find_all('del')
del_tags

[<del><ref><org>ASTM INTERNATIONAL (ASTM)</org><brk></brk>
 <brk></brk>
 <rid>ASTM E84</rid><rtl>(2018a) Standard Test Method for Surface Burning 
 Characteristics of Building Materials</rtl><brk></brk>
 <brk></brk></ref>
 </del>,
 <del><org>JAPANESE INDUSTRIAL STANDARDS (JIS)</org></del>,
 <del><rtl>(2014) Hexagon Head Bolts and Hexagon Head Screws</rtl></del>,
 <del><rtl>(2010) Low-Voltage Three-Phase Squirrel-Cage 
 High-Efficiency Induction Motors (Amendment 
 1)</rtl></del>,
 <del>17</del>,
 <del> 
 (Amendment 1) </del>,
 <del><rtl>(2019) Hot Dip Zinc Coated Steel Sheet and Strip</rtl></del>,
 <del>MLIT MECHANICAL STANDARD SPECIFICATION</del>,
 <del><rtl>(2019) Public Building Construction Standard 
 Specification</rtl></del>,
 <del><org>WATER SUPPLY LAW</org><brk></brk>
 <brk></brk>
 <rid>PSCP</rid><rtl>Performance Standard Compliant Product</rtl><brk></brk>
 <brk></brk>
 <rid>CCPV</rid><rtl>(2016) Construction Code of Pressure Vessel</rtl><brk></brk>
 <brk></brk></del>,
 <del>JA

In [7]:
with open(file, 'r') as doc:
    soup = bs.BeautifulSoup(doc.read(), 'lxml')
    tags = soup.find_all('add')
tags

[]