## Data Cleaning for SAP txt data

- More details can be found in my article at the link: 
https://medium.com/@zaishanweng/data-cleaning-on-sap-data-extracts-with-regex-and-python-cd2afe73be14

In [12]:
import pandas as pd
import re

In [13]:
def make_data_format(datalist, column_length):
    output_list = []
    for string, length in zip(datalist, column_length):
        output_string = (string + " " * length)[0:length]
        output_list.append(output_string)
    return "|" + "|".join(output_list) + "|"

In [14]:
with open("data/Sample SAP Format.txt", encoding="utf-8") as f:
    content_raw = f.read()

In [19]:
content_raw

'DD.MM.YYYY           Dynamic List Display           1\n-----------------------------------------------------\n-----------------------------------------------------\n|DocumentNo|Year|Itm|Clrng doc.|Text                |\n-----------------------------------------------------\n|10002001  |2022|001|60007000  |Item 1              |\n|10002002  |2022|001|60007001  |Item A|B|C          |\n|10002004  |2022|001|60007006  |Item Z    \n         |\n|10002003  |2022|002|60007005  |Item A|ID:01        |\n-----------------------------------------------------\n'

In [15]:
#Alternative way to read the file with CRLF (Carriage Return Line Feed intact)
with open("data/Sample SAP Format.txt", encoding="utf-8", newline="") as f:
    content_raw_2 = f.read()

In [16]:
content_raw_2

'DD.MM.YYYY           Dynamic List Display           1\r\n-----------------------------------------------------\r\n-----------------------------------------------------\r\n|DocumentNo|Year|Itm|Clrng doc.|Text                |\r\n-----------------------------------------------------\r\n|10002001  |2022|001|60007000  |Item 1              |\r\n|10002002  |2022|001|60007001  |Item A|B|C          |\r\n|10002004  |2022|001|60007006  |Item Z    \r\n         |\r\n|10002003  |2022|002|60007005  |Item A|ID:01        |\r\n-----------------------------------------------------\r\n'

In [17]:
re.findall(r"\r\n", content_raw)

[]

In [18]:
re.findall(r"\r\n", content_raw_2)

['\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n',
 '\r\n']

In [4]:
new_line_pattern = re.compile("([^1|-])[\n](.)|(.)[\n]([^|-])")
content_cleaned_newline = new_line_pattern.sub(r"\1 \2", content_raw)
content_split_line = content_cleaned_newline.split("\n")

In [5]:
content_split_line

['DD.MM.YYYY           Dynamic List Display           1',
 '-----------------------------------------------------',
 '-----------------------------------------------------',
 '|DocumentNo|Year|Itm|Clrng doc.|Text                |',
 '-----------------------------------------------------',
 '|10002001  |2022|001|60007000  |Item 1              |',
 '|10002002  |2022|001|60007001  |Item A|B|C          |',
 '|10002004  |2022|001|60007006  |Item Z              |',
 '|10002003  |2022|002|60007005  |Item A|ID:01        |',
 '-----------------------------------------------------',
 '']

In [15]:
header_string = content_split_line[3]
print(header_string)

|DocumentNo|Year|Itm|Clrng doc.|Text                |


In [16]:
column_header = [column.strip() for column in header_string.split("|")][1:-1]

In [17]:
column_header

['DocumentNo', 'Year', 'Itm', 'Clrng doc.', 'Text']

In [18]:
list_column_width = [
    "(.{" + str(len(column)) + "})" for column in header_string.split("|")
][1:-1]

In [19]:
column_string_pattern = "[|]" + "[|]".join(list_column_width) + "[|]"

In [20]:
column_string_pattern

'[|](.{10})[|](.{4})[|](.{3})[|](.{10})[|](.{20})[|]'

In [21]:
column_pattern = re.compile(column_string_pattern)

In [22]:
cleaned_content = [
    [row.strip() for row in column_pattern.match(row).groups()]
    for row in content_split_line[5:-2]
]

In [23]:
cleaned_content

[['10002001', '2022', '001', '60007000', 'Item 1'],
 ['10002002', '2022', '001', '60007001', 'Item A|B|C'],
 ['10002004', '2022', '001', '60007006', 'Item Z'],
 ['10002003', '2022', '002', '60007005', 'Item A|ID:01']]

In [24]:
df_clean = pd.DataFrame(cleaned_content, columns=column_header)

In [25]:
df_clean

Unnamed: 0,DocumentNo,Year,Itm,Clrng doc.,Text
0,10002001,2022,1,60007000,Item 1
1,10002002,2022,1,60007001,Item A|B|C
2,10002004,2022,1,60007006,Item Z
3,10002003,2022,2,60007005,Item A|ID:01
