### Handling Irregular XPaths
 
 As stated in resmigazete_scrape.ipynb, the structure contains irregularities.

 Eg. in https://www.resmigazete.gov.tr/eskiler/2007/06/20070624-1.htm, the last letter "r" is out of the "a" Tag, stored in "p"
 
 **Additionally, some links' text is split into different XPaths. There is an additional complexity:**

 _Eg. in https://www.resmigazete.gov.tr/eskiler/2000/07/20000727.htm#8, the text is separated into two XPaths. Which is_ 
 - /html[1]/body[1]/font[1]/font[3]/font[4]/dt[1]/a[1] : "....Eğitim-Öğretim Yönetmeliğinin"
 - /html[1]/body[1]/font[1]/a[1] : "2 nci ve 21 inci Maddelerinde Değişiklik..."

 
_Concatenation with space is also problematic, for some words are also separated into different XPaths. Eg. https://www.resmigazete.gov.tr/eskiler/2000/07/20000727.htm#4_
 - /html[1]/body[1]/font[1]/font[3]/font[2]/div[1]/dt[1]/a[1] : "S"
 - /html[1]/body[1]/font[1]/font[3]/font[2]/div[1]/dt[1]/a[2] : "utopu Müsabaka Yönetmeliğinde..."



In [14]:
import pandas as pd
import re

file_path = 'XPaths_resmigazete.txt'

data = []
current_entry = None

# Read the file line by line and parse each line
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        if 'Link:' in line:
            # Split the line into components based on the format, catching key parts
            parts = re.split(r', (?=\b(?:URL|XPath|Tag|Text|Link)\b)', line.strip())
            entry = {}
            for part in parts:
                if ': ' in part:
                    key, value = part.split(': ', 1)
                    entry[key.strip()] = value.strip()

            # Check if the current entry should be appended to or started new
            if current_entry and current_entry['Link'] == entry['Link']:
                # Same link, append text
                current_entry['Text'] += "" + entry['Text']
            else:
                # New link, save old entry if it exists and start a new one
                if current_entry:
                    data.append(current_entry)
                current_entry = entry

# Add the last entry if not already added
if current_entry:
    data.append(current_entry)

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data)

# Ensure to capture only the necessary columns if you want to be specific
df_filtered = df[['URL', 'XPath', 'Tag', 'Text', 'Link']]

# Optionally, save this filtered DataFrame to a new CSV file
df_filtered.to_csv('resmigazete_linked_XPaths.csv', index=False)
