### Handling Irregular XPaths
 
 As stated in resmigazete_scrape.ipynb, the structure contains irregularities.

 Eg. in https://www.resmigazete.gov.tr/eskiler/2007/06/20070624-1.htm, the last letter "r" is out of the "a" Tag, stored in "p"
 
 **Additionally, some links' text is split into different XPaths. There is an additional complexity:**

 _Eg. in https://www.resmigazete.gov.tr/eskiler/2000/07/20000727.htm#8, the text is separated into two XPaths. Which is_ 
 - /html[1]/body[1]/font[1]/font[3]/font[4]/dt[1]/a[1] : "....Eğitim-Öğretim Yönetmeliğinin"
 - /html[1]/body[1]/font[1]/a[1] : "2 nci ve 21 inci Maddelerinde Değişiklik..."

 
_Concatenation with space is also problematic, for some words are also separated into different XPaths. Eg. https://www.resmigazete.gov.tr/eskiler/2000/07/20000727.htm#4_
 - /html[1]/body[1]/font[1]/font[3]/font[2]/div[1]/dt[1]/a[1] : "S"
 - /html[1]/body[1]/font[1]/font[3]/font[2]/div[1]/dt[1]/a[2] : "utopu Müsabaka Yönetmeliğinde..."



#### Test the parser for selected years

In [1]:
# Parser class is migrated to resmigazete_module.py
# Example usage is shown below

from resmigazete_module import ResmiGazeteLinkParser

In [2]:

# Example Usage
parser_test2000 = ResmiGazeteLinkParser(year=2000)
parser_test2000.parse_file()
parser_test2000.save_to_json()

file_path = 'resmigazete_all/links_resmigazete_2000.json'

import pandas as pd
# Load the JSON data into a DataFrame
df = pd.read_json(file_path, lines=True, dtype={'Date': str})

# Print a slice
print(df.iloc[7:15])  



Data saved to resmigazete_all/links_resmigazete_2000.json
          Date                                              XPath Tag  \
7   2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
8   2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/dt[1]...   a   
9   2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
10  2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
11  2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
12  2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
13  2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   
14  2000-07-26  /html[1]/body[1]/font[1]/font[4]/font[1]/font[...   a   

                                                 Link  \
7   https://www.resmigazete.gov.tr/eskiler/2000/07...   
8   https://www.resmigazete.gov.tr/eskiler/2000/07...   
9   https://www.resmigazete.gov.tr/eskiler/2000/07...   
10  https://www.resmigazete.gov.tr/eskiler/2000/07...   

In [3]:

# Example Usage
parser_test2007 = ResmiGazeteLinkParser(year=2007)
parser_test2007.parse_file()
parser_test2007.save_to_json()

file_path = 'resmigazete_all/links_resmigazete_2007.json'

import pandas as pd
# Load the JSON data into a DataFrame
df = pd.read_json(file_path, lines=True, dtype={'Date': str})

# Print a slice
print(df.iloc[7:15]) 

Data saved to resmigazete_all/links_resmigazete_2007.json
          Date                                              XPath Tag  \
7   2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[11]/sp...   a   
8   2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[12]/sp...   a   
9   2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[15]/sp...   a   
10  2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[16]/sp...   a   
11  2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[19]/sp...   a   
12  2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[24]/sp...   a   
13  2007-05-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/p[26]/fo...   a   
14  2007-06-26  /html[1]/body[1]/table[1]/tr[1]/td[1]/table[2]...   a   

                                                 Link  \
7   https://www.resmigazete.gov.tr/eskiler/2007/05...   
8   https://www.resmigazete.gov.tr/eskiler/2007/05...   
9   https://www.resmigazete.gov.tr/eskiler/2007/05...   
10  https://www.resmigazete.gov.tr/eskiler/2007/05...   