> ## Summary of what was done, and what we have

> ## We hold thousands of complete files LOCALLY, each corresponding to one webpage. We will use a selection of the columns and clean and rearrange them to build a table

> ## Roughly speaking, these files are numbered in their urls from 10_000_000 to 12_500_000 and so represent about 2.5 million files.



> ## There will be small variations between files but roughly, one file will hold the following information. This example has 48 variables, but not all of them contain information. Other files can have fewer variables. We add an extra variable at the start which is the URL where we collected the complete file.

> **Note** The code below is not commented. If you want to read a cleaner presentation of the code visit the PARES folder at https://github.com/aodhanlutetiae/sp_archive


In [30]:
import xml.etree.ElementTree as et 
import warnings
warnings.simplefilter(action = 'ignore', category = Warning)
import pandas as pd
pd.set_option('max_columns', 100)
pd.set_option('display.max_colwidth', 1000) 
import os
local_file = '/Users/aidanair/Desktop/test_files/paresbusquedas_10375724.xml'
xtree = et.parse(local_file)
xroot = xtree.getroot()
ending = local_file.split('_')[2].split('.')[0]
url = f'http://pares.mcu.es/ParesBusquedas20/catalogo/description/{ending}'
print(url)
for c in xroot.findall(".//"):
    print(c.tag.split('}')[1] + ":" + str(c.text))

http://pares.mcu.es/ParesBusquedas20/catalogo/description/10375724
control:None
recordid:ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0101453
otherrecordid:ES-CDMH-37274-10375724
filedesc:None
titlestmt:None
titleproper:Ficha de  Ferrandiz 
maintenancestatus:revised
maintenanceagency:None
agencycode:ES-37008-CDMH
agencyname:CDMH
maintenancehistory:None
maintenanceevent:None
eventtype:created
eventdatetime:2020-12-22T11:05:16
agenttype:machine
agent:Generado automaticamente por el Portal de Archivos Españoles (PARES) con los datos de sus registros de autoridad.
archdesc:None
did:None
unitid:ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0101453
unittitle:Ficha de  Ferrandiz 
unitdatestructured:None
daterange:None
fromdate:1937
todate:1977
origination: 
corpname:None
part:Delegación Nacional de los Servicios Documentales (España)
physdescset:None
physdescstructured:None
quantity:1 
unittype:Ficha(s) 
physfacet:Papel 
descriptivenote: 
p:. Cartulina
physloc:DNSD-SECRETARIA,FIC

## We take the following variables from the list

- url 
- recordid
- otherrecordid
- titleproper
- agencycode
- agencyname
- fromdate
- todate


## We break up the 'recordid' into its parts, while keeping a copy of the original 'recordid'


- url
- recordid
- recordid_part_A
- recordid_part_B
- recordid_part_C
- recordid_part_D
- recordid_part_E
- otherrecordid
- titleproper
- agencycode
- agencyname
- fromdate
- todate


## We rename the name column and remove 'ficha de' ...

## and we reorder the columns as follows


- name (previously 'titleproper')
- recordid_part_A
- recordid_part_B
- recordid_part_C
- recordid_part_D
- recordid_part_E
- fromdate
- todate
- agencycode
- agencyname
- recordid
- otherrecordid
- url


In [31]:
selected = []
list_of_files = os.listdir('/Users/aidanair/Desktop/test_files')
file_ctr = 0
path = '/Users/aidanair/Desktop/test_files/'
for x in range (len(list_of_files)):
    local_file = path + list_of_files[file_ctr]
    xtree = et.parse(local_file)
    xroot = xtree.getroot()
    empty = {}
    key_list = ['recordid', 'otherrecordid', 'titleproper', 'agencycode', 'agencyname', 'fromdate', 'todate', 'url']
    for c in xroot.findall(".//"):
        k = c.tag.split('}')[1]
        v = c.text
        if k in key_list:   
            empty[k] = v
    ending = local_file.split('_')[2].split('.')[0]
    url = f'http://pares.mcu.es/ParesBusquedas20/catalogo/description/{ending}'
    empty['url'] = url
    selected.append(empty)
    file_ctr += 1
df = pd.DataFrame(selected)
df

Unnamed: 0,recordid,otherrecordid,titleproper,agencycode,agencyname,fromdate,todate,url
0,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,40,M0070587",ES-CDMH-37274-10616571,Ficha de Martinez,ES-37008-CDMH,CDMH,1937,1977,http://pares.mcu.es/ParesBusquedas20/catalogo/description/10616571
1,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0100945",ES-CDMH-37274-11578952,Ficha de Ferran,ES-37008-CDMH,CDMH,1937,1977,http://pares.mcu.es/ParesBusquedas20/catalogo/description/11578952
2,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0101453",ES-CDMH-37274-10375724,Ficha de Ferrandiz,ES-37008-CDMH,CDMH,1937,1977,http://pares.mcu.es/ParesBusquedas20/catalogo/description/10375724


In [32]:

def remove_ficha(titleproper):    
    splitup = titleproper.split('Ficha de ')
    return splitup[1]
df['name'] = df.titleproper.apply(remove_ficha)
df = df[['name', 'otherrecordid', 'titleproper', 'agencycode', 'agencyname', 'fromdate', 'todate', 'url', 'recordid']]
df[['recordid_a', 'recordid_b', 'recordid_x', 'recordid_c']] = df.recordid.str.split("/", expand=True)
df[['recordid_d', 'recordid_e', 'recordid_f', 'recordid_g']] = df.recordid.str.split(",", expand=True)
df = df[['name', 'recordid_a', 'recordid_b', 'recordid_e', 'recordid_f', 'recordid_g', 'fromdate', 'todate', 'agencycode', 'agencyname', 'recordid', 'otherrecordid', 'url']]
df.columns = ['name', 'recordid_a', 'recordid_b', 'recordid_c', 'recordid_d', 'recordid_e', 'fromdate', 'todate', 'agencycode', 'agencyname', 'recordid', 'otherrecordid', 'url']


## which — if we gather three files rather than one — leaves us with the following dataset 

In [33]:
df

Unnamed: 0,name,recordid_a,recordid_b,recordid_c,recordid_d,recordid_e,fromdate,todate,agencycode,agencyname,recordid,otherrecordid,url
0,Martinez,ES.37274.CDMH,8.4.8.12,FICHERO,40,M0070587,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,40,M0070587",ES-CDMH-37274-10616571,http://pares.mcu.es/ParesBusquedas20/catalogo/description/10616571
1,Ferran,ES.37274.CDMH,8.4.8.12,FICHERO,20,F0100945,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0100945",ES-CDMH-37274-11578952,http://pares.mcu.es/ParesBusquedas20/catalogo/description/11578952
2,Ferrandiz,ES.37274.CDMH,8.4.8.12,FICHERO,20,F0101453,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,20,F0101453",ES-CDMH-37274-10375724,http://pares.mcu.es/ParesBusquedas20/catalogo/description/10375724


This table is then exported as a csv, named as follows:

> **one_pares_archive_10074724_10117072_03042021_LEN41442.csv**

When this appears as a CSV the file name identifies:

- the start number of the first file gathered in this csv (the number found at the end of the url)
- the number of the last file gathered into the csv
- the date
- the number of files presented in the CSV. Note - this is not just the difference between the first file and last file because some files are empty and are ignored. For example, the first csv covered about 40,000 file numbers but 700 of them didn't have any data so were skipped.