> ## Summary of what was done, and what we have

<div class="alert alert-success">

> ## We hold thousands of complete files LOCALLY, each corresponding to one webpage. We will use a selection of the columns and clean and rearrange them to build a table

</div>

> ## List all the fields in one file as an example

<div class="alert alert-success">

> ## There will be small variations between files but roughly, one file will hold the following information. This file has 48 variables, but not all of them contain information. Other files can have fewer variables. We add an extra variable at the start which is the URL where we collected the complete file.

> **Note** The code below is not commented. If you want to read a cleaner presentation of the code visit the PARES folder at https://github.com/aodhanlutetiae/sp_archive

</div>

In [44]:
import xml.etree.ElementTree as et 
import pandas as pd
pd.set_option('max_columns', 100)
import os
local_file = '/Users/aidanair/Desktop/test_files/paresbusquedas_10074730.xml'
xtree = et.parse(local_file)
xroot = xtree.getroot()
ending = local_file.split('_')[2].split('.')[0]
url = f'http://pares.mcu.es/ParesBusquedas20/catalogo/description/{ending}'
print(url)
for c in xroot.findall(".//"):
    ctr +=1
    print(c.tag.split('}')[1] + ":" + str(c.text))

http://pares.mcu.es/ParesBusquedas20/catalogo/description/10074730
control:None
recordid:ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,1,A0000622
otherrecordid:ES-CDMH-37274-10074730
filedesc:None
titlestmt:None
titleproper:Ficha de Carmen Abad 
maintenancestatus:revised
maintenanceagency:None
agencycode:ES-37008-CDMH
agencyname:CDMH
maintenancehistory:None
maintenanceevent:None
eventtype:created
eventdatetime:2021-04-01T21:56:13
agenttype:machine
agent:Generado automaticamente por el Portal de Archivos Españoles (PARES) con los datos de sus registros de autoridad.
archdesc:None
did:None
unitid:ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHERO,1,A0000622
unittitle:Ficha de Carmen Abad 
unitdatestructured:None
daterange:None
fromdate:1937
todate:1977
physloc:DNSD-SECRETARIA,FICHERO,1,A0000622
langmaterial: 
languageset:None
language:Español
script:Latino
repository:None
name:None
part:Centro Documental de la Memoria Histórica
relations:None
relation: 
relationentry:Represión política
de

## We take the following variables

<div class="alert alert-success">

- url 
- recordid
- otherrecordid
- titleproper
- agencycode
- agencyname
- fromdate
- todate

</div>


## We break up the 'recordid' into its parts, while keeping a copy of the full recordid

<div class="alert alert-success">

- url
- recordid
- recordid_part_A
- recordid_part_B
- recordid_part_C
- recordid_part_D
- recordid_part_E
- otherrecordid
- titleproper
- agencycode
- agencyname
- fromdate
- todate

</div>

## We rename the name column and remove 'ficha de' 

## and we reorder the columns as follows

<div class="alert alert-success">

- name (previously 'titleproper')
- recordid_part_A
- recordid_part_B
- recordid_part_C
- recordid_part_D
- recordid_part_E
- fromdate
- todate
- agencycode
- agencyname
- recordid
- otherrecordid
- url

</div>

In [56]:
selected = []
list_of_files = os.listdir('/Users/aidanair/Desktop/test_files')
file_ctr = 0
path = '/Users/aidanair/Desktop/test_files/'
for x in range (len(list_of_files)):
    local_file = path + list_of_files[file_ctr]
    xtree = et.parse(local_file)
    xroot = xtree.getroot()
    empty = {}
    key_list = ['recordid', 'otherrecordid', 'titleproper', 'agencycode', 'agencyname', 'fromdate', 'todate', 'url']
    for c in xroot.findall(".//"):
        k = c.tag.split('}')[1]
        v = c.text
        if k in key_list:   
            empty[k] = v
    ending = local_file.split('_')[1].split('.')[0]
    url = f'http://pares.mcu.es/ParesBusquedas20/catalogo/description/{ending}'
    empty['url'] = url
    selected.append(empty)
    file_ctr += 1
df = pd.DataFrame(selected)


In [57]:

def remove_ficha(titleproper):    
    splitup = titleproper.split('de ')
    return splitup[1]
df['name'] = df.titleproper.apply(remove_ficha)
df = df[['name', 'otherrecordid', 'titleproper', 'agencycode', 'agencyname', 'fromdate', 'todate', 'url', 'recordid']]
df[['recordid_a', 'recordid_b', 'recordid_x', 'recordid_c']] = df.recordid.str.split("/", expand=True)
df[['recordid_d', 'recordid_e', 'recordid_f', 'recordid_g']] = df.recordid.str.split(",", expand=True)
df = df[['name', 'recordid_a', 'recordid_b', 'recordid_e', 'recordid_f', 'recordid_g', 'fromdate', 'todate', 'agencycode', 'agencyname', 'recordid', 'otherrecordid', 'url']]
df.columns = ['name', 'recordid_a', 'recordid_b', 'recordid_c', 'recordid_d', 'recordid_e', 'fromdate', 'todate', 'agencycode', 'agencyname', 'recordid', 'otherrecordid', 'url']


## which leaves us with the following dataset if you read just three files

In [58]:
df

Unnamed: 0,name,recordid_a,recordid_b,recordid_c,recordid_d,recordid_e,fromdate,todate,agencycode,agencyname,recordid,otherrecordid,url
0,Aureliano Abad,ES.37274.CDMH,8.4.8.12,FICHERO,1,A0000607,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHER...",ES-CDMH-37274-10074727,http://pares.mcu.es/ParesBusquedas20/catalogo/...
1,Celedonio Abad,ES.37274.CDMH,8.4.8.12,FICHERO,1,A0000628,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHER...",ES-CDMH-37274-10074733,http://pares.mcu.es/ParesBusquedas20/catalogo/...
2,Carmen Abad,ES.37274.CDMH,8.4.8.12,FICHERO,1,A0000622,1937,1977,ES-37008-CDMH,CDMH,"ES.37274.CDMH/8.4.8.12//DNSD-SECRETARIA,FICHER...",ES-CDMH-37274-10074730,http://pares.mcu.es/ParesBusquedas20/catalogo/...
