<a href="https://colab.research.google.com/github/charlotter62/EU-ETS-EUTL/blob/main/T1_transaction_xmls_byregistry_bydate_DOWNLOAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading Transactions XML files


---


**Description**:

The following code downloads XML files from the [European Union Transaction Log](https://ec.europa.eu/clima/ets/transaction.do). The files are downloaded and organized by registry and type of file:
* The DetailsAll.xml files are downloaded with the "DetailsAll" button at the bottom of the search results.
* The TransactionsBasic.xml files are downloaded with the "Export" button.

The following script parses the data from XML format to combined csv files:
[xml-byregistry-bydate_Parse_XML.ipynb](https://colab.research.google.com/drive/1oJL9WKQtElhPaD1bn4wGke2FhaqkACFJ?usp=sharing)

**Author**: Charlotte Rivard
**Contact**: 15crivard@gmail.com
**Date**: 1/13/2022

*Please reach out with questions and coauthorship considerations if using this script for publications*

---

In [None]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [None]:
!pip install wget
import wget



In [None]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import os
from socket import error as SocketError
import errno

Function to create a folder if it does not exist

In [None]:
def createFolder(folder):
  if not os.path.isdir(folder):
    os.makedirs(folder)
    print("created folder : ", folder)

Function to check whether the XML can be downloaded

In [None]:
def checkDownloadError(link):
  page = requests.get(link)
  soup = BeautifulSoup(page.content, "html.parser")
  error = soup.find("pre", {"class":"errortext"})
  if error:
    error=error.string
  else:
    error="No error"
  return(error)

Function to check whether the file is an XML file or faulty file

In [None]:
def isXML(filepath):
  f = open(filepath, "r")
  xml=False
  if "<?xml" in f.readline():
    xml=True
  return(xml)

Function to repeatedly attempt downloading the file (if it downloads a faulty XML)

In [None]:
def patientDownload(link,savename):
  success=0
  while(success!=1):
    try:
      wget.download(link,savename)
      if(isXML(savename)):
        success=1
      else:
        print("Download failed, attempting again")
        os.remove(savename)
        time.sleep(10)
    except SocketError as e:
      if e.errno != errno.ECONNRESET:
          raise # Not error we are looking for
      print("Download failed, attempting again")
      time.sleep(10)

Function to orchestrate the XML downloads by date range and registry

In [None]:
def getTransactionsXML(d1,m1,y1,d2,m2,y2,treg,savepath):
  link = "https://ec.europa.eu/clima/ets/exportEntry.do?form=transaction&endDate="+d2+"%2F"+m2+"%2F"+y2+ "&transactionStatus=4&suppTransactionType=-1&currentSortSettings=&originatingAccountType=-1&originatingAccountIdentifier=&languageCode=en&originatingAccountHolder=&destinationAccountIdentifier=&transactionID=&transactionType=-1&destinationAccountType=-1&toCompletionDate=&originatingRegistry=" + treg + "&destinationAccountHolder=&fromCompletionDate=&destinationRegistry=-1&startDate="+d1+"%2F"+m1+"%2F"+y1+"&exportType=1&exportAction=transactionAll&exportOK=exportOK"
  print(link)
  detailslink = "https://ec.europa.eu/clima/ets/exportEntry.do?originatingAccountNumber=&suppTransactionType=-1&endDate="+d2+"%2F"+m2+"%2F"+y2+"&currentSortSettings=&currentSortSettings=&originatingAccountType=-1&originatingAccountIdentifier=&originatingAccountHolder=&destinationAccountIdentifier="+"&transactionID=&transferringEsdRegistryCode=&toCompletionDate=&destinationRegistry=-1&transactionStatus=4&transferringEsdYear=&destinationAccountNumber="+"&languageCode=en&transactionType=-1&destinationAccountType=-1&acquiringEsdYear=&form=transactionAll&originatingRegistry=" + treg+ "&acquiringEsdRegistryCode=&destinationAccountHolder="+"&fromCompletionDate=&startDate="+d1+"%2F"+m1+"%2F"+y1+"&exportType=1&exportAction=transactionAll&exportOK=exportOK"
  print(detailslink)

  savename = savepath + treg + "_" + y1 + "-" + m1 + "-" + d1 +"_" + y2 + "-" + m2 + "-" + d2 +"_TransactionsBasic.xml"
  print(savename)
  detailsname = savename.replace("TransactionsBasic","DetailsAll")

  if(os.path.exists(savename)):
    errorcheck = "File already downloaded!"
  else:
    errorcheck = checkDownloadError(link).replace("\n","")
    if(errorcheck=="No error"):
      patientDownload(link,savename)
      patientDownload(detailslink,detailsname)

  return(errorcheck)

# erroroutput = getTransactionsXML("01","01","2015","01","01","2016","BG",workingdir)
# erroroutput

Get a month of XMLs

In [None]:
def getMonth(y,m,code):
  folderbasic = workingdir + code +"/TransactionsBasic/"
  createFolder(folderbasic)
  folderdetails = workingdir + code +"/DetailsAll/"
  createFolder(folderdetails)

  m1 = "{:02d}".format(m)
  m2 = "{:02d}".format(m+1)
  if(m==12):
    erroroutput = getTransactionsXML("01",m1,str(y),"01","01",str(y+1),code,folderbasic)
    #getDetailsAllXML("01",m1,str(y),"01","01",str(y+1),code,folderdetails)
  else:
    erroroutput = getTransactionsXML("01",m1,str(y),"01",m2,str(y),code,folderbasic)
    #getDetailsAllXML("01",m1,str(y),"01",m2,str(y),code,folderdetails)
  return(erroroutput)

In [None]:
def getLastofMonth(m,y):
  lastofmonths = {
  1:31,
  2:28,
  3:31,
  4:30,
  5:31,
  6:30,
  7:31,
  8:31,
  9:30,
  10:31,
  11:30,
  12:31
  }

  lastday = lastofmonths[m]
  if(m==2):
    if(int(y)%4==0):
      print("leap year!")
      lastday=29

  return(lastday)

In [None]:
def getWeek(y,m,w,code):
  folderbasic = workingdir + code +"/TransactionsBasic/Weekly/"
  createFolder(folderbasic)
  folderdetails = workingdir + code +"/DetailsAll/Weekly/"
  createFolder(folderdetails)

  y1 = str(y)
  y2 = str(y) #starts the same

  m1 = "{:02d}".format(m)
  if(w==4): #If last week of month
    if(m==12): #If last week of december
      print("last week of december!")
      m2 = "01" #Next month is january
      y2 = str(y+1) #Year is next year
    else:
      print("last week not december")
      m2 = "{:02d}".format(m+1) #Month is next month
  else:
    m2=m1 #Month is same month

  weekbounds = {
  1:["01","07"],
  2:["08","14"],
  3:["15","21"],
  4:["22","01"],
  }
  d1 = weekbounds[w][0]
  d2 = weekbounds[w][1]

  weekoutput = getTransactionsXML(d1,m1,y1,d2,m2,y2, code, folderbasic)
  #getDetailsAllXML("01",m1,y1,"07",m1,y2, code, folderdetails)
  return(weekoutput)

#getWeek(2015,4,4,"BG")

In [None]:
def getDay(y,m,d,code):
  folderbasic = workingdir + code +"/TransactionsBasic/Daily/"
  createFolder(folderbasic)
  folderdetails = workingdir + code +"/DetailsAll/Daily/"
  createFolder(folderdetails)

  y1 = str(y)
  y2 = y1
  m1 = "{:02d}".format(m)
  m2 = m1
  d1 = "{:02d}".format(d)
  d2 = "{:02d}".format(d)#+1)

  # if(d==getLastofMonth(m,y)):
  #   print("last of month")
  #   d2 = "01" #First of next month
  #   if(m==12):
  #     m2 = "01" #January
  #     y2 = str(y+1) #Next year
  #   else:
  #     m2 = "{:02d}".format(m+1)

  dayoutput = getTransactionsXML(d1,m1,y1,d2,m2,y2, code, folderbasic)
  return(dayoutput)

#getDay(2014,10,31,"FR")

In [None]:
def saveHighVolumeDay(daylist):
  highvolumedf = pd.DataFrame([daylist],columns=["Year","Month","Day","Code"])
  if(os.path.exists(workingdir+"/HighVolumeDays.csv")):
    print("reading file...")
    fullhighvolumedf = pd.read_csv(workingdir+"/HighVolumeDays.csv")
    pd.concat([fullhighvolumedf,highvolumedf]).to_csv(workingdir+"/HighVolumeDays.csv",index=False)
  else:
    print("creating file...")
    highvolumedf.to_csv(workingdir+"/HighVolumeDays.csv",index=False)

Downloading XML files by country and time frame

In [None]:
workingdir = "/gdrive/MyDrive/Brookings/XML_downloads/xml-byregistry-bydate/"
createFolder(workingdir)

In [None]:
registries = pd.read_csv(workingdir+'RegistryLookup.csv')
registries
#Could also just make registries a list of the country codes to be simpler
#registries = []

Unnamed: 0,Country,Code,Large
0,Belgium,BE,0
1,Bulgaria,BG,0
2,Czech Republic,CZ,0
3,Czechia,CZ,0
4,CDM,CDM,0
...,...,...,...
73,South Korea,KR,0
74,Taiwan,TW,0
75,United States,US,0
76,Malta CP0,MT0,0


*Should update to have a yearly, monthly, weekly, daily subfolders, or have it all mixed in the general registry folder. Right now I have yearly and monthly going to the main folder and then subfolders for weekly and daily

In [None]:
for i in range(0,len(registries)):
  registry = registries.loc[i][0]
  code = registries.loc[i][1]
  large = registries.loc[i][2]

  folderbasic = workingdir +code+"/TransactionsBasic/"
  createFolder(folderbasic)
  folderdetails = workingdir +code+"/DetailsAll/"
  createFolder(folderdetails)

  for year in range(2005,2019):
    output = getTransactionsXML("01","01",str(year),"01","01",str(year+1), code,folderbasic)
    print(output)
    if("exceeds the predefined limit" in output):
      print("Attempting monthly...")

      for month in range(1,13):                   #Attempt monthly, 1,13
        output = getMonth(year,month,code)
        print(output)
        if("exceeds the predefined limit" in output):
          print("Attempting weekly...")

          for week in range(1,5):                 #Attempt weekly
            output = getWeek(year,month,week,code)
            print(output)
            if("exceeds the predefined limit" in output):
              wkstart = (7*(week-1)) + 1
              wkend = (7*week)
              lastday = getLastofMonth(month,year)
              if(week==4):
                wkend = lastday
              for day in range(wkstart,wkend+1):      #Attempt daily
                output = getDay(year,month,day,code)
                print(output)
                if("exceeds the predefined limit" in output):
                  print("High volume day!")
                  saveHighVolumeDay([year,month,day,code])