Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Do NOT add any cells to the notebook!

Do not forget to submit both the notebook AND the files in the data/ subfolder according to the CoC!

Make sure you fill in any place that says `YOUR CODE HERE` or _YOUR ANSWER HERE_ , as well as your name and group below:

In [1]:
NAME = "Christoph Helmberger"
STUDENTID = "11915039"
GROUPID = "3";

# Assignment 2 (Group)
When carrying out a data science project, screening and selecting appropriate data sources for the tasks at hand comes at the beginning. This assignment is about accessing and characterising potential data sources in teams of three. The teams have been randomly assigned.

-----
## Step 0 (2 points)

Find two data sets online (from one or several sources) that would be interesting to combine. The data sets should fulfill the following requirements:

* Each data set must have a different file format (either CSV, XML, or JSON), please choose 
 - one CSV file (dataset1) 
 - and one JSON or XML file (dataset2)

* Workable data-set sizes: The selected or extracted data sets should have thousands of entries (>= 1000), but not more than (<=) 10000 entries. *If larger, use an excerpt from the original data set. Justify in detail the extraction criteria in the markdown cell below and add the code used for the extraction in the code cell.*
* You may start from (but you are not limited to) the resource collections hinted at [in the Unit 2 slides](https://datascience.ai.wu.ac.at/ws21/dataprocessing1/unit2.html#slide-53).

* Important: The use of datasets from kaggle.com and other curated collections (as highlighted to you in Unit 2) of datasets with accompanying tutorials on processing and analysis is discouraged. You are required to use primary data sources. See the policy on kaggle.com & friends at this assignment's submission site at MyLearn.

* Please adhere to the CoC - It is advised to already do so while working on the assignments.


[Data citations](http://blogs.nature.com/scientificdata/2016/07/14/data-citations-at-scientific-data/) must contain the following details:
- creator: provider organisation / author(s) of the data set, e.g. "Zentralanstalt für Meteorologie und Geodynamik (ZAMG)"
- catalogName: Names of the data repository and/or the Open Data portal used, e.g. Open Data Österreich"
- catalogURL: URL of th repository / portal, e.g. "https://www.data.gv.at/"
- datasetID: (specific to the data repository), e.g. "https://www.data.gv.at/katalog/dataset/zamg_meteorologischemessdatenderzamg"
- resourceURL: a URL where the CSV, XML or JSON file can be downloaded, e.g. "https://www.football-data.co.uk/new/JPN.csv"
- pubYear: Dataset publication year, i.e. since when it is published, e.g. "2012"
- lastAccessed: when have you last accessed the dataset (i.e. datetime of accessing, obtaining a copy of the data set) in ISO Format? e.g. "2021-03-08T13:55:00"

Store the data citation in a dictionary for each of the datasets:

In [2]:
dataset1= {
    "creator" : "European Centre for Disease Prevention and Control" ,
    "catalogName" : "data.europa.eu" ,
    "catalogURL" : "https://data.europa.eu/" ,
    "datasetID" : "https://data.europa.eu/data/datasets/34ce6bfa-87f3-4f07-82a7-fb5decea1a18?locale=de" ,
    "resourceURL" : "https://opendata.ecdc.europa.eu/covid19/agecasesnational/csv/data.csv"  ,
    "pubYear" : "2021"  ,
    "lastAccessed" : "2021-10-23T17:35:00"
}

dataset2= {
    "creator" : "European Centre for Disease Prevention and Control" ,
    "catalogName" : "data.europa.eu" ,
    "catalogURL" : "https://data.europa.eu/" ,
    "datasetID" : "https://data.europa.eu/data/datasets/covid-19-testing?locale=en" ,
    "resourceURL" : "https://opendata.ecdc.europa.eu/covid19/testing/json/"  ,
    "pubYear" : "2021"  ,
    "lastAccessed" : "2021-10-23T17:35:00"
}

In [3]:
from nose.tools import assert_equal, assert_in, assert_true

assert_equal(type(dataset1), dict)
assert_equal(type(dataset2), dict)

Use the following structure for your answer below:

**Data set 1**

*(Describe the source and the general content of the dataset and why you chose it)*

**Data set 2**

*(Describe the source and the general content of the dataset and why you chose it)*

**Project ideas**

*(Describe in your own words, which kind of tasks could be addressed by combining the selected data sets, esp. how the two data sets fit together and what complementary information they contain; what question could be potentially answered by combining data from both datasets; how could the data sets be combined exactly? 250 words max.)*

**Data set 1**

It describes the data on the 14-day age-specific notification rate of new COVID-19 cases.
The data is structered as a CSV-file with 8 columns. Because of the row count of over 17.000 entries, we will exclude some data to get the row count under 10.000. The data is updated weekly. For this task we focus on excluding the age groups (15yr, 25-49yr, 65-79yr). We chose this data set as we believe that it will give valuable insight into Covid-19 testing. Our excerpt of the data is in the data folder under the name dataset1_excerpt.csv. The code for extraction is written as a comment in Step 1 - File Access.


**Data set 2**

The second data set contains data on testing for COVID-19 by week and country in the EU. This file contains a dictionary of 12 different key-value pairs for each week (starting with week 15 in 2020; ending with week 42 in 2021 – as of 01.11.2021, the data is updated weekly) on a national and subnational level of EU countries. In each dictionary there are 12 different key-value pairs: country, country_code, year_week, level (subnational or national level data), region, region_name, new_cases, tests_done, population, testing_rate (per 100.000 population), positivity_rate and testing_data_source. We have chosen this data set as we believe that it can be connected well to data set 1 and provides us with additional useful information which can be used for interesting predictions. 

**Project ideas**

The idea for this project is to be able to combine the two data sets and therefore see how many new cases (in percent) belong to a certain age group. This could be useful in order to know which age group is currently the most affected and how this changes over the weeks. The combination of the two data sets could also be used in a machine learning way to predict the future covid cases per week or the number of cases per age group per week. It should be mentioned that the data in both files is being updated regularly as time goes on.

------
## Step 1 - File Access (3 points)

Write a python function `accessData` that takes the dataset dictionary created in step 0 as an input and returns an extended of that dictionary back with the following additions:

* Write code that accesses the dataset from its `resourceURL`
 * detects whether it's and XML, CSV or JSON file by
     * checking whether the download URL ends with suffix ".xml", ".json", ".csv" or ".tsv" 
     * checking whether the "Content-Type" HTTP header field contains information about the format, hinting on XML, JSON or CSV.
 * Detects the file size (convert to KB) of each data set, clearly documenting your actions (e.g. through commented code).

The result of the code below should extend your dictionaries `dataset1` and `dataset2` with two keys named 
* `"detectedFormat"` (which has one of the following values: `"XML"`, `"JSON"`, `"CSV"`, or `"unknown"`, if nothing could be detected from checking the suffix or HTTP header, or if the information in both was inconsistent)
* and `"filesizeKB"` which contains the filesize in KB
* If the detected format is `"unknown"`, the expected filesize is 0

In [4]:
# Code for extraction of the data:
"""
import os, csv
with open('dataset1_valid.csv', 'r') as inp, open('dataset1_excerpt.csv', 'w') as out:
    writer = csv.writer(out)
    for row in csv.reader(inp):
        if row[3].strip() != "<15yr" and row[3].strip() != "80+yr" and row[3].strip() != "65-79yr":
            writer.writerow(row)
"""
import requests
import urllib.request

def accessData(datadict):       
    url=datadict["resourceURL"]
    req =  urllib.request.Request(url , method="HEAD")
    with urllib.request.urlopen(req) as resp:
        header = resp.info()
        
        splittedUrl = datadict["resourceURL"].split("/")
        urlEnding = splittedUrl[-1].split(".")

        if urlEnding[-1].upper() == "CSV" or urlEnding[-1].upper() == "JSON" or urlEnding[-1].upper() == "XML":
            datadict["detectedFormat"] = urlEnding[-1].upper()
        else:
            contentType = header['Content-Type'].split("/")
            if contentType[-1].upper() == "CSV" or contentType[-1].upper() == "JSON" or contentType[-1].upper() == "XML":
                 datadict["detectedFormat"] = contentType[-1].upper()
            else:
                datadict["detectedFormat"] = "unknown"
        if datadict["detectedFormat"] == "unknown":
            datadict["filesizeKB"] = 0
        else:
            datadict["filesizeKB"] = (int(header['Content-Length']) / 1000)
    
    return datadict

In [5]:
from nose.tools import assert_equal, assert_in, assert_true
dataset1= accessData(dataset1)
dataset2= accessData(dataset2)
assert_in(dataset1["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_in(dataset2["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_true(isinstance(dataset1["filesizeKB"], (int, float)))
assert_true(isinstance(dataset2["filesizeKB"], (int, float)))

Please explain your findings, using the following structure for your answer below (in "other remarks" you can explain for instance why you think your code did not detect the correct format, if needed)

**Data set 1**

*(format, size, other remarks)*


**Data set 2**

*(format, size, other remarks)*


**Data set 1**
Data set 1 is a CSV file. The size of the file is 1157.397KB (as of 01.11.2021). The size will differ after they update the dataset. Our code didn't have any issues detecting the format. When we only use the excerpt of the data set 1, our file size is 581.917KB (as of 01.11.2021).


**Data set 2**
Data set 2 is a JSON file. The size of the file is 3438.952KB.
For our json file, our code couldn't detect the .json file at first, because our link to the JSON file was MIME type, meaning it wasn't written as .json, but /json. We managed to fix it by splitting the url from the dataset2 dictionary from above based on "/", and then splitting it again based on ".". 

-----
## Step 2  (5 points) - Format Validation

Establish that the two data files obtained are well-formed according to the detected data format (CSV, JSON, or XML). That is, the syntax used is valid according to accepted syntax definitions. Are there any violations of well-formedness?


Proceed as follows (for each data file, in turn): according to the "suspected" data format from Step 1:

  1. Use an _online validator_ for CSV, XML, and JSON, respectively, to confirm whether the files you downloaded in Step 1 are well-formed for the respective file format, document your findings and modify the file as described: 

   a. **Case 1**: no well-formedness errors were detected: 
    * Generally describe at least 3 well-formedness checks that your data sets, depending on its "suspected" format (against the background knowledge of Unit 2) should fulfill;
    * Store a local copy of the file called `dataset1_valid` (or, respectively, `dataset2_valid`) in the `data/` subfolder
    * Create another local copy of your data file called `dataset1_invalid` (or, respectively, `dataset2_invalid`) and introduce a selected well-formedness violation (one occurrence) therein;
    * document that the online validator you used finds the error you introduced

   b. **Case 2**: well-formedness errors occurred:
    * Document the occurrences by printing out the error message and describe the types of well-formedness violation that were reported to you.
    * Store a local copy called `dataset1_invalid` (or, respectively, `dataset2_invalid`) in the `data/ subfolder`
    * Create another local copy called `dataset1_valid` (or, respectively, `dataset2_valid`), of your data file that fixes the well-formedness violations therein manually.  
    

  2. Write a Python function `parseFile(datadict, format)` that that accesses the dataset from its `resourceURL`. The dataset should then be checked accordingly the given parser for the parameter `format` to check the following:
     * CSV: Returns `True`, if a consistent delimiter out of `",",";","\t"` can be detected, such that each row has the same (> 1) number of elements, otherwise False
     * JSON: Returns `True` if the file can be parsed with the `json` package, catching any parsing exceptions.
     * XML: Returns `True` if the file can be parsed with the `xmltodict` package, catching any parsing exceptions.
     * Returns `False` if any other format is supplied by the parameter.
     
In order to handle parsing exceptions and errors from the used packages, you can use [catching exceptions](https://docs.python.org/3/tutorial/errors.html), such that the program does not simply fail to check whether the file is parseable as the format specified in `format`     

Use the following structure for your answer in the cell below to document **Step 2.1**:

***Data set 1***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*

***Data set 2***

*(validator used, validation results, describe the modification to fix the file or to create an invalid version of it)*


***Data set 1***
Case 1: Some well-formedness errors that could occur in a CSV file are: there being different number of fields in different lines in the file, each record not being on a seperate line or there being more than one field within each record or the header.
To validate the CSV file (dataset1), we used csvlint.io. There were no well-formedness errors detected. To modify the file (to create an invalid version of it), we just added an extra field in the third line, which caused the validator to detect a structural problem (Row 3 contains a different number of columns to the first row in the CSV file).


***Data set 2***
Case 1: Some well-formedness errors that could occur in JSON files are: a JSON object not being enclosed in {}, keys and values not being separated by : but something else or key value pairs not being separated by a comma (,).
To validate the JSON file (dataset1), we used jsonformatter.curiousconcept.com. There were no well-formedness errors detected. To modify the file (to create an invalid version of it), we removed a comma between two key-value pairs in one dictionary. The error we got is: Expecting comma or }, not string.


In [6]:
import requests
import csv
import json
import xmltodict
import urllib.request
def parseFile(datadict, format):
    url=datadict["resourceURL"]
    delimiters = [",", ";", "\t"]
    l=list()
    if format == "CSV":
        def checkDelimitersWithColumnNumber(columnNumber):
            i = 0
            for elem in lines:
                i += 1
                if (len(elem.split(",")) != columnNumber and len(elem.split(";")) != columnNumber and len(elem.split("\t")) != columnNumber):
                    return False

                if i>=1:
                    break
            
            return True;
        
        l1 = list()
        response = urllib.request.urlopen(url)
        lines = [l.decode('utf-8') for l in response.readlines()]
        cr = csv.reader(lines)
        for row in cr:            
            l1.append(len(row))
        if len(set(l1)) == 1 and l1[0] > 1:
            return checkDelimitersWithColumnNumber(l1[0])
        else:
            return False
    elif format == "XML":
        response = requests.get(url)
        try:
            data = xmltodict.parse(response.content)
            return True
        except xmltodict.expat.ExpatError:
            return False
    elif format == "JSON":
        
        response = urllib.request.urlopen(url)
        try:
            data = json.loads(response.read())
            return True
        except ValueError:
            return False
    else:
        return False

In [7]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal([parseFile(dataset1, "XML"),
    parseFile(dataset1, "JSON"),
    parseFile(dataset1, "CSV"),
    parseFile(dataset2, "XML"),
    parseFile(dataset2, "JSON"),
    parseFile(dataset2, "CSV")].count(True), 2)

-----
## Step 3 - Content analysis (5 points)

Similar to the Python function `parseFile(datadict,format)` above, now create a new Python function `describeFile(datadict)` that analyses the given file according to the respective format detected in Step 1 and returns a dictionary containing the following information:

* for CSV files: number of columns, number of rows, column number (from 0 to n) of the column which contains the longest text. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfColumns:"  ...,
       "numberOfRows":  ... ,
       "longestColumn" : ... }
    ```

* for JSON files: number of different attribute names, nesting depth, length of the longest list appearing in an attribute value. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfAttributes:" ... ,
      "nestingDepth":  ... ,
      "longestListLength" : ... }
     ```

  Here the `longestListLength` should be set to 0 if no list appears. [Nesting depth](https://www.tutorialspoint.com/find-depth-of-a-dictionary-in-python) is defined as follows: 
   * a flat JSON object with only atomic attribute values has depth 1. 
   * a JSON attribute with another object as value (or another oject as member of a list value!) increases the depth by 1
   * and so on.


* for XML files: number of different element and attribute a names (i.e. the sum of both), nesting depth, maximum number of child nodes in any element (including the root element). That is, the resulting dictionary should have the following form:

    ```
    { "numberOfElementsAttributes:" ... ,
      "nestingDepth":  ... ,
      "maxChildren" : ... }
     ```

  Here the `maxChildren` should be set to 0 if only a root element appears. Nesting depth is defined as the nesting depth of elements.
  
For files that do cannot be parsed with respective given format, the function should simply return an empty dictionary (`{}`).

In [8]:
def describeFile(datadict):
    if parseFile(datadict, datadict["detectedFormat"]) == False:
        return {}
    
    url=datadict["resourceURL"]
    
    # describe CSV file
    if datadict["detectedFormat"] == "CSV":    
        response = urllib.request.urlopen(url)
        lines = [l.decode('utf-8') for l in response.readlines()]
        cr = csv.reader(lines)
        
        maxColumn = {"columnNumber" : 0, "length" : 0}
        rowCount = 0
        for row in cr:
            rowCount += 1
            if rowCount > 1:
                columnCount = len(row) 
                
                # determine longest column
                i = 0
                while i < columnCount:
                    if len(row[i]) > maxColumn["length"]:
                        maxColumn["length"] = len(row[i])
                        maxColumn["columnNumber"] = i
                    i += 1
        return {"numberOfColumns": columnCount, "numberOfRows": rowCount, "longestColumn" : maxColumn["columnNumber"]+1}
    
    # describe JSON file
    elif datadict["detectedFormat"] == "JSON":
        # recursive function to get nestingDepth and longestListLength
        def getElementInfos(element):
            maxDepth = 0
            maxList = 0
            
            for attribute in element:
                tempDepth = 1
                if isinstance(attribute, dict):
                    # get nestingDepth
                    tempDepth += getDepth(attribute)
                    
                    if tempDepth > maxDepth:
                        maxDepth = tempDepth
                    
                    # get longestListLength
                    len_lists = list()
                    for key in attribute:
                        if type(attribute.get(key)) == list:
                            len_lists.append(len(attribute.get(key)))
                        if len(len_lists) > 0:
                            maxList = max(len_lists)
                        else:
                            maxList = 0
                else:
                    if maxDepth < 1:
                        maxDepth = 1
                
            return {"maxDepth": maxDepth, "maxList": maxList};
        
        response = urllib.request.urlopen(url)
        data = json.loads(response.read())
        
        maxNestingDepth = 0
        maxListLength = 0
        
        # interate over every JSON element and get it's depth and longest list length
        # then define it as maximum value, if it is
        for element in data:
            elementInfos = getElementInfos(element)
            if elementInfos["maxDepth"] > maxNestingDepth:
                maxNestingDepth = elementInfos["maxDepth"]
            
            if elementInfos["maxList"] > maxListLength:
                maxListLength = elementInfos["maxList"]
        
        return {"numberOfAttributes": len(data[0].keys()), "nestingDepth": maxNestingDepth, "longestListLength" : maxListLength}
    
    # describe XML file
    elif datadict["detectedFormat"] == "XML":        
        # read the XML file and convert it to JSON
        with urllib.request.urlopen(url) as f:
            x=(xmltodict.parse(f.read()))
        res = json.dumps(x)
        res_dict = res.split("[")[1].split("]")[0]
        data = list(eval(res_dict))
        
        def getElementInfos(element):
            maxDepth = 0
            maxChild = 0
            
            for attribute in element:
                tempDepth = 1
                
                if isinstance(attribute, dict):
                    # get nestingDepth
                    tempDepth += getDepth(attribute)
                    
                    if tempDepth > maxDepth:
                        maxDepth = tempDepth
                    
                    len_lists = list()
                    for key in attribute:
                        if type(attribute.get(key)) == list or type(attribute.get(key)) == dict:
                            len_lists.append(len(attribute.get(key)))
                        if len(len_lists) > 0:
                            maxChild = max(len_lists)
                        else:
                            maxChild = 0
                else:
                    if maxDepth < 1:
                        maxDepth = 1
                
            return {"maxDepth": maxDepth, "maxChild": maxChild};
       
        maxNestingDepth = 0
        maxChildren = 0
        
        # interate over every JSON element and get it's nesting depth and maximum children
        # then define it as maximum value, if it is
        for element in data:
            elementInfos = getElementInfos(element)
            if elementInfos["maxDepth"] > maxNestingDepth:
                maxNestingDepth = elementInfos["maxDepth"]
            if elementInfos["maxChild"] > maxChildren:
                maxChildren = elementInfos["maxChild"]
        return {"numberOfElementsAttributes:": len(data[0].keys()), "nestingDepth": maxNestingDepth, "maxChildren" : maxChildren}
    

In [9]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal(len(describeFile(dataset1)), 3)
assert_equal(len(describeFile(dataset2)), 3)

Use the following structure for your answer below:

**Data set 1**

*(number and types of items etc.)*


**Data set 2**

*(number and types of items etc.)*

**Data set 1**

At the time of writing (01.11.2021), data set 1 has 8 columns, 17101 rows and the longest length of a column is 7.

**Data set 2**

At the time of writing (01.11.2021), data set 2 has 12 attributes, a nesting depth of 1 and the longest list length is 0, meaning there are no lists (nested attributes).