# Assignment 2 

-----
## Step 0 (2 points)

Find two data sets online (from one or several sources) that would be interesting to combine and create ***data citations*** as Python dictionaries. 

Store the data citation in a dictionary for each of the datasets:

In [10]:
# YOUR CODE HERE


dataset1= {
    "creator" : "European Environment Agency (EEA)" ,
    "catalogName" : "Eurostat" ,
    "catalogURL" : "https://ec.europa.eu/eurostat/de/home" ,
    "datasetID" : "https://ec.europa.eu/eurostat/databrowser/view/env_air_emis__custom_13506018/default/table?lang=en" ,
    "resourceURL" : "https://github.com/crazy-donuts/Air-pollutants/raw/main/Air%20pollutants.csv"  ,
    "pubYear" : "2024"  ,
    "lastAccessed" : ""  ,
}

dataset2= {
    "creator" : "Eurostat" ,
    "catalogName" : "Eurostat" ,
    "catalogURL" : "https://ec.europa.eu/eurostat/de/home" ,
    "datasetID" : "https://ec.europa.eu/eurostat/databrowser/view/road_tf_vehmov__custom_13506734/default/table?lang=en" ,
    "resourceURL" : "https://github.com/crazy-donuts/Road-motor-vehicle-traffic-performance/raw/main/Road%20motor%20vehicle%20traffic%20performance.json"  ,
    "pubYear" : "2024"  ,
    "lastAccessed" : ""  ,
}


In [11]:
from nose.tools import assert_equal, assert_in, assert_true
import traceback
import sys
import os

assert_equal(type(dataset1), dict)
assert_equal(type(dataset2), dict)


Dataset 1: Air Pollutants by Source for European Countries
This dataset provides information on the trends in air pollutant levels across European countries from 1980 to 2022. No data condensation is required

Dataset 2: Road motor vehicle traffic performance by traffic, registration location and type of vehicle
This dataset captures the performance metrics (Million vehicle-kilometres) of road motor vehicle traffic. It provides information into traffic distribution and vehicle usage patterns across different regions and vehicle categories. No data condensation is required

Project Idea: 
We’re looking into whether there’s a link between how much traffic there is and the levels of air pollution in Europe. While we know that pollution can come from other sources like industry, we’re assuming here that traffic is the main contributor.

------
## Step 1 - File Access (3 points)

Write a Python function `accessData` that takes the dataset dictionary created in step 0 as an input and returns an extended dictionary including following additions:

* Write code that accesses the dataset from its `resourceURL` using the python `requests` package:
 * detects whether it's and XML, CSV or JSON file by
     * checking whether the download URL **ends** with suffix "xml", "json", "csv" (in either upper- or lowercase)
     * checking whether the "Content-Type" HTTP header field contains information about the format, hinting on XML, JSON or CSV, i.e., check whether the substring XML, JSON or CSV appears in the "Content-Type" header in either upper- or lowercase. 
 * Detects the file size from the HTTP header (converted to KB) of each data set, clearly documenting your actions (e.g. through commented code).

The result of the code below should extend your dictionaries `dataset1` and `dataset2` with two keys named 
* `"detectedFormat"` (which has one of the following values: `"XML"`, `"JSON"`, `"CSV"`, or `"unknown"`, if nothing could be detected from checking the suffix or HTTP header, or if the information in both was inconsistent)
* and `"filesizeKB"` which contains the filesize in KB (Conversion should be done accordingly to decimal SI prefixes) from the number of bytes in the header-information. If there is no respective header information return 0.
* If the detected format is `"unknown"`, the expected filesize to be returned is also 0


In [12]:
# YOUR CODE HERE 
import requests

def accessData(datadict):
    resourceURL = datadict["resourceURL"]
    #we retrieve the header info
    r = requests.get(resourceURL)
    rhead = r.headers 
    #lets check if the sufix indicates the file type
    suffix = resourceURL.split('.')[-1].lower()
    if "xml" in suffix:
        datadict["detectedFormat"] = "XML"
    elif "json" in suffix:
        datadict["detectedFormat"] = "JSON"
    elif "csv" in suffix:
        datadict["detectedFormat"] = "CSV"
    else:
        content_type = rhead.get("Content-Type", "").lower()
        if "xml" in content_type:
            datadict["detectedFormat"] = "XML"
        elif "json" in content_type:
            datadict["detectedFormat"] = "JSON"
        elif "csv" in content_type:
            datadict["detectedFormat"] = "CSV"
        else:
            datadict["detectedFormat"] = "unknown"
   
    
    if datadict["detectedFormat"] != "unknown":
        file_size = int(r.headers.get("Content-Length", 0))
        filesizeKB = int(file_size / 1024)
        datadict["filesizeKB"] = filesizeKB
    else:
        datadict["filesizeKB"] = 0
    return datadict


In [13]:
# Basic tests to see if your solution meets the foundational demands described in the task description
from nose.tools import assert_equal, assert_in, assert_true
dataset1= accessData(dataset1)
dataset2= accessData(dataset2)
assert_in(dataset1["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_in(dataset2["detectedFormat"], ["XML", "JSON", "CSV", "unknown"])
assert_true(isinstance(dataset1["filesizeKB"], (int, float)))
assert_true(isinstance(dataset2["filesizeKB"], (int, float)))


Data Set 1
Format: CSV
Size: 47 KB


Data Set 2
Format: JSON
Size: 11 KB

-----
## Step 2  (5 points) - Format Validation

Establish that the two data files obtained are well-formed according to the detected data format (CSV, JSON, or XML). That is, the syntax used is valid according to accepted syntax definitions. Are there any violations of well-formedness?


Proceed as follows (for each data file, in turn): according to the "suspected" data format from Step 1:

  1. Use an _online validator_ for CSV, XML, and JSON, respectively, to confirm whether the files you downloaded in Step 1 are well-formed for the respective file format, document your findings and modify the file as described: 

   a. **Case 1**: no well-formedness errors were detected: 
    * Generally describe at least 3 well-formedness checks that your data sets, depending on its "suspected" format (against the background knowledge of Unit 2) should fulfill;
    * Store a local copy of the file called `data_notebook-[notebook-nr.]_[name].[file extension]` in the `data/` subfolder
    * Create another local copy of your data file called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]` and introduce a selected well-formedness violation (one occurrence) therein;
    * document that the online validator you used finds the error you introduced

   b. **Case 2**: well-formedness errors occurred:
    * Document the occurrences by printing out the error message and describe the types of well-formedness violation that were reported to you.
    * Store a local copy called `data_notebook-[notebook-nr.]_[name]-invalid.[file extension]`  in the `data/ subfolder`
    * Create another local copy called `data_notebook-[notebook-nr.]_[name].[file extension]`, of your data file that fixes the well-formedness violations therein manually.  
    
**Please note that the datasets in the `data/` subfolder are for documentation only. Do not access those for subsequent steps!**
    

  2. Write a Python function `parseFile(datadict, format)` that that accesses the dataset from its `resourceURL`. The dataset should then be checked accordingly the given parser for the parameter `format` to check the following:
     * CSV: Returns `True`, if a consistent delimiter out of `",",";","\t"` can be detected, such that each row has the same (> 1) number of elements, otherwise False
     * JSON: Returns `True` if the file can be parsed with the `json` package, catching any parsing exceptions.
     * XML: Returns `True` if the file can be parsed with the `xmltodict` package, catching any parsing exceptions.
     * Returns `False` if any other format is supplied by the parameter.
     
In order to handle parsing exceptions and errors from the used packages, you can use [catching exceptions](https://docs.python.org/3/tutorial/errors.html), such that the program does not simply fail to check whether the file is parseable as the format specified in `format`    

Data set 1

Validator Used: I used the parseFile function with the format set to "CSV."
Validation Results: The result was True, which shows the dataset was correctly identified as a CSV file.
Modification Description: To create an invalid version of this file, one could edit it by removing some of the delimiters or making the rows uneven in length. Adding some random characters that don’t fit the CSV structure would also help.

Data set 2

Validator Used: I used the parseFile function with the format set to "JSON."
Validation Results: The result was True, which shows the dataset was correctly identified as a JSON.
Modification Description: To make an invalid version, one could make syntax errors like deleting commas or brackets. This would mess up the JSON structure and help to check whether the validation function catches the errors.

In [5]:
import requests
import csv
import json
#import xmltodict

def parseFile(datadict, format):
    resourceURL = datadict["resourceURL"]
    response = requests.get(resourceURL)
    content = response.text.strip()

    if format == "JSON":
        if not content:
            return False                                          # empty links
        try:
            json_obj = json.loads(content)                        # Try to load JSON
            if type(json_obj) == dict or type(json_obj) == list:  # Check if it's a valid JSON object (dict or list)
                return True
        except Exception:
            return False                                          # JSON parsing failed

    elif format == "XML":
        if not content:
            return False 
        try:
            xmltodict.parse(content)                              # Try to parse XML
            return True
        except Exception:
            return False                                          # XML parsing failed

    elif format == "CSV":
        if not content or content.startswith('{') or content.startswith('['):
            return False                                          # Skip CSV check if content looks like JSON

        delimiters = [',', ';', '\t']
        for delimiter in delimiters:
            rows = content.splitlines()
            reader = csv.reader(rows, delimiter=delimiter)
            try:
                row_length = []
                for row in reader:
                    if row:                                       # Only counts non-empty rows
                        row_length.append(len(row))
                                                                  # Check if all rows have the same length and are non-trivial
                if row_length and len(set(row_length)) == 1 and row_length[0] > 1:
                    return True
            except Exception:
                continue                                          # If there's an error, try the next delimiter
        return False                                              # No valid CSV found

    return False                                                  # If format is none of the above


results = [
    parseFile(dataset1, "XML"),
    parseFile(dataset1, "JSON"),
    parseFile(dataset1, "CSV"),
    parseFile(dataset2, "XML"),
    parseFile(dataset2, "JSON"),
    parseFile(dataset2, "CSV")
]
print("Results:", results)
assert_equal(results.count(True), 2)

Results: [False, False, True, False, True, False]


In [6]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal([parseFile(dataset1, "XML"),
    parseFile(dataset1, "JSON"),
    parseFile(dataset1, "CSV"),
    parseFile(dataset2, "XML"),
    parseFile(dataset2, "JSON"),
    parseFile(dataset2, "CSV")].count(True), 2)

-----
## Step 3 - Content analysis (5 points)

Similar to the Python function `parseFile(datadict,format)` above, now create a new Python function `describeFile(datadict)` that analyses the given file according to the respective format detected in Step 1 and returns a dictionary containing the following information:

* for CSV files: number of columns, number of rows, column number (from 0 to n) of the column which contains the longest text. You do not have to try to transform any string to integer or float, simply take the values as is from the csv file. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfColumns:"  ...,
       "numberOfRows":  ... ,
       "longestColumn" : ... }
    ```

* for JSON files: number of distinct attribute names, nesting depth, length of the longest list appearing in an attribute value. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfAttributes:" ... ,
      "nestingDepth":  ... ,
      "longestListLength" : ... }
     ```

  Here the `longestListLength` should be set to 0 if no list appears. [Nesting depth](https://www.tutorialspoint.com/find-depth-of-a-dictionary-in-python) is defined as follows: 
   * a flat list of atomic values has depth 0, a flat JSON object with only atomic attribute values has depth 1. 
   * a JSON attribute with another object as value (or another object as member of a list value!) increases the depth by 1
   * and so on.


* for XML files: number of different element and attribute a names (i.e. the sum of both), nesting depth, maximum numeric value in the dataset. That is, the resulting dictionary should have the following form:

    ```
    { "numberOfElementsAttributes:" ... ,
      "nestingDepth":  ... ,
      "maxNumericValue" : ... }
     ```

  Here the `maxNumericValue` should be set to 0 if there are no numberic values present. Nesting depth is defined as the nesting depth of elements.
  
For files that cannot be parsed with respective given format, the function should simply return an empty dictionary (`{}`).

In [7]:
import codecs
import urllib.request

def describeFile(datadict):
   
    resourceURL = datadict["resourceURL"]
    detectedFormat = datadict["detectedFormat"]
    # Check file type 
    if detectedFormat == "CSV":
        return describeCSV(resourceURL)
    elif detectedFormat == "JSON":
        return describeJSON(resourceURL)
    elif detectedFormat == "XML":
        return describeXML(resourceURL)
    else:
        return {}

def describeCSV(resourceURL):
    
    with urllib.request.urlopen(resourceURL) as f:
        dialect = csv.Sniffer().sniff(f.read(5000).decode("utf-8"))
        delimiter = dialect.delimiter

    
    with urllib.request.urlopen(resourceURL) as f:
        csv_reader = csv.reader(codecs.iterdecode(f, "utf-8"), delimiter=delimiter)
        data = list(csv_reader)
    # Calculate Number of rows and columns
    numberOfColumns = len(data[0])
    numberOfRows = len(data)
    # Find column with the longest text
    longestColumn = 0
    Length = 0
    for col in range(numberOfColumns):
        for i in data:
            if len(i[col]) > Length:
                Length = len(i[col])
                longestColumn = col
    # Return the results as dictionary
    return {
        "numberOfColumns": numberOfColumns,
        "numberOfRows": numberOfRows,
        "longestColumn": longestColumn
    }

def describeJSON(resourceURL):
    # Open URL and read JSON content
    resp = urllib.request.urlopen(resourceURL)
    data = json.loads(resp.read())
    
    # Find depth with recursion
    def findNestingDepth(datJ):
        if isinstance(datJ, dict):
            return 1 + (max([findNestingDepth(x) for x in datJ.values()]) if datJ else 0)
        if isinstance(datJ, list):
            return 1 + (max([findNestingDepth(i) for i in datJ]) if datJ else 0)
        return 0
    
    # Find longest list also checking if the dictonairies contain lists 
    def findLongestList(datJ):
        if isinstance(datJ, list):
            return max(len(datJ), max([findLongestList(i) for i in datJ] if datJ else [0]))
        if isinstance(datJ, dict):
            return max([findLongestList(x) for x in datJ.values()] if datJ else [0])
        return 0

    numberOfAttributes = len(data)
    nestingDepth = findNestingDepth(data)
    longestListLength = findLongestList(data)

    return {
        "numberOfAttributes": numberOfAttributes,
        "nestingDepth": nestingDepth,
        "longestListLength": longestListLength
    }

def describeXML(resourceURL):
    resp = urllib.request.urlopen(resourceURL)
    data = xmltodict.parse(resp.read())

    # Find depth with recursion
    def findNestingDepth(datX):
        if isinstance(datX, dict):
            return 1 + max([findNestingDepth(x) for x in datX.values()] if datX else 0)
        if isinstance(datX, list):
            return 1 + max([findNestingDepth(i) for i in datX] if datX else 0)
        return 0
    # Function to find the maximum numeric Value
    def findMaxValue(datX):
        maxValue = 0
        if isinstance(datX, dict):
            for x in datX.values():
                maxValue = max(max_value, findMaxValue(x))
        elif isinstance(datX, list):
            for i in datX:
                maxValue = max(max_value, findMaxValue(i))
        else:
            if isinstance(datX, (int, float)):
                maxValue = max(max_value, datX)
        return maxValue
    
    numberOfElementsAttributes = len(data)
    nestingDepth = findNestingDepth(data)
    maxNumericValue = findMaxValue(data)
    
    return {
        "numberOfElementsAttributes": numberOfElementsAttributes,
        "nestingDepth": nestingDepth,
        "maxNumericValue": maxNumericValue
    }

In [8]:
from nose.tools import assert_equal, assert_in, assert_true
assert_equal(len(describeFile(dataset1)), 3)
assert_equal(len(describeFile(dataset2)), 3)

Dataset 1:
The dataset contains of 10 Columns and 6670 Rows. The 0 Column contains the longest value.

Dataset 2:
The second dataset contains of 11 Attributes. The deepest level of nested elements is 5 levels deep and the longest list contains 11 items.