# Documentation

## Notebook History

**Date | Version | Author | Comments**
- 2022-01-18 | 0.2 | Andre Buser | Updated basic analysis because GDPR article split was removed from the GDPR fine dataset. Split needs to be done here, now. Corrected df merge to use outer joins.
- 2022-01-15 | 0.1 | Andre Buser | Initial draft.

## Open tasks / Last Activitites

Open tasks and/or the latest activities in this section:
- [ ] #TODO: SPLIT articles (see GDPR fine DCL notebook)
- [ ] #TODO: **Complete** Data Science Ethics Checklist

## Purpose

The objective of this **stage three** notebook is to conduct EDA.

## Data Science Ethics Checklist

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### C. Analysis
 - [**NA**] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [**NA**] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

*Data Science Ethics Checklist generated with [deon](http://deon.drivendata.org).*


# Setup Environment

In [1]:
!python --version

Python 3.7.12


## Install Modules

List !pip install commands for modules which are not part of the Google Colabs standard environment. For local environment, please use the provided installation files and environments.

In [2]:
# document module versions
!pip install watermark

Collecting watermark
  Downloading watermark-2.3.0-py2.py3-none-any.whl (7.2 kB)
Collecting importlib-metadata<3.0
  Downloading importlib_metadata-2.1.2-py2.py3-none-any.whl (10 kB)
Installing collected packages: importlib-metadata, watermark
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.10.0
    Uninstalling importlib-metadata-4.10.0:
      Successfully uninstalled importlib-metadata-4.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.3.6 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 2.1.2 which is incompatible.[0m
Successfully installed importlib-metadata-2.1.2 watermark-2.3.0


## Import Modules

In [3]:
# Base libraries
import time
import datetime
import os
import sqlite3

# Scientific libraries
import numpy as np
import pandas as pd
#from empiricaldist import Cdf, Pmf

# Visual libraries
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
import missingno as msno  # Visualize missing values

# Helper libraries
#from tqdm.notebook import tqdm, trange
#from colorama import Fore, Back, Style
import gc # garbage collection to optimize memory usage, use gc.collect()
import warnings
warnings.filterwarnings('ignore')

# Visual setup
import altair as alt
import matplotlib.ticker as ticker
plt.style.use('ggplot')
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['figure.figsize'] = [12, 9]
rcParams['font.size'] = 16
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
custom_colors = ['#74a09e','#86c1b2','#98e2c6','#f3c969','#f2a553', '#d96548', '#c14953']
sns.set_palette(custom_colors)
%config InlineBackend.figure_format = 'retina'
%config Completer.use_jedi = False

# Pandas options
#pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.float_format',  '{:,}'.format)
pd.set_option('max_colwidth', 40)
pd.options.display.max_columns = None  # Possible to limit
pd.options.display.max_rows = None  # Possible to limit
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Load magic commands
%load_ext watermark


## Define Parameters

In [4]:
try:
    # Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    # Paths: Google Colabs Setup
    PATH_EXT = "/content/drive/MyDrive/MADS/SIADS591-592/Project/data/external/"
    PATH_RAW = "/content/drive/MyDrive/MADS/SIADS591-592/Project/data/raw/"
    PATH_INT = "/content/drive/MyDrive/MADS/SIADS591-592/Project/data/interim/"
    PATH_PRO = "/content/drive/MyDrive/MADS/SIADS591-592/Project/data/processed/"
    PATH_REP = "/content/drive/MyDrive/MADS/SIADS591-592/Project/reports/"
    PATH_FIGS = "/content/drive/MyDrive/MADS/SIADS591-592/Project/reports/figures/"

except:
    # Paths: Local Setup
    PATH_EXT = "../data/external/"
    PATH_RAW = "../data/raw/"
    PATH_INT = "../data/interim/"
    PATH_PRO = "../data/processed/"
    PATH_REP = "../reports/"
    PATH_FIGS = "../reports/figures/"

Mounted at /content/drive


In [5]:
# Set global seed
seed = 42

# Define available cpu cores
n_cpu = os.cpu_count()
print("Number of CPUs used:", n_cpu)

Number of CPUs used: 2


# Load Data

In [6]:
# Loading all tables from the sqlite database file
FILENAME = "project_GDPR-fines.sqlite"
data_path = os.path.join(PATH_PRO, FILENAME)

con = sqlite3.connect(data_path)
df_gdpr = pd.read_sql("select * from GDPR", con)
df_gdp = pd.read_sql("select * from GDP", con)
df_cpi = pd.read_sql("select * from CPI", con)
df_pop = pd.read_sql("select * from POP", con)
con.close()

In [7]:
# Merging all tables into one dataframe
df = pd.DataFrame()
df_gdp = df_gdp[['mapping_key','gdp','gdp_cat']]
df = df_gdpr.merge(df_gdp, on="mapping_key", how='outer')

df_cpi = df_cpi[['iso3','mapping_key','cpi_score','cpi_score_cat']]
df = df.merge(df_cpi, on="mapping_key", how='outer')

df_pop = df_pop[['mapping_key','population','population_cat']]
df = df.merge(df_pop, on="mapping_key", how='outer')

In [8]:
df.head()

Unnamed: 0,etid,country,fine,controller_processor,article,violation_type,sector,summary,decision_date_imputed,decision_year,fine_cat,mapping_key,country_label,violation_type_label,sector_label,gdp,gdp_cat,iso3,cpi_score,cpi_score_cat,population,population_cat
0,ETid-986,GREECE,30000.0,Info Communication Services,"Art. 13 GDPR, Art. 14 GDPR, Art. 11 ...",Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,"Art. 13 GDPR, Art. 14 GDPR, Art. 11 ...",Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000
2,ETid-957,GREECE,30000.0,One Way Private Company,"Art. 28 (3) c) GDPR, Art. 32 (2), (4...",Insufficient technical and organisat...,"Media, Telecoms and Broadcasting",The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,7.0,6.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000
3,ETid-919,GREECE,20000.0,Καπα Λαμδα Ωμεγα Διαφημιστικη Εμπορι...,"Art. 6 GDPR, Art. 12 (2) GDPR, Art. ...",Insufficient legal basis for data pr...,Industry and Commerce,The Hellenic DPA has fined ΚΑΠΑ ΛΑΜΔ...,No,2021.0,20000.0,GREECE-2021,11.0,6.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000
4,ETid-897,GREECE,5000.0,Premiummedia Παραγωγη Οπτικο-Ακουστι...,"Art. 21 (3) GDPR, Art. 25 GDPR",Insufficient fulfilment of data subj...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,5000.0,GREECE-2021,11.0,3.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000


In [9]:
df.shape


(1021, 22)

In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1021 entries, 0 to 1020
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   etid                   978 non-null    object 
 1   country                978 non-null    object 
 2   fine                   952 non-null    float64
 3   controller_processor   978 non-null    object 
 4   article                978 non-null    object 
 5   violation_type         978 non-null    object 
 6   sector                 978 non-null    object 
 7   summary                978 non-null    object 
 8   decision_date_imputed  978 non-null    object 
 9   decision_year          978 non-null    float64
 10  fine_cat               952 non-null    float64
 11  mapping_key            1021 non-null   object 
 12  country_label          978 non-null    float64
 13  violation_type_label   978 non-null    float64
 14  sector_label           978 non-null    float64
 15  gdp 

In [11]:
df.describe().round(0)

Unnamed: 0,fine,decision_year,fine_cat,country_label,violation_type_label,sector_label,gdp,gdp_cat,cpi_score,cpi_score_cat,population,population_cat
count,952.0,978.0,952.0,978.0,978.0,978.0,1016.0,1016.0,1012.0,1012.0,1021.0,1021.0
mean,1382942.0,2020.0,1382941.0,19.0,6.0,5.0,1022089522983.0,1021870078740.0,62.0,61.0,32681446.0,32690010.0
std,25363689.0,1.0,25363689.0,9.0,2.0,2.0,886300103286.0,886411177289.0,13.0,13.0,23372784.0,23376374.0
min,0.0,2018.0,0.0,0.0,0.0,0.0,6331996143.0,10000000000.0,42.0,40.0,37910.0,0.0
25%,3000.0,2020.0,3000.0,12.0,5.0,3.0,248715551367.0,250000000000.0,53.0,50.0,9684679.0,9700000.0
50%,10000.0,2020.0,10000.0,23.0,6.0,5.0,1281484640044.0,1280000000000.0,60.0,60.0,46736776.0,46700000.0
75%,50000.0,2021.0,50000.0,27.0,7.0,7.0,1320033318744.0,1320000000000.0,70.0,70.0,46771375.0,46800000.0
max,746000000.0,2021.0,746000000.0,30.0,9.0,10.0,3975347237443.0,3980000000000.0,89.0,90.0,84189092.0,84200000.0


# Project Objectives

**Objective**: The purpose of the project is to analyze GDPR fines that have been issued since 2018 and to get: 
Basic insights regarding:
- Which industry sectors have been penalized the most?
- **CHANGE** Highest fined company and sector?
- Which EU countries have the most violations?
- **CHANGE** Which GDPR articles have been quoted the most?
- What are the “average costs” of a violation per sector?


Advanced insights by correlating the GDPR fine dataset with the population by country (POP), gross domestic product (GDP), and corruption perception index (CPI) by country, the project intents to verify the following assumptions:
- A higher GDP could lead to: 
  - More violated cases, because a higher GDP could mean more companies in the country
  - Higher fines, because the maximum fine is linked to the total worldwide annual turnover
- A higher CPI could lead to 
  - Fewer violation cases, because the public sector is maybe influenced by the companies
- A higher population could lead to
  - More reported cases because more data subjects could execute their rights

# EDA

## Preparing dataset

In [12]:
# I recommend to drop the "Isle of Man" GDPR fines due to missing values:
df = df[df['country'] != 'ISLE OF MAN']
assert len(df[df['country'] == 'ISLE OF MAN']) == 0

In [13]:
# Group features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

numerical_features
categorical_features

['fine',
 'decision_year',
 'fine_cat',
 'country_label',
 'violation_type_label',
 'sector_label',
 'gdp',
 'gdp_cat',
 'cpi_score',
 'cpi_score_cat',
 'population',
 'population_cat']

['etid',
 'country',
 'controller_processor',
 'article',
 'violation_type',
 'sector',
 'summary',
 'decision_date_imputed',
 'mapping_key',
 'iso3']

## Univariate EDA for numerical features

In [14]:
def desc_num_feature(feature_name, df, bins=100, edgecolor="k", **kwargs):
    fig, ax = plt.subplots(figsize=(8, 4))
    df[feature_name].hist(bins=bins, edgecolor=edgecolor, ax=ax, **kwargs)
    ax.set_title(feature_name, size=15);
    text = str(df[feature_name].describe())
    plt.figtext(1, 0.15, text)
    plt.show();


### Checking characteristics of all single numerical features for all years.

In [15]:
years_list = df['decision_year'].unique().tolist()
years_list = sorted(years_list)
years_list

[2018.0, 2019.0, 2020.0, 2021.0, nan]

In [16]:
for year in years_list:
    print('-----'*20)
    print("Year:", year)
    print('-----'*20)
    print()

    # Filter on year
    df_eda = df[df['decision_year'] == year]#.set_index('decision_year')

    # Plot distribution and describe info via a custom function
    for numerical_feature in numerical_features:
        if not numerical_feature.endswith("_cat"):
            desc_num_feature(numerical_feature, df_eda)

    # Plot box plot
    for numerical_feature in numerical_features:
        if not numerical_feature.endswith("_cat"):
            fig, ax = plt.subplots(figsize=(8, 4))
            ax = sns.boxplot(x=df_eda[numerical_feature]);

    print()
    print()

Output hidden; open in https://colab.research.google.com to view.

## Answer Basic Questions

In [17]:
df_eda = df.copy()

In [18]:
#df_eda.head()
#df_eda.info()
df_eda.shape

(1019, 22)

### Which industry sectors have been penalized the most?

In [19]:
# Most penalized by count of GDPR cases:
print("Top 5 penalized sectors by count of GDPR cases")
#df_eda['sector'].value_counts().head(5).to_frame("count")
df_eda.groupby(['sector'])['etid'].count().sort_values(ascending=False).head(5).to_frame("count").reset_index()

Top 5 penalized sectors by count of GDPR cases


Unnamed: 0,sector,count
0,Industry and Commerce,211
1,"Media, Telecoms and Broadcasting",164
2,Public Sector and Education,130
3,"Finance, Insurance and Consulting",98
4,Health Care,80


In [20]:
# Most penalized by sum of GDPR fines:
print("Top 5 penalized sectors by sum of GDPR fines")
#df_eda['sector'].value_counts().head(5).to_frame("count")
df_eda.groupby(['sector'])['fine'].sum().sort_values(ascending=False).head(5).to_frame("count").reset_index()

Top 5 penalized sectors by sum of GDPR fines


Unnamed: 0,sector,count
0,Industry and Commerce,767536542.0
1,"Media, Telecoms and Broadcasting",370180441.0
2,Transportation and Energy,53341369.0
3,Employment,47711677.0
4,"Finance, Insurance and Consulting",28536065.0


In [21]:
# Overview of sector
df_eda.groupby(['sector'])['fine'].agg(['sum','mean','size']).round(0).reset_index()#.sort_values(ascending=False).head(5).to_frame("sum")

Unnamed: 0,sector,sum,mean,size
0,Accomodation and Hospitalty,21461207.0,766472.0,28
1,Employment,47711677.0,701642.0,69
2,"Finance, Insurance and Consulting",28536065.0,303575.0,98
3,Health Care,12510933.0,158366.0,80
4,Individuals and Private Associations,1367646.0,19821.0,71
5,Industry and Commerce,767536542.0,3780968.0,211
6,"Media, Telecoms and Broadcasting",370180441.0,2285064.0,164
7,Public Sector and Education,12637213.0,100295.0,130
8,Real Estate,515970.0,19845.0,27
9,Transportation and Energy,53341369.0,1212304.0,44


### Highest fined company and sector?

In [22]:
# Most penalized by sum of GDPR fines:
print("Top 5 highest fined companies")
df_eda.groupby(['controller_processor'])['fine'].max().sort_values(ascending=False).head(5).to_frame("max").reset_index()

Top 5 highest fined companies


Unnamed: 0,controller_processor,max
0,Amazon Europe Core S.À.R.L.,746000000.0
1,Whatsapp Ireland Ltd.,225000000.0
2,Google Llc,50000000.0
3,H&M Hennes & Mauritz Online Shop A.B...,35258708.0
4,Tim (Telecommunications Operator),27800000.0


In [23]:
# Most penalized by sum of GDPR fines:
print("Top 5 highest fined sectors")
df_eda.groupby(['sector'])['fine'].max().sort_values(ascending=False).head(5).to_frame("max").reset_index()

Top 5 highest fined sectors


Unnamed: 0,sector,max
0,Industry and Commerce,746000000.0
1,"Media, Telecoms and Broadcasting",225000000.0
2,Employment,35258708.0
3,Transportation and Energy,22046000.0
4,Accomodation and Hospitalty,20450000.0


### Which EU countries have the most violations?

In [24]:
# Most penalized by GDPR cases opened:
print("Top 5 penalized countries by count of GDPR cases")
#df_eda['sector'].value_counts().head(5).to_frame("count")
df_eda.groupby(['country','iso3'])['fine'].count().sort_values(ascending=False).head(5).to_frame("count").reset_index()

Top 5 penalized countries by count of GDPR cases


Unnamed: 0,country,iso3,count
0,SPAIN,ESP,352
1,ITALY,ITA,103
2,ROMANIA,ROU,69
3,HUNGARY,HUN,45
4,NORWAY,NOR,36


In [25]:
# Most penalized by sum of GDPR fines:
print("Top 5 penalized sectors by sum of GDPR fines")
df_eda.groupby(['country','iso3'])['fine'].sum().sort_values(ascending=False).head(5).to_frame("sum").reset_index()

Top 5 penalized sectors by sum of GDPR fines


Unnamed: 0,country,iso3,sum
0,LUXEMBOURG,LUX,746257900.0
1,IRELAND,IRL,225877900.0
2,ITALY,ITA,89804096.0
3,FRANCE,FRA,58194300.0
4,GERMANY,DEU,50159583.0


### Which violation types have been raised the most?

In [26]:
df_eda.groupby(['violation_type'])['violation_type'].count().sort_values(ascending=False).head(5).to_frame("count").reset_index()

Unnamed: 0,violation_type,count
0,Insufficient legal basis for data pr...,337
1,Non-compliance with general data pro...,197
2,Insufficient technical and organisat...,193
3,Insufficient fulfilment of data subj...,89
4,Insufficient fulfilment of informati...,82


In [27]:
#df.groupby(['violation_type','gdpr_article_short'])['violation_type'].count().sort_values(ascending=False).head(5).to_frame("count")

### What are the “average costs” of a violation per sector?

In [28]:
df_eda.groupby(['sector'])['fine'].mean().sort_values(ascending=False).head(5).to_frame("avg.").reset_index().round(0)

Unnamed: 0,sector,avg.
0,Industry and Commerce,3780968.0
1,"Media, Telecoms and Broadcasting",2285064.0
2,Transportation and Energy,1212304.0
3,Accomodation and Hospitalty,766472.0
4,Employment,701642.0


### Which GDPR articles have been quoted the most?

In [29]:
# Replacing , Art to get a better split option (quick and dirty)
df['article'].replace(", Art.", ", Art. Art.", regex=True, inplace=True)

# Split articles
separator = ', Art.'
df['article'] = df['article'].str.split(separator)
df_split = df.explode('article')

# Create temp column to identify GDPR relevant articles
df_split['gdpr_article'] = df_split["article"].str.contains('GDPR')

# Remove all non-GDPR articles
df_split = df_split[df_split['gdpr_article'] == True]

# Extract GDPR article into separate column
pattern = r'(Art.\s\d+)'
df_split['gdpr_article_short'] = df_split['article'].str.extract(pattern)

# Replace dummy Art. 00 with Unknown
df_split['gdpr_article_short'].replace("Art. 00", "Art. Unknown",inplace=True)
df_split['gdpr_article_short'].unique()

array(['Art. 13', 'Art. 14', 'Art. 28', 'Art. 32', 'Art. 6', 'Art. 12',
       'Art. 21', 'Art. 25', 'Art. 5', 'Art. 17', 'Art. 15', 'Art. 24',
       'Art. 33', 'Art. 34', 'Art. 31', 'Art. 58', 'Art. 29', 'Art. 7',
       'Art. 9', 'Art. 35', 'Art. 44', 'Art. 46', 'Art. 30', 'Art. 22',
       'Art. 37', 'Art. Unknown', 'Art. 16', 'Art. 8', 'Art. 38',
       'Art. 39', 'Art. 27', 'Art. 18', 'Art. 19', 'Art. 20', 'Art. 36'],
      dtype=object)

In [30]:
df_split.head()

Unnamed: 0,etid,country,fine,controller_processor,article,violation_type,sector,summary,decision_date_imputed,decision_year,fine_cat,mapping_key,country_label,violation_type_label,sector_label,gdp,gdp_cat,iso3,cpi_score,cpi_score_cat,population,population_cat,gdpr_article,gdpr_article_short
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14
2,ETid-957,GREECE,30000.0,One Way Private Company,Art. 28 (3) c) GDPR,Insufficient technical and organisat...,"Media, Telecoms and Broadcasting",The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,7.0,6.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 28


In [31]:

#df['gdpr_article_short'].value_counts().to_frame("count").head(5)

Calculate the population and gdp rations to nomalize the fines to the country size

In [32]:
df_display = df_split.groupby(['country', 'sector']).agg({'gdp':'mean', 'population':'mean', 'fine':'sum'}).sort_values(by=['fine'], ascending=False).reset_index()
df_display['ratio_pop'] = df_display['fine'] / (df_display['population'] / 100000) # Calculate ratio total fines per 100k population
df_display['ratio_gdp'] = df_display['fine'] / df_display['gdp'] * 100 # Calculate ratio total fines per 100k population
df_display

Unnamed: 0,country,sector,gdp,population,fine,ratio_pop,ratio_gdp
0,IRELAND,"Media, Telecoms and Broadcasting",329493980013.3333,4990359.333333333,900451400.0,18043818.888659853,0.273283111261566
1,LUXEMBOURG,Industry and Commerce,64692749785.0,638549.0,746010600.0,116829029.56546795,1.153159515524217
2,ITALY,"Media, Telecoms and Broadcasting",1945369810134.0,60450189.5,418346618.0,692051.7891842176,0.0215047347718007
3,FRANCE,"Media, Telecoms and Broadcasting",2693947583327.6665,65301450.222222224,201500000.0,308568.9510941812,0.0074797297930756
4,GERMANY,Employment,3798878164314.6,83892622.6,91611416.0,109200.80116794442,0.0024115386710889
5,ITALY,Transportation and Energy,2004707022509.0,60512901.0,48568507.0,80261.4090506089,0.00242272344311
6,ITALY,Industry and Commerce,1956083363187.2727,60448073.77272727,44742000.0,74017.24688237546,0.0022873258288489
7,UNITED KINGDOM,Transportation and Energy,2707743777174.0,67886011.0,44092000.0,64950.05281721443,0.0016283667742749
8,SPAIN,"Media, Telecoms and Broadcasting",1307648297952.041,46757901.44520548,39666000.0,84832.72083218636,0.0030333844399998
9,SPAIN,"Finance, Insurance and Consulting",1309982048239.0232,46763976.093023255,34888410.0,74605.3114273254,0.0026632739011118


In [33]:
PATH_EXPORT = "/content/drive/MyDrive/MADS/SIADS591-592/Project/notebooks"
EXPORT_FILENAME = 'out.zip'

compression_opts = dict(method='zip', archive_name='out.csv')

In [34]:
df_display.to_csv(os.path.join(PATH_EXPORT, EXPORT_FILENAME), index=False, compression=compression_opts)

In [35]:
df_split.to_csv(os.path.join(PATH_EXPORT, EXPORT_FILENAME), index=False, compression=compression_opts)

# Art. analysis

Create columns for each categorical value in state

In [36]:
df_split.head()

Unnamed: 0,etid,country,fine,controller_processor,article,violation_type,sector,summary,decision_date_imputed,decision_year,fine_cat,mapping_key,country_label,violation_type_label,sector_label,gdp,gdp_cat,iso3,cpi_score,cpi_score_cat,population,population_cat,gdpr_article,gdpr_article_short
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14
2,ETid-957,GREECE,30000.0,One Way Private Company,Art. 28 (3) c) GDPR,Insufficient technical and organisat...,"Media, Telecoms and Broadcasting",The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,7.0,6.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 28


In [37]:
df_gdpr_art = df_split

# Remove Unknown violation types
#df_gdpr_art = df_gdpr_art[df_gdpr_art['sector'] != 'Unknown']

# Get values for Healthcare sector only
#df_gdpr_art = df_gdpr_art[df_gdpr_art['sector'] == 'Health Care']

# Remove fines  below 100.000
# df_gdpr_art = df_gdpr_art[df_gdpr_art['fine'] >= 100000]



In [38]:
import plotly.express as px
import plotly.graph_objs as go

fig = px.density_heatmap(df_split, x="sector", y="violation_type", width=2000, height=800,
                         labels={
                             "violation_type": "Violation Type",
                             "sector": "Sector"
                             }, title="Heatmap violation types across all sectors")

fig.show()

In [39]:
import plotly.express as px

fig = px.density_heatmap(df_split, x="sector", y="gdpr_article_short", width=2000, height=800,
                         labels={
                             "gdpr_article_short": "GDPR Article",
                             "sector": "Sector"
                             }, title="Heatmap GDPR articles across all sectors")

fig.show()

In [40]:
fig = px.density_heatmap(df_gdpr_art, x="country", y="fine")
fig.show()

In [41]:
#One hot encoding the categorical columns in training set
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
train_enc = ohe.fit_transform(df_gdpr_art[['violation_type','gdpr_article_short']])
#Converting back to a dataframe 
df_gdpr_art_coef = pd.DataFrame(train_enc, columns=ohe.get_feature_names())

In [42]:
# Pearson's r cannot be used, because the varibales are not normally distributed
corr_df = df_gdpr_art_coef.corr(method='spearman')
corr_df.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,x0_Insufficient cooperation with supervisory authority,x0_Insufficient data processing agreement,x0_Insufficient fulfilment of data breach notification obligations,x0_Insufficient fulfilment of data subjects rights,x0_Insufficient fulfilment of information obligations,x0_Insufficient involvement of data protection officer,x0_Insufficient legal basis for data processing,x0_Insufficient technical and organisational measures to ensure information security,x0_Non-compliance with general data processing principles,x0_Unknown,x1_Art. 12,x1_Art. 13,x1_Art. 14,x1_Art. 15,x1_Art. 16,x1_Art. 17,x1_Art. 18,x1_Art. 19,x1_Art. 20,x1_Art. 21,x1_Art. 22,x1_Art. 24,x1_Art. 25,x1_Art. 27,x1_Art. 28,x1_Art. 29,x1_Art. 30,x1_Art. 31,x1_Art. 32,x1_Art. 33,x1_Art. 34,x1_Art. 35,x1_Art. 36,x1_Art. 37,x1_Art. 38,x1_Art. 39,x1_Art. 44,x1_Art. 46,x1_Art. 5,x1_Art. 58,x1_Art. 6,x1_Art. 7,x1_Art. 8,x1_Art. 9,x1_Art. Unknown
x0_Insufficient cooperation with supervisory authority,1.0,-0.01,-0.02,-0.05,-0.04,-0.02,-0.12,-0.08,-0.09,-0.01,-0.03,-0.05,-0.02,-0.03,-0.01,-0.02,-0.01,-0.01,-0.01,-0.03,-0.01,-0.02,-0.03,-0.0,-0.02,-0.01,-0.01,0.46,-0.06,-0.02,-0.02,-0.02,-0.01,-0.01,-0.01,-0.01,-0.01,-0.0,-0.1,0.74,-0.08,-0.02,-0.01,-0.03,-0.01
x0_Insufficient data processing agreement,-0.01,1.0,-0.01,-0.02,-0.02,-0.01,-0.04,-0.03,-0.03,-0.0,0.03,0.01,-0.01,-0.01,-0.0,-0.01,-0.0,-0.0,-0.0,-0.01,-0.0,-0.01,-0.01,-0.0,0.27,-0.0,-0.0,-0.01,-0.02,-0.01,-0.01,-0.01,-0.0,-0.01,-0.0,-0.0,-0.0,-0.0,-0.02,-0.01,-0.03,-0.01,-0.0,-0.01,-0.0
x0_Insufficient fulfilment of data breach notification obligations,-0.02,-0.01,1.0,-0.05,-0.04,-0.01,-0.1,-0.06,-0.08,-0.01,-0.03,-0.04,-0.02,-0.02,-0.01,-0.02,-0.01,-0.0,-0.0,-0.02,-0.01,-0.02,-0.02,-0.0,-0.02,-0.01,-0.01,-0.02,-0.03,0.55,0.41,-0.01,-0.0,-0.01,-0.01,-0.01,-0.01,-0.0,-0.07,0.01,-0.07,-0.02,-0.0,-0.02,-0.01
x0_Insufficient fulfilment of data subjects rights,-0.05,-0.02,-0.05,1.0,-0.09,-0.03,-0.25,-0.16,-0.19,-0.02,0.21,-0.06,0.05,0.43,0.02,0.25,0.09,-0.01,-0.01,0.2,-0.01,0.01,0.02,-0.01,0.0,-0.02,0.0,0.01,-0.11,-0.05,-0.03,-0.03,-0.01,-0.03,-0.02,-0.02,0.06,-0.01,-0.14,-0.04,-0.12,-0.04,-0.01,-0.05,-0.02
x0_Insufficient fulfilment of information obligations,-0.04,-0.02,-0.04,-0.09,1.0,-0.03,-0.2,-0.13,-0.15,-0.01,0.1,0.47,0.15,-0.04,-0.01,-0.04,-0.01,-0.01,-0.01,-0.03,-0.01,-0.03,-0.01,-0.01,0.0,-0.01,-0.02,-0.01,-0.1,-0.04,-0.03,-0.03,-0.01,0.0,-0.01,-0.01,-0.01,-0.01,-0.11,-0.03,-0.1,0.04,0.06,-0.04,-0.02
x0_Insufficient involvement of data protection officer,-0.02,-0.01,-0.01,-0.03,-0.03,1.0,-0.07,-0.05,-0.06,-0.0,-0.02,-0.03,-0.01,-0.02,-0.0,-0.01,-0.01,-0.0,-0.0,-0.02,-0.0,-0.01,-0.02,-0.0,-0.01,-0.01,-0.01,0.04,-0.04,-0.01,-0.01,-0.01,-0.0,0.41,0.51,0.51,-0.0,-0.0,-0.06,0.02,-0.05,-0.01,-0.0,-0.02,-0.01
x0_Insufficient legal basis for data processing,-0.12,-0.04,-0.1,-0.25,-0.2,-0.07,1.0,-0.35,-0.41,-0.03,-0.06,-0.1,-0.02,-0.11,-0.03,-0.04,-0.04,-0.02,-0.02,-0.01,-0.03,-0.04,-0.05,-0.02,-0.03,0.0,-0.03,-0.06,-0.26,-0.11,-0.07,-0.04,0.04,-0.07,-0.04,-0.04,-0.03,-0.02,0.05,-0.1,0.49,0.06,0.01,0.09,-0.04
x0_Insufficient technical and organisational measures to ensure information security,-0.08,-0.03,-0.06,-0.16,-0.13,-0.05,-0.35,1.0,-0.26,-0.02,-0.1,-0.13,-0.07,-0.08,-0.02,-0.06,-0.02,-0.02,-0.02,-0.07,-0.02,0.06,0.04,-0.01,0.03,0.06,-0.03,-0.05,0.65,0.06,0.02,-0.02,-0.02,-0.04,-0.02,-0.02,-0.02,-0.01,-0.03,-0.06,-0.21,-0.05,-0.02,-0.05,-0.03
x0_Non-compliance with general data processing principles,-0.09,-0.03,-0.08,-0.19,-0.15,-0.06,-0.41,-0.26,1.0,-0.03,-0.03,0.04,-0.01,-0.06,0.05,-0.02,0.02,0.06,0.06,-0.03,0.07,0.03,0.03,0.04,-0.01,-0.03,0.09,-0.06,-0.11,-0.04,-0.02,0.12,-0.02,0.05,-0.03,-0.03,0.03,0.04,0.21,-0.07,-0.15,-0.0,-0.02,0.02,0.03
x0_Unknown,-0.01,-0.0,-0.01,-0.02,-0.01,-0.0,-0.03,-0.02,-0.03,1.0,-0.01,-0.01,-0.01,-0.01,-0.0,-0.01,-0.0,-0.0,-0.0,-0.01,-0.0,-0.01,-0.01,-0.0,-0.01,-0.0,-0.0,-0.0,-0.02,-0.01,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.03,-0.01,-0.02,-0.01,-0.0,-0.01,0.76


In [43]:
df_gdpr_art_coef.head()

Unnamed: 0,x0_Insufficient cooperation with supervisory authority,x0_Insufficient data processing agreement,x0_Insufficient fulfilment of data breach notification obligations,x0_Insufficient fulfilment of data subjects rights,x0_Insufficient fulfilment of information obligations,x0_Insufficient involvement of data protection officer,x0_Insufficient legal basis for data processing,x0_Insufficient technical and organisational measures to ensure information security,x0_Non-compliance with general data processing principles,x0_Unknown,x1_Art. 12,x1_Art. 13,x1_Art. 14,x1_Art. 15,x1_Art. 16,x1_Art. 17,x1_Art. 18,x1_Art. 19,x1_Art. 20,x1_Art. 21,x1_Art. 22,x1_Art. 24,x1_Art. 25,x1_Art. 27,x1_Art. 28,x1_Art. 29,x1_Art. 30,x1_Art. 31,x1_Art. 32,x1_Art. 33,x1_Art. 34,x1_Art. 35,x1_Art. 36,x1_Art. 37,x1_Art. 38,x1_Art. 39,x1_Art. 44,x1_Art. 46,x1_Art. 5,x1_Art. 58,x1_Art. 6,x1_Art. 7,x1_Art. 8,x1_Art. 9,x1_Art. Unknown
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# Add information about state by spliting 'state' column into N separate int columns
#df_gdpr_art = df_split

# Get values for Healthcare sector only
df_gdpr_art = df_gdpr_art[df_gdpr_art['sector'] == 'Health Care']

# Remove fines  below 100.000
df_gdpr_art = df_gdpr_art[df_gdpr_art['fine'] >= 100000]

# Remove unknown GDPR articles
df_gdpr_art = df_gdpr_art[df_gdpr_art['gdpr_article_short'].str.contains('Unknown')==False]

# Generate the list of unique GDPR article values
unique_art = list(df_gdpr_art['gdpr_article_short'].unique())

# Add 1 to the GDPR short name article column
for i in unique_art:
    df_gdpr_art[i] = df_gdpr_art[['gdpr_article_short']].apply(lambda x: 1 if i in list(x) else 0, raw=True, axis=1) 

# Generate the list of unique violation type values
unique_violation = list(df_gdpr_art['violation_type'].unique())

# Add 1 to the violation type
for i in unique_violation:
    df_gdpr_art[i] = df_gdpr_art[['violation_type']].apply(lambda x: 1 if i in list(x) else 0, raw=True, axis=1) 

unique_value = unique_art + unique_violation

for i in unique_value:
    df_gdpr_art[i] = df_gdpr_art[i] * df_gdpr_art['fine']

df_gdpr_art.head()

Unnamed: 0,etid,country,fine,controller_processor,article,violation_type,sector,summary,decision_date_imputed,decision_year,fine_cat,mapping_key,country_label,violation_type_label,sector_label,gdp,gdp_cat,iso3,cpi_score,cpi_score_cat,population,population_cat,gdpr_article,gdpr_article_short,Art. 5,Art. 9,Art. 32,Art. 33,Art. 34,Art. 6,Art. 13,Art. 14,Art. 28,Art. 30,Art. 35,Non-compliance with general data processing principles,Insufficient technical and organisational measures to ensure information security,Insufficient legal basis for data processing
69,ETid-773,ITALY,150000.0,Azienda Provinciale Per I Servizi Sa...,"Art. 5 (1) a), f) GDPR",Non-compliance with general data pro...,Health Care,The Italian DPA (Garante) has fined ...,No,2021.0,150000.0,ITALY-2021,16.0,8.0,3.0,2004294351927.0,2000000000000.0,ITA,47.0,50.0,60438553,60400000,True,Art. 5,150000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150000.0,0.0,0.0
69,ETid-773,ITALY,150000.0,Azienda Provinciale Per I Servizi Sa...,Art. 9 GDPR,Non-compliance with general data pro...,Health Care,The Italian DPA (Garante) has fined ...,No,2021.0,150000.0,ITALY-2021,16.0,8.0,3.0,2004294351927.0,2000000000000.0,ITA,47.0,50.0,60438553,60400000,True,Art. 9,0.0,150000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150000.0,0.0,0.0
70,ETid-772,ITALY,120000.0,Azienda Usl Della Romagna,Art. 5 (1) f) GDPR,Non-compliance with general data pro...,Health Care,The Italian DPA (Garante) has fined ...,No,2021.0,120000.0,ITALY-2021,16.0,8.0,3.0,2004294351927.0,2000000000000.0,ITA,47.0,50.0,60438553,60400000,True,Art. 5,120000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,120000.0,0.0,0.0
70,ETid-772,ITALY,120000.0,Azienda Usl Della Romagna,Art. 9 GDPR,Non-compliance with general data pro...,Health Care,The Italian DPA (Garante) has fined ...,No,2021.0,120000.0,ITALY-2021,16.0,8.0,3.0,2004294351927.0,2000000000000.0,ITA,47.0,50.0,60438553,60400000,True,Art. 9,0.0,120000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,120000.0,0.0,0.0
341,ETid-857,DENMARK,107000.0,Danish Cancer Society,Art. 32 GDPR,Insufficient technical and organisat...,Health Care,The Danish DPA has fined the Danish ...,No,2021.0,107000.0,DENMARK-2021,6.0,7.0,3.0,336902718261.0,340000000000.0,DNK,89.0,90.0,5813128,5800000,True,Art. 32,0.0,0.0,107000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,107000.0,0.0


In [45]:
# Obtain per province and city the patient count per state
df_gdpr_art = df_gdpr_art.groupby(['etid', 'country'], as_index=False)[unique_value].agg('mean')

# Drop NA values especially because in a limited set of cases the value for the city is unknown
# p_info_state_df.dropna(inplace=True)

df_gdpr_art.head()

Unnamed: 0,etid,country,Art. 5,Art. 9,Art. 32,Art. 33,Art. 34,Art. 6,Art. 13,Art. 14,Art. 28,Art. 30,Art. 35,Non-compliance with general data processing principles,Insufficient technical and organisational measures to ensure information security,Insufficient legal basis for data processing
0,ETid-122,GERMANY,0.0,0.0,105000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,105000.0,0.0
1,ETid-158,UNITED KINGDOM,0.0,0.0,320000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,320000.0,0.0
2,ETid-321,NORWAY,0.0,0.0,112000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,112000.0,0.0
3,ETid-45,PORTUGAL,200000.0,0.0,200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,400000.0,0.0
4,ETid-466,SWEDEN,731500.0,0.0,731500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1463000.0,0.0


In [46]:
# Pearson's r cannot be used, because the varibales are not normally distributed
corr_df = df_gdpr_art.corr(method='spearman')
corr_df.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,Art. 5,Art. 9,Art. 32,Art. 33,Art. 34,Art. 6,Art. 13,Art. 14,Art. 28,Art. 30,Art. 35,Non-compliance with general data processing principles,Insufficient technical and organisational measures to ensure information security,Insufficient legal basis for data processing
Art. 5,1.0,0.17,0.32,0.23,0.23,0.03,0.09,-0.16,-0.16,-0.16,-0.16,0.21,0.27,-0.16
Art. 9,0.17,1.0,-0.22,-0.09,-0.09,0.24,0.4,-0.09,-0.09,-0.09,-0.09,0.76,-0.45,-0.16
Art. 32,0.32,-0.22,1.0,-0.28,-0.28,-0.33,0.01,-0.16,-0.16,-0.16,-0.16,-0.38,0.9,-0.51
Art. 33,0.23,-0.09,-0.28,1.0,1.0,-0.1,-0.07,-0.05,-0.05,-0.05,-0.05,0.44,-0.25,-0.09
Art. 34,0.23,-0.09,-0.28,1.0,1.0,-0.1,-0.07,-0.05,-0.05,-0.05,-0.05,0.44,-0.25,-0.09
Art. 6,0.03,0.24,-0.33,-0.1,-0.1,1.0,0.34,-0.1,-0.1,-0.1,-0.1,0.12,-0.53,0.79
Art. 13,0.09,0.4,0.01,-0.07,-0.07,0.34,1.0,0.65,0.65,0.65,0.65,0.58,-0.36,-0.13
Art. 14,-0.16,-0.09,-0.16,-0.05,-0.05,-0.1,0.65,1.0,1.0,1.0,1.0,0.3,-0.25,-0.09
Art. 28,-0.16,-0.09,-0.16,-0.05,-0.05,-0.1,0.65,1.0,1.0,1.0,1.0,0.3,-0.25,-0.09
Art. 30,-0.16,-0.09,-0.16,-0.05,-0.05,-0.1,0.65,1.0,1.0,1.0,1.0,0.3,-0.25,-0.09


In [47]:
# Add labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_split['gdpr_article_short_label']= le.fit_transform(df_split['gdpr_article_short'].values)


In [48]:
df_split.head()

Unnamed: 0,etid,country,fine,controller_processor,article,violation_type,sector,summary,decision_date_imputed,decision_year,fine_cat,mapping_key,country_label,violation_type_label,sector_label,gdp,gdp_cat,iso3,cpi_score,cpi_score_cat,population,population_cat,gdpr_article,gdpr_article_short,gdpr_article_short_label
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13,1
0,ETid-986,GREECE,30000.0,Info Communication Services,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14,2
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 13 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 13,1
1,ETid-985,GREECE,25000.0,Plus Real Advertisement,Art. 14 GDPR,Insufficient fulfilment of informati...,Industry and Commerce,The Hellenic DPA has imposed a fine ...,No,2021.0,25000.0,GREECE-2021,11.0,4.0,5.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 14,2
2,ETid-957,GREECE,30000.0,One Way Private Company,Art. 28 (3) c) GDPR,Insufficient technical and organisat...,"Media, Telecoms and Broadcasting",The Hellenic DPA has imposed a fine ...,No,2021.0,30000.0,GREECE-2021,11.0,7.0,6.0,212266363527.0,210000000000.0,GRC,44.0,40.0,10376349,10400000,True,Art. 28,14


In [49]:
# Pearson's r cannot be used, because the varibales are not normally distributed
#corr_df = df_split.corr(method='spearman')
#corr_df.style.background_gradient(cmap='coolwarm').set_precision(2)

## Answer Hypothesis

## ...

# Watermark

In [50]:
%watermark

Last updated: 2022-01-21T20:46:28.418974+00:00

Python implementation: CPython
Python version       : 3.7.12
IPython version      : 5.5.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.144+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [51]:
%watermark --iversions

matplotlib: 3.2.2
plotly    : 4.4.1
IPython   : 5.5.0
pandas    : 1.1.5
google    : 2.0.3
altair    : 4.2.0
numpy     : 1.19.5
missingno : 0.5.0
seaborn   : 0.11.2
sqlite3   : 2.6.0



-----
