# wipEU: Women in Power and Decision-Making in the EU - Project Documentation
<a href="https://github.com/matteo-guenci/Open_access_project">wipEU</a> is a project developed by Salvatore di Marzo, Matteo Guenci, and Alice Picco for the final exam of the course <a href="https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2023/424645">"Open Access and Digital Ethics"</a> held by professor Monica Palmirani within the <a href="https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge">Digital Humanities and Digital Knowledge Master Degree</a> (University of Bologna), during the A.Y. 2023/2024.

## Introduction
The European Institute for Gender Equality (EIGE) is an agency of the European Union (EU) that was established in 2006 with the aim of promoting gender equality and combating discrimination based on sex. EIGE collects, analyzes, and disseminates data on various aspects of gender equality to support evidence-based policymaking within the EU. 

One key area of focus for EIGE is the representation of women and men in decision-making positions. The agency's Gender Statistics Database provides a comprehensive collection of statistical information related to gender equality, including data on women and men in decision-making roles across different sectors at European level. In terms of decision-making positions, EIGE monitors indicators such as the proportion of women and men in political, economic, and social leadership roles. This includes positions in national governments, parliaments, and local authorities, as well as leadership roles in business and academia.

EIGE's database helps identify gender imbalances and trends over time, providing policymakers with valuable insights to design and implement measures that promote greater gender equality. Achieving gender balance in decision-making positions is a key goal for the EU, as it is seen as essential for democratic representation and effective governance.

EIGE's work contributes to tracking progress and identifying areas where further efforts are needed to address gender disparities. By promoting transparency and accountability, the agency plays a crucial role in advancing gender equality within the European Union. Policymakers can use the data and analyses provided by EIGE to develop targeted interventions and policies aimed at addressing the underrepresentation of women in decision-making positions and promoting equal opportunities for both genders.

It's important to note that the data in EIGE's Gender Statistics Database is regularly updated to reflect the evolving landscape of gender equality in the European Union. Policymakers, researchers, and the public can access this valuable resource to stay informed about progress and challenges in achieving gender equality, particularly in decision-making roles.

In this context, we opted to formulate our unique conceptual framework to delve into the data, exploring statistical correlations and longitudinal trends. Our objective was to visualize the temporal evolution through the creation of choropleth maps and line charts. Additionally, we aimed to identify clusters within the dataset and analyze correlations to derive deeper insights.

RQ: "How do the levels of representation of women in different decision-making positions correlate within the European Union? Is there a discernible pattern or trend in the relationships between women occupying various roles, and how have these correlations evolved over time?"

### Statement of responsibility
Team member | Task | Contact
--- | --- | ---
Salvatore di Marzo | Project Ideation — Data retrieval — Mashup datasets — Technical analysis | [contact](salvatore.dimarzo2@studio.unibo.it)
Matteo Guenci | Project ideation — Data retrieval — Mashup datasets — Technical Analysis | [contact](matteo.guenci@studio.unibo.it)
Alice Picco | Project ideation — Data retrieval — Visualizations — Quality and legal analyses — Website development | [contact](alice.picco@studio.unibo.it)


## Original and mashup datasets
The project comprises the use of <b>10 different datasets</b>, between source ones and mashup ones.
Each source dataset was divided in three parts for women, men and total values, so - practically - we worked with 18 source datasets and 2 mash up datasets.

The <b>source datasets</b> have been downloaded in .csv and .json format from the Women and men in decision-making section in EIGE's Gender Statistics Database:

Id | Dataset | Description (factor of interest) | Provenience | Link / Path
--- | --- | --- | --- | --- 
D1 | European financial institutions: presidents and members | number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_bus_fin__wmid_fineur/hbar)
D2 | European agencies: presidents, members and executive heads | number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_adm_eur__wmid_euadmin_eurag)
D3 | Research funding organisations: presidents and members of the highest decision-making body | number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_educ__wmid_resfund)
D4 | European agencies working in areas related to environment and climate change| number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_env_euinst__wmid_env_eu_ag)
D5 | European courts: presidents and members | number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_jud_eucrt__wmid_eucrt/hbar/year:2009/geo:EU/EGROUP:CRTS_EUR/sex:M,W/UNIT:PC/POSITION:MEMB_CRT/ENTITY:CST,ECHR,GC,CJEU,ECJ)
D6 | Major political parties: leader and deputy leaders | number of men, women, total in the domain | EIGE | [Link](https://eige.europa.eu/gender-statistics/dgs/indicator/wmidm_pol_part__wmid_polpart)





We proceeded with the <b>mashup phase</b>, creating the final three main mashup datasets used to answer our research question. As with source and clean datasets, we distinguished between the three years of our time span of interest: in this way, we ended up with 9 final mashup datasets (three for each factor of interest).

Id | Dataset | Description (factor of interest) | Original source datasets | Year
--- | --- | --- | --- | ---
MD1_17 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D1, D3 | 2017
MD1_18 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D2, D3 | 2018
MD1_19 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D2, D3 | 2019
MD2_17 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6 | 2017
MD2_18 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6 | 2018
MD2_19 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6 | 2019
MD3_17 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D1, D7 | 2017
MD3_18 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D2, D7 | 2018
MD3_19 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D2, D7 | 2019

The code and more detailed documentation for the clean up and mashup phases is freely donwloadable and can be found in `documentation > CLEAN.ipynb` and `documentation > MASHUP.ipynb`.

## Quality analysis

EIGE collects data on decision making to provide reliable statistics to monitor situation and trends.
The data on national environment ministries is collected on an annual basis. 
No legal acts are applicable, the ceu monitors the correct implementation of the beijing platform for action (to promulgate a series of principles regarding equality between male and female)
Confidentiality and data treatment policies are not applicable.  It means that the dataset is exempt from cretain data protection regulations and confidentiality policies that normally are applied to personal data, the scopes in which these are exempted are for the public interest, official authority, and research purposes. 
Data is generally released within one month of the data collection.
The methodology used is the wmid
All the data that we have gathered follows the WMID (women and men in Decision-Making) methodology which is a tool developed by the EIGE to monitor the participation of women in Decision-Making, starting from local and arriving to national levels of governance; the key principles of this method are based on: data-driven, since it relies on reliable and comprable data from a variety of sources, mainly from the websites and other publications produced by the reference organisation, some of the data is also collected by direct contact with the various organisations; gender-sensitive, it takes into account the different exeperiences and perspectives of women and men in decision-making; comparative since it allows for comparison between different countries and regions, even thou some there are some limitations to the extent to which data can be considered as fully comparable between countries.
The data of the EIGE is collected by researchers which are subjected to routine validation like: cross-checking of the data relating to at least 10% of organisation covered by aother researcher; verification of the data with the concerned organisation; comparison of data with previous periods and the review in case there are drastic changes of the data.
We have performed a quality analysis of our datasets by looking at the four factors: 
<ul>
<li><b>Accuracy</b>: when the data and its attributes represent the real concepts behind the data that is shown, and also that the data is rid of errors (there can be some areas in the data gathering of EIGE where the coverage of some organisations is restricted due to cost limits, the burden of data collections, and for some national environment ministries the data refers to the situation of previous years or sometimes the information is not up to date); apart from this the data can be considered fully accurate.</li>
<li><b>Coherence</b>: the data doesn't show contradictions, and this is ensured by the wmid methodology and the routine validation of data. </li>
<li><b>Completeness</b>: the data is exhaustive for the values that are shown, overrall the datasets are complete except for some countries in which the data is missing as mentioned before. </li>
<li><b>Timeliness</b>: the data refers to the correct time it references, and it is released with a puncutal schedule. </li> 

The following table showcases the quality of each of the source datasets and highlights possible flaws.
Id | Accuracy | Coherence | Completeness | Timeliness
--- | --- | --- | --- | --- 
D1 - Financial Institutions | Satisfied | Satisfied | Satisfied | Satisfied
D2 - European Agencies (Public Administration)| Satisfied | Satisfied | Satisfied | Satisfied
D3 - Research Funding Organizations | Satisfied | Satisfied | Satisfied | Satisfied
D4 - European Agencies (Climate Change) | Satisfied | Satisfied | Satisfied | Not Satisfied
D5 - European Courts | Satisfied | Satisfied | Satisfied | Satisfied
D6 - Major Political Parties | Satisfied | Satisfied | Satisfied | Satisfied


## Legal analysis
The legal analysis of the source datasets, fundamental to obtain <b>sustainability over time</b> of the production process and of the publication of datasets and to guarantee a <b>balanced service</b>  in compliance with the public function and with individual rights, was carried out using a reference checklist consisting of a series of binary questions regarding the topics of:
<ul>
<li><b>Privacy issues</b></li>
<li><b>IPR policy</b></li>
<li><b>Licences</b></li>
<li><b>Limitations on public access</b></li>
<li><b>Economical conditions</b></li>
<li><b>Temporal aspects</b></li>
</ul>

### Privacy Issues
To check: | D1 - Financial Institutions | D2 - European Agencies (PA) | D3 - Research Funding | D4 - European Agencies (CC) | D5 - European Courts | D6 - Major Political Parties 
--- | --- | --- | --- | --- | --- | --- 
Is the dataset free of any personal data as defined in the Regulation (EU) 2016/679? | Yes | Yes | Yes | Yes | Yes | Yes 
Is the dataset free of any indirect personal data that could be used for identifying the natural person? | Yes | Yes | Yes | Yes | Yes | Yes 
Is the dataset free of any particular personal data (art. 9 GDPR)? | Yes | Yes | Yes | Yes | Yes | Yes 
Is the dataset free of any information that combined with common data available in the web, could identify the person? | Yes | Yes | Yes | Yes | Yes | No 
Is the dataset free of any information related to human rights (e.g., refugees, witness protection, etc.) | Yes | Yes | Yes | Yes | Yes | Yes 
Did you use a tool for calculating the range of the risk of deanonymization? | Not needed | Not needed | Not needed | Not needed | Not needed | Not needed 
Are you using geolocalization capabilities? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check that the open data platform respect all the privacy regulations? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you know who is, in your open data platform, the Controller and Processor of the privacy data of the system? | Yes | Yes | Yes | Yes | Yes | Yes 
Have you checked the privacy regulation of the country where the dataset are physically stored? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you have non-personal data? | Yes | Yes | Yes | Yes | Yes | Yes 

### Intellectual Property Rights
To check: | D1 - Financial Institutions | D2 - European Agencies (PA) | D3 - Research Funding |D4 - European Agencies (CC) | D5 - European Courts | D6 - Major Political Parties 
--- | --- | --- | --- | --- | --- | --- 
Have you created and generated the dataset? | No | No | No | No | No | No 
Are you the owner of the dataset? | No | No | No | No | No | No 
Is the dataset free from third party licenses or patents? | Yes | Yes | Yes | Yes | Yes | Yes 
Have you checked if there are any limitations in your national legal system for releasing some kind of datasets with open license? | Yes | Yes | Yes | Yes | Yes | Yes 

### Licences
To check: | D1 - Financial Institutions | D2 - European Agencies (PA) | D3 - Research Funding | D4 - European Agencies (CC) | D5 - European Courts | D6 - Induced abortions 
--- | --- | --- | --- | --- | --- | --- 
Did you release the dataset with an open data licence? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you include the clause: "In any case the dataset can't be used for re-identifying the person"? | No | No | No | No | No | No 
Did you release the API (in case you have it) with an open source license? | Yes | Yes | Yes | Yes | Yes | Yes 
Have you checked that the open data/API platform licence regime is in compliance with your IPR policy? | Yes | Yes | Yes | Yes | Yes | Yes 


### Limitations on public access
To check: |D1 - Financial Institutions | D2 - European Agencies (PA) | D3 - Research Funding | D4 - European Agencies (CC) | D5 - European Courts | D6 - Major Political Parties 
--- | --- | --- | --- | --- | --- | --- 
Did you check that the dataset concerns your institutional competences, scope and finality? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check the limitations for the publication stated by your national legislation or by the EU directives? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check if there are some limitations connected to the international relations, public security or national defence? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check if there are some limitations concerning the public interest? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check the international law limitations? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check the INSPIRE law limitations for the spatial data? | Yes | Yes | Yes | Yes | Yes | Yes 


### Economical conditions
To check: | D1 - Financial Institutions | D2 - European Agencies (PA) | D3 - Research Funding | D4 - European Agencies (CC) | D5 - European Courts | D6 - Major Political Parties 
--- | --- | --- | --- | --- | --- | --- 
Did you check that the dataset could be released for free? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check if there are some agreements with some other partners in order to release the dataset with a reasonable price? | Not needed | Not needed | Not needed | Not needed | Not needed | Not needed 
Did you check if the open data platform terms of service include a clause of "non liability agreement" regarding the dataset and API provided? | Yes | Yes | Yes | Yes | Yes | Yes 
In case you decide to release the dataset to a reasonable price did you check if the limitation imposed by the new directive 2019/1024/EU are respected? | Not needed | Not needed | Not needed | Not needed | Not needed | Not needed 
In case you decide to release the dataset to a reasonable price did you check the e-Commerce directive and regulation? | Not needed | Not needed | Not needed | Not needed | Not needed | Not needed 

### Temporary aspects
To check: |  D1 - Financial Institutions |  D2 - European Agencies (PA) | D3 - Research Funding | D4 - European Agencies (CC) | D5 - European Courts | D6 - Major Political Parties 
--- | --- | --- | --- | --- | --- | --- 
Did you have a temporary policy for updating the dataset? | No | Yes | Yes | Yes | Yes | Yes 
Did you have some mechanism for informing the end-user that the dataset is updated at a given time to avoid mis-usage and so potential risk of damage? | Yes | Yes | Yes | Yes | Yes | Yes 
Did you check if the dataset for some reason cannot be indexed by the research engines (e.g., Google, Yahoo, etc.)? | Yes | Yes | Yes | Yes | Yes | Yes 
In case of personal data, do you have a reasonable technical mechanism for collecting request of deletion (e.g., right to be forgotten)? | Not needed | Not needed | Not needed | Not needed | Not needed | Not needed 

## Ethical analysis


EIGE adhers to strict ethical principles and guidelines in its data collection and analysis practices, ensuring that the data is collected responsibly, by using sampling methods and respecting data confidentiality. It also prioritizes transparency and accessibility by making its data public and providing clear explanation, inside of the documentation, of its methdology. 
Regarding the individuals whose data is used to make these studies, a strict data protetion regulation is applied by using robust security measures and safeguarding personal information, EIGE anonymizes data whenever possible, removes personal identifiers from the materials, and obtains before collecting the data, the consent from individuals or agencies.
Data integrity and privacy was respected as most of the data we collected doesn't mention specific individuals but provides a number of the members, simply linking them to either their national provenance or the agency for which their work for. Being a european institution the EIGE receives its data from national ministries which already ensure that the personal data of the individual is protected. 
For the datasets that we've used: in research funding organisations, national ministries dealing with climate change, and major political parties, the data is linked to the country is linked to the country it describes; for the european agencies, european financial instituions and courts, the data is linked to the agency it describes.


          
            

 








## Technical analysis

All source dataset have been evaluated based on the <b><a href="https://docs.italia.it/italia/daf/lg-patrimonio-pubblico/it/stabile/modellometadati.html" target="_blank">metadata model</a> provided by AGID</b> that classifies metadata quality on a range of 4 levels according to two factors: <i>data-metadata bond</i> and <i>detail level</i>

Source datasets:
Id | Provenience | Format | Metadata | URI | Licence
--- | --- | --- | --- | --- | ---
D1 | [demo](https://demo.istat.it/?l=en) | .csv, .xlsx, .pdf | Level 2: A weak data-metadata bond since an external <b>pdf</b> with additional information and methodology reports is accessible; Dataset detail level, information are shared by all dataset data |  [Link](https://demo.istat.it/app/?i=RIC&l=en) | CC BY 3.0
D2 | [demo](https://demo.istat.it/?l=en) | .csv, .xlsx, .pdf | Level 1: Not provided |  [Link](https://demo.istat.it/app/?i=POS&l=en) | CC BY 3.0
D3 | [I.Stat](http://dati.istat.it/?lang=en) |  .csv, .xlsx, .px, .xml | Level 4: An SDMX structured file is downloadable with a strong data-metadata bond and a datum-level detail of description. They are machine readable.<br> Level 2: Additional metadata to provide transparent information about sources and methodologies are available in a separated [webpage](https://siqual.istat.it/SIQual/visualizza.do?id=0058000), accessible through a sidebar menu |  [Link](http://dati.istat.it/index.aspx?queryid=24349) | CC BY 3.0
D4 | [IstatData](https://esploradati.istat.it/databrowser/#/) | .json, .xml, .xlsx, .csv | Level 4: An SDMX structured file is downloadable with a strong data-metadata bond and a datum-level detail of description. They are machine readable. | [Link](https://esploradati.istat.it/databrowser/#/en/dw/categories/IT1,POP,1.0/POP_BIRTHFERT/DCIS_NATI1/DCIS_NATI1_PARENT_CHARACT/IT1,25_74_DF_DCIS_NATI1_8,1.0) | CC BY 3.0
D5 | [I.Stat](http://dati.istat.it/?lang=en) | .csv, .xlsx, .px, .xml | Level 4: An SDMX structured file is downloadable with a strong data-metadata bond and a datum-level detail of description. They are machine readable.<br> Level 2: Additional metadata to provide transparent information about sources and methodologies are available in a separated [webpage](https://siqual.istat.it/SIQual/visualizza.do?id=5000132&refresh=true&language=EN), accessible through a sidebar menu |  [Link](http://dati.istat.it/index.aspx?queryid=29218) | CC BY 3.0
D6 | [I.Stat](http://dati.istat.it/?lang=en) | .csv, .xlsx, .px, .xml | Level 4: An SDMX structured file is downloadable with a strong data-metadata bond and a datum-level detail of description. They are machine readable.<br> Level 2: Additional metadata to provide transparent information about sources and methodologies are available in a separated [webpage](https://siqual.istat.it/SIQual/visualizza.do?id=0038900&refresh=true&language=EN), accessible through a sidebar menu |  [Link](http://dati.istat.it/index.aspx?queryid=7098) | CC BY 3.0
D7 | [I.Stat](http://dati.istat.it/?lang=en) | .csv, .xlsx, .px, .xml | Level 4: An SDMX structured file is downloadable with a strong data-metadata bond and a datum-level detail of description. They are machine readable.<br> Level 2: Additional metadata to provide transparent information about sources and methodologies are available in a separated [webpage](https://siqual.istat.it/SIQual/sintesi.do?id=5000098), accessible through a sidebar menu |  [Link](http://dati.istat.it/Index.aspx?DataSetCode=DCCV_ESL_UNT2020) | CC BY 3.0

Mashup datasets:
Id | Creation date | Format | Metadata | URI | Licence
--- | --- | --- | --- | --- | ---
MD1 | creation_date | .csv | Provided | [MD1_17](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD1_17.csv), [MD1_18](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD1_18.csv), [MD1_19](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD1_19.csv) | CC BY 4.0
MD2 | creation_date | .csv | Provided | [MD2_17](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD2-PERC-2017.csv), [MD2_18](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD2-PERC-2018.csv), [MD2_19](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD2-PERC-2019.csv) | CC BY 4.0
MD3 | creation_date | .csv | Provided | [MD3_17](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD3_17.csv), [MD3_18](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD3_18.csv), [MD3_19](https://github.com/OrsolaMBorrini/blessedfruit/blob/main/data/mashupDS/MD3_19.csv) | CC BY 4.0





## Technical analysis 


### Introduction

This report outlines the step-by-step process of analyzing the datasets related to the influential roles of women in various domains, focusing on factors such as research funding organizations, major political parties, financial institutions, european court and national ministries dealing with the environment and climate change. The objective is to understand correlations and trends within the data and identify clusters representing similar temporal patterns.

### First Step: Data Preparation

The initial step involves reading the data in <code>.xlsx</code> format and converting it to <code>csv</code> for efficient analysis using the **`pandas`** library. In this phase, several crucial functions are also defined:

- **`force_to_numeric`**: This function addresses the challenge of inconsistent data types by converting values in selected columns to numeric. This is particularly important as some datasets may have non-numeric values, possibly due to NaN values.

- **`knn_impute`**: This multifaceted function first determines the ideal number of neighbors and then utilizes this parameter to impute missing data in the merged datasets. The algorithm includes selecting numeric data, cross-validation for k selection, identifying optimal k, and final imputation.

- **`iterative_impute`** and **`mice_impute`**: both functions can be considered as implementing the MICE (Multiple Imputation by Chained Equations) approach. The MICE method involves iteratively imputing missing values for each variable in the dataset based on the observed values of other variables. The iterative process is typically repeated until convergence or a specified maximum number of iterations. Here's how each function aligns with the MICE approach:

- - <code>iterative_impute</code> Function: it uses a custom iterative approach with a while loop, and it iteratively imputes missing values using the <code>IterativeImputer</code> class with a <code>RandomForestRegressor</code> model. The iteration process continues until either the difference between consecutive imputed matrices falls below a specified tolerance (tol) or the maximum number of iterations (max_iter) is reached.

- - <code>mice_impute</code> Function: it directly uses the <code>IterativeImputer</code> class with a <code>RandomForestRegressor</code>, and it performs a fixed number of iterations specified by the max_iter parameter.
Both functions adhere to the core idea of MICE by iteratively imputing missing values while considering the observed values of other variables in the dataset. The primary difference lies in the stopping criteria for the iteration process. The iterative_impute function has a dynamic stopping criteria based on convergence, while the mice_impute function uses a fixed number of iterations which is a better choice if computational power or time constraints are an issue.

- **`column_remove`**: Designed to equalize datasets, this function limits the analysis to columns referring to data from 2018 onwards (it is worth to mention that analyses have been applied also to complete datasets without major changings in the overall results.). 

- **`generate_overall_trend_correlation`**: A function dedicated to calculating the overall correlation of a dataset based on the difference of individual values taken and compared year by year.

- **`generate_yearly_correlations`**: Similar to the previous function, this one generates correlations between the absolute values of each instance taken individually year by year.

### Correlation Analysis

Once the data is prepared, the analysis begins by examining correlations using the `column_remove` and `force_to_numeric` functions to focus on data from 2018 to 2023. Two main functions are employed:

- **`generate_overall_trend_correlation`**: Examines the correlation between the absolute values of the columns, providing insights into general trends year by year.

- **`generate_yearly_correlations`**: Looks at the correlation between the trends or changes in values over time for each column. Particularly useful when understanding how trends in values for common columns correlate between two dataframes year by year.

Before applying these functions, the `column_remove` and `force_to_numeric` functions are used to ensure a standardized dataset for analysis.

### Merging Datasets

To conduct broader analyses, datasets sharing common columns, such as 'Geographic region,' are merged. This step is crucial for identifying correlations and patterns across different influential roles of women in various regions. After the merge, to deal with missing values in this first merged dataset a few trials were taken, in order to do it with the highest possible accuracy. The first choice was to impute the missing data with the `knn_impute` that given a dataset with features (independent variables) and a target variable (dependent variable), computes the distance (commonly Euclidean distance) between the data point with missing values and all other data points in the dataset. It then identifies the k nearest neighbors (data points with the smallest distances) to the data point with missing values. In the end the algorithm imputes missing values with the average (or weighted average) of the target values of its k nearest neighbors.

The subsequent choice was the <code>mice_impute</code> function that utilizes the Multiple Imputation by Chained Equations (MICE) algorithm. This algorithm employs the <code>IterativeImputer</code> with a <code>RandomForestRegressor</code> as an estimator. The choice of a Random Forest model is preferred over the k-NN algorithm due to its superior ability to handle non-linear relationships within the data. The Random Forest model excels in capturing complex patterns, making it more suitable for scenarios where the relationships between variables are not strictly linear.

The algorithm iteratively imputes missing values by predicting each missing variable conditioned on the observed values of the other variables. This iterative process continues until convergence or reaching the specified maximum number of iterations.

During the process of choosing which one could provide better data also the <code>iterative_impute</code> function was taken in account. This function also uses the IterativeImputer with a Random Forest Regressor but it differs in its approach. This function incorporates a custom iterative scheme, repeatedly fitting and transforming the data until either convergence or the maximum number of iterations is reached.

The iterative nature involves fitting the imputer on the data, calculating the difference between consecutive imputations, and repeating the process until the change falls below a specified tolerance or the maximum iteration limit is reached. Despite its potential for achieving better results, this function is computationally expensive due to the repeated fitting and transforming of the imputer. This approach was ultimately abbandoned due to its HUGE computational cost and its use was not really necessary (it is recomended to not apply the function to the dataset unless a high computational power is available for the user). 

It is important to note that the classes in `scikit-learn` handles the complexities of the imputation algorithms. This library's website and other websites like <a href="https://medium.com/capital-one-tech/random-forest-algorithm-for-machine-learning-c4b2c8cc9feb">this</a> or <a href="https://www.machinelearningplus.com/machine-learning/mice-imputation/">this</a> were taken as inspiration

In general this is the flow of any iterative imputer which use a Random Forest Regressor as model:

### IterativeImputer with Random Forest Regressor

The iterative process can be expressed mathematically as:

$$
X_i^{(t+1)} = f^{(t)}(X_{\neg i}^{(t)}),
$$

where $(X_i^{(t)})$ represents the imputation for variable $(X_i)$ at iteration $(t)$, and $(f^{(t)})$ denotes the Random Forest Regressor trained up to iteration $(t)$. This process iterates until convergence or reaches a specified maximum number of iterations.

- Random Forest Regressor

The Random Forest Regressor is a key component of the imputation process. In mathematical terms, a Random Forest model can be expressed as an ensemble of decision trees:

$$
f(x) = \sum_{k=1}^{K} h_k(x),
$$

where $(K)$ is the number of trees, and $(h_k(x))$ represents the $(k)$-th decision tree. Each tree is trained on a random subset of the data, contributing to the final prediction.

- Regression Modeling

At each iteration of the MICE algorithm, a regression model is constructed for each variable with missing values. For variable $(X_i)$ with missing values, the regression model can be written as:

$$
X_i = f(X_{\neg i}),
$$

Here, $(X_{\neg i})$ denotes all variables except $(X_i)$. The imputation process involves iteratively updating estimates based on these regression models.


### Spearman's Correlation

To further investigate and identify interesting correlation trends, Spearman's correlation is applied. This non-parametric measure of statistical dependence between two variables is particularly useful when dealing with non-linear data. Unlike Pearson's correlation, Spearman's correlation does not assume a linear relationship between variables.

The Spearman's correlation process involves ranking data points for each variable, ordering them from lowest to highest, computing the differences between the ranks, squaring these differences, and calculating Spearman's rank correlation coefficient.

The formula for Spearman's rank correlation coefficient (ρ) is given by:

$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$
 
where $\rho $ is the number of data points, and $ d_i $ represents the differences in ranks.

Spearman's correlation is advantageous when dealing with non-linear relationships, capturing monotonic (increasing or decreasing) relationships. This makes it more suitable for assessing associations in datasets where the relationship between variables is not strictly linear.

The numerical analysis of the Spearman correlation matrix provides details on the strength of correlations and potential influences between categories. Key observations from the numerical values include:

- Temporal Correlations within the Same Categories: Generally high correlations within the same categories over the years, indicating a strong positive correlation. This suggests that gender dynamics in one category positively influence dynamics in similar categories in subsequent years.

- Correlations between Different Categories in the Same Year: Varied correlations, often significant. A high value indicates a positive relationship, while a low or negative value indicates a weaker or even negative relationship.

- Correlations Between Different Categories Over the Years: Significant variations over time, suggesting that gender dynamics in an agency in one year may influence other categories differently in subsequent years.

- Negative Values in Year-to-Year Correlations: Indicate an inverse relationship between categories in specific years. This suggests that, in certain periods, an increase in the number of women in one category is associated with a decrease in other categories and vice versa.

- Correlations between Agencies in the Same Year: Reflect how gender dynamics in one agency may be linked to those in other agencies during the same period.

### Heatmap of Spearman's Correlation for Non-linear Variables

A heatmap of Spearman's correlation provides a visual representation of non-linear relationships within the dataset. The heatmap is useful for understanding correlations within the same domains across years, between different domains in the same year, and between different domains across years.

- Correlation within the Same Domains Across Years: Warm colors (e.g., red) along the diagonal indicate a strong positive correlation within the same domain in consecutive years.

- Correlation between Different Domains in the Same Year: Cells not on the diagonal represent the correlation between different domains in the same year. Warm colors indicate a positive relationship, while cool colors indicate a weaker or negative relationship.

- Correlation between Different Domains Across Years: Cells neither on the diagonal nor in the same year show the correlation between different domains across years. Cool colors suggest that these domains are not strongly correlated over time.

### Clusters and Visualization

In this section, a different analysis approach is employed to identify patterns in the dataset. At first calculated the clusters by using the K-Means, after the appropriate implementation of the elbow method to decide the right amount of clusters to create and display, but then we thougt to use also DBSCAN (Density-Based Spatial Clustering of Applications with Noise) which could be better for our case. When deciding between K-Means and DBSCAN, it boils down to their distinctive traits. K-Means excels in handling spherical clusters with similar sizes and works best with globular or isotropic shapes. However, it is sensitive to outliers and requires specifying the number of clusters beforehand. On the other hand, DBSCAN is well-suited for clusters with arbitrary shapes and densities, making it effective in dealing with irregular cluster shapes. It is robust to outliers, adapts to varying cluster densities, and doesn't demand a predefined number of clusters. We opted for DBSCAN due to its adaptability, robustness, and automatic determination of the number of clusters. In a few words DBSCAN substantially clusters data based on density. The algorithm groups together data points that are close to each other and have a sufficient number of neighbors within a specified distance (epsilon or eps). This approach allows DBSCAN to identify dense regions as clusters and label points with lower densities as noise or outliers.

The key criteria for DBSCAN are the following: core points which are data points with at least a specified number of neighbors within a defined distance (eps) are considered core points, border points basically points that are within the specified distance (eps) of a core point but do not have enough neighbors to be core themselves are considered border points and noise (outliers) that are points that are neither core points nor border points are labeled as noise or outliers.

The algorithm forms clusters by connecting core points and border points that are close to each other. The result is a set of clusters with varying shapes and sizes, adapting to the density of the underlying data distribution.

### 3D Visualization with Principal Component Analysis (PCA)

To gain insights into the dataset's patterns, a 3D visualization is created using Principal Component Analysis (PCA). PCA is employed to reduce the dimensionality of the dataset, allowing for better visualization and understanding of relationships between variables. The reduction of dimensionality is crucial for several reasons, including data visualization, noise reduction and computational efficiency.

The first principal component is a linear combination of the original variables, capturing the majority of the variance in the data. It is determined by the PCA algorithm during the analysis, and each term in the expression represents a weight multiplied by the value of the corresponding original variable. The positive or negative sign of each weight indicates the direction and strength of the relationship between the original variable and the principal component.

The 3D graph represents each category by a point, with the position along the X-axis determined by the value of the first principal component, the position along the Y-axis determined by the second principal component, and the position along the Z-axis determined by the third principal component. The visualization is obtained by using <code>plotly</code> which is a versatile python's library.

### Cluster Analysis and Interpretation

The obtained clusters represent different geographical regions that exhibit similar patterns in the number of women in influential roles over the years. Regions within the same cluster have more similar temporal change patterns compared to agencies in different clusters. In general through the visualization it is possible to conclude that the majority of data seems to be associated with one cluster, this could suggest that the majority of the geographical regions taken into account shows similar pattern and trend in dealing with the numbers of women in influential roles in diffeent fields.

## Second set of datasets
Before proceeding with the analysis of this second set of datasets, it is necessary to specify some peculiarities that have influenced various choices made during the aforementioned analyses. First of all, this second analysis block is intended to be the more "experimental" of the two. By this, it is meant that the results obtained from the same workflow applied to the first set of datasets could be, and indeed should be, subject to questioning. This is due to the nature of the datasets analyzed, which, having a substantial difference in the quantity of data they collect, made the analyses quite challenging and in need of some interventions. For this reason, the following workflow, the same applied before, can be skipped from the union of the datasets into a single final dataset until the end to obtain more concrete and realistic data that, however, lack interconnectivity between categories.

To better explain this last consideration, just think about the fact that the rows in the first dataset of this second block are in the order of 40 units, while for the remaining two, we are in the order of a few tens. This implies that a merge operation based on the common column across all three datasets creates multiple rows with NaN values for the second and third datasets, indicating missing values since they do not exist. This unfortunate inconvenience is generated by the fact that, unlike the first set of datasets where the "Geographical region" column had the same values for all three datasets, listing Agencies (in the column to be renamed "Agency") reveals that these, referring to different sectors, are different. The algorithm applied by pandas in merge operations thus generates additional rows even where there were no values to list in order to retain instances of these agencies.

It goes without saying that, therefore, as will be seen later in the following analysis, especially regarding the section related to data imputation, the results are "artificial." This is because, for significant sections of the correlation matrix (clearly visible in the heatmap), the data is entirely "guessed" by the selected model through its problem-solving logic. To obtain real results, one only needs to comment out the specific line of code related to data imputation to obtain realistic correlation matrices. Essentially, this solution was not chosen as the default in conducting this research for two main reasons: firstly, because the curiosity that drives any hypothesis led us to test the previously used data imputation algorithms to see if they could indeed provide a realistic view of the situation (by comparing the two heatmaps and noting the differences where they became particularly evident). Secondly, limiting the analysis to a simple correlation operation on the dataset without imputation (and thus with all its NaN values where real values do not exist) produces a correlation matrix that shows influences (which, as always, do not imply a causation relationship) within the categories themselves, but not across them. This results in a less profound outcome.

### Correlation Analysis

The initial step, as before, involves examining the correlations between absolute values and trends over time using the `generate_overall_trend_correlation` and `generate_yearly_correlations` functions. The `column_remove` and `force_to_numeric` functions are applied to standardize the dataset. However, it is crucial to note that the results may be less trustworthy due to the significant difference in the number of rows between the datasets. This issue poses a challenge in conducting a reliable analysis, especially for values that require imputation.

### Merging Datasets

To gain a broader perspective on potential trends, the datasets are merged based on the renamed "Agency" column. The challenge, as said before, lies in the substantial discrepancy in the number of rows across individual datasets. Similar to the previous set, the different imputations functions are applied to impute missing values in the merged dataset (it is worth to mention that for this merged dataset the imputation takes longer as the model needs in fact to guess a greater amount of values in respect of the other merged dataset).

### Spearman's Correlation

Spearman's correlation coefficient is then applied to identify interesting correlations between variables. The results align with the patterns observed in the previous set of datasets. Noteworthy observations include strong temporal correlations within the same categories, varying correlations between different categories in the same year, and fluctuating correlations between different categories over the years. Negative values suggest intricate dynamics where an increase in the number of women in one category may counterbalance variations in other categories.

### Heatmap of Spearman's Correlation for Non-linear Variables

Similar to the previous set, a heatmap of Spearman's correlation is generated for visual representation of non-linear relationships within the dataset. Warm colors along the diagonal indicate strong positive correlations within the same domain in consecutive years. The analysis reveals positive correlations within specific domains, with weaker correlations between different domains. In general the original aim of the application of this correlation method to this "guessed" set of values in the merged dataset shows a trend not too dissimilar in respect of the previous one, which of course does not imply that the reality of facts is just as it is displayed in this analysis.

### Clusters and Visualization

The same criteria to form and visualize clusters are applied once again, resulting in the same amount of clusters before thus showing again that this second set of data is showing similar patterns to the one we saw before.

### Conclusion

In conclusion, the analyses applied to both final datasets reveal similar patterns, correlations (and hence possible influences) that appear more as isolated events within the analyzed categories. These correlations are less likely to produce a chain reaction that extends across categories. The formation of some clusters demonstrates patterns and trends related to the number of women occupying influential roles within the agencies or institutions under consideration. Clearly, as specified earlier, the results of the analyses on the second set would benefit from further in-depth study, and what is presented here is just a preliminary step for further exploration. That being said the hypothesis by which the rising in numbers of women that have important roles in different institutions, agencies, public administrations etc, do not generate a rise in numbers in other fields.

Nevertheless, it can be asserted with some degree of confidence that the patterns shown in the second set are not vastly dissimilar (at least in their behavior, if not in the actual values) from those observed in the first section. Therefore, one could hypothesize the existence of fairly similar dynamics in the second section under consideration.


## Visualizations

The visualisation of the final datasets includes:
<li> Data processing and filtering, in order to select specific values in json file</li>
<li> Data visualisation using different graphs and various libraries.</li>
    
In detail, the following libraries were used:
<li> leaflet.js - for creating the interactive choropleth maps - <a href= "https://github.com/Leaflet/Leaflet/blob/main/LICENSE">BSD 2-Clause "Simplified" License</a></li>
<li> plotly.js - for the interactive line charts - <a href="https://github.com/plotly/plotly.js/blob/master/LICENSE">MIT License</a></li>



<br>
The code for creating the json files can be found in our github repository's <a href="https://github.com/OrsolaMBorrini/blessedfruit/tree/main/visualisations">scripts section</a>;

The javascript files for creating and displaying the visualisations can be found in the <a href="https://github.com/OrsolaMBorrini/blessedfruit/tree/main/assets/js"> assets folder</a> of our website.

## RDF Assertion of the metadata

All produced mashup dataset have been thoroughly described with metadata, following the specification of <a href="https://docs.italia.it/italia/daf/linee-guida-cataloghi-dati-dcat-ap-it/it/stabile/index.html" target="_blank"><b>DCAT-AP_IT</b></a> standard as recommended by <b>AGID's public information heritage valorization guidelines</b>.<br><br>
Since all our datasets contain data of specific national interest and are derived from Istat datasets, which is an italian public research institution we decided to adopt <b>DCAT-AP_IT</b> (2016), the national standard. <br>Although it is based on the first version of the european standard DCAT and has more constraints, compared to the more flexible and recent european standard <a href="https://www.w3.org/TR/vocab-dcat-2/" target="_blank">DCAT-AP 2.0</a>, we considered DCAT-AP_IT the more suitable standard for our mashuo datasets: on the one hand, because in the italian public sector <b>an increasing number of Public Administrations are adopting DCAT-AP_IT</b></span>; on the other hand, because this allowed us to follow more detailed national guidelines and therfore <b>ensure interoperability and harmonization with other data on a national level</b>.
            
Moreover:<br>

* To describe in full transparency the sources and the activities underlying the creation of our mashup datasets we adopted <a href="https://www.w3.org/TR/prov-o/" target="_blank"><b>PROV-O - the provenance ontology</b></a> as strongly recommended on a european level and also allowed by DCAT-AP_IT <a href="https://docs.italia.it/italia/daf/lg-patrimonio-pubblico/it/stabile/modellometadati.html" target="_blank">metadata model</a>
* Since all our mashup datasets are series containing individual datasets for each year (2017, 2018, 2019) and only DCAT-AP 3.0 currently provides a <code>dcat:DatasetSeries</code> with related properties, again, we followed AGID's <a href="https://docs.italia.it/italia/daf/lg-patrimonio-pubblico/it/stabile/modellometadati.html" target="_blank">metadata model</a> instruction about how to handle relationships between datasets.<br>

We emphasized individual elements of the serie and created, inside each mashup dataset RDF assertion, a triple with a Serie Dataset subject, connected through the Dublin Core property <code>dct:type</code> to the value &lt;http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series&gt; .<br> 

Then we specified which dataset belonged to the Serie by means of <code>dct:hasPart</code> property.<br>

Finally, every individual yearly Mashup dataset, it's connected in its turn with the related Serie by
means of <code>dct:isPartOf</code>

Find the downloadable RDF assertions on <a href="https://orsolamborrini.github.io/blessedfruit/" target="_blank">Blessed be the fruit</a>

## Sustainability of the update


This is the final project for the course "Open acces and DIgital Ethics" for the Masters degree in "Digital Humanties and Digital Knowledge" at the University of Bologna for the a.y 2023/2024, there is no intention on updating the resources gathered in the future. Although all of the source datasets for "wipEU" come from the <a href="https://eige.europa.eu/">EIGE</a>, which maintains and update them yearly. On the website of EIGE it is possible to download the data year by year