## Final datasets 

| Original /Intermediate dataset | Description | Name
| --- | ----------- | --- |
| D2, D3 | Economic independence | I1 |
| D4, D6 | Working status by age | I2 |
| D5, D6, D7 | Work wages across sectors | I3 |
| D4, D7, D17 | Graduates' employment status | I4 |
| D1, I1 | Emigration and economic independence | I5 |


**with which license do we release the dataset?**
Original licenses:
- CC-BY-3.0 for Istat
- "Unless otherwise indicated, reproduction for non-commercial purposes with citation of the source is authorized" for AlmaLaurea
 
For each of the dataset created, only percentages, numbers (thousands) or prices are being considered. No names nor other sensitive information about an individual were present in the original datasets, therefore they are not present in the mesh-up datasets. For this reason, considering this data there's no problem of de-anonymization and no individual can be identified. 

For each dataset created, both data taken from Istat and AlmaLaurea are updated at 19/1/2024. 

### I1 : Economic independence

The goal of this dataset is to merge data related to the indices of house prices and the number of young people still living with their parents. The main focus is trying to understand if the increase of house prices influences the fact that young people find it difficult to have economic independence. Data is investigated considering a time span of 8 years (2014-2022) and analyzing the differences between italian geographical macro-areas (North-West, North-East, Center and South).

In [2]:
import pandas as pd

young = pd.read_csv("../datasets/cleaned_csv/D2.csv", encoding="utf-8")
young.head(5)

Unnamed: 0,Area,Time period,Young people still living with their parents (%)
0,Nord-ovest,2014,60.6
1,Nord-ovest,2015,58.4
2,Nord-ovest,2016,59.3
3,Nord-ovest,2017,57.2
4,Nord-ovest,2018,57.6


In [3]:
house = pd.read_csv("../datasets/cleaned_csv/D3.csv", encoding="utf-8")
house.head(5)

Unnamed: 0,Area,Time period,House prices indices (2015 = 100)
0,Nord-ovest,2014,104.3
1,Nord-ovest,2015,100.0
2,Nord-ovest,2016,100.2
3,Nord-ovest,2017,99.5
4,Nord-ovest,2018,99.4


In [4]:
i1 = pd.merge(young, house, on=["Time period", "Area"])
i1.to_csv("../datasets/intermediate_datasets/i1.csv", index=False)
i1.head(5)

Unnamed: 0,Area,Time period,Young people still living with their parents (%),House prices indices (2015 = 100)
0,Nord-ovest,2014,60.6,104.3
1,Nord-ovest,2015,58.4,100.0
2,Nord-ovest,2016,59.3,100.2
3,Nord-ovest,2017,57.2,99.5
4,Nord-ovest,2018,57.6,99.4


### I2 : Working status by age

This dataset merges data related to the satisfaction of young people at work and the gross hourly wage retribution. The goal is trying to understand if there is a correlation between the wage and the satisfaction at work, and if this could be a parameter that could influence emigration to foreign countries. 

It is necessary to notice that the age span analyzed in D4 is 15-24 while D6 consideres an age bracket 15-29 years. 
Even though the two age spans do not coincide, we still considered them as representative of young people, and since no alignment of this data was possible, we chose to work with these different age spans. 

Furthermore, D6 does not contain data related to 2022, for this reason this mash-up dataset takes into consideration only data from 2014 to 2021.

In [5]:
satisfaction = pd.read_csv("../datasets/cleaned_csv/D4.csv", encoding="utf-8")
satisfaction

Unnamed: 0,Level of satisfaction,Qualification,Time period,Satisfaction (%)
0,very much,"Primary school certificate, no educational degree",2014,31.3
1,very much,"Primary school certificate, no educational degree",2015,18.7
2,very much,"Primary school certificate, no educational degree",2016,0.0
3,very much,"Primary school certificate, no educational degree",2017,32.3
4,very much,"Primary school certificate, no educational degree",2018,44.7
...,...,...,...,...
175,not at all,Total,2018,4.6
176,not at all,Total,2019,4.1
177,not at all,Total,2020,2.9
178,not at all,Total,2021,2.7


In [6]:
for idx, i in satisfaction.iterrows():
    if i.iloc[1] == "Tertiary (university, doctoral and specialization courses)" or i.iloc[1] == "Primary school certificate, no educational degree" or i.iloc[1] == "Upper and post secondary" or i.iloc[1] =="Lower secondary school certificate":
        satisfaction = satisfaction.drop(idx)

x = []
y = []
z = []
k = []

for id, n in satisfaction.iterrows():
    if n.iloc[0] == " very much":
        value = n.iloc[3]
        x.append(value)
    if n.iloc[0] == " quite":
        value = n.iloc[3]
        y.append(value)
    if n.iloc[0] == " not much":
        value = n.iloc[3]
        z.append(value)
    if n.iloc[0] == " not at all":
        value = n.iloc[3]
        k.append(value)

df = satisfaction[['Time period']]
s_df = df.drop_duplicates().copy()
s_df.loc[:, "Satisfaction (%): very much"] = x
s_df.loc[:, "Satisfaction (%): quite"] = y
s_df.loc[:, "Satisfaction (%): not much"] = z
s_df.loc[:, "Satisfaction (%): not at all"] = k

s_df

Unnamed: 0,Time period,Satisfaction (%): very much,Satisfaction (%): quite,Satisfaction (%): not much,Satisfaction (%): not at all
36,2014,18.2,52.1,19.0,4.3
37,2015,19.9,55.2,18.5,2.2
38,2016,17.3,53.2,18.3,2.5
39,2017,16.9,57.1,16.2,3.5
40,2018,17.7,59.2,15.5,4.6
41,2019,20.6,55.1,13.4,4.1
42,2020,21.4,58.4,13.9,2.9
43,2021,19.9,58.5,13.9,2.7
44,2022,18.7,63.3,10.8,1.3


In [22]:
wageage = pd.read_csv("../datasets/cleaned_csv/D6.csv", encoding="utf-8")

for idx, i in wageage.iterrows():
    if i.iloc[1] !='TOTAL':
        wageage = wageage.drop(idx)
            
for idx, i in wageage.iterrows():
    if i.iloc[0] != "Italy":
        wageage = wageage.drop(idx)


wageage

Unnamed: 0,Area,Economic activity,Time period,Hourly wage by age (15-29) (€)
0,Italy,TOTAL,2014,9.76
1,Italy,TOTAL,2015,9.9
2,Italy,TOTAL,2016,9.92
3,Italy,TOTAL,2017,10.03
4,Italy,TOTAL,2018,10.06
5,Italy,TOTAL,2019,10.15
6,Italy,TOTAL,2020,10.35
7,Italy,TOTAL,2021,10.37


In [23]:
i2 = pd.merge(wageage, s_df, on="Time period")
i2.to_csv("../datasets/intermediate_datasets/i2.csv", index=False)
i2

Unnamed: 0,Area,Economic activity,Time period,Hourly wage by age (15-29) (€),Satisfaction (%): very much,Satisfaction (%): quite,Satisfaction (%): not much,Satisfaction (%): not at all
0,Italy,TOTAL,2014,9.76,17.4,43.9,19.5,2.2
1,Italy,TOTAL,2015,9.9,31.2,55.4,13.4,0.0
2,Italy,TOTAL,2016,9.92,22.3,53.5,16.6,0.0
3,Italy,TOTAL,2017,10.03,16.1,48.7,22.5,5.9
4,Italy,TOTAL,2018,10.06,18.5,65.6,8.7,2.6
5,Italy,TOTAL,2019,10.15,20.9,58.0,13.9,0.0
6,Italy,TOTAL,2020,10.35,24.9,68.5,2.7,1.6
7,Italy,TOTAL,2021,10.37,17.0,61.9,11.7,5.1


### I3 : Work wages across sectors.

This dataset aims at combining data related to the average salary based on type of contract, age of the employees and educational qualification. The final dataset takes in consideration not only the temporal span, but also italian geographical macroareas (North-west, North-east, Center and South) and the economic activity.

In all of the three starting datasets only data related to 2014-2021 was available. 

In [9]:
contract = pd.read_csv("../datasets/cleaned_csv/D5.csv", encoding="utf-8")
for idx, i in contract.iterrows():
    if i.iloc[1].startswith('TOTAL'):
        contract = contract.drop(idx)
            
for idx, i in contract.iterrows():
    if i.iloc[0] == "Italy":
        contract = contract.drop(idx)


contract

Unnamed: 0,Area,Economic activity,Time period,Hourly wage for temporary employees (€),Hourly wage for permanent employees (€)
192,Nord-ovest,Mining and quarrying,2014,13.20,21.56
193,Nord-ovest,Mining and quarrying,2015,13.84,21.51
194,Nord-ovest,Mining and quarrying,2016,12.57,21.98
195,Nord-ovest,Mining and quarrying,2017,12.50,22.07
196,Nord-ovest,Mining and quarrying,2018,12.36,23.14
...,...,...,...,...,...
835,Sud,Other service activities,2017,8.13,8.17
836,Sud,Other service activities,2018,8.08,8.18
837,Sud,Other service activities,2019,8.06,8.24
838,Sud,Other service activities,2020,8.31,8.45


In [10]:
wageage = pd.read_csv("../datasets/cleaned_csv/D6.csv", encoding="utf-8")
for idx, i in wageage.iterrows():
    if i.iloc[1].startswith('TOTAL'):
        wageage = wageage.drop(idx)
            
for idx, i in wageage.iterrows():
    if i.iloc[0] == "Italy":
        wageage = wageage.drop(idx)


wageage

Unnamed: 0,Area,Economic activity,Time period,Hourly wage by age (15-29) (€)
160,Nord-ovest,Mining and quarrying,2014,14.83
161,Nord-ovest,Mining and quarrying,2015,16.85
162,Nord-ovest,Mining and quarrying,2016,15.32
163,Nord-ovest,Mining and quarrying,2017,16.33
164,Nord-ovest,Mining and quarrying,2018,14.54
...,...,...,...,...
755,Sud,Other service activities,2017,7.72
756,Sud,Other service activities,2018,7.67
757,Sud,Other service activities,2019,7.70
758,Sud,Other service activities,2020,7.81


In [11]:
educationalq = pd.read_csv("../datasets/cleaned_csv/D7.csv", encoding="utf-8")
for idx, i in educationalq.iterrows():
    if i.iloc[1].startswith('TOTAL'):
        educationalq = educationalq.drop(idx)
            
for idx, i in educationalq.iterrows():
    if i.iloc[0] == "Italy":
        educationalq = educationalq.drop(idx)


educationalq

Unnamed: 0,Area,Economic activity,Time period,"Hourly wage no educational degree, primary and lower secondary school certificate (€)",Hourly wage upper and post secondary (€),"Hourly wage tertiary (university, doctoral and specialization courses) (€)"
192,Nord-ovest,Mining and quarrying,2014,13.39,21.33,30.27
193,Nord-ovest,Mining and quarrying,2015,13.62,21.16,30.75
194,Nord-ovest,Mining and quarrying,2016,13.59,22.02,30.45
195,Nord-ovest,Mining and quarrying,2017,13.68,21.34,29.43
196,Nord-ovest,Mining and quarrying,2018,13.68,21.86,31.22
...,...,...,...,...,...,...
835,Sud,Other service activities,2017,8.02,8.17,9.64
836,Sud,Other service activities,2018,8.05,8.12,9.53
837,Sud,Other service activities,2019,8.09,8.14,9.59
838,Sud,Other service activities,2020,8.27,8.38,10.45


In [12]:
i3 = pd.merge(pd.merge(contract, educationalq, on=["Time period", "Area", "Economic activity"]), wageage, on=["Time period", "Area", "Economic activity"])

i3.to_csv("../datasets/intermediate_datasets/i3.csv", index=False)

i3.head(5)


Unnamed: 0,Area,Economic activity,Time period,Hourly wage for temporary employees (€),Hourly wage for permanent employees (€),"Hourly wage no educational degree, primary and lower secondary school certificate (€)",Hourly wage upper and post secondary (€),"Hourly wage tertiary (university, doctoral and specialization courses) (€)",Hourly wage by age (15-29) (€)
0,Nord-ovest,Mining and quarrying,2014,13.2,21.56,13.39,21.33,30.27,14.83
1,Nord-ovest,Mining and quarrying,2015,13.84,21.51,13.62,21.16,30.75,16.85
2,Nord-ovest,Mining and quarrying,2016,12.57,21.98,13.59,22.02,30.45,15.32
3,Nord-ovest,Mining and quarrying,2017,12.5,22.07,13.68,21.34,29.43,16.33
4,Nord-ovest,Mining and quarrying,2018,12.36,23.14,13.68,21.86,31.22,14.54


### I4 : Graduates' employment status

This dataset focuses mainly on the graduates' working conditions, in particular considering their level of satisfaction at work, the employment rate and the average wage. 
This analysis is quite general, since no geographical subdivision nor economic activities are taken into consideration. This is due to the fact that this information was not present in D4 and D17. 

Moreover, the age span considered in this analysis is not consistent:
- D4 presents an age span of 15-24 
- D17 consideres employment rate of graduates after 3 years from graduation. No data about their age is present. 
- D7 consideres only hourly wage by educational qualification with no data related to age. 

Although we had considered eliminating the age range in D4, removing this variable would have meant not paying proper attention to young people. D4 does not provide any information regarding an age group considered as "youth" different from the one indicated, so the chosen age range has been retained.

D17 provides information related to the 3 years following the completion of a master's degree. According to AlmaLaurea data, on average, obtaining a master's degree occurs around the age of 27, so the reference to these data would be relevant to 30 years old graduates.

The hourly salary based on the level of education is considered as the median, allowing for the analysis of the central position relative to the ordered data and is not sensitive to the presence of outliers. 

Even in this case the temporal span considered for this analysis is 2014-2021 since D7 contained only data related to this time span.

In [13]:
satisfaction = pd.read_csv("../datasets/cleaned_csv/D4.csv", encoding="utf-8")
satisfaction

Unnamed: 0,Level of satisfaction,Qualification,Time period,Satisfaction (%)
0,very much,"Primary school certificate, no educational degree",2014,31.3
1,very much,"Primary school certificate, no educational degree",2015,18.7
2,very much,"Primary school certificate, no educational degree",2016,0.0
3,very much,"Primary school certificate, no educational degree",2017,32.3
4,very much,"Primary school certificate, no educational degree",2018,44.7
...,...,...,...,...
175,not at all,Total,2018,4.6
176,not at all,Total,2019,4.1
177,not at all,Total,2020,2.9
178,not at all,Total,2021,2.7


In [14]:
for idx, i in satisfaction.iterrows():
    if i.iloc[1] != "Tertiary (university, doctoral and specialization courses)":
        satisfaction = satisfaction.drop(idx)

x = []
y = []
z = []
k = []

for id, n in satisfaction.iterrows():
    if n.iloc[0] == " very much":
        value = n.iloc[3]
        x.append(value)
    if n.iloc[0] == " quite":
        value = n.iloc[3]
        y.append(value)
    if n.iloc[0] == " not much":
        value = n.iloc[3]
        z.append(value)
    if n.iloc[0] == " not at all":
        value = n.iloc[3]
        k.append(value)

df = satisfaction[['Time period']]
s_df = df.drop_duplicates().copy()
s_df["Satisfaction (%): very much"] = x
s_df["Satisfaction (%): quite"] = y
s_df["Satisfaction (%): not much"] = z
s_df["Satisfaction (%): not at all"] = k

s_df

Unnamed: 0,Time period,Satisfaction (%): very much,Satisfaction (%): quite,Satisfaction (%): not much,Satisfaction (%): not at all
27,2014,17.4,43.9,19.5,2.2
28,2015,31.2,55.4,13.4,0.0
29,2016,22.3,53.5,16.6,0.0
30,2017,16.1,48.7,22.5,5.9
31,2018,18.5,65.6,8.7,2.6
32,2019,20.9,58.0,13.9,0.0
33,2020,24.9,68.5,2.7,1.6
34,2021,17.0,61.9,11.7,5.1
35,2022,20.6,70.8,6.7,0.0


In [15]:
graduates = pd.read_csv("../datasets/cleaned_csv/D17.csv", encoding="utf-8")
graduates = graduates[['Graduates occupational rate (%)', 'Time period']]
graduates

Unnamed: 0,Graduates occupational rate (%),Time period
0,799,2014
1,796,2015
2,801,2016
3,825,2017
4,818,2018
5,844,2019
6,839,2020
7,857,2021
8,861,2022


In [16]:
graduateswage = pd.read_csv("../datasets/cleaned_csv/D7.csv", encoding="utf-8")
graduateswage = graduateswage[['Area', 'Economic activity', 'Time period', 'Hourly wage tertiary (university, doctoral and specialization courses) (€)']]
graduateswage

Unnamed: 0,Area,Economic activity,Time period,"Hourly wage tertiary (university, doctoral and specialization courses) (€)"
0,Italy,TOTAL,2014,13.83
1,Italy,TOTAL,2015,13.77
2,Italy,TOTAL,2016,13.80
3,Italy,TOTAL,2017,13.85
4,Italy,TOTAL,2018,13.86
...,...,...,...,...
835,Sud,Other service activities,2017,9.64
836,Sud,Other service activities,2018,9.53
837,Sud,Other service activities,2019,9.59
838,Sud,Other service activities,2020,10.45


In [17]:
for idx, i in graduateswage.iterrows():
    if i.iloc[1] != "TOTAL":
        graduateswage = graduateswage.drop(idx)
            
for idx, i in graduateswage.iterrows():
    if i.iloc[0] != "Italy":
        graduateswage = graduateswage.drop(idx)

        
graduateswage

Unnamed: 0,Area,Economic activity,Time period,"Hourly wage tertiary (university, doctoral and specialization courses) (€)"
0,Italy,TOTAL,2014,13.83
1,Italy,TOTAL,2015,13.77
2,Italy,TOTAL,2016,13.8
3,Italy,TOTAL,2017,13.85
4,Italy,TOTAL,2018,13.86
5,Italy,TOTAL,2019,14.11
6,Italy,TOTAL,2020,14.57
7,Italy,TOTAL,2021,14.75


In [18]:
i4 = pd.merge(pd.merge(s_df, graduates, on="Time period"), graduateswage, on="Time period")

i4.to_csv("../datasets/intermediate_datasets/i4.csv", index=False)

i4

Unnamed: 0,Time period,Satisfaction (%): very much,Satisfaction (%): quite,Satisfaction (%): not much,Satisfaction (%): not at all,Graduates occupational rate (%),Area,Economic activity,"Hourly wage tertiary (university, doctoral and specialization courses) (€)"
0,2014,17.4,43.9,19.5,2.2,799,Italy,TOTAL,13.83
1,2015,31.2,55.4,13.4,0.0,796,Italy,TOTAL,13.77
2,2016,22.3,53.5,16.6,0.0,801,Italy,TOTAL,13.8
3,2017,16.1,48.7,22.5,5.9,825,Italy,TOTAL,13.85
4,2018,18.5,65.6,8.7,2.6,818,Italy,TOTAL,13.86
5,2019,20.9,58.0,13.9,0.0,844,Italy,TOTAL,14.11
6,2020,24.9,68.5,2.7,1.6,839,Italy,TOTAL,14.57
7,2021,17.0,61.9,11.7,5.1,857,Italy,TOTAL,14.75


### I5 : Emigration and economic independence

This dataset contains the number of young people emigrated from Italy, indices of house prices and number of young people still living at home with their parents from 2014 to 2022.

In [19]:
emigrated = pd.read_csv("../datasets/cleaned_csv/D1.csv", encoding="utf-8")
emigrated.head(5)

Unnamed: 0,Area,Time period,Number of emigrates
0,Italy,2014,45074
1,Italy,2015,51048
2,Italy,2016,60788
3,Italy,2017,61553
4,Italy,2018,63570


In [20]:
young = pd.read_csv("../datasets/intermediate_datasets/i1.csv", encoding="utf-8")
young.head(5)

Unnamed: 0,Area,Time period,Young people still living with their parents (%),House prices indices (2015 = 100)
0,Nord-ovest,2014,60.6,104.3
1,Nord-ovest,2015,58.4,100.0
2,Nord-ovest,2016,59.3,100.2
3,Nord-ovest,2017,57.2,99.5
4,Nord-ovest,2018,57.6,99.4


In [21]:
i5 = pd.merge(young, emigrated, on=["Time period", "Area"])
i5.to_csv("../datasets/intermediate_datasets/i5.csv", index=False)
i5.head(5)

Unnamed: 0,Area,Time period,Young people still living with their parents (%),House prices indices (2015 = 100),Number of emigrates
0,Nord-ovest,2014,60.6,104.3,13043
1,Nord-ovest,2015,58.4,100.0,14567
2,Nord-ovest,2016,59.3,100.2,16597
3,Nord-ovest,2017,57.2,99.5,16818
4,Nord-ovest,2018,57.6,99.4,17253
