### D1. Young people (18-39) of Italian citizenship emigrated to foreign countries from 2014 to 2022 


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: deregistrations
- Type of transfer:  abroad
- Citizenship: Italian
- Gender: all variables
- Age: 18-39

Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory of origin
- CHANGE_OF_RESIDENCE - type of transfer
- CITIZENSHIP - citizenship
- SEX - gender
- AGE - age
- OBS_status - observation status

The last parameter, OBS_status, contains "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory of origin (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Time period (spanning from 2014 to 2022)
- Observed data, in this case, the number of emigrated individuals.

In [1]:
import pandas as pd

original = pd.read_csv("../datasets/emigrazione/Emigrants - province of origin (IT1,28_185_DF_DCIS_MIGRAZIONI_7,1.0).csv", encoding="utf-8")
original.head(5)

Unnamed: 0,FREQ,Frequency,REF_AREA,Territory of previous residence,DATA_TYPE,Indicator,CHANGE_OF_RESIDENCE,Change of residence,CITIZENSHIP,Citizenship (DESC),SEX,Gender,AGE,Age (DESC),Time (TIME_PERIOD),Observation,OBS_STATUS,Observation status
0,A,Annual,IT,Italy,TDEREG,Deregistrations,FREIGN,Abroad,IT,Italy,9,Total,Y18-39,18-39 years,2014,45074,,
1,A,Annual,IT,Italy,TDEREG,Deregistrations,FREIGN,Abroad,IT,Italy,9,Total,Y18-39,18-39 years,2015,51048,,
2,A,Annual,IT,Italy,TDEREG,Deregistrations,FREIGN,Abroad,IT,Italy,9,Total,Y18-39,18-39 years,2016,60788,,
3,A,Annual,IT,Italy,TDEREG,Deregistrations,FREIGN,Abroad,IT,Italy,9,Total,Y18-39,18-39 years,2017,61553,,
4,A,Annual,IT,Italy,TDEREG,Deregistrations,FREIGN,Abroad,IT,Italy,9,Total,Y18-39,18-39 years,2018,63570,,


In [2]:

selected_columns = ['Territory of previous residence', 'Time (TIME_PERIOD)', "Observation"]  
d1 = original[selected_columns].copy()

column_name_mapping = {
    'Territory of previous residence': 'Area',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Number of emigrates',
}


d1.rename(columns=column_name_mapping, inplace=True)

d1.to_csv("../datasets/cleaned_csv/D1.csv", index=False)

d1.head(5)


Unnamed: 0,Area,Time period,Number of emigrates
0,Italy,2014,45074
1,Italy,2015,51048
2,Italy,2016,60788
3,Italy,2017,61553
4,Italy,2018,63570


### D2. Young people (18-34) still living with at least one of their parents (2014-2022)


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: young people who still live with at least one of their parents
- Type: married and unmarried
- measure: considering 100 people with the same characteristics
- Gender: all variables
- Age: 18-34

Since in the dataset some values present "," the delimiter ";" has been chosen to prevent ambiguities.

Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory
- MEASURE - Measure
- SEX - gender
- AGE - age
- OBS_status - observation status

The last parameter, OBS_status, contains "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Time period (spanning from 2014 to 2022)
- Observed data, in this case, the number of young people still living with their parents.

In [3]:
original = pd.read_csv("../datasets/giovani_casa/Young people living in family - reg. (IT1,83_63_DF_DCCV_AVQ_PERSONE_121,1.0).csv", encoding="utf-8",  sep=';')

original.head(5)

Unnamed: 0,FREQ,Frequency,REF_AREA,Territory,DATA_TYPE,Indicator,MEASURE,Measure (DESC),SEX,Gender,AGE,Age (DESC),Time (TIME_PERIOD),Observation,OBS_STATUS,Observation status
0,A,Annual,ITC,Nord-ovest,18_YOUNG,Young unmarried people aged 18-34 years living...,HSC,Per 100 people with the same characteristics,9,Total,Y18-34,18-34 years,2014,60.6,,
1,A,Annual,ITC,Nord-ovest,18_YOUNG,Young unmarried people aged 18-34 years living...,HSC,Per 100 people with the same characteristics,9,Total,Y18-34,18-34 years,2015,58.4,,
2,A,Annual,ITC,Nord-ovest,18_YOUNG,Young unmarried people aged 18-34 years living...,HSC,Per 100 people with the same characteristics,9,Total,Y18-34,18-34 years,2016,59.3,,
3,A,Annual,ITC,Nord-ovest,18_YOUNG,Young unmarried people aged 18-34 years living...,HSC,Per 100 people with the same characteristics,9,Total,Y18-34,18-34 years,2017,57.2,,
4,A,Annual,ITC,Nord-ovest,18_YOUNG,Young unmarried people aged 18-34 years living...,HSC,Per 100 people with the same characteristics,9,Total,Y18-34,18-34 years,2018,57.6,,


In [4]:
selected_columns = ['Territory', 'Time (TIME_PERIOD)', "Observation"]  
d2 = original[selected_columns].copy()

column_name_mapping = {
    'Territory': 'Area',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Young people still living with their parents (%)',
}

d2.rename(columns=column_name_mapping, inplace=True)

d2.to_csv("../datasets/cleaned_csv/D2.csv", index=False)

d2


Unnamed: 0,Area,Time period,Young people still living with their parents (%)
0,Nord-ovest,2014,60.6
1,Nord-ovest,2015,58.4
2,Nord-ovest,2016,59.3
3,Nord-ovest,2017,57.2
4,Nord-ovest,2018,57.6
5,Nord-ovest,2019,58.7
6,Nord-ovest,2020,57.6
7,Nord-ovest,2021,63.3
8,Nord-ovest,2022,63.7
9,Nord-est,2014,59.5


### D3. House prices


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: house prices indices (base 2015 = 100)
- Dwellings sold :all categories (both existing and new)


Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory
- MEASURE - Measure
- PURCHASE_DWELLINGS - dwellings sold
- OBS_status - observation status

The last parameter, OBS_status, contains "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Time period (spanning from 2014 to 2022)
- Observed data, in this case, the indices of prices for houses.

In [5]:
import pandas as pd

original = pd.read_csv("../datasets/prezzi_case/Annual average from 2010 onwards (base 2015) (IT1,143_497_DF_DCSP_IPAB_2,1.0).csv", encoding="utf-8")

original.head(5)

Unnamed: 0,FREQ,Frequency,REF_AREA,Territory,DATA_TYPE,Indicator,MEASURE,Measure (DESC),PURCHASES_DWELLINGS,Purchases of dwellings,Time (TIME_PERIOD),Observation,OBS_STATUS,Observation status
0,A,Annual,ITC,Nord-ovest,60,House price index (base 2015=100) - annual ave...,4,Index number,ALL,H1 - all items,2014,104.3,,
1,A,Annual,ITC,Nord-ovest,60,House price index (base 2015=100) - annual ave...,4,Index number,ALL,H1 - all items,2015,100.0,,
2,A,Annual,ITC,Nord-ovest,60,House price index (base 2015=100) - annual ave...,4,Index number,ALL,H1 - all items,2016,100.2,,
3,A,Annual,ITC,Nord-ovest,60,House price index (base 2015=100) - annual ave...,4,Index number,ALL,H1 - all items,2017,99.5,,
4,A,Annual,ITC,Nord-ovest,60,House price index (base 2015=100) - annual ave...,4,Index number,ALL,H1 - all items,2018,99.4,,


In [6]:
selected_columns = ['Territory', 'Time (TIME_PERIOD)', "Observation"]  
d3 = original[selected_columns].copy()

column_name_mapping = {
    'Territory': 'Area',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'House prices indices (2015 = 100)',
}

d3.rename(columns=column_name_mapping, inplace=True)

d3.to_csv("../datasets/cleaned_csv/D3.csv", index=False)
d3.head(5)

Unnamed: 0,Area,Time period,House prices indices (2015 = 100)
0,Nord-ovest,2014,104.3
1,Nord-ovest,2015,100.0
2,Nord-ovest,2016,100.2
3,Nord-ovest,2017,99.5
4,Nord-ovest,2018,99.4


### D4. Work satisfaction


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: Working people with more than 15 years old 
- Age: 15-24
- Gender: all variables


Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- DATA_TYPE -indicatore
- MEASURE - Measure
- SEX - sesso
- AGE - età
- OBS_status - observation status

The last parameter, OBS_status, contains "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Educational qualification
- Time period (spanning from 2014 to 2022)
- Observed data, in this case, the indices of prices for houses.

In [7]:
original = pd.read_csv("../datasets/soddisfazione/Work satisfaction - age, educational level (IT1,83_63_DF_DCCV_AVQ_PERSONE_152,1.0).csv", encoding="utf-8" ,sep=";")

original.head(5)

Unnamed: 0,FREQ,Frequency,DATA_TYPE,Indicator,MEASURE,Measure (DESC),SEX,Gender,AGE,Age (DESC),EDU_LEV_HIGHEST,Highest level of education attained,Time (TIME_PERIOD),Observation,OBS_STATUS,Observation status
0,A,Annual,15_EMPL_SVER,Employed persons aged 15 years and over by lev...,HSC,Per 100 people with the same characteristics,9,Total,Y15-24,15-24 years,3,"Primary school certificate, no educational degree",2014,31.3,,
1,A,Annual,15_EMPL_SVER,Employed persons aged 15 years and over by lev...,HSC,Per 100 people with the same characteristics,9,Total,Y15-24,15-24 years,3,"Primary school certificate, no educational degree",2015,18.7,,
2,A,Annual,15_EMPL_SVER,Employed persons aged 15 years and over by lev...,HSC,Per 100 people with the same characteristics,9,Total,Y15-24,15-24 years,3,"Primary school certificate, no educational degree",2016,0.0,,
3,A,Annual,15_EMPL_SVER,Employed persons aged 15 years and over by lev...,HSC,Per 100 people with the same characteristics,9,Total,Y15-24,15-24 years,3,"Primary school certificate, no educational degree",2017,32.3,,
4,A,Annual,15_EMPL_SVER,Employed persons aged 15 years and over by lev...,HSC,Per 100 people with the same characteristics,9,Total,Y15-24,15-24 years,3,"Primary school certificate, no educational degree",2018,44.7,,


In [8]:
selected_columns = ['Indicator', 'Highest level of education attained', 'Time (TIME_PERIOD)', 'Observation']
d4 = original[selected_columns].copy()

column_name_mapping = {
    'Indicator': 'Level of satisfaction',
    'Highest level of education attained': 'Qualification',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Satisfaction (%)',
}

d4.rename(columns=column_name_mapping, inplace=True)

d4['Level of satisfaction'] = d4['Level of satisfaction'].str.split(':').str[1]

d4.to_csv("../datasets/cleaned_csv/D4.csv", index=False)

d4


Unnamed: 0,Level of satisfaction,Qualification,Time period,Satisfaction (%)
0,very much,"Primary school certificate, no educational degree",2014,31.3
1,very much,"Primary school certificate, no educational degree",2015,18.7
2,very much,"Primary school certificate, no educational degree",2016,0.0
3,very much,"Primary school certificate, no educational degree",2017,32.3
4,very much,"Primary school certificate, no educational degree",2018,44.7
...,...,...,...,...
175,not at all,Total,2018,4.6
176,not at all,Total,2019,4.1
177,not at all,Total,2020,2.9
178,not at all,Total,2021,2.7


### D5. Hourly wage based on type of contract (2014-2021)


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: Hourly gross pay per compensated hour for employee positions in euros (median).
- Gender: all variables;
- Employee class: all variables
- Employment status: totale


Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory
- DATA_TYPE - indicator type
- TYPE_OF_CONTRACT -type of contract
- NOTE_EMPLOYEES CLASS- employee class
- CONTARCTUAL_OCCUPATION - Qualifica contrattuale;
- ECON_ACTIVITY_NACE_2007 - Attività economica (ATECO 2007);
- NOTE_TIME_PERIOD - time
- BASE-PER - base year
- UNIT_MEAS - unit of measure
- UNIT_MULT - multiplication unit
- OBS_status - observation status

The last 6 parameters contain "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Type of contract (tempo determinato o indeterminato)
- Time period (spanning from 2014 to 2021)
- economic activity
- Observed data, in this case, the hourly wage.

In [9]:
original = pd.read_csv("../datasets/tipo_contratto/Kind of labour contract (IT1,533_957_DF_DCSC_RACLI_4,1.0).csv", encoding="utf-8" ,sep=";")

original.head(5)


Unnamed: 0,FREQ,Frequency,REF_AREA,Territory,DATA_TYPE,Indicator,SEX,Sex (DESC),TYPE_OF_CONTRACT,Type of employment contract,...,NOTE_EMPLOYESS_CLASS,Employees class (NOTE_EMPLOYESS_CLASS),NOTE_TIME_PERIOD,Time,BASE_PER,Base year,UNIT_MEAS,Measure unit,UNIT_MULT,Multiplication unit
0,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,1,Temporary employees,...,,,,,,,,,,
1,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,1,Temporary employees,...,,,,,,,,,,
2,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,1,Temporary employees,...,,,,,,,,,,
3,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,1,Temporary employees,...,,,,,,,,,,
4,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,1,Temporary employees,...,,,,,,,,,,


In [10]:
selected_columns = ['Territory', 'Type of employment contract','Economic activity (NACE Rev. 2)', 'Time (TIME_PERIOD)', 'Observation']
d5 = original[selected_columns].copy()

column_name_mapping = {
    'Territory': 'Area',
    'Type of employment contract': 'Type of contract',
    'Economic activity (NACE Rev. 2)' : 'Economic activity',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Hourly wage by type of contract (€)',
}

d5.rename(columns=column_name_mapping, inplace=True)
x = []
y = []


for idx, i in d5.iterrows():
    if i.iloc[1] == "Total":
        d5 = d5.drop(idx)
    if i.iloc[1] == "Temporary employees":
        value = i.iloc[4]
        x.append(value)
    if i.iloc[1] == "Permanent employees":
        value = i.iloc[4]
        y.append(value)


df = d5[['Area', 'Economic activity', 'Time period']]
s_d5 = df.drop_duplicates().copy()
s_d5['Hourly wage for temporary employees (€)'] = x
s_d5['Hourly wage for permanent employees (€)'] = y
       
s_d5.to_csv("../datasets/cleaned_csv/D5.csv", index=False)

s_d5

Unnamed: 0,Area,Economic activity,Time period,Hourly wage for temporary employees (€),Hourly wage for permanent employees (€)
0,Italy,TOTAL,2014,10.05,11.69
1,Italy,TOTAL,2015,10.33,11.70
2,Italy,TOTAL,2016,10.24,11.81
3,Italy,TOTAL,2017,10.25,12.03
4,Italy,TOTAL,2018,10.27,12.10
...,...,...,...,...,...
2179,Sud,Other service activities,2017,8.13,8.17
2180,Sud,Other service activities,2018,8.08,8.18
2181,Sud,Other service activities,2019,8.06,8.24
2182,Sud,Other service activities,2020,8.31,8.45


### D6. Hourly wage by age bracket (2014-2021)


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: Hourly gross pay per compensated hour for employee positions in euros (median).
- Gender: all variables;
- Age: 15-29
- Employee class: all variables
- Employment status: totale


Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory
- DATA_TYPE - indicator type
- SEX - sex
- AGE - age
- NOTE_ECON_ACTIVITY_NACE_2007 - Attività economica (ATECO 2007)
- NOTE_CONTARCTUAL_OCCUPATION - Qualifica contrattuale (NOTE_CONTARCTUAL_OCCUPATION);
- NOTE_EMPLOYEES CLASS- employee class
- NOTE_TIME_PERIOD - time
- BASE-PER - base year
- UNIT_MEAS - unit of measure
- UNIT_MULT - multiplication unit
- OBS_status - observation status

The last 6 parameters contain "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Time period (spanning from 2014 to 2021)
- economic activity
- Observed data, in this case, the hourly wage.

In [11]:
original = pd.read_csv("../datasets/retribuzione/Age class (IT1,533_957_DF_DCSC_RACLI_1,1.0)(1).csv", encoding="utf-8", sep=";")

original.head(5)

Unnamed: 0,FREQ,Frequency,REF_AREA,Territory,DATA_TYPE,Indicator,SEX,Sex (DESC),AGE,Age (DESC),...,NOTE_EMPLOYESS_CLASS,Employees class (NOTE_EMPLOYESS_CLASS),NOTE_TIME_PERIOD,Time,BASE_PER,Base year,UNIT_MEAS,Measure unit,UNIT_MULT,Multiplication unit
0,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,Y15-29,15-29 years,...,,,,,,,,,,
1,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,Y15-29,15-29 years,...,,,,,,,,,,
2,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,Y15-29,15-29 years,...,,,,,,,,,,
3,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,Y15-29,15-29 years,...,,,,,,,,,,
4,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,Y15-29,15-29 years,...,,,,,,,,,,


In [12]:
selected_columns = ['Territory','Economic activity (NACE Rev. 2)', 'Time (TIME_PERIOD)', 'Observation']
d6 = original[selected_columns].copy()

column_name_mapping = {
    'Territory': 'Area',
    'Economic activity (NACE Rev. 2)' : 'Economic activity',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Hourly wage by age (15-29) (€)',
}

d6.rename(columns=column_name_mapping, inplace=True)

d6.to_csv("../datasets/cleaned_csv/D6.csv", index=False)

d6

Unnamed: 0,Area,Economic activity,Time period,Hourly wage by age (15-29) (€)
0,Italy,TOTAL,2014,9.76
1,Italy,TOTAL,2015,9.90
2,Italy,TOTAL,2016,9.92
3,Italy,TOTAL,2017,10.03
4,Italy,TOTAL,2018,10.06
...,...,...,...,...
755,Sud,Other service activities,2017,7.72
756,Sud,Other service activities,2018,7.67
757,Sud,Other service activities,2019,7.70
758,Sud,Other service activities,2020,7.81


### D7. Hourly wage by educational qualification (2014-2021)


This dataset has been downloaded from Istat and has already been filtered based on the following criteria:

- Indicator: Hourly gross pay per compensated hour for employee positions in euros (median).
- Gender: all variables;
- Employee class: all variables
- Employment status: totale


Within the dataset, there are duplicates in the following columns:

- FREQ - frequency
- REF-AREA - territory
- DATA_TYPE - indicator type
- SEX - sex
- NOTE_ECON_ACTIVITY_NACE_2007 - Attività economica (ATECO 2007)
- NOTE_CONTARCTUAL_OCCUPATION - Qualifica contrattuale (NOTE_CONTARCTUAL_OCCUPATION);
- NOTE_EMPLOYESS CLASS- employee class
- NOTE_TIME_PERIOD - time
- BASE-PER - base year
- UNIT_MEAS - unit of measure
- UNIT_MULT - multiplication unit
- OBS_status - observation status

The last 6 parameters contain "NaN" values, which are not of interest for our research purposes. Therefore, the column is directly eliminated without resorting to value replacement.

Since these data have already been filtered and presented within the dataset for each column, they need to be removed. The dataset only considers data for:

- Territory (including both Italy and the subdivision into macro-areas: Northwest, Northeast, Center, South)
- Educational level
- Time period (spanning from 2014 to 2021)
- economic activity
- Observed data, in this case, the hourly wage.

In [13]:

original = pd.read_csv("../datasets/retribuzione/Level of education (IT1,533_957_DF_DCSC_RACLI_3,1.0).csv", encoding="utf-8", sep=";")

original.head(5)

Unnamed: 0,FREQ,Frequency,REF_AREA,Territory,DATA_TYPE,Indicator,SEX,Sex (DESC),EDU_LEV_HIGHEST,Highest level of education attained,...,NOTE_EMPLOYESS_CLASS,Employees class (NOTE_EMPLOYESS_CLASS),NOTE_TIME_PERIOD,Time,BASE_PER,Base year,UNIT_MEAS,Measure unit,UNIT_MULT,Multiplication unit
0,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,13,"No educational degree, primary and lower secon...",...,,,,,,,,,,
1,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,13,"No educational degree, primary and lower secon...",...,,,,,,,,,,
2,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,13,"No educational degree, primary and lower secon...",...,,,,,,,,,,
3,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,13,"No educational degree, primary and lower secon...",...,,,,,,,,,,
4,A,Annual,IT,Italy,HOUWAG_ENTEMP_MED_MI,Gross hourly wage per hour paid of employee jo...,9,Total,13,"No educational degree, primary and lower secon...",...,,,,,,,,,,


In [14]:
selected_columns = ['Territory','Economic activity (NACE Rev. 2)', 'Highest level of education attained', 'Time (TIME_PERIOD)', 'Observation']
d7 = original[selected_columns].copy()

column_name_mapping = {
    'Territory': 'Area',
    'Economic activity (NACE Rev. 2)' : 'Economic activity',
    'Highest level of education attained' : 'Qualification',
    'Time (TIME_PERIOD)': 'Time period',
    'Observation': 'Hourly wage for type of education (€)',
}

d7.rename(columns=column_name_mapping, inplace=True)


x = []
y = []
z = []


for idx, i in d7.iterrows():
    if i.iloc[2] == "Total" or i.iloc[2] == "N.a.":
        d7 = d7.drop(idx)
    if i.iloc[2] == "No educational degree, primary and lower secondary school certificate":
        value = i.iloc[4]
        x.append(value)
    if i.iloc[2] == "Upper and post secondary":
        value = i.iloc[4]
        y.append(value)
    if i.iloc[2] == "Tertiary (university, doctoral and specialization courses)":
        value = i.iloc[4]
        z.append(value)



df = d7[['Area', 'Economic activity', 'Time period']]
s_d7 = df.drop_duplicates().copy()
s_d7["Hourly wage no educational degree, primary and lower secondary school certificate (€)"] = x
s_d7["Hourly wage upper and post secondary (€)"] = y
s_d7["Hourly wage tertiary (university, doctoral and specialization courses) (€)"] = z

       
s_d7.to_csv("../datasets/cleaned_csv/D7.csv", index=False)

s_d7

Unnamed: 0,Area,Economic activity,Time period,"Hourly wage no educational degree, primary and lower secondary school certificate (€)",Hourly wage upper and post secondary (€),"Hourly wage tertiary (university, doctoral and specialization courses) (€)"
0,Italy,TOTAL,2014,10.53,11.41,13.83
1,Italy,TOTAL,2015,10.70,11.53,13.77
2,Italy,TOTAL,2016,10.69,11.53,13.80
3,Italy,TOTAL,2017,10.73,11.54,13.85
4,Italy,TOTAL,2018,10.74,11.56,13.86
...,...,...,...,...,...,...
2935,Sud,Other service activities,2017,8.02,8.17,9.64
2936,Sud,Other service activities,2018,8.05,8.12,9.53
2937,Sud,Other service activities,2019,8.09,8.14,9.59
2938,Sud,Other service activities,2020,8.27,8.38,10.45
