<h1>CEU Master Thesis</h1>
<br>
<font size="4">
    The Effects of Migration on Attitudes towards the European Union: Extent, Dynamics and Causality<br>
    by Alina Cherkas
</font>

_The notebook can be used to replicate the **multilevel** dataset used in the master's thesis submitted to CEU in a.y. 2019/2020._

<h2>Table of Contents</h2>

- [Data Sources](#Data-Sources)
- [Preliminaries](#Preliminaries)
- [Data Cleaning](#Data-Cleaning)
    - [1. Eurobarometer (Individual)](#1.-Eurobarometer-(Individual))
    - [2. OECD (Regional)](#2.-OECD-(Regional))
    
<h2>Output:</h2>

Running this script will create an output file `CEU_Thesis_Multilevel.feather`.
- `.feather` is an efficient cross-platform format that can be used in both `Python` and `R` languages.
- To use `.feather`, you need to install `pyarrow` (v.0.17.1 or higher). Use `conda install -c conda-forge pyarrow==0.17.1`.
- The file combines country-level data from _DatasetFinal_ (`CEU_Thesis_Data.xlsx`) with individual-level Eurobarometer data.

<div class="alert alert-warning">
    <b>Note:</b> Individual-level data is not provided for replication due to data sharing restictions, but can be obtained from the source. See Data Sources below.
</div>

## Data Sources

- [Eurobarometer](https://dbk.gesis.org/dbksearch/gdesc2.asp?no=0008&search=&search2=&db=e&tab=0&notabs=&nf=1&af=&ll=10)
    - **Type:** Cross-section
    - **Level:** Individual-level Data
    - **Date Range**: 28/02/2015-07/12/2015
    - **Source:** [GESIS Datenarchiv](https://www.gesis.org/eurobarometer-data-service/survey-series/standard-special-eb)
    - **Edition:** Standard Eurobarometer
    - **Editions:** 83.1-84.4    
        - ZA5964 Eurobarometer 83.1 (2015)
        - ZA5965 Eurobarometer 83.2 (2015)
        - ZA5998 Eurobarometer 83.3 (2015)
        - ZA6595 Eurobarometer 83.4 (2015)
        - ZA6596 Eurobarometer 84.1 (2015)*
        - ZA6642 Eurobarometer 84.2 (2015)*
        - ZA6643 Eurobarometer 84.3 (2015)
        - ZA6644 Eurobarometer 84.4 (2015)
        - *excluded since the attitude question was not asked.
    - **Files***:
        - `ZA5964_v2-0-0.dta`
        - `ZA5965_v2-0-0.dta`
        - `ZA5998_v2-0-0.dta`
        - `ZA6595_v3-0-0.dta`
        - `ZA6643_v4-0-0.dta`
        - `ZA6644_v4-0-0.dta`
        - *files are not directly provided for replication due to data sharing restictions.

- [Database on Migrants in OECD Regions](https://stats.oecd.org/Index.aspx?DataSetCode=REGION_MIGRANTS)
    - **Type:** Cross-section
    - **Level:** Country-level Data
    - **Date Range:** 2015
    - **Source:** [OECD.Stat](https://ec.europa.eu/eurostat/data/database)
    - **Indicators:** Full Database Export
    - **File:** `REGION_MIGRANTS_24052020163240973.csv`

## Preliminaries

In [1]:
# Standard library imports
import os, sys
from collections import Counter

# Third party imports
import numpy as np
import pandas as pd

print('Loaded!')

Loaded!


In [2]:
# System information
print(f'Executable: {sys.executable}\nPython version: {sys.version}')
print(f'\nPackage verions:\n- Numpy: {np.__version__}\n- Pandas: {pd.__version__}')

Executable: /Users/alinacherkas/opt/anaconda3/bin/python
Python version: 3.7.7 (default, Mar 26 2020, 10:32:53) 
[Clang 4.0.1 (tags/RELEASE_401/final)]

Package verions:
- Numpy: 1.18.1
- Pandas: 1.0.3


**Helper File**

In [3]:
# NUTS name mapping from previous classification to the 2016 one
print('Source: https://simap.ted.europa.eu/web/simap/nuts')
df_NUTS = pd.read_excel('./Source Data/Auxiliary Data.xlsx', sheet_name = 'NUTS_Names')
print(f'Shape:{df_NUTS.shape}')
display(df_NUTS.head())

Source: https://simap.ted.europa.eu/web/simap/nuts
Shape:(1845, 6)


Unnamed: 0,Previously used code,Previously used name,Change,2016 NUTS,2016 NUTS name,Mapping with previous
0,AT,ÖSTERREICH,,AT,ÖSTERREICH,AT
1,AT1,OSTÖSTERREICH,,AT1,OSTÖSTERREICH,AT1
2,AT11,Burgenland (A),Name change,AT11,Burgenland,AT11
3,AT111,Mittelburgenland,,AT111,Mittelburgenland,AT111
4,AT112,Nordburgenland,,AT112,Nordburgenland,AT112


In [4]:
# Mapping with previous contains coma-separated names
df_NUTS['Mapping with previous'] = df_NUTS['Mapping with previous'].str.split(',')

In [5]:
print(f'Shape before:{df_NUTS.shape}')
df_NUTS = df_NUTS.explode('Mapping with previous')
df_NUTS['Mapping with previous'] = df_NUTS['Mapping with previous'].str.strip()
print(f'Shape after:{df_NUTS.shape}')
display(df_NUTS.head())

Shape before:(1845, 6)
Shape after:(1952, 6)


Unnamed: 0,Previously used code,Previously used name,Change,2016 NUTS,2016 NUTS name,Mapping with previous
0,AT,ÖSTERREICH,,AT,ÖSTERREICH,AT
1,AT1,OSTÖSTERREICH,,AT1,OSTÖSTERREICH,AT1
2,AT11,Burgenland (A),Name change,AT11,Burgenland,AT11
3,AT111,Mittelburgenland,,AT111,Mittelburgenland,AT111
4,AT112,Nordburgenland,,AT112,Nordburgenland,AT112


In [6]:
# Complete set of all avilable names
NUTSset = set(df_NUTS['Previously used code']) | set(df_NUTS['2016 NUTS']) | set(df_NUTS['Mapping with previous'])
print('Number of all unique NUTS names:', len(NUTSset))

Number of all unique NUTS names: 2358


In [7]:
# Mapping all old names to new ones
NUTS_mapper = dict(zip(df_NUTS['Previously used code'], df_NUTS['2016 NUTS']))
NUTS_mapper2 = dict(zip(df_NUTS['Mapping with previous'], df_NUTS['2016 NUTS']))
print('Number of Previous NUTS-codes in one column:', len(NUTS_mapper))
print('Number of Previous NUTS-codes in another column:', len(NUTS_mapper2))

Number of Previous NUTS-codes in one column: 1686
Number of Previous NUTS-codes in another column: 1793


In [8]:
# Creating a combined mapper of old names to new names
NUTS_mapper = {**NUTS_mapper, **NUTS_mapper2}
print('Combined Number of Previous NUTS-codes:', len(NUTS_mapper))

Combined Number of Previous NUTS-codes: 1799


**Country-level Data**

In [9]:
df_countries = pd.read_excel('CEU_Thesis_Data.xlsx', sheet_name = 'DatasetFinal')
print(f'Shape:{df_countries.shape}')
display(df_countries.head())

Shape:(198, 44)


Unnamed: 0,ISO,Country,Year,Eurobarometer,ImmigrantStock,InflowsOECD,InflowsEurostat,InflowsEU28,InflowsNonEU28,ShiftShare,...,Pop25_49,PopTotal,PopDensity,PopOver65,YearJoined,Eurozone,Schengen,M49Standard,IV_Sample,EUnonEU_Sample
0,AT,Austria,2009,71.85,10.317981,1.0997,0.831373,,,0.904602,...,37.2,83.35003,101.2,17.4,14,1,1,Western,True,False
1,AT,Austria,2010,73.8,10.579703,1.160203,0.849869,,,0.956444,...,36.9,83.51643,101.5,17.6,15,1,1,Western,True,False
2,AT,Austria,2011,75.8,10.903703,1.312464,0.981832,,,1.072163,...,36.5,83.75164,101.8,17.6,16,1,1,Western,True,False
3,AT,Austria,2012,77.7,11.315596,1.493853,1.088912,,,1.135971,...,36.2,84.08121,102.3,17.8,17,1,1,Western,True,False
4,AT,Austria,2013,79.7,11.882213,1.599979,1.205249,0.712494,0.381466,1.209287,...,35.8,84.5186,102.9,18.1,18,1,1,Western,True,True


In [10]:
# Subsetting records for the year of 2015
print(f'Shape before:{df_countries.shape}')
df_countries = df_countries.query('Year == 2015')[['ISO', 'InflowsOECD']].copy()
print(f'Shape after:{df_countries.shape}')
display(df_countries.head())

Shape before:(198, 44)
Shape after:(22, 2)


Unnamed: 0,ISO,InflowsOECD
6,AT,2.314033
15,BE,1.145847
24,BG,0.438602
33,DE,2.483131
42,DK,1.037066


## Data Cleaning

In this part, I read and clean individual-level data from Eurobarometer and combine it with regional OECD data as well as country-level records from above. The output data is used for robustness checks in the thesis.

### 1. Eurobarometer (Individual)

In [11]:
datafiles = [x for x in os.listdir('./Source Data/Eurobarometer (GESIS)') if x.endswith('.dta')]
datafiles.sort()
datafiles

['ZA5964_v2-0-0.dta',
 'ZA5965_v2-0-0.dta',
 'ZA5998_v2-0-0.dta',
 'ZA6595_v3-0-0.dta',
 'ZA6643_v4-0-0.dta',
 'ZA6644_v4-0-0.dta']

In [12]:
# Namings of the dependent variable in each file
files = {'ZA5964_v2-0-0.dta':'qa7',
         'ZA5965_v2-0-0.dta':'d78',
         'ZA5998_v2-0-0.dta':'qa9',
         'ZA6595_v3-0-0.dta':'d78',
         'ZA6643_v4-0-0.dta':'qa9',
         'ZA6644_v4-0-0.dta':'d78'}

to_keep = ['doi', 'version', 'survey', 'caseid',
           'country', 'isocntry', 'nuts', 'nutslvl',
           'd10', 'd11', 'd25', 'd70', 'd72_1', 'Attitude']

to_rename = {'qa7':'Attitude',
             'd78':'Attitude',
             'qa9':'Attitude',
             'd78':'Attitude',
             'qa9':'Attitude',
             'd78':'Attitude',
             'd10':'Gender',
             'd11':'Age',
             'd15':'Rural',
             'd70':'LifeSatisfaction',
             'd72_1':'VoiceCounts'}

In [13]:
df_list = []

# Reading individual-level Eurobarometer data
for file, var in files.items():
    df_lambda = pd.read_stata(f'./Source Data/Eurobarometer (GESIS)/{file}')
    df_lambda.rename({var:'Attitude'}, axis = 1, inplace = True)
    df_list.append(df_lambda[to_keep])
    
df_eurobar = pd.concat(df_list, ignore_index = True)
del df_list
print(f'Shape:{df_eurobar.shape}')
display(df_eurobar.head())

Shape:(177153, 14)


Unnamed: 0,doi,version,survey,caseid,country,isocntry,nuts,nutslvl,d10,d11,d25,d70,d72_1,Attitude
0,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),1,BE - Belgium,BE,BE23,NUTS level 2,Man,75,Rural area or village,Fairly satisfied,Totally disagree,Neutral
1,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),2,BE - Belgium,BE,BE23,NUTS level 2,Woman,51,Rural area or village,Very satisfied,Tend to agree,Neutral
2,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),3,BE - Belgium,BE,BE23,NUTS level 2,Woman,40,Rural area or village,Fairly satisfied,Totally disagree,Neutral
3,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),4,BE - Belgium,BE,BE23,NUTS level 2,Woman,54,Rural area or village,Not at all satisfied,Totally disagree,Neutral
4,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),5,BE - Belgium,BE,BE33,NUTS level 2,Woman,52,Rural area or village,Fairly satisfied,Totally disagree,Fairly positive


In [14]:
df_eurobar.columns = ['DOI', 'Version', 'Survey', 'Case_ID', 'Country', 'ISO', 'NUTS', 'NUTS_Level',
                      'Gender', 'Age', 'Residence', 'LifeSatisfaction', 'VoiceCounts', 'Attitude']

In [15]:
to_replace = {'DE-W':'DE',
             'GB-GBN':'UK',
             'DE-E':'DE',
             'GB-NIR':'UK',
             'CY-TCC':'CY'}

In [16]:
df_eurobar['Date'] = pd.to_datetime(df_eurobar['Version'].str.slice(7,-1))
df_eurobar['ISO'].replace(to_replace, inplace = True)

In [17]:
print('Countries in the Dataset but not in Eurobarometer:', set(df_countries['ISO']) - set(df_eurobar['ISO']))
print('Countries in Eurobarometer but not in the Dataset:', set(df_eurobar['ISO']) - set(df_countries['ISO']))

Countries in the Dataset but not in Eurobarometer: set()
Countries in Eurobarometer but not in the Dataset: {'CZ', 'RO', 'TR', 'AL', 'HR', 'RS', 'MT', 'CY', 'LT', 'MK', 'ME'}


In [18]:
# Filtering data for countries of interest
print(f'Shape before:{df_eurobar.shape}')
df_eurobar = df_eurobar.query('ISO in @df_countries.ISO').copy()
print(f'Shape after:{df_eurobar.shape}')
display(df_eurobar.head())

Shape before:(177153, 15)
Shape after:(136302, 15)


Unnamed: 0,DOI,Version,Survey,Case_ID,Country,ISO,NUTS,NUTS_Level,Gender,Age,Residence,LifeSatisfaction,VoiceCounts,Attitude,Date
0,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),1,BE - Belgium,BE,BE23,NUTS level 2,Man,75,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10
1,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),2,BE - Belgium,BE,BE23,NUTS level 2,Woman,51,Rural area or village,Very satisfied,Tend to agree,Neutral,2018-08-10
2,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),3,BE - Belgium,BE,BE23,NUTS level 2,Woman,40,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10
3,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),4,BE - Belgium,BE,BE23,NUTS level 2,Woman,54,Rural area or village,Not at all satisfied,Totally disagree,Neutral,2018-08-10
4,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),5,BE - Belgium,BE,BE33,NUTS level 2,Woman,52,Rural area or village,Fairly satisfied,Totally disagree,Fairly positive,2018-08-10


In [19]:
display(df_eurobar.groupby('ISO').agg({'NUTS_Level':Counter, 'NUTS':'nunique'}))

Unnamed: 0_level_0,NUTS_Level,NUTS
ISO,Unnamed: 1_level_1,Unnamed: 2_level_1
AT,{'NUTS level 2': 6099},9
BE,{'NUTS level 2': 6107},11
BG,{'NUTS level 2': 6293},6
DE,{'NUTS level 1': 9298},16
DK,{'NUTS level 2': 6069},5
EE,{'NUTS level 3': 6056},5
ES,{'NUTS level 2': 6030},17
FI,{'NUTS level 2': 6057},4
FR,{'NUTS level 2': 6092},21
GR,{'NUTS level 2': 6025},10


In [20]:
print('NUTS not found in the official dataset:', set(df_eurobar['NUTS']) - NUTSset)

print('\nNUTS in Eurobarometer that are not found in 2016 classification:')
print('\n- Before replacing:', set(df_eurobar['NUTS']) - set(df_NUTS['2016 NUTS']))
df_eurobar['NUTS'].replace(NUTS_mapper, inplace = True)
print('\n- After replacing:', set(df_eurobar['NUTS']) - set(df_NUTS['2016 NUTS']))

NUTS not found in the official dataset: {'EL11', 'EL23', 'EL14', 'EL12', 'EL21', 'EL24', 'EL13', 'EL25'}

NUTS in Eurobarometer that are not found in 2016 classification:

- Before replacing: {'FR43', 'FR72', 'FR63', 'IE013', 'FR71', 'SI022', 'FR53', 'SI012', 'FR26', 'SI016', 'SI024', 'IE012', 'FR62', 'SI021', 'IE025', 'EL23', 'PL31', 'FR22', 'SI017', 'FR41', 'FR61', 'FR52', 'HU10', 'FR24', 'EL13', 'PL32', 'PL33', 'EL25', 'SI014', 'FR51', 'IE024', 'SI015', 'SI011', 'EL21', 'FR42', 'PL11', 'SI013', 'FR82', 'IE011', 'FR25', 'EL11', 'FR21', 'EL14', 'FR30', 'IE021', 'EL12', 'IE023', 'SI018', 'EL24', 'IE022', 'SI023', 'FR23', 'PL12', 'FR81', 'PL34'}

- After replacing: {'EL11', 'EL23', 'EL14', 'EL12', 'EL21', 'EL24', 'EL13', 'EL25'}


In [21]:
print(f'Shape before:{df_eurobar.shape}')

#Excluding Greece NUTS
df_eurobar = df_eurobar.query('ISO != "GR"')
df_eurobar.reset_index(drop = True, inplace = True)

assert len(set(df_eurobar['NUTS']) - set(df_NUTS['2016 NUTS'])) == 0, 'Uknown NUTS'
print(f'Shape after:{df_eurobar.shape}')
display(df_eurobar.head())

Shape before:(136302, 15)
Shape after:(130277, 15)


Unnamed: 0,DOI,Version,Survey,Case_ID,Country,ISO,NUTS,NUTS_Level,Gender,Age,Residence,LifeSatisfaction,VoiceCounts,Attitude,Date
0,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),1,BE - Belgium,BE,BE23,NUTS level 2,Man,75,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10
1,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),2,BE - Belgium,BE,BE23,NUTS level 2,Woman,51,Rural area or village,Very satisfied,Tend to agree,Neutral,2018-08-10
2,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),3,BE - Belgium,BE,BE23,NUTS level 2,Woman,40,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10
3,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),4,BE - Belgium,BE,BE23,NUTS level 2,Woman,54,Rural area or village,Not at all satisfied,Totally disagree,Neutral,2018-08-10
4,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),5,BE - Belgium,BE,BE33,NUTS level 2,Woman,52,Rural area or village,Fairly satisfied,Totally disagree,Fairly positive,2018-08-10


In [22]:
# Merging country-level data with individual-level
print(f'Shape before:{df_eurobar.shape}')
df_eurobar = df_eurobar.merge(df_countries, on = 'ISO')
print(f'Shape after:{df_eurobar.shape}')
display(df_eurobar.head())

Shape before:(130277, 15)
Shape after:(130277, 16)


Unnamed: 0,DOI,Version,Survey,Case_ID,Country,ISO,NUTS,NUTS_Level,Gender,Age,Residence,LifeSatisfaction,VoiceCounts,Attitude,Date,InflowsOECD
0,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),1,BE - Belgium,BE,BE23,NUTS level 2,Man,75,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10,1.145847
1,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),2,BE - Belgium,BE,BE23,NUTS level 2,Woman,51,Rural area or village,Very satisfied,Tend to agree,Neutral,2018-08-10,1.145847
2,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),3,BE - Belgium,BE,BE23,NUTS level 2,Woman,40,Rural area or village,Fairly satisfied,Totally disagree,Neutral,2018-08-10,1.145847
3,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),4,BE - Belgium,BE,BE23,NUTS level 2,Woman,54,Rural area or village,Not at all satisfied,Totally disagree,Neutral,2018-08-10,1.145847
4,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),5,BE - Belgium,BE,BE33,NUTS level 2,Woman,52,Rural area or village,Fairly satisfied,Totally disagree,Fairly positive,2018-08-10,1.145847


### 2. OECD (Regional)

In [23]:
to_replace = {'AUS':np.nan,
             'AUT':'AT',
             'BEL':'BE',
             'CAN':np.nan,
             'CHE':np.nan,
             'CZE':np.nan,
             'DEU':'DE',
             'DNK':'DK',
             'ESP':'ES',
             'EST':'EE',
             'FIN':'FI',
             'FRA':'FR',
             'GBR':'UK',
             'GRC':'GR',
             'HUN':'HU',
             'IRL':'IE',
             'ITA':'IT',
             'LUX':'LU',
             'LVA':'LV',
             'MEX':np.nan,
             'NLD':'NL',
             'NOR':np.nan,
             'POL':'PL',
             'PRT':'PT',
             'SVK':'SK',
             'SVN':'SI',
             'SWE':'SE',
             'USA':np.nan}

In [24]:
df_regions = pd.read_csv('./Source Data/OECD Migration/REGION_MIGRANTS_24052020163240973.csv')
print(f'Shape:{df_regions.shape}')
display(df_regions.head())

Shape:(1277, 11)


Unnamed: 0,REG_ID,Region,ORIGIN,Place of birth,VAR,Indicator,TIME,Year,Value,Flag Codes,Flags
0,DE5,Bremen,FB,Foreign-Born,ALL_T_SH,Share of Foreign-Born Population,2015,2015,20.7,,
1,DE5,Bremen,FB,Foreign-Born,ALL_HEDU_SH,Share of Highly Educated,2015,2015,19.5,,
2,DE5,Bremen,FB,Foreign-Born,ALL_T_1564UNEMP_RA,"Unemployment rate, both sex",2015,2015,9.9,,
3,DE6,Hamburg,FB,Foreign-Born,ALL_T_SH,Share of Foreign-Born Population,2015,2015,19.6,,
4,DE6,Hamburg,FB,Foreign-Born,ALL_HEDU_SH,Share of Highly Educated,2015,2015,23.1,,


In [25]:
to_drop = ['ORIGIN', 'VAR', 'TIME', 'Year', 'Flag Codes', 'Flags']

In [26]:
print(f'Shape before:{df_regions.shape}')

# Dropping columns and pivoting the table
df_regions.drop(to_drop, axis = 1, inplace = True)
df_regions = df_regions.pivot_table(index = ['REG_ID', 'Region'], columns = ['Place of birth', 'Indicator'])
df_regions.reset_index(inplace = True)

print(f'Shape after:{df_regions.shape}')
display(df_regions.head())

Shape before:(1277, 11)
Shape after:(208, 9)


Unnamed: 0_level_0,REG_ID,Region,Value,Value,Value,Value,Value,Value,Value
Place of birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Foreign-Born,Foreign-Born,Foreign-Born,Foreign-Born,Foreign-Born,Native-Born,Native-Born
Indicator,Unnamed: 1_level_2,Unnamed: 2_level_2,Share of EU Foreign-Born Population,Share of Foreign-Born Population,Share of Highly Educated,Share of non-EU Foreign-Born,"Unemployment rate, both sex",Share of Highly Educated,"Unemployment rate, both sex"
0,AT1,Eastern Austria,8.7,20.9,19.0,12.2,,26.2,
1,AT11,Burgenland,8.7,20.9,29.3,12.2,,25.6,4.7
2,AT12,Lower Austria,8.7,20.9,26.6,12.2,10.2,28.2,4.5
3,AT13,Vienna,8.7,20.9,32.7,12.2,13.5,37.2,8.7
4,AT2,Southern Austria,5.4,10.1,13.4,4.7,,20.2,


In [27]:
df_regions.columns = ['NUTS', 'NUTS_Name', 'StockEU', 'StockTotal', '_ImmEdu', 'StockNonEU',
                      '_ImmUnempl', 'NativeEdu', 'NativeUnempl',]

In [28]:
to_drop = ['_ImmEdu', '_ImmUnempl']
to_order = ['NUTS', 'NUTS_Name', 'StockTotal', 'StockEU', 'StockNonEU', 'NativeEdu', 'NativeUnempl']

In [29]:
print(f'Shape before:{df_regions.shape}')

df_regions.drop(to_drop, axis = 1, inplace = True)
df_regions = df_regions.reindex(to_order, axis = 1)

assert df_regions['NUTS'].nunique() == df_regions.shape[0], 'NUTS contains duplicates'
print(f'Shape after:{df_regions.shape}')
display(df_regions.head())

Shape before:(208, 9)
Shape after:(208, 7)


Unnamed: 0,NUTS,NUTS_Name,StockTotal,StockEU,StockNonEU,NativeEdu,NativeUnempl
0,AT1,Eastern Austria,20.9,8.7,12.2,26.2,
1,AT11,Burgenland,20.9,8.7,12.2,25.6,4.7
2,AT12,Lower Austria,20.9,8.7,12.2,28.2,4.5
3,AT13,Vienna,20.9,8.7,12.2,37.2,8.7
4,AT2,Southern Austria,10.1,5.4,4.7,20.2,


In [30]:
# Excluding county-level ovservations
print(f'Shape before:{df_regions.shape}')
df_regions['NUTS'].replace(to_replace, inplace = True)
df_regions = df_regions.query('NUTS not in @to_replace.values()').copy()
print(f'Shape after:{df_regions.shape}')
display(df_regions.head())

Shape before:(208, 7)
Shape after:(187, 7)


Unnamed: 0,NUTS,NUTS_Name,StockTotal,StockEU,StockNonEU,NativeEdu,NativeUnempl
0,AT1,Eastern Austria,20.9,8.7,12.2,26.2,
1,AT11,Burgenland,20.9,8.7,12.2,25.6,4.7
2,AT12,Lower Austria,20.9,8.7,12.2,28.2,4.5
3,AT13,Vienna,20.9,8.7,12.2,37.2,8.7
4,AT2,Southern Austria,10.1,5.4,4.7,20.2,


In [31]:
print('NUTS not found in the official dataset:', set(df_regions['NUTS']) - NUTSset)

print('\nNUTS in Eurobarometer that are not found in 2016 classification:')
print('\n- Before replacing:', set(df_regions['NUTS']) - set(df_NUTS['2016 NUTS']))

df_regions['NUTS'].replace(NUTS_mapper, inplace = True)
assert df_regions['NUTS'].nunique() == df_regions.shape[0], 'NUTS contains duplicates'

print('\n- After replacing:', set(df_regions['NUTS']) - set(df_NUTS['2016 NUTS']))

NUTS not found in the official dataset: {'IE02'}

NUTS in Eurobarometer that are not found in 2016 classification:

- Before replacing: {'FR43', 'IE01', 'FR72', 'FR63', 'FR71', 'FR53', 'FR26', 'FR62', 'PL31', 'FR22', 'FR41', 'FR61', 'FR52', 'HU10', 'FR24', 'FR83', 'PL32', 'PL33', 'FR51', 'FR42', 'PL11', 'FR82', 'FR25', 'FR21', 'FR30', 'FR23', 'IE02', 'PL12', 'FR81', 'PL34'}

- After replacing: {'IE02'}


In [32]:
print('NUTS overlap:', len(set(df_eurobar['NUTS']) & set(df_regions['NUTS'])))
print('Missing NUTS:', sorted(set(df_eurobar['NUTS']) - set(df_regions['NUTS'])))

NUTS overlap: 136
Missing NUTS: ['BE10', 'BE21', 'BE22', 'BE23', 'BE24', 'BE25', 'BE31', 'BE32', 'BE33', 'BE34', 'BE35', 'BG31', 'BG32', 'BG33', 'BG34', 'BG41', 'BG42', 'EE001', 'EE004', 'EE006', 'EE007', 'EE008', 'IE041', 'IE042', 'IE051', 'IE052', 'IE053', 'IE061', 'IE062', 'IE063', 'ITC', 'ITF', 'ITG', 'ITH', 'ITI', 'LU', 'LV003', 'LV005', 'LV006', 'LV007', 'LV008', 'LV009', 'SI031', 'SI032', 'SI033', 'SI034', 'SI035', 'SI036', 'SI037', 'SI038', 'SI041', 'SI042', 'SI043', 'SI044']


We can join NUTS in Eurobarometer with no consequences. We cannot merge NUTS records in OECD.

In [33]:
# Merging NUTS in Eurobarometer
to_replace = {'BE10':'BE1',
              'BE21':'BE2',
              'BE22':'BE2',
              'BE23':'BE2',
              'BE24':'BE2',
              'BE25':'BE2',
              'BE31':'BE3',
              'BE32':'BE3',
              'BE33':'BE3',
              'BE34':'BE3',
              'BE35':'BE3',
              'EE001':'EE00',
              'EE004':'EE00',
              'EE006':'EE00',
              'EE007':'EE00',
              'EE008':'EE00',
              'IE041':'IE04',
              'IE042':'IE04',
              'IE051':'IE02',
              'IE052':'IE02',
              'IE053':'IE02',
              'IE061':'IE02',
              'IE062':'IE02',
              'IE063':'IE02',
              'LU':'LU00',
              'LV003':'LV00',
              'LV005':'LV00',
              'LV006':'LV00',
              'LV007':'LV00',
              'LV008':'LV00',
              'LV009':'LV00',
              'SI031':'SI03',
              'SI032':'SI03',
              'SI033':'SI03',
              'SI034':'SI03',
              'SI035':'SI03',
              'SI036':'SI03',
              'SI037':'SI03',
              'SI038':'SI03',
              'SI041':'SI04',
              'SI042':'SI04',
              'SI043':'SI04',
              'SI044':'SI04'}

In [34]:
df_eurobar['NUTS'].replace(to_replace, inplace = True)

In [35]:
print('NUTS overlap:', len(set(df_eurobar['NUTS']) & set(df_regions['NUTS'])))
print('Missing NUTS:', sorted(set(df_eurobar['NUTS']) - set(df_regions['NUTS'])))

NUTS overlap: 146
Missing NUTS: ['BG31', 'BG32', 'BG33', 'BG34', 'BG41', 'BG42', 'ITC', 'ITF', 'ITG', 'ITH', 'ITI']


In [36]:
# Merging regional-level to multi-level data
print(f'Shape before:{df_eurobar.shape}')
df_eurobar = df_eurobar.merge(df_regions, on = 'NUTS')
print(f'Shape after:{df_eurobar.shape}')
display(df_eurobar.head())

Shape before:(130277, 16)
Shape after:(117880, 22)


Unnamed: 0,DOI,Version,Survey,Case_ID,Country,ISO,NUTS,NUTS_Level,Gender,Age,...,VoiceCounts,Attitude,Date,InflowsOECD,NUTS_Name,StockTotal,StockEU,StockNonEU,NativeEdu,NativeUnempl
0,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),1,BE - Belgium,BE,BE2,NUTS level 2,Man,75,...,Totally disagree,Neutral,2018-08-10,1.145847,Flemish Region,9.9,4.3,5.6,37.3,4.2
1,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),2,BE - Belgium,BE,BE2,NUTS level 2,Woman,51,...,Tend to agree,Neutral,2018-08-10,1.145847,Flemish Region,9.9,4.3,5.6,37.3,4.2
2,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),3,BE - Belgium,BE,BE2,NUTS level 2,Woman,40,...,Totally disagree,Neutral,2018-08-10,1.145847,Flemish Region,9.9,4.3,5.6,37.3,4.2
3,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),4,BE - Belgium,BE,BE2,NUTS level 2,Woman,54,...,Totally disagree,Neutral,2018-08-10,1.145847,Flemish Region,9.9,4.3,5.6,37.3,4.2
4,doi:10.4232/1.13071,2.0.0 (2018-08-10),Eurobarometer 83.1 (February-March 2015),16,BE - Belgium,BE,BE2,NUTS level 2,Woman,88,...,Totally disagree,Neutral,2018-08-10,1.145847,Flemish Region,9.9,4.3,5.6,37.3,4.2


In [37]:
display(df_eurobar['ISO'].value_counts())

DE    9298
UK    7877
SE    6270
HU    6266
SK    6166
SI    6135
NL    6129
BE    6107
PT    6105
AT    6099
FR    6092
DK    6069
FI    6057
EE    6056
PL    6035
IE    6031
ES    6030
LV    6025
LU    3033
Name: ISO, dtype: int64

In [38]:
to_drop = ['DOI', 'Version', 'Date',  'NUTS_Level', 'StockEU', 'StockNonEU']
to_order = ['Survey', 'ISO', 'Country', 'NUTS', 'NUTS_Name', 'Case_ID', # Metadata
            'Attitude', 'Gender', 'Age', 'Residence', 'LifeSatisfaction', 'VoiceCounts', # Individual-level vars
            'StockTotal', 'NativeEdu', 'NativeUnempl', # NUTS-level contorls
            'InflowsOECD'] # Country-level treatment

In [39]:
print(f'Shape before:{df_eurobar.shape}')
df_eurobar.drop(to_drop, axis = 1, inplace = True)
df_eurobar = df_eurobar.reindex(to_order, axis = 1)
print(f'Shape after:{df_eurobar.shape}')
display(df_eurobar.head())

Shape before:(117880, 22)
Shape after:(117880, 16)


Unnamed: 0,Survey,ISO,Country,NUTS,NUTS_Name,Case_ID,Attitude,Gender,Age,Residence,LifeSatisfaction,VoiceCounts,StockTotal,NativeEdu,NativeUnempl,InflowsOECD
0,Eurobarometer 83.1 (February-March 2015),BE,BE - Belgium,BE2,Flemish Region,1,Neutral,Man,75,Rural area or village,Fairly satisfied,Totally disagree,9.9,37.3,4.2,1.145847
1,Eurobarometer 83.1 (February-March 2015),BE,BE - Belgium,BE2,Flemish Region,2,Neutral,Woman,51,Rural area or village,Very satisfied,Tend to agree,9.9,37.3,4.2,1.145847
2,Eurobarometer 83.1 (February-March 2015),BE,BE - Belgium,BE2,Flemish Region,3,Neutral,Woman,40,Rural area or village,Fairly satisfied,Totally disagree,9.9,37.3,4.2,1.145847
3,Eurobarometer 83.1 (February-March 2015),BE,BE - Belgium,BE2,Flemish Region,4,Neutral,Woman,54,Rural area or village,Not at all satisfied,Totally disagree,9.9,37.3,4.2,1.145847
4,Eurobarometer 83.1 (February-March 2015),BE,BE - Belgium,BE2,Flemish Region,16,Neutral,Woman,88,Rural area or village,,Totally disagree,9.9,37.3,4.2,1.145847


In [40]:
# Recoding variables

df_eurobar['Attitude'].replace({'Very positive':2,
                                'Fairly positive':1,
                                'Neutral':0,
                                'Fairly negative':-1,
                                'Very negative':-2}, inplace = True)

df_eurobar['Gender'].replace({'Man':0, 'Woman':1}, inplace = True)

df_eurobar['Residence'].replace({'Rural area or village':0,
                                 'Small or middle sized town':0,
                                 'Large town':1,}, inplace = True)

df_eurobar['VoiceCounts'].replace({'Totally agree':1,
                                   'Tend to agree':1,
                                   'Tend to disagree':0, 
                                   'Totally disagree':0}, inplace = True)

df_eurobar['LifeSatisfaction'].replace({'Very satisfied':1,
                                        'Fairly satisfied':1,
                                        'Not very satisfied':0,
                                        'Not at all satisfied':0}, inplace = True)

In [41]:
# Recoding Age
df_eurobar['Age'].replace({'[NOT CLEARLY DOCUMENTED]':np.nan}, inplace = True)
df_eurobar['Age'] = df_eurobar['Age'].astype(str).str.split().str.get(0).astype(float)

In [42]:
# Missing values per column
display(df_eurobar.isna().sum())

Survey                 0
ISO                    0
Country                0
NUTS                   0
NUTS_Name              0
Case_ID                0
Attitude            2026
Gender                 0
Age                   16
Residence             62
LifeSatisfaction     456
VoiceCounts         7563
StockTotal          6305
NativeEdu              0
NativeUnempl           0
InflowsOECD            0
dtype: int64

In [43]:
df_eurobar.to_feather('CEU_Thesis_Multilevel.feather')
print('Saved!')

Saved!


## End of the Script