# Immigration into Spain: A Case Study
*This is the main notebook for this project. All code, analysis and discussion is contained within this document.*

## Introduction 
This summer, I was an intern at a small non-profit organization based out of Barcelona called the **Open Cultural Center (OCC)**. Its main goal is to support migrants and refugees coming into Spain through free language classes, career workshops, and skill-based classes, as long as need is demonstrated. I took a particular interest in **Migracode**, a daughter project specializing in coding skills classes. Given my background as a female computer science major, I was curious about the general demographics of these classes, and whether this sample reflects those of Barcelona's immigrant population as a whole. I also wondered if any origin countries were increasingly represented among Migracode students. 

These are important questions to ask, as [the educational level of the immigrant population in Spain has remained stagnant since the beginning of the century](https://blog.funcas.es/el-nivel-educativo-de-la-poblacion-inmigrante-en-espana-permanece-estancado-desde-principios-de-siglo/) *despite* positive population growth solely due to immigration. Certain South American countries are eligible for [expedited Spanish nationality applications](https://corralinternational.com/en/complete-guide-for-latin-american-citizens-emigrating-to-spain/#:~:text=Requirements:%20After%20residing%20legally%20in%20Spain%20for,only%20requires%20two%20years%20of%20legal%20residency.), and arrivals continue to break records each year. For these reasons and many others, the OCC needs to be able to recognize whether it effectively attracts immigrants in need so that it can continue to receive donor support. More importantly, it needs to make sure that no eligible immigrant feels deterred from applying on the basis of exclusion or cultural expectations. 

Luckily, it was easy to obtain the data for all of the students who have taken classes through **Migracode** from the years 2021 (its origin year) through 2024. 

Next, I took a look at the **Migracode** user-facing website and walked through the application process myself. There are four requirements in order to apply.
*Here is the link if you would like to look at the website:*
 
#### Requirements to be a Migracode Student:

**(1)** EITHER / Be a refugee or asylum seeker / Be a migrant from outside the EU who cannot find work / OR / Be a migrant from outside the EU in a difficult/vulnerable living situation

**(2)** Be fluent in **English** or **Spanish**

**(3)** Have **documentation** to live/work in Spain 

**(4)** Live in/around **Barcelona**

<u>Henceforth, I will be refering to people fulfilling these four criteria as **4 Criteria People**. I will refer to people who are or have been Migracode students as **Migracode Students**.</u>

The form on the Migracode website populates an entry in the data table once a new student has applied. The data table I will start with has been populated with a new row each time a new student applies.

## Research Question: How do Migracode Students differ from 4 Criteria People? 

I want to know if **Migracode Students** represent the general population of **4 Criteria People**, or if there are certain groups that are more highly saturated among **Migracode Students**. 

This may help to determine which groups respond to Migracode recruitment, in order to better target groups which are underrepresented. 


## Some sub-questions: 
### Do Migracode Students already have technical experience? 

This question aims to address the approachability of Migracode classes. The courses are intended to be easy to apply for and encouraging of little to no technical background (unless they are taking the advanced courses). 

If most **Migracode Students** indicate that they already have experience, there may be other factors that discourage unexperienced people to apply, like marketing or website interface. 

### Where do Migracode students come from? 

Are certain countries of origin represented more highly among **Migracode Students** than **4 Criteria People**?  

### What portion of students are over the age of 30? 

It's possible that filling out an online application or embarking on a course in technical skills is more approachable for a younger generation. 

## Methods / The Data
I will be using the following datasets/doxuments, described in more detail below:

**(1)** Migracode's Students Database (modified for anonymity, cleanliness, and application)

**(2)** Ajuntament de Barcelona's (City Hall of Barcelona) public Immigrants by Nationality, Sex, and Age dataset, 2023

**(3)** Spain Profiling of new arrivals (January - December 2023) 

### Limitations
My main research question is a lofty and largely unattainable ask for the following reasons: 

**There is no way to find data on broader 4 Criteria People.** That is, people who are  (1) immigrants/aslum seekers in sub-optimal financial/employment situations (2) speaking english or spanish,  (3) with documentation (4) in Barcelona. 

**Why?**
- There is *yearly population data* on immigration into Barcelona, but it doesn't get into income, employment or spoken language. Neighborhoods are described, which can hint at financial situation, but ultimately are unreliable indicators. 
- There is *asylum seeker annual arrival data*, but it doesn't describe individual cities or work documentation.
- Annual reports don't describe immigrants who may have arrived a long time ago.


### Solutions

I will describe Migracode students as they appear in the applications data table. Then, I will use various sources for information about arrivals or immigrants in Spain to hopefully describe its immigrant population, though it will be impossible to describe the very specific **4 Criteria Person** demographic. 

### Migracode's Students Database

#### Cleaning
Migracode's data table was originally populated through the use of a form, intended to register students for individual classes. This means that if a returning student filled out the form for a new class, they were added to the table as a new entry. In order to process each entry as **one individual**, I needed to find a way to combine entries.

I alphabetized the entries by name so that it was more clear which people had taken multiple classes over time, and then manually extraneous entries by the same person. I replaced a field originally called **Course** with a field called **Most Basic Course Taken** which describes whether the person *ever* took a basic level course. 

To preserve anonymity, after each person only had one entry, I removed the names from the sheet and replaced them with an index. This process reduced the entries from 1731 to 1390. 

I also added Did not report or Unknown values, as sometimes there are empty values, which can get in the way of code execution. 

#### Encodings 

**Most Basic Class Taken:**

| Code    | Courses |
| -------- | ------- |
| 0  | Applied for a Beginner class   |
| 1 | Only applied for Intermediate     |
| 2    | Only applied for AWS, ADA     |
| 3    | Only applied for Self-Learning Class

*By this classification, a value of 0 means the student has applied for at least one basic level class, and any other values mean the student has not applied for a basic level class. As this exploration is not concerned with retention, progess of students over time, or popularity of class topics, I simply note the most basic course taken overall.*

**Gender**
| Code    | Gender |
| -------- | ------- |
| 0  | Woman  |
| 1 | Man     |
| 2    | Other   |
| 3 | Did not report | 

**Country of Birth**

A string of the name of the country. If they did not fill it out, it's Unknown


**Languages**
| Code    | Languages |
| -------- | ------- |
| 0  | English  |
| 1 | English Or Spanish     |
| 2    | Spanish   |
| 3 | Unknown | 

**Laptop**
| Code    | Meaning |
| -------- | ------- |
| 0  | I don't have a laptop  |
| 1 | I have a good laptop to use     |
| 2    | I'm not sure/did not select an option  |


**Self-Reported Experience:**

| Code    | Self-Reported Experience |
| -------- | ------- |
| 0  | I don't have previous experience with coding   |
| 1 | I know a little bit about the basics     |
| 2    | I know already about coding   |
| 3 | Did not select an option | 

*To stay true to the form responses, I have copied their original values here, which are awkwardly translated. Language barriers contribute to form confusion as well.*

**Educational Background**
| Code    | Education |
| -------- | ------- |
| 0  | (Online) Courses or Elementary School  |
| 1 | High School     |
| 2    | University   |
| 3 | Other Education | 

**Work Permit**
| Code    | Permit |
| -------- | ------- |
| 0  | I don't have a work permit in Spain  |
| 1 | No, but within 8 months from now I can get it     |
| 2    | I have a work permit in Spain   |
| 3 | I don't know/did not select an option | 

**Age**

-1 if not reported or value is nonsensical (i.e. 110+, 0)
otherwise a two-digit number 

**Nationality**
| Code    | Sex |
| -------- | ------- |
| 1 | Spain  |
| 2 | Rest of the EU     |
| 3 | Rest of the world  |
| 4 | No value     |


**Younger than 30**
| Code | Age group | 
| ------- | -------- |
| -1 | if age is -1 
| 0 | Age is equal to or greater than 30
| 1 | Age is less than 30

#### Preview of Data
*Including modifications to assist data visualization creation*

In [131]:
##import cell 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [164]:
df = pd.read_csv('Students_Migracode_Anon_Encoded.csv')

#Create a new column that reports whether or not country of origin is in the EU
eu_countries = [
    'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 
    'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 
    'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 
    'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 
    'Spain', 'Sweden'
]
def classify_country(country):
    if country == 'Spain':
        return 1
    elif country in eu_countries:
        return 2
    elif country == 'Unknown':
        return 4
    else:
        return 3
df['Nationality'] = df['Country of Birth'].apply(classify_country)

#Create a new column that reports whether the person is under 30
def age_status(age):
    age = int(age)
    if age == -1:
        return -1
    elif age < 30:
        return 1
    else:
        return 0
        
df['Younger than 30'] = df['Age'].apply(age_status)

# Preview the data. 
df.head(10)


Unnamed: 0,Application date,Selected course,Gender,Country of Birth,Language(s),Laptop,Previous Experience,Educational background,Work permit,Age,Nationality,Younger than 30
0,18/7/2023,1,1,Egypt,0,1,1,2,2.0,-1,3,-1
1,19/3/2021,1,1,Morocco,0,0,1,1,0.0,38,3,0
2,30/8/2021,1,1,Morocco,0,1,2,2,1.0,28,3,1
3,5/10/2023,2,1,Morocco,0,0,2,2,0.0,25,3,1
4,1/5/2023,0,1,Morocco,1,1,2,0,2.0,33,3,0
5,29/6/2022,0,1,Syria,1,1,2,3,2.0,26,3,1
6,29/3/2022,0,1,Afghanistan,0,0,0,2,1.0,27,3,1
7,4/1/2022,0,1,Syria,0,1,2,1,2.0,24,3,1
8,7/11/2023,2,1,Pakistan,0,1,2,2,0.0,19,3,1
9,28/10/2023,1,1,Afghanistan,0,1,1,2,2.0,37,3,0


### Ajuntament de Barcelona's public Immigrants by Nationality, Sex, and Age dataset 
The City Hall of Barcelona provides many public datasets on annual immigration. This one had the most qualities that were relevant for this exploration: nationality, age, and sex. 

The dataset and its values have been modified from the original. This is because irrelevant information to this exploration and Catalan were used for the values and column names.


**Sex**
| Code    | Sex |
| -------- | ------- |
| 1 | Female  |
| 2 | Male     |

*This dataset did not have values for nonbinary or other gender expressions*

**Nationality**
| Code    | Sex |
| -------- | ------- |
| 1 | Spain  |
| 2 | Rest of the EU     |
| 3 | Rest of the world  |
| 4 | No value     |

**Age Range**
| Code    | Sex |
| -------- | ------- |
| 0 | <5 years  |
| 1 | 5-9    |
...
| 20 | >=100  |
| 21| No value     |

**Younger than 30**
| Code | Age group | 
| ------- | -------- |
| -1 | if age is -1 
| 0 | Age is equal to or greater than 30
| 1 | Age is less than 30


https://opendata-ajuntament.barcelona.cat/data/en/dataset/pad_imm_mdbas_sexe_edat-q_nacionalitat-g


In [200]:
bcn = pd.read_csv('Barcelona_Immigration_City_Hall_2023.csv')
print(bcn.columns)

# Create a Younger than 30 column that is the same as the other dataset
def age_status(age_code):
    # Check if the age_code is between 1 and 5 (5–29 years old)
    if 0 <= age_code <= 5:
        return 1
    # Check if the age_code is between 6 and 20 (30 years and above)
    elif 6 <= age_code <= 20:
        return 0
    # If age_code is out of the expected range, handle it as 'Unknown'
    else:
        return -1
bcn['Younger than 30'] = bcn['EDAT_Q'].apply(age_status)

def sex_same(sex_code):
    if 1:
        return 0
    elif 2: 
        return 1
bcn['Sex'] = bcn['SEXE'].apply(lambda x: 0 if x == 1 else 1)
bcn['Age Range'] = bcn['EDAT_Q']
bcn['Nationality'] = bcn['NACIONALITAT_G']


columns_to_delete = ['Any', 'Codi_Districte', 'Nom_Districte', 'Codi_Barri', 'Nom_Barri', 'AEB', 'Seccio_Censal', 'Valor', 'EDAT_Q', 'SEXE', 'NACIONALITAT_G' ]
bcn.drop(columns=columns_to_delete, inplace=True)
print("There are ", len(bcn.axes[0]), " entries in this dataset. ")
bcn.head(20)


Index(['Any', 'Codi_Districte', 'Nom_Districte', 'Codi_Barri', 'Nom_Barri',
       'AEB', 'Seccio_Censal', 'Valor', 'NACIONALITAT_G', 'EDAT_Q', 'SEXE'],
      dtype='object')
There are  49261  entries in this dataset. 


Unnamed: 0,Younger than 30,Sex,Age Range,Nationality
0,1,0,0,1
1,1,1,0,1
2,1,1,1,1
3,1,1,3,1
4,1,0,4,1
5,1,1,4,1
6,1,0,5,1
7,1,1,5,1
8,0,0,6,1
9,0,1,6,1



### United Nations High Commissioner for Refugees 
https://www.worlddata.info/europe/spain/asylum.php


The most important comparison I wanted to make was between 

## Results

### Visualizations


## Discussion

### Aside
During the internship, while I was often tasked with web development issues, my coworker was instructed to improve many of the data tables, as there are many across OCC's various other projects. He ran into issues when attempting to automate what I have manually done here, as people can misspell their names, use the same email for different people, or use many different formats for phone numbers. OCC doesn't currently use a unique identifier such as a student ID. Because it's such a small organization, we quickly realized that we were the only people trying to improve this (me by looking over his shoulder). This makes it difficult for the people at OCC to draw conclusions about their students, and for me to do the same throughout this project.  

From a web development standpoint, the websites for OCC, Migracode and other projects can be visually confusing, not updated, or, simply put, bad. This can make it more confusing to apply for these classes and properly fill out the forms. 

For these reasons, error can be introduced when I attempt to make each entry correspond to one individual. 