<img title="a title" alt="Alt text" src="https://upload.wikimedia.org/wikipedia/commons/1/1e/Erasmus%2B_Logo.svg">

# 👨‍🎓 What is Erasmus+?
> ` Erasmus+` is the EU's programme to support education, training, youth and sport in Europe.<br>
It supports priorities and activities set out in the European
Education Area, Digital Education Action Plan and the European
Skills Agenda. The programme also:
 > - Supports the European Pillar of Social Rights
 > - Implements the EU Youth Strategy 2019-2027
 > - Develops the European dimension in sport
<br>
<br>

> `Erasmus+` offers mobility and cooperation opportunities in:
>  - higher education
>  - vocational education and training
>  - school education (including early childhood education and care)
>  - adult education
>  - youth
>  - sport

# 📝 Business Understanding
> I have an exchange, what is the appropriate country to receive this exchange?
I am a data scientist and I want to work on this data so that I will explore it and classify it by machine learning and then I will predict the best three countries that receive exchange by giving the machine exchange data and it will suggest the best country that receives this exchange

# Data Description

<table border="3">
  <thead>
    <tr style="text-align: left;">
      <th>#</th>
      <th>Column Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>Project Reference</td>
      <td>A specific number for each exchange separately</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Academic Year</td>
      <td>The academic year in which the study took place</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Mobility Start Month</td>
      <td>The month in which the exchange began</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Mobility End Month</td>
      <td>The month in which the exchange ended</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Mobility Duration</td>
      <td>The number of days from the beginning of the exchange to its end</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Activity (mob)</td>
      <td>Exchange activity such as training, study, work, and so on</td>
    </tr>
    <tr>
      <th>7</th>
      <td>Field of Education</td>
      <td>The field of education such as business administration, physical education, science, and so on</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Participant Nationality </td>
      <td>The home country from which the participant participated</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Education Level</td>
      <td>Education level such as master's, bachelor's, doctorate, diploma, and so on</td>
    </tr>
    <tr>
      <th>10</th>
      <td>Participant Gender</td>
      <td>Male or Female</td>
    </tr>
    <tr>
      <th>11</th>
      <td>Participant Profile</td>
      <td>The person sent in the exchange is classified as a learner or staff</td>
    </tr>
    <tr>
      <th>12</th>
      <td>Special Needs</td>
      <td>Describe whether the person has special needs or not</td>
    </tr>
    <tr>
      <th>13</th>
      <td>Fewer Opportunities</td>
      <td>A person is classified if he has any health problems or problems with European cultures and so on so that the chances will be few</td>
    </tr>
    <tr>
      <th>14</th>
      <td>Group Leader</td>
      <td>Is the person a group leader or not?</td>
    </tr>
    <tr>
      <th>15</th>
      <td>Participant Age</td>
      <td>The age of the participant</td>
    </tr>
    <tr>
      <th>16</th>
      <td>Sending Country Code</td>
      <td>The code of the country sending the exchange</td>
    </tr>
    <tr>
      <th>17</th>
      <td>Receiving Country Code</td>
      <td>The code of the country receiving the exchange</td>
    </tr>
    <tr>
      <th>18</th>
      <td>Sending City</td>
      <td>The city sent for the exchange</td>
    </tr>
    <tr>
      <th>19</th>
      <td>Receiving City</td>
      <td>The city receiving the exchange</td>
    </tr>
    <tr>
      <th>20</th>
      <td>Sending Organization</td>
      <td>The name of the organization sending the exchange</td>
    </tr>
    <tr>
      <th>21</th>
      <td>Receiving Organization</td>
      <td>The name of the organization receiving the exchange</td>
    </tr>
    <tr>
      <th>22</th>
      <td>Sending Organisation Erasmus Code</td>
      <td>The code of the organization sending the exchange</td>
    </tr>
    <tr>
      <th>23</th>
      <td>Receiving Organisation Erasmus Code</td>
      <td>The code of the organization receiving the exchange</td>
    </tr>
    <tr>
      <th>24</th>
      <td>Participants</td>
      <td>The number of participants in the exchange is from 1 to 13</td>
    </tr>
  </tbody>
</table>


# ⚙️ Setting Up our libraries and files

In [None]:
# Pandas for data processing 
import pandas as pd


import numpy as np
from numpy import random

# Library to get countries codes
import pycountry

# Regex for string values
import re

### Function used for proccessing

In [None]:
# Function to get country name by the code
def countries(x):
    country = pycountry.countries.get(alpha_2=x)
    if country != None:
        return country.name
    else:
        return x

# Handle age column 
def drop_ages(x):
    if x > 90 or x < 7:
        return np.nan
    else:
        return x

### Read The Dataset

In [None]:
erasmus_df = pd.read_csv("/kaggle/input/erasmus-mobility-statistics-2014-2019/Erasmus_mobility_statistics_2014_2019.gzip", compression='gzip')

### Columns

In [None]:
for i in range(0,len(erasmus_df.columns)):
    print(str(i+1),"-",erasmus_df.columns[i]) 

#### We need to drop the Useless columns for your EDA, because it belongs to the E+ system that uses, and less storage while proccessing the data 
- Project Reference
- Sending Country Code
- Sending Organisation Erasmus Code
- Receiving Country Code
- Receiving Organisation Erasmus Code


In [None]:
erasmus_df = erasmus_df.drop(columns = ['Project Reference',
                                        'Sending Organisation Erasmus Code',
                                        'Receiving Organisation Erasmus Code'])

In [None]:
print("There are ( {} ) exchange in our Dataset".format(len(erasmus_df))) 

# ⚙️ Data Transformation

#### Handle missing values and Change Dtype

In [None]:
print(erasmus_df.isnull().sum())

<strong>There are very few missing data so we will delete the records that do not contain data, it will not affect the data because it is very little</strong>

In [None]:
erasmus_df = erasmus_df.dropna()
print("There are ( {} ) exchange in our Dataset after dropping Null values".format(len(erasmus_df))) 

#### Let us start with the columns

<strong>1) Academic Year</strong> 

In [None]:
erasmus_df['Academic Year'].unique()

Academic Year is perfect column, no missing and no wrong data

<strong>2) Mobility Start Month </strong> 

In [None]:
print(sorted(erasmus_df['Mobility Start Month'].unique()))

Mobility Start Month column is perfect too, start with **2014-05** to **2019-12**

**3) Mobility End Month**

In [None]:
print(sorted(erasmus_df['Mobility End Month'].unique()))

Mobility Start Month column is perfect too, start with **2014-06** to **2021-04**


**4)Mobility Duration**

In [None]:
print(sorted(erasmus_df['Mobility Duration'].unique()))

- We need to convert to correct DataType (Integer)

In [None]:
# Change the data type for Mobility Duration
erasmus_df['Mobility Duration'] = erasmus_df['Mobility Duration'].astype('int')
print(sorted(erasmus_df['Mobility Duration'].unique()))

There are **zero** values and values less than **10 days** we will handle it when arrive to **ML** step.<br>
Now the column is perfect for **EDA**.


**5) Activity (mob)**

In [None]:
sorted(erasmus_df['Activity (mob)'].unique())

Here are a duplicated values (**Same meaning but different values**), they just need to replace the values 

In [None]:
# Change column name
erasmus_df = erasmus_df.rename(columns = {'Activity (mob)':'Activity'})

# Replace the values
erasmus_df['Activity'] = erasmus_df['Activity'].replace(
                        {"Advance Planning Visit – EVS":"Advance Planning Visit - EVS",
                         "Training/teaching assignments abroad":"Teaching/training assignments abroad"}
                            )

len(sorted(erasmus_df['Activity'].unique()))

**6) Participant Age**

In [None]:
erasmus_df['Participant Age'].unique()

Here in **Age** column there are a lot of wrong and missing values, values > 100 and <100, it should be between 5 to 100 or 90. Let us handle it with take **the mean from the same exchanges** <br>
The best column to handle that is **Activity** column becuase it has not missing or wrong values, will take **the mean age** from every activity

In [None]:
# Convert wrong and missing values to NaN values to fill it again 
# First convert the number to correct type (Int)
erasmus_df['Participant Age'] = erasmus_df['Participant Age'].replace('-',1525)
erasmus_df['Participant Age'] = erasmus_df['Participant Age'].astype('int')
erasmus_df['Participant Age'] = erasmus_df['Participant Age'].apply(lambda x:drop_ages(x))
erasmus_df['Participant Age'].unique()

Now we will find the mean ages for every activity

In [None]:
mean_ages_from_activity = erasmus_df[['Participant Age','Activity']].dropna()
mean_ages_from_activity['Participant Age'] = mean_ages_from_activity['Participant Age'].astype('int')
mean_ages_from_activity = round(mean_ages_from_activity.groupby(['Activity']).mean())

mean_ages_from_activity

As we see here is the mean value for each activity, now we will fill **nan** values with the mean age for 

In [None]:
# Convert mean_ages to dict
mean_ages_from_activity = dict(mean_ages_from_activity)
keys_activities = mean_ages_from_activity['Participant Age'].keys()

# Get ages and activity cols to fill nan values
activity_age_df = erasmus_df[['Activity','Participant Age']].astype(str)

s = 0
for index,row in activity_age_df.iterrows():
    if row['Participant Age'] == "nan":
        if row['Activity'] in keys_activities:
            row['Participant Age'] = mean_ages_from_activity['Participant Age'][row['Activity']]
        else:
            row['Participant Age'] = random.randint(17,30)
    else:
        pass

# Set the correct type to the values
activity_age_df['Participant Age'] = activity_age_df['Participant Age'].astype(float)

# Set the values to the main df
erasmus_df['Participant Age'] = activity_age_df['Participant Age'].astype(int)

print(sorted(erasmus_df['Participant Age'].unique()))


**PERFECT AGE COLUMN**

**7) Participant Nationality**

In [None]:
print(sorted(erasmus_df['Participant Nationality'].unique()))

We will drop "-", because there is not a lot of this value, they are just 2000 values

In [None]:
erasmus_df = erasmus_df[erasmus_df['Participant Nationality'] != '-']
print(sorted(erasmus_df['Participant Nationality'].unique()))

The coulmn now is perfect but we will replace country code with country name

In [None]:
# New column for countries names
correct_cantries_names = {
    "Palestine, State of":"Palestine",
    "Virgin Islands, U.S.": "Virgin Islands",
    "Taiwan, Province of China": "Taiwan",
    "Holy See (Vatican City State)": "Holy See"
}
erasmus_df["Nationality"] = erasmus_df['Participant Nationality'].apply(lambda x:countries(x))
erasmus_df["Nationality"] = erasmus_df["Nationality"].replace(correct_cantries_names)
erasmus_df = erasmus_df.drop(columns = "Participant Nationality")

In [None]:
print(sorted(erasmus_df['Nationality'].unique(),key=len))

We will change these values manualy **['XK', 'EL', 'UK', 'TP', 'AN', 'AB', 'CP']**

In [None]:
erasmus_df["Nationality"] = erasmus_df["Nationality"].replace({
                                        "XK":"Kosovo",
                                        "EL":"Greece",
                                        "UK":"United Kingdom",
                                        "TP":"East Timor",
                                        "AN":"Netherlands Antilles",
                                        "AB":"Albania",
                                        "CP":"Clipperton Island",
})
print(sorted(erasmus_df['Nationality'].unique(), key=len))

Nationality column is perfect now

**8) Field of Education**

In [None]:
erasmus_df['Field of Education'].value_counts()

There are **848635** Unknown values in Field of Education column, we will replace it with other value, because we can not get it. **It is a big number**

In [None]:
erasmus_df['Field of Education'] = erasmus_df['Field of Education'].replace("? Unknown ?","Other")

Field of Education is perfect now

**9) Education Level**

In [None]:
unknown = len(erasmus_df[erasmus_df['Education Level'] == '??? - ? Unknown ?'])
print("There are ({}) of Education Level are Unknown values ".format(unknown))
sorted(erasmus_df['Education Level'].unique())

**We will replace the current values with these values to be more readable:**
- ISCED 0 = Early childhood education
- ISCED 1 = Primary Education
- ISCED 2 = Lower Secondary Education
- ISCED 3 = Upper Secondary Education
- ISCED 4 = Post-secondary non-Tertiary Education
- ISCED 5 = Short-cycle tertiary education
- ISCED 6 = Bachelors degree or equivalent tertiary education level
- ISCED 7 = Masters degree or equivalent tertiary education level
- ISCED 8 = Doctoral degree or equivalent tertiary education level

In [None]:
erasmus_df['Education Level'] = erasmus_df['Education Level'].replace({
"??? - ? Unknown ?":"Unknown",
"ISCED-2 - Lower secondary education":"Lower Secondary Education",
"ISCED-3 - Upper secondary education":"Upper Secondary Education",
"ISCED-4 - Post-secondary non-tertiary education":"Post-secondary non-Tertiary Education",
"ISCED-5 - Short-cycle within the first cycle / Short-cycle tertiary education (EQF-5)":"Short-cycle tertiary education",
"ISCED-6 - First cycle / Bachelor’s or equivalent level (EQF-6)":"Bachelors degree or equivalent tertiary education level",
"ISCED-7 - Second cycle / Master’s or equivalent level (EQF-7)":"Masters degree or equivalent tertiary education level",
"ISCED-8 - Third cycle / Doctoral or equivalent level (EQF-8)":"Doctoral degree or equivalent tertiary education level",
"ISCED-9 - Not elsewhere classified":"Not classified",
})
sorted(erasmus_df['Education Level'].unique())

Education Level is perfect now

**10) Participant Gender**

In [None]:
unknown_gender = len(erasmus_df[erasmus_df['Participant Gender'] == 'Undefined'])
print("There are ({}) of Participant Gender are Undefined values ".format(unknown))
sorted(erasmus_df['Participant Gender'].unique())

The Undefined values we will fill it with fill method because it is not a bigg missing

In [None]:
erasmus_df['Participant Gender'] = erasmus_df['Participant Gender'].replace("Undefined",np.nan)

# Fill nan values with the most frequent value
erasmus_df['Participant Gender'].fillna(erasmus_df['Participant Gender'].mode()[0], inplace=True)
sorted(erasmus_df['Participant Gender'].unique())

Participant Gender is perfect now

**11) Participant Profile**

In [None]:
sorted(erasmus_df['Participant Profile'].unique())

Participant Profile is coool

**12) Special Needs**

In [None]:
sorted(erasmus_df['Special Needs'].unique())

Special Needs is cool too

**13) Fewer Opportunities**

In [None]:
sorted(erasmus_df['Fewer Opportunities'].unique())

Fewer Opportunities is cool too

**14) GroupLeader**

In [None]:
sorted(erasmus_df['GroupLeader'].unique())

Group Leader is cool too

**15) Sending Country Code**

In [None]:
print(sorted(erasmus_df['Sending Country Code'].unique()))

We will replace code with name of the country

In [None]:
# New column for countries names
erasmus_df["Sending Country Name"] = erasmus_df['Sending Country Code'].apply(lambda x:countries(x))
erasmus_df["Sending Country Name"] = erasmus_df["Sending Country Name"].replace({
                                        "XK":"Kosovo",
                                        "EL":"Greece",
                                        "UK":"United Kingdom"
})
erasmus_df["Sending Country Name"] = erasmus_df["Sending Country Name"].replace(correct_cantries_names)

erasmus_df = erasmus_df.drop(columns = "Sending Country Code")
print(sorted(erasmus_df['Sending Country Name'].unique(), key=len ))

Sending Country Name is perfect now

**16) Receiving Country Code**

In [None]:
print(sorted(erasmus_df['Receiving Country Code'].unique(), key = len))

We will replace code with name of the country

In [None]:
erasmus_df["Receiving Country Name"] = erasmus_df['Receiving Country Code'].apply(lambda x:countries(x))
erasmus_df["Receiving Country Name"] = erasmus_df["Receiving Country Name"].replace({
                                        "XK":"Kosovo",
                                        "EL":"Greece",
                                        "UK":"United Kingdom"
})
erasmus_df["Receiving Country Name"] = erasmus_df["Receiving Country Name"].replace(correct_cantries_names)

print(sorted(erasmus_df['Receiving Country Name'].unique(), key = len))

This is the **Class lable for our ML Algorthim** and it is perfect now

**17) Participants**

In [None]:
print(sorted(erasmus_df['Participants'].unique()))

We will convert the value to corect type (integr) 

In [None]:
erasmus_df['Participants'] = erasmus_df['Participants'].astype(int)
print(sorted(erasmus_df['Participants'].unique()))

Participants is perfect now

**18) Sending City** <br>
**19) Receiving City**<br>
**20) Sending Organization**<br>
**21) Receiving Organization**<br>
**These columns it is very dirty and need much time to clean it and will not be very clear values after the processing, so we will drop it from the dataset**   

In [None]:
print("There are ({}) unique Receiving organizarion name ".format(len(sorted(erasmus_df['Receiving Organization'].unique()))))
print("There are ({}) unique Sending organizarion name ".format(len(sorted(erasmus_df['Sending Organization'].unique()))))
print("There are ({}) unique Sending City name ".format(len(sorted(erasmus_df['Sending City'].unique()))))
print("There are ({}) unique Receiving City name ".format(len(sorted(erasmus_df['Receiving City'].unique()))))
erasmus_df = erasmus_df.drop(columns = ["Receiving Organization",
                                        "Sending Organization",
                                        "Sending City",
                                        "Receiving City",
                                        'Receiving Country Code'])

**This code could be used for clean city name**

In [None]:
# from fuzzywuzzy import fuzz
# from fuzzywuzzy import process
# !pip install swifter
# import swifter
# !pip install allcities
# from allcities import cities

# # Store correct cities names in a list.
# correct_cities_names = []
# for i in list(cities):
#     correct_cities_names.append(i.name)

# # Compare the wrong country with correct one.
# def fuzzywuzzy_city_name(city):
#     return process.extractOne(city, correct_cities_names, scorer=fuzz.token_set_ratio)[0]
   
# # Will process the city name and get the right name.
# def correct_city_name(x):
#     if x.replace(" ","").isalpha():
#         return x.capitalize()
#     else:
#         x_city = " ".join(re.findall("[a-zA-Z]+",x))
#         if len(x_city) == 0:
#             return "Unknown"
#         else:
#             return fuzzywuzzy_city_name(x_city)

# # Process and replace the wrong city name and set the right name on the dataframe.
# erasmus_df['Sending City'] = erasmus_df['Sending City'].swifter.apply(correct_city_name)
# erasmus_df['Receiving City'] = erasmus_df['Receiving City'].swifter.apply(correct_city_name)

**The last step, correcting the data types**

In [None]:
# list of Category types
category_type = ['Academic Year', 'Mobility Start Month', 'Mobility End Month',
                 'Activity', 'Field of Education','Education Level', 'Participant Gender',
                 'Participant Profile','Special Needs', 'Fewer Opportunities', 
                 'GroupLeader', 'Nationality','Sending Country Name', 
                 'Receiving Country Name']
erasmus_df[category_type] = erasmus_df[category_type].astype("category")

# 📊📈 Exploratory Data Analysis

In [None]:
print("There are ({}) exchange in our dataset, after processing and cleaning".format(len(erasmus_df)))
print("There are ({}) features (columns) in our dataset".format(len(erasmus_df.columns)))

In [None]:
erasmus_df.info()

In [None]:
erasmus_df.describe()

In [None]:
print("Descripe The Numerical Columns")
print("The minimum value of [Mobility Duration] is : {} ".format(erasmus_df['Mobility Duration'].min()))
print("The maximum value of [Mobility Duration] is : {} ".format(erasmus_df['Mobility Duration'].max()))
print("The mean value of [Mobility Duration] is : {} ".format(round(erasmus_df['Mobility Duration'].mean())))
print("-"*50)
print("The minimum value of [Participant Age] is : {} ".format(erasmus_df['Participant Age'].min()))
print("The maximum value of [Participant Age] is : {} ".format(erasmus_df['Participant Age'].max()))
print("The mean value of [Participant Age] is : {} ".format(round(erasmus_df['Participant Age'].mean())))
print("-"*50)
print("The minimum value of [Participants] is : {} ".format(erasmus_df['Participants'].min()))
print("The maximum value of [Participants] is : {} ".format(erasmus_df['Participants'].max()))
print("The mean value of [Participants] is : {} ".format(round(erasmus_df['Participants'].mean())))

In [None]:
# Import libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns


**1) Data Distribution of Participant Age**

In [None]:
# Set-up the size and plots 
fig, axs = plt.subplots()
fig.set_size_inches([15, 8])


axs.hist(erasmus_df['Participant Age'],color="#004494")

axs.set_title("Data Distribution of Participant Age", fontsize=20)
axs.set_xticks([i for i in range(0,80,3)])
axs.set_xlabel("Participant Age", color="#004494", fontsize=17)
axs.set_ylabel("Count of Participant Age", color="#004494", fontsize=17)

plt.show()

- Those ages are more than 60 and less than 16 maybe an outliers
- 15 To 22 have the biggest count

**2) Data Distribution of Participants**

In [None]:
# Set-up the size and plots 
plt.figure(figsize=(15,8))

gfg = sns.countplot(x = "Participants", data=erasmus_df, color="#004494")
gfg.set(xlabel ="Participants", ylabel = "Count of Participants", title ='Data Distribution of Participants')

plt.show()

The most Participants are less than 2, because the most exchanges are for one Participant

**3) Data Distribution of Mobility Duration**

In [None]:
# Set-up the size and plots 
fig, axs = plt.subplots()
fig.set_size_inches([15, 8])


axs.hist(erasmus_df['Mobility Duration'],color="#004494")

axs.set_title("Data Distribution of Mobility Duration", fontsize=20)
axs.set_xticks([i for i in range(0,500,20)])
axs.set_xlabel("Mobility Duration", color="#004494", fontsize=17)
axs.set_ylabel("Count of Mobility Duration", color="#004494", fontsize=17)

plt.show()

The most Durations are between 0 and 80 

**4) Let's check if is there any relationship between the numerical columns**

In [None]:
erasmus_df.corr()

There is no any relathionship between the numerical data

**5) Data Distribution of Academic Year**

In [None]:
# Set-up the size and plots 
plt.figure(figsize=(15,8))

gfg = sns.countplot(x = "Academic Year", data=erasmus_df, color="#004494")
gfg.set(xlabel ="Academic Year", 
        ylabel = "Count of Academic Year", 
        title ='Data Distribution of Academic Year')

plt.show()

- The count of participant is increased year by year from 2014 to 2020
- From 2014 to 2016 the count has been increased more than the half of participan

**6) Difference between Sending and Receiving Countries**

In [None]:
sending = erasmus_df['Sending Country Name'].value_counts().to_frame().reset_index()
receiving = erasmus_df['Receiving Country Name'].value_counts().to_frame().reset_index()

In [None]:
sending_receiving = sending.merge(receiving)
sending_receiving = sending_receiving.rename(columns = {"index":"Country",
                                                        "Sending Country Name":"Sending Amount",
                                                        "Receiving Country Name":"Receiving Amount"})
# Get only the bigest 20 country
sending_receiving = sending_receiving.head(20)

sending_receiving = sending_receiving.set_index('Country')
sending_receiving

In [None]:
ax = sending_receiving.plot(kind='bar', figsize=(15, 8), rot=0)
ax.set_title('Sending and Receiving Amount by Country')
ax.set_xlabel('Country')
ax.set_ylabel('Amount')
plt.xticks(rotation=90)
plt.show()

**7) Data Distribution of Education Level**

In [None]:
plt.figure(figsize=(15,8))

# count the frequency of each category
counts = erasmus_df["Education Level"].value_counts()

# create a bar chart of the counts
plt.bar(counts.index, counts.values)

# add axis labels and title
plt.xlabel('Education Level')
plt.ylabel('Count ofEducation Level in (Millions)')
plt.title('Frequency of Education Level')
plt.xticks(rotation=90)

plt.show()
counts = None

**8)Group Leader** <br>
**9)Participant Gender** <br>
**10)Participant Profile** <br>
**11)Special Needs** <br>
#### I will plot it with Donat Chart

In [None]:
fig, axs = plt.subplots(2, 2)
fig.set_size_inches([15, 8])

# Gender 
gender = erasmus_df['Participant Gender'].value_counts()
axs[0, 0].pie(gender.values,
              labels = gender.index,
              wedgeprops=dict(width=0.5),
              autopct='%1.1f%%')
axs[0, 0].set_title('Gender %')

# Group Leader
GroupLeader = erasmus_df['GroupLeader'].value_counts()
axs[0, 1].pie(GroupLeader.values,
              labels = GroupLeader.index,
              wedgeprops=dict(width=0.5),
              autopct='%1.1f%%')
axs[0, 1].set_title('Group Leader %')

# Participant Profile
ParticipantProfile = erasmus_df['Participant Profile'].value_counts()
axs[1, 0].pie(ParticipantProfile.values,
              labels = ParticipantProfile.index,
              wedgeprops=dict(width=0.5),
              autopct='%1.1f%%')
axs[1, 0].set_title('Participant Profile %')


# Special Needs
SpecialNeeds = erasmus_df['Special Needs'].value_counts()
axs[1, 1].pie(SpecialNeeds.values,
              labels = SpecialNeeds.index,
              wedgeprops=dict(width=0.5),
              autopct='%1.1f%%')
axs[1, 1].set_title('Special Needs %')

plt.show()
gender,GroupLeader,ParticipantProfile,SpecialNeeds = None,None,None,None

**Data Distribution of Education Level and Academic Year**

In [None]:
# group the dataframe by year and activity
grouped = erasmus_df[['Academic Year','Education Level']]
grouped = grouped.groupby(['Academic Year', 'Education Level']).size().unstack(fill_value=0)

# create a stacked bar chart
fig, axs = plt.subplots(figsize=(15, 11))

grouped.plot(kind='bar', stacked=True, ax = axs)

# set the x-label and y-label
axs.set_xlabel('Academic Year')
axs.set_ylabel('Count')

# display the plot
plt.show()
grouped = None

- **The Distribution of Education level was increased (2016 - 2020)**
- **From 2014-2015 To 2015-2016 The number of exchanges were Doubled the number** 

**12) The most five popular Field of Education**

In [None]:
fig, axs = plt.subplots(figsize=(15, 11))
FieldofEducation = erasmus_df['Field of Education'].value_counts().head(5)
axs.pie(FieldofEducation.values,
              labels = FieldofEducation.index,
              wedgeprops=dict(width=0.6),
              autopct='%1.1f%%',
              textprops={'fontsize': 25})
axs.set_title('Field of Education %')
plt.show()
FieldofEducation = None

**13) Data Distribution of Activity**

In [None]:
plt.figure(figsize=(15,8))

# count the frequency of each category
Activity = erasmus_df["Activity"].value_counts()

# create a bar chart of the counts
plt.bar(Activity.index, Activity.values)

# add axis labels and title
plt.xlabel('Activity')
plt.ylabel('Count of Activity (Millions)')
plt.title('Frequency of Activity')
plt.xticks(rotation=90)

plt.show()
Activity = None

**14) Data Distribution of Sending Country Name (biggest 30) only**

In [None]:
import squarify
plt.figure(figsize=(16,10))

# create a sample data
SendingCountryName = erasmus_df["Sending Country Name"].value_counts().head(30)

# create treemap using squarify
squarify.plot(sizes=SendingCountryName.values, label=SendingCountryName.index, alpha=.8 )

# create a legend and title
plt.axis('off')
plt.title("Data Distribution of Sending Country Name (biggest 30) only")

# display the plot
plt.show()

SendingCountryName = None

**15) Data Distribution of Receiving Country Name (biggest 30) only**

In [None]:
plt.figure(figsize=(16,10))

# create a sample data
ReceivingCountryName = erasmus_df["Receiving Country Name"].value_counts().head(30)

# create treemap using squarify
squarify.plot(sizes=ReceivingCountryName.values, label=ReceivingCountryName.index, alpha=.8 )

# create a legend and title
plt.axis('off')
plt.title("Data Distribution of Receiving Country Name (biggest 30) only")

# display the plot
plt.show()
ReceivingCountryName = None

# 🐍 ML Clustering

In [None]:
fig, axs = plt.subplots(figsize=(15, 11))
axs.boxplot([erasmus_df['Mobility Duration'],
             erasmus_df['Participant Age'],
             erasmus_df['Participants']])
axs.set_xticklabels(['Mobility Duration','Participant Age','Participants'])
plt.show()

In [None]:
erasmus_clusters_df = erasmus_df.copy()
changed_categories = {
    "Yes":1,
    "No":0,
    "Female":1,
    "Male":2,
    "Learner":1,
    "Staff":2
}
# Replace all (Yes,No), (Female,Male), (Learner, Staff)
erasmus_clusters_df = erasmus_clusters_df.replace(changed_categories)

# Replace Mobility Start Month with the month only
erasmus_clusters_df['Mobility Start Month'] = erasmus_clusters_df['Mobility Start Month'].apply(lambda x:x[-2:])

# Replace Mobility End Month with the month only
erasmus_clusters_df['Mobility End Month'] = erasmus_clusters_df['Mobility Start Month'].apply(lambda x:x[-2:])

# Replace Academic Year with the second year only
erasmus_clusters_df['Academic Year'] = erasmus_clusters_df['Academic Year'].apply(lambda x:x[-4:])

# Replace Activity with counter from (1 to N)
cnt = 1
changed_values = {}
for i in erasmus_df['Activity'].unique():
    changed_values[i] = cnt
    cnt += 1
    
erasmus_clusters_df['Activity'] = erasmus_clusters_df['Activity'].replace(changed_values)

# Replace Field of Education with counter from (1 to N)
cnt = 1
changed_values = {}
for i in erasmus_clusters_df['Field of Education'].unique():
    changed_values[i] = cnt
    cnt += 1
erasmus_clusters_df['Field of Education'] = erasmus_clusters_df['Field of Education'].replace(changed_values)

# Replace Education Level with counter from (1 to N)
cnt = 1
changed_values = {}
for i in erasmus_clusters_df['Education Level'].unique():
    changed_values[i] = cnt
    cnt += 1
erasmus_clusters_df['Education Level'] = erasmus_clusters_df['Education Level'].replace(changed_values)


In [None]:
nationalities = list(erasmus_clusters_df['Nationality'].unique())
sending_country_name = list(erasmus_clusters_df['Sending Country Name'].unique())
receiving_country_name = list(erasmus_clusters_df['Receiving Country Name'].unique())
countries_names = list(set(nationalities + sending_country_name + receiving_country_name))

In [None]:
from geopy.geocoders import Nominatim

# Create a Geopy geocoder instance
geolocator = Nominatim(user_agent="my_app")

# Loop through the country names and get their latitude and longitude
def get_latitude(country_name):
    location = geolocator.geocode(country_name)
    if location is not None:
        return location.latitude
    else:
        return country_name

country_latitude = {}

for country_name in countries_names:
    country_latitude[country_name] = get_latitude(country_name)
country_latitude.values()

In [None]:
erasmus_clusters_df['Nationality'] = erasmus_clusters_df['Nationality'].replace(country_latitude)
erasmus_clusters_df['Sending Country Name'] = erasmus_clusters_df['Sending Country Name'].replace(country_latitude)
erasmus_clusters_df['Receiving Country Name'] = erasmus_clusters_df['Receiving Country Name'].replace(country_latitude)

In [None]:
# erasmus_clusters_df.head()
erasmus_clusters_df = erasmus_clusters_df.astype(float)

In [None]:
erasmus_clusters_df.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# create a MinMaxScaler object
scaler = MinMaxScaler(feature_range=(-1,1))

# fit and transform the data using the scaler
df_scaled = pd.DataFrame(scaler.fit_transform(erasmus_clusters_df), columns=erasmus_clusters_df.columns)

df_scaled

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

# erasmus_df_k_means = erasmus_df[['Participant Age','Participants','Mobility Duration']]
# scaler = MinMaxScaler(feature_range=(0,1))

# erasmus_df_k_means = pd.DataFrame(scaler.fit_transform(erasmus_df_k_means), columns=erasmus_df_k_means.columns)
# # create the K-means object with the desired number of clusters

kmeans = KMeans(n_clusters=3, random_state=42)

# fit the K-means object to the data
kmeans.fit(df_scaled)

# get the cluster labels for each data point
labels = kmeans.predict(df_scaled)

# get the centroids of the clusters
centroids = kmeans.cluster_centers_

df_scaled['Clusters'] = labels

In [None]:
centroids

In [None]:
df_custer_1 = df_scaled[df_scaled.Clusters == 0]
df_custer_2 = df_scaled[df_scaled.Clusters == 1]
df_custer_3 = df_scaled[df_scaled.Clusters == 2]
# df_custer_4 = df_scaled[df_scaled.Clusters == 3]
# df_custer_5 = df_scaled[df_scaled.Clusters == 4]

fig, axs = plt.subplots(figsize=(15, 11))

plt.scatter(df_custer_1['Participants'],df_custer_1['Participant Age'],color = "green")
plt.scatter(df_custer_2['Participants'],df_custer_2['Participant Age'],color = "red")
plt.scatter(df_custer_3['Participants'],df_custer_3['Participant Age'],color = "blue")
# plt.scatter(df_custer_4['Participant Age'],df_custer_4['Mobility Duration'],color = "yellow")
# plt.scatter(df_custer_5['Participant Age'],df_custer_5['Mobility Duration'],color = "black")
plt.scatter(centroids[:,0],centroids[:,1],color = "purple",marker = "*",label = "centroid")
plt.legend()

In [None]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# load the dataset

# standardize the data using StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(erasmus_df_k_means)

# create the DBSCAN object with the desired parameters
dbscan = DBSCAN(eps=0.1, min_samples=5)

# fit the DBSCAN object to the data
dbscan.fit(data_scaled)

# get the cluster labels for each data point
labels = dbscan.labels_

# get the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

In [None]:
get_latitude("Holy See")

In [None]:
# from kmodes.kprototypes import KPrototypes
# customers_norm = pd.get_dummies(erasmus_df_clustering,
#                                 columns = [['Academic Year', 
#                                             'Mobility Start Month', 
#                                             'Mobility End Month',
#                                             'Activity', 
#                                             'Field of Education',
#                                             'Education Level', 
#                                             'Participant Gender', 
#                                             'Participant Profile',
#                                             'Special Needs', 
#                                             'Fewer Opportunities', 
#                                             'GroupLeader',
#                                             'Participant Age', 
#                                             'Participants', 
#                                             'Nationality',
#                                             'Sending Country Name', 
#                                             'Receiving Country Name']])
# kproto = KPrototypes(n_clusters=3, init='Cao')
# clusters = kproto.fit_predict(erasmus_df_clustering, categorical=[0, 1])

# #join data with labels 
# erasmus_df_clustering['Cluster'] = clusters
# erasmus_df_clustering['Cluster']

In [None]:
range(1,2)

In [None]:
erasmus_df.columns
# erasmus_df_clustering['Mobility Start Month'].unique()

# 🌍 ML Receiving Country Prediction