# Introduction

## Dataset

https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003

Number of students in Germany by subject of study, nationality, gender from 2018-2024. I will focus on the final year 2023/24 for a start. (Will I?)

Check this document for official practice of subject codes and subject cluster classifications: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pdf?__blob=publicationFile&v=12

## Questions 

1. What was the total number of students in Germany in 2023/24?
2. What were the 10 subjects with the highest number of students in 2023/24?
3. How was the gender distribution in 2023/24?
4. How was the gender distribution in the 5 most studied subjects?
5. What were the top subjects by gender?
6. Which were the top 5 subjects studied by non-citizens? 
7. (Sort the subjects into clusters and provide a cluster identifier to the dataframe.) How are the student numbers distributed across subject clusters?
8. How does the number of students change over the time period by cluster?
9. How does the number of students change over the time period for language related subjects?
10. How does the number of students change for linguistics in a narrow perspective?





# Setup

## Load libraries

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

## Load dataset(s)

In [2]:
# fields of study, all Germany 2018-2024
# using the non-flat csv
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003
stud = pd.read_csv('./datasets/21311-0003_de_flat_2018-2024_GER.csv',sep=';')
stud_nofl = pd.read_csv('./datasets/21311-0003_de_2018-2024_GER.csv',sep=';')

# for alternative table with data per state see:
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0006

# Preprocessing

In [3]:
stud_nofl.head(10)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Tabelle: 21311-0003
"Studierende: Deutschland, Semester, Nationalität,",,,,,,,,,,,
"Geschlecht, Studienfach",,,,,,,,,,,
Statistik der Studenten,,,,,,,,,,,
Deutschland,,,,,,,,,,,
Studierende (Anzahl),,,,,,,,,,,
,,,Deutsche,,,Ausländer,,,Insgesamt,,
,,,männlich,weiblich,Insgesamt,männlich,weiblich,Insgesamt,männlich,weiblich,Insgesamt
WS 2018/19,SF141,Abfallwirtschaft,80,34,114,5,-,5,85,34,119
,SF002,Afrikanistik,395,711,1106,120,168,288,515,879,1394
,SF138,Agrarbiologie,256,259,515,184,137,321,440,396,836


The non-flat csv seems to be aimed at presentation in spreadsheet editors. Reformating might be possible, but potentially complex. Let's try the flat-csv instead.

In [4]:
stud.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
statistics_code,21311,21311,21311,21311,21311,21311,21311,21311,21311,21311
statistics_label,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten
time_code,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST
time_label,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester
time,2018-10P6M,2020-10P6M,2019-10P6M,2022-10P6M,2020-10P6M,2022-10P6M,2019-10P6M,2019-10P6M,2022-10P6M,2023-10P6M
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT


The data structure is also rather complex, which probably makes sense for standardisation purposes at the Statistisches Bundesamt, but for present purposes it makes sense to create a more transparent dataframe. The advantage to the non-flat csv is that every line seems to cleanly correspond to a datapoint. Labels are intermingled, so some cleanup will be required.

The value of `4_variable_attribute_label` also hints at a problem to be encountered later, namely shifts in the subject code allocation. That course of study seems to have been assigned to a new code in 2020; there are likely to be other datapoints with this issue, to be checked later.

In [5]:
stud.columns

Index(['statistics_code', 'statistics_label', 'time_code', 'time_label',
       'time', '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '2_variable_attribute_code',
       '2_variable_attribute_label', '3_variable_code', '3_variable_label',
       '3_variable_attribute_code', '3_variable_attribute_label',
       '4_variable_code', '4_variable_label', '4_variable_attribute_code',
       '4_variable_attribute_label', 'value', 'value_unit',
       'value_variable_code', 'value_variable_label'],
      dtype='object')

## Variables to keep and rename


**Rough overview of columns**

```python
allcols = ['statistics_code', 'statistics_label',                                                                   # identifier of statistic
            'time_code', 'time_label', 'time',                                                                      # time label 
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_code', '2_variable_attribute_label',       # nationality
            '3_variable_code', '3_variable_label', '3_variable_attribute_code', '3_variable_attribute_label',       # gender
            '4_variable_code', '4_variable_label', '4_variable_attribute_code', '4_variable_attribute_label',       # subject
            'value', 'value_unit', 'value_variable_code', 'value_variable_label']                                   # value = number of students
```

In [6]:
# checking that variable 1 only has one distinct value, signifying that the scope of the data is all of Germany
stud['1_variable_code'].unique()

array(['DINSG'], dtype=object)

The identifiers for the statistic can be dropped, as can the code and label for time. Since we only include data for all of Germany for now, all `1_variable` columns can also be removed.
The general strategy for the next two variables is to only keep the `[23]_variable_attribute_code`s. They contain NaN for the total values, which will allow relatively easy filtering later to remove these "totals" lines to avoid double counting issues.
For variable 4, we keep `variable_attribute_code` and `variable_attribute_label`. The former may be helpful for clustering subjects later on, the latter is more transparent.


In [7]:
remove_cols = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',       # 
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', 
            'value_unit', 'value_variable_code', 'value_variable_label']


Below we identify the columns that should be kept and create a dictionary for a more transparent naming scheme.

In [8]:
stud.time.unique()      # checking unique values in `time`

array(['2018-10P6M', '2020-10P6M', '2019-10P6M', '2022-10P6M',
       '2023-10P6M', '2021-10P6M'], dtype=object)


`time` corresponds to the year of record. Can be mapped to plain year for simplicity. Currently, '2018-10P6M' presumably indicates the academic year 2018/2019, which began in October 2018. This could be mapped to the integer 2018 (or '2018/19', but the year of start should be a sufficient identifier). These are effectively categorical variables, but treating them as integer is more memory efficient and fine for sorting.



In [9]:
colname_remap = {
    'time': 'acyear',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'subj_code',
    '4_variable_attribute_label': 'subj_name',
    'value': 'stud_count'
    }      # variable for remapping the column names

In [10]:
df = stud.drop(remove_cols,axis='columns')
df = df.rename(columns=colname_remap)

In [11]:
df.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018-10P6M,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
1,2020-10P6M,,,SF142,Schiffbau/Schiffstechnik,928
2,2019-10P6M,,GESW,SF370,Wirtschaftsingenieurw. m. ingenieurwiss.Schwer...,15987
3,2022-10P6M,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
4,2020-10P6M,,GESW,SF213,Versorgungstechnik,514


Now, we need to make sure `stud_count` is an integer and simplify `acyear`.

In [12]:
# check non-digit values for `stud_count`
df.loc[df.stud_count.str.isdigit() == False,'stud_count'].unique()

array(['-'], dtype=object)

In [13]:
df.loc[df.stud_count == '0']

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count


Currently, zero values are represented by '-', so we replace all instances of '-' by 0.

In [14]:
# replace all instances of '-' in stud_count by 0
df.stud_count = df.stud_count.str.replace('-','0')
df.loc[df.stud_count == '0']

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
12,2018-10P6M,NATA,,SF262,Bibliothekswesen,0
73,2018-10P6M,NATA,GESM,SF087,Körperbehindertenpädagogik (ab 2016 zu SF190),0
87,2018-10P6M,NATA,GESW,SF429,Stahlbau,0
111,2019-10P6M,NATA,,SF063,Geistigbeh./Prakt.-Bildb.-Pädag.(ab 2016 zu SF...,0
139,2018-10P6M,NATA,GESM,SF180,Kaukasistik,0
...,...,...,...,...,...,...
15742,2021-10P6M,NATA,GESM,SF087,Körperbehindertenpädagogik (ab 2016 zu SF190),0
15773,2022-10P6M,NATD,GESM,SF196,Studienkolleg,0
15804,2020-10P6M,NATD,GESM,SF041,Sonstiges Orientierungsstudium,0
15813,2023-10P6M,NATA,GESW,SF061,Meliorationswesen,0


In [15]:
# now we can cast as type int
df.stud_count = df.stud_count.astype(int)
df.dtypes

acyear         object
nationality    object
gender         object
subj_code      object
subj_name      object
stud_count      int64
dtype: object

In [16]:
df.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018-10P6M,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
1,2020-10P6M,,,SF142,Schiffbau/Schiffstechnik,928
2,2019-10P6M,,GESW,SF370,Wirtschaftsingenieurw. m. ingenieurwiss.Schwer...,15987
3,2022-10P6M,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
4,2020-10P6M,,GESW,SF213,Versorgungstechnik,514


To take care of the year, we can just split at the hyphen to keep only the year and then cast as int as well.

In [17]:
df.acyear = df.acyear.str.split('-').str[0].astype(int)


In [18]:
df.dtypes

acyear          int64
nationality    object
gender         object
subj_code      object
subj_name      object
stud_count      int64
dtype: object

Now all datatypes should be fine, let's check that we can indeed remove the rows with NULL for `gender` or `nationality`. These should correspond to the totals, which we can easily reconstruct

In [19]:
df.loc[(df.subj_code =='SF142') & (df.acyear == 2018)].sort_values(['nationality','gender'])

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
2326,2018,NATA,GESM,SF142,Schiffbau/Schiffstechnik,201
12783,2018,NATA,GESW,SF142,Schiffbau/Schiffstechnik,46
8180,2018,NATA,,SF142,Schiffbau/Schiffstechnik,247
896,2018,NATD,GESM,SF142,Schiffbau/Schiffstechnik,711
1937,2018,NATD,GESW,SF142,Schiffbau/Schiffstechnik,130
2738,2018,NATD,,SF142,Schiffbau/Schiffstechnik,841
8072,2018,,GESM,SF142,Schiffbau/Schiffstechnik,912
6463,2018,,GESW,SF142,Schiffbau/Schiffstechnik,176
13395,2018,,,SF142,Schiffbau/Schiffstechnik,1088


If the concept is right, the sum of all stud_counts where neither `gender` nor `nationality` is na should be 1088.

In [20]:
df.loc[(df.subj_code =='SF142') & (df.acyear == 2018) & (df.nationality.isna() == False) & (df.gender.isna() == False)].stud_count.sum()

np.int64(1088)

This checks out, so we can (and should) indeed remove all rows with NaN in either of those two columns. 

In [21]:
df.isna().sum()

acyear            0
nationality    5274
gender         5274
subj_code         0
subj_name         0
stud_count        0
dtype: int64

In [22]:
df = df.dropna(subset=['gender','nationality'])

In [23]:
df.isna().sum()

acyear         0
nationality    0
gender         0
subj_code      0
subj_name      0
stud_count     0
dtype: int64

In [24]:
df.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
3,2022,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
8,2022,NATA,GESW,SF280,Kartografie,44
16,2023,NATA,GESM,SF086,"Katholische Theologie, - Religionslehre",407
18,2023,NATA,GESW,SF272,Alte Geschichte,18


Great, no na values left! The dataset should be usable now (barring further extension for subject clustering). Let's reset the index and save the cleaned up version for easier access.

In [27]:
df.reset_index(drop=True,inplace=True)
df.to_csv('./datasets/GER_students_2018_2023_cleaned.csv')

# Further EDA on cleaned dataset

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   acyear       7032 non-null   int64 
 1   nationality  7032 non-null   object
 2   gender       7032 non-null   object
 3   subj_code    7032 non-null   object
 4   subj_name    7032 non-null   object
 5   stud_count   7032 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 329.8+ KB


In [29]:
df.describe()

Unnamed: 0,acyear,stud_count
count,7032.0,7032.0
mean,2020.5,2479.224261
std,1.707947,8134.877221
min,2018.0,0.0
25%,2019.0,44.0
50%,2020.5,286.0
75%,2022.0,1245.75
max,2023.0,114157.0
