# Introduction

## Dataset

https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003

Number of students in Germany by subject of study, nationality, gender from 2018-2024. I will focus on the final year 2023/24 for a start. (Will I?)

Check this document for official practice of subject codes and subject cluster classifications: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pst_all?__blob=publicationFile&v=12

## Questions 

1. What was the total number of students in Germany in 2023/24?
2. What were the 10 subjects with the highest number of students in 2023/24?
3. How was the gender distribution of students in 2023/24?
4. How was the gender distribution in the 5 most studied subjects?
5. What were the top subjects by gender?
6. Which were the top 5 subjects studied by non-citizens? 
7. (Sort the subjects into clusters and provide a cluster identifier to the dataframe.) How are the student numbers distributed across subject clusters?
8. How does the number of students change over the time period by cluster?
9. How does the number of students change over the time period for language related subjects?
10. How does the number of students change for linguistics in a narrow perspective?





# Setup

## Load libraries

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

## Load dataset(s)

In [2]:
# fields of study, all Germany 2018-2024
# using the non-flat csv
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003
stud = pd.read_csv('./datasets/raw/21311-0003_de_flat_allstudents.csv',sep=';')

# for alternative table with data per state see:
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0006

In [3]:
# additional datasets
# freshers by course of study
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0012
freshers = pd.read_csv('./datasets/raw/21311-0012_de_flat_freshers.csv',sep=';')

# employees at higher education institutions by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0002
pers = pd.read_csv('./datasets/raw/21341-0002_de_flat_personnel.csv',sep=';')


# professors by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0003
prof = pd.read_csv('./datasets/raw/21341-0003_de_flat_profs.csv',sep=';')


# exams by subject
# https://www-genesis.destatis.de/datenbank/online/statistic/21321/table/21321-0003
exams = pd.read_csv('./datasets/raw/21321-0003_de_flat_exams.csv',sep=';')



## Creating a list of relations for subjects, subject clusters and subject groups

Prompt for data extraction from pst_all (request for csv in previous context) at Claude. Initially, I tried ChatGPT, but it struggled with the pst_all format (presumably the three columns) and hit its limit while processing.

> **Prompt**:
>
> The uploaded pst_all contains three columns per page in the main text body. The document is in German. Do not change any of the subject titles, but drop any footnote markers. Extract the following data into a csv-file:
> - "Fächergruppe" into the columns 'subj_group_code' for the numerical code and 'subj_group_name' for the name (e.g. Geisteswissenschaften)
> - "Studienbereich" (underlined in the pst_all) into columns 'subj_cluster_code' and 'subj_cluster_name' respectively
> - "Studienfach" into columns 'subj_code' and 'subj_name'
>
> For each of the three levels, also provide a column '..._name_eng' with an approximate English translation for the respective German '..._name'. The translation need not be literal, but can consider more common English terminology as long as the content is coherent. Make sure to use the same translation for any recurring term.


**To note for future reference. I had originally suspected a dictionary/JSON would be the simplest representation of the data (albeit requiring some post-processing to generate a table later), but ChatGPT had suggested I generate a csv in an earlier chat. However, it may actually have considered the ease with which I would be able to use the data rather than its own computational efficiency.**

(Note that I was not able to use this prompt in the same chat that generated the csv-file due to length restrictions on the chat. So I am still not quite sure if repeating values may have been stored because this answer is a bit more general.)

> **Prompt**
> 
> I just had you create a csv file with slightly under 300 rows of data extracted from a pst_all involving an underlying nested structure of subject groups and their codes, subject clusters the groups consist of including their numerical code and subjects including their code. Since the pst_all was in German, I additionally asked for an additional column in the csv file with an English translation for each name column of the three structural levels. For the higher level values, there was naturally a lot of repetition due to the large number of subjects. For future reference, in such contexts, do the repeating names and translations have to be recomputed for every line of the csv? Would a different output formatting have been less computationally expensive for you (i.e. using less energy and/or tokens), e.g. as a python dictionary?

> **Claude reply**:
> 
> When working with hierarchical data like you described, you're right to consider computational efficiency, especially for a large dataset with repetitive elements.
> 
> For your specific case with grouped subject data in German/English:
> 
> CSV format is indeed less efficient when dealing with hierarchical data with repetition. Each time a higher-level group value appears in multiple rows, I needed to:
> 1. Store that repeated text in each row
> 2. Translate the same text multiple times (or recall the previous translation)
> 
> A more efficient approach would have been:
> 
> **Nested data structures** like a Python dictionary or JSON would better preserve the hierarchy and eliminate repetition:
> 
> ```
> {
>   "group_code1": {
>     "name_de": "Gruppe Name",
>     "name_en": "Group Name",
>     "clusters": {
>       "cluster_code1": {
>         "name_de": "Cluster Name",
>         "name_en": "Cluster Name",
>         "subjects": [
>           {"code": "subj1", "name_de": "Fach1", "name_en": "Subject1"},
>           {"code": "subj2", "name_de": "Fach2", "name_en": "Subject2"}
>         ]
>       }
>     }
>   }
> }
> ```
> 
> This structure would:
> - Store each name/translation only once
> - Preserve the hierarchy naturally
> - Require significantly fewer tokens to represent
> - Be more computationally efficient to generate
> - Use less memory/energy overall
> 
> If you need to work with this data in the future, requesting structured formats that match the natural hierarchy of your data will be both more efficient for AI processing and likely more useful for your subsequent data handling.


So for the future, generating a JSON or dictionary for this type of data and then do local post-processing might be the more computationally effective and responsible option.

In [4]:
# loading subject code lookup table
subjectcodes = pd.read_csv('./datasets/german_subject_classification.csv')

# Preprocessing

## Student data total exploration

*This subsection outlines my original data exploration to understand the source data format. The cells are commented out, but kept for documentation. In the next subsection further below [Create Cleaning function](#create-cleaning-function) I am going defining a function for automated data cleaning and reformatting before combining some of the source tables as appropriate to create the working data frames.*

In [5]:
# not used, kept for documentation
#stud_nofl = pd.read_csv('./datasets/21311-0003_de_2018-2024_GER.csv',sep=';')
#stud_nofl.head(10)

The non-flat csv seems to be aimed at presentation in spreadsheet editors. Reformating might be possible, but potentially complex. Let's use the flat-csv instead.

In [6]:
stud.head(10).T

The data structure is also rather complex, which probably makes sense for standardisation purposes at the Statistisches Bundesamt, but for present purposes it makes sense to create a more transparent dataframe. The advantage to the non-flat csv is that every line seems to cleanly correspond to a datapoint. Labels are intermingled, so some cleanup will be required.

The value of `4_variable_attribute_label` also hints at a problem to be encountered later, namely shifts in the subject code allocation. That course of study seems to have been assigned to a new code in 2020; there are likely to be other datapoints with this issue, to be checked later.

In [7]:
stud.columns

### Variables to keep and rename


**Rough overview of columns**

```python
allcols = ['statistics_code', 'statistics_label',                                                                   # identifier of statistic
            'time_code', 'time_label', 'time',                                                                      # time label 
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_code', '2_variable_attribute_label',       # nationality
            '3_variable_code', '3_variable_label', '3_variable_attribute_code', '3_variable_attribute_label',       # gender
            '4_variable_code', '4_variable_label', '4_variable_attribute_code', '4_variable_attribute_label',       # subject
            'value', 'value_unit', 'value_variable_code', 'value_variable_label']                                   # value = number of students
```

In [8]:
# checking that variable 1 only has one distinct value, signifying that the scope of the data is all of Germany
# stud['1_variable_code'].unique()

The identifiers for the statistic can be dropped, as can the code and label for time. Since we only include data for all of Germany for now, all `1_variable` columns can also be removed.
The general strategy for the next two variables is to only keep the `[23]_variable_attribute_code`s. They contain NaN for the total values, which will allow relatively easy filtering later to remove these "totals" lines to avoid double counting issues.
For variable 4, we keep `variable_attribute_code` and `variable_attribute_label`. The former may be helpful for clustering subjects later on, the latter is more transparent.


In [9]:
# remove_cols = ['statistics_code', 'statistics_label', 
#             'time_code', 'time_label',
#             '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
#             '2_variable_code', '2_variable_label', '2_variable_attribute_label',       # 
#             '3_variable_code', '3_variable_label', '3_variable_attribute_label',
#             '4_variable_code', '4_variable_label', 
#             'value_unit', 'value_variable_code', 'value_variable_label']


Below we identify the columns that should be kept and create a dictionary for a more transparent naming scheme.

In [10]:
# stud.time.unique()      # checking unique values in `time`


`time` corresponds to the year of record. Can be mapped to plain year for simplicity. Currently, '2018-10P6M' presumably indicates the academic year 2018/2019, which began in October 2018. This could be mapped to the integer 2018 (or '2018/19', but the year of start should be a sufficient identifier). These are effectively categorical variables, but treating them as integer is more memory efficient and fine for sorting.



In [11]:
# colname_remap = {
#     'time': 'acyear',
#     '2_variable_attribute_code': 'nationality',
#     '3_variable_attribute_code': 'gender',
#     '4_variable_attribute_code': 'subj_code',
#     '4_variable_attribute_label': 'subj_name',
#     'value': 'stud_count'
#     }      # dict for remapping the column names

In [12]:
# st_all = stud.drop(remove_cols,axis='columns')
# st_all = st_all.rename(columns=colname_remap)

In [13]:
# st_all.head()

Now, we need to make sure `stud_count` is an integer and simplify `acyear`.

In [14]:
# check non-digit values for `stud_count`
# st_all.loc[st_all.stud_count.str.isdigit() == False,'stud_count'].unique()

In [15]:
# st_all.loc[st_all.stud_count == '0']

Currently, zero values are represented by '-', so we replace all instances of '-' by 0.

In [16]:
# replace all instances of '-' in stud_count by 0
# st_all.stud_count = st_all.stud_count.str.replace('-','0')
# st_all.loc[st_all.stud_count == '0']

In [17]:
# now we can cast as type int
# st_all.stud_count = st_all.stud_count.astype(int)
# st_all.dtypes

In [18]:
# st_all.head()

To take care of the year, we can just split at the hyphen to keep only the year and then cast as int as well.

In [19]:
# st_all.acyear = st_all.acyear.str.split('-').str[0].astype(int)


In [20]:
# st_all.dtypes

### Removing totals columns

*This can actually be achieved by changing selections at dataset download. I leave the code below for documentation, but comment it out.*

Now all datatypes should be fine, let's check that we can indeed remove the rows with NULL for `gender` or `nationality`. These should correspond to the totals, which we can easily reconstruct.

In [21]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018)].sort_values(['nationality','gender'])

If the concept is right, the sum of all stud_counts where neither `gender` nor `nationality` is na should be 1088.

In [22]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018) & (st_all.nationality.isna() == False) & (st_all.gender.isna() == False)].stud_count.sum()

This checks out, so we can (and should) indeed remove all rows with NaN in either of those two columns. 

In [23]:
#st_all.isna().sum()

In [24]:
#st_all = st_all.dropna(subset=['gender','nationality'])

In [25]:
#st_all.isna().sum()

In [26]:
#st_all.head()

### Saving

Great, no na values left! The dataset should be usable now (barring further extension for subject clustering). Let's reset the index and save the cleaned up version for easier access.

In [27]:
#st_all.reset_index(drop=True,inplace=True)
#st_all.to_csv('./datasets/GER_2018_2023_students_all_cleaned.csv')

## Create cleaning function

Based on findings above, I create a function to encapsulate cleaning this particular dataformat - I have at least one other student dataset after all. I leave the code above intact for documentation purposes.

Different tables from Statistisches Bundesamt might require further adaptations of function. 

In [None]:
# cleaning and saving function
def clean_dat(in_df, dropcols, rencols, csvname='',complexdate=False,insgesamt=False):
    '''Function to streamline cleaning student data from Statistisches Bundesamt
    
    in_df: the input dataframe
    dropcols: list of column names to drop
    rencols: dictionary of column name changes
    csvname: string for the path and name of the cleaned csv file
    '''
    df = in_df.copy()
    
    
    # replace all instances of '-' in stud_count by '0' and cast as int
    # doing this before renaming to slightly increase chances of portability of code
    # (still assumes that column 'value' exists and works the same)
    if 'value' in df.columns:
        df.value = df.value.str.replace('-','0').astype(int)
    else:
        print(f'Column `value` not found in dataframe {in_df}')
    
    # transform `time` to a plain year integer, corresponding to starting date of academic year
    if 'time' in df.columns:
        if complexdate:
            df.time = df.time.str.split('-').str[0].astype(int)
        else:                                                       # if complexdate flag is not set, simply cast `time` to int
            df.time = df.time.astype(int)
    else:
        print(f'Column `time` not found in dataframe {in_df}')

    # removing columns
    df = df.drop(dropcols,axis='columns')
    
    # renaming columns
    df = df.rename(columns=rencols)
    
    # special check for tables with na values due to datarows with na in subj_grp_code (the "Insgesamt"/total rows we don't need)
    if insgesamt:
        df.dropna(subset=['subj_grp_code'],inplace=True)

    
    # reset index
    df.reset_index(drop=True,inplace=True)
    # optionally save csv file if path is provided
    if csvname:                                          
        df.to_csv(csvname)
    
    return df
    
    


## Merging and cleaning student data

Combining the datasets for students by subject overall and freshers.
Then apply the cleaning function.

First, create identifying columns for the stud and freshers dataframes. (One could use the `value_variable_label` or `code`, but the values are a bit unwieldy, so I'll just create new columns and continue to get rid of these.)

In [29]:
stud['subgroup'] = 'stud_all'
freshers['subgroup'] = 'stud_y1'

In [30]:
# list and dictionary for data cleaning
stud_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', 
            'value_unit', 'value_variable_code', 'value_variable_label']

stud_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'subj_code',
    '4_variable_attribute_label': 'subj_name',
    'value': 'number'
    }      # dict for remapping the column names

In [31]:
dfs = clean_dat(pd.concat([stud,freshers]),stud_drop,stud_remap,'./datasets/clean_stud_GER_2018_2023.csv',complexdate=True)

In [32]:
dfs.head()

In [33]:
dfs.isna().sum()

Great, concatenation and cleaning seems to have worked as intended.

Student dataset is clean in `dfs`.

## Exam data

In [34]:
exams.head().T

In [35]:
ex_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', '4_variable_attribute_label',
            '5_variable_code', '5_variable_label',
            'value_unit', 'value_variable_code', 'value_variable_label']

ex_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'passed',
    '5_variable_attribute_code': 'subj_code',
    '5_variable_attribute_label': 'subj_name',
    'value': 'number'
    } 

In [36]:
dfex = clean_dat(exams,ex_drop,ex_remap,'./datasets/clean_exams_GER_2018_2023.csv')

In [37]:
dfex.isna().sum()

In [38]:
dfex.head()

Exam data is student-related and considering that people typically have to be registered students to take exams, they're probably a subset of the `stud_all` subgroup of the dfs (students) dataset. However, I'll keep the dataset separate for now, not least because of the additional `passed` property.

Exam data is clean in dataframe `dfex`.

## Combining personnel data

Similar to the student datasets above, the general personnel data and the professorial counts are in a subset relation, so we'll combine them with labels.

In [39]:
pers['subgroup'] = 'p_all'
prof['subgroup'] = 'p_prof'

In [40]:
# inspecting personnel data structure to identify column structure
pers.head().T

In [41]:
pers.columns

In [42]:
# list and dict for column dropping and renaming
pers_drop = [
    'statistics_code', 'statistics_label', 'time_code', 'time_label',
    '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '3_variable_code', '3_variable_label',
        '3_variable_attribute_label',
       'value_unit', 'value_variable_code', 'value_variable_label'
]

pers_rename ={
    'time': 'year',
    '2_variable_attribute_label': 'subj_grp_name',
    '2_variable_attribute_code': 'subj_grp_code',
    '3_variable_attribute_code': 'gender',
    'value': 'number'
}



In [47]:
dfp = clean_dat(pd.concat([pers,prof]),pers_drop,pers_rename,'./datasets/clean_personnel_GER_2018_2023.csv',insgesamt=True)

In [None]:
# # keeping this code here for documentation, cleaning implemented in function `clean_dat` now

# display(dfp.isna().sum())
# display(dfp.loc[dfp.subj_grp_code.isna()])
# display(dfp.loc[dfp.subj_grp_name=='Insgesamt'])

# # The problem is fundamentally the same in both `subj_grp_code` and `subj_grp_name` columns, we don't need these "Total" rows, so drop them.  
# dfp.dropna(subset=['subj_grp_code'],inplace=True)
# dfp.isna().sum()

In [51]:
dfp.info()

In [52]:
pers.info()

`pers_cl` is now clean.

In [None]:
dfp.loc[dfp.subgroup == 'p_prof','subj_grp_code'].unique()

# Further EDA and processing on cleaned datasets

In [None]:
st_all.describe()

dfp.loc[dfp.subgroup == 'p_prof','subj_grp_code'].unique()