# Introduction

## Dataset

https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003

Number of students in Germany by subject of study, nationality, gender from 2018-2024. I will focus on the final year 2023/24 for a start. (Will I?)

Check this document for official practice of subject codes and subject cluster classifications: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pst_all?__blob=publicationFile&v=12

## Questions 

1. What was the total number of students in Germany in 2023/24?
2. What were the 10 subjects with the highest number of students in 2023/24?
3. How was the gender distribution of students in 2023/24?
4. How was the gender distribution in the 5 most studied subjects?
5. What were the top subjects by gender?
6. Which were the top 5 subjects studied by non-citizens? 
7. (Sort the subjects into clusters and provide a cluster identifier to the dataframe.) How are the student numbers distributed across subject clusters?
8. How does the number of students change over the time period by cluster?
9. How does the number of students change over the time period for language related subjects?
10. How does the number of students change for linguistics in a narrow perspective?





# Setup

## Load libraries

In [34]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

## Load dataset(s)

In [35]:
# fields of study, all Germany 2018-2024
# using the non-flat csv
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003
stud = pd.read_csv('./datasets/raw/21311-0003_de_flat_allstudents.csv',sep=';')

# for alternative table with data per state see:
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0006

In [36]:
# additional datasets
# freshers by course of study
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0012
freshers = pd.read_csv('./datasets/raw/21311-0012_de_flat_freshers.csv',sep=';')

# employees at higher education institutions by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0002
pers = pd.read_csv('./datasets/raw/21341-0002_de_flat_personnel.csv',sep=';')


# professors by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0003
prof = pd.read_csv('./datasets/raw/21341-0003_de_flat_profs.csv',sep=';')


# exams by subject
# https://www-genesis.destatis.de/datenbank/online/statistic/21321/table/21321-0003
exams = pd.read_csv('./datasets/raw/21321-0003_de_flat_exams.csv',sep=';')



## Creating a list of relations for subjects, subject clusters and subject groups

Frustratingly, there are two distinct systems for coding subjects, subject clusters and subject fields for a) subjects of study and b) personnel.

### Taxonomy/lookup table for subject codes

Prompt for data extraction from pst_all (request for csv in previous context) at Claude. Initially, I tried ChatGPT, but it struggled with the pst_all format (presumably the three columns) and hit its limit while processing.

> **Prompt**:
>
> The uploaded pdf contains three columns per page in the main text body. The document is in German. Do not change any of the subject titles, but drop any footnote markers. Extract the following data into a csv-file:
> - "Fächergruppe" into the columns 'subj_group_code' for the numerical code and 'subj_group_name' for the name (e.g. Geisteswissenschaften)
> - "Studienbereich" (underlined in the pst_all) into columns 'subj_cluster_code' and 'subj_cluster_name' respectively
> - "Studienfach" into columns 'subj_code' and 'subj_name'
>
> For each of the three levels, also provide a column '..._name_eng' with an approximate English translation for the respective German '..._name'. The translation need not be literal, but can consider more common English terminology as long as the content is coherent. Make sure to use the same translation for any recurring term.


**To note for future reference. I had originally suspected a dictionary/JSON would be the simplest representation of the data (albeit requiring some post-processing to generate a table later), but ChatGPT had suggested I generate a csv in an earlier chat. However, it may actually have considered the ease with which I would be able to use the data rather than its own computational efficiency.**

(Note that I was not able to use this prompt in the same chat that generated the csv-file due to length restrictions on the chat. So I am still not quite sure if repeating values may have been stored because this answer is a bit more general.)

> **Prompt**
> 
> I just had you create a csv file with slightly under 300 rows of data extracted from a pst_all involving an underlying nested structure of subject groups and their codes, subject clusters the groups consist of including their numerical code and subjects including their code. Since the pst_all was in German, I additionally asked for an additional column in the csv file with an English translation for each name column of the three structural levels. For the higher level values, there was naturally a lot of repetition due to the large number of subjects. For future reference, in such contexts, do the repeating names and translations have to be recomputed for every line of the csv? Would a different output formatting have been less computationally expensive for you (i.e. using less energy and/or tokens), e.g. as a python dictionary?

> **Claude reply**:
> 
> When working with hierarchical data like you described, you're right to consider computational efficiency, especially for a large dataset with repetitive elements.
> 
> For your specific case with grouped subject data in German/English:
> 
> CSV format is indeed less efficient when dealing with hierarchical data with repetition. Each time a higher-level group value appears in multiple rows, I needed to:
> 1. Store that repeated text in each row
> 2. Translate the same text multiple times (or recall the previous translation)
> 
> A more efficient approach would have been:
> 
> **Nested data structures** like a Python dictionary or JSON would better preserve the hierarchy and eliminate repetition:
> 
> ```
> {
>   "group_code1": {
>     "name_de": "Gruppe Name",
>     "name_en": "Group Name",
>     "clusters": {
>       "cluster_code1": {
>         "name_de": "Cluster Name",
>         "name_en": "Cluster Name",
>         "subjects": [
>           {"code": "subj1", "name_de": "Fach1", "name_en": "Subject1"},
>           {"code": "subj2", "name_de": "Fach2", "name_en": "Subject2"}
>         ]
>       }
>     }
>   }
> }
> ```
> 
> This structure would:
> - Store each name/translation only once
> - Preserve the hierarchy naturally
> - Require significantly fewer tokens to represent
> - Be more computationally efficient to generate
> - Use less memory/energy overall
> 
> If you need to work with this data in the future, requesting structured formats that match the natural hierarchy of your data will be both more efficient for AI processing and likely more useful for your subsequent data handling.


So for the future, generating a JSON or dictionary for this type of data and then do local post-processing might be the more computationally effective and responsible option.

In [37]:
# loading subject code lookup table
subjectcodes = pd.read_csv('./datasets/german_subject_classification.csv')

### Taxonomy/lookup table for personnel

> **Prompt**
>
> The uploaded pdf contains three columns per page in the main text body. The document is in German. Do not change any of the subject titles and retain commas inside them, but drop any footnote markers. The main body contains information about a hierachical structure of subject groups (Fächergruppen), which contain subject clusters (Lehr- und Forschungsbereich, underlined in the pdf) and actual subjects (Fachgebiete). Each entry has a code and a name. Extract all codes and names in a way that a) reflects the hierarchical structure and b) is computationally efficient to you, e.g. into structured JSON file, but also allows relatively easy further processing in Python (e.g. import as dictionary and further local processing into dataframe by looping). For each name in German, also provide an appropriate English translation in an additional field (or key-value pair, whichever format is preferred). The translations need not be literal, but can consider more common English terminology as long as the content is coherent. Make sure to use the same translation for any recurring term.

# Preprocessing

## Student data exploration

*This subsection outlines my original data exploration to understand the source data format. The cells are commented out, but kept for documentation. In the next subsection further below [Create Cleaning function](#create-cleaning-function) I am going defining a function for automated data cleaning and reformatting before combining some of the source tables as appropriate to create the working data frames.*

In [38]:
# not used, kept for documentation
#stud_nofl = pd.read_csv('./datasets/21311-0003_de_2018-2024_GER.csv',sep=';')
#stud_nofl.head(10)

The non-flat csv seems to be aimed at presentation in spreadsheet editors. Reformating might be possible, but potentially complex. Let's use the flat-csv instead.

In [39]:
stud.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
statistics_code,21311,21311,21311,21311,21311,21311,21311,21311,21311,21311
statistics_label,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten
time_code,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST
time_label,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester
time,2018-10P6M,2022-10P6M,2022-10P6M,2023-10P6M,2023-10P6M,2018-10P6M,2019-10P6M,2020-10P6M,2022-10P6M,2021-10P6M
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT


The data structure is also rather complex, which probably makes sense for standardisation purposes at the Statistisches Bundesamt, but for present purposes it makes sense to create a more transparent dataframe. The advantage to the non-flat csv is that every line seems to cleanly correspond to a datapoint. Labels are intermingled, so some cleanup will be required.

The value of `4_variable_attribute_label` also hints at a problem to be encountered later, namely shifts in the subject code allocation. That course of study seems to have been assigned to a new code in 2020; there are likely to be other datapoints with this issue, to be checked later.

In [40]:
stud.columns

Index(['statistics_code', 'statistics_label', 'time_code', 'time_label',
       'time', '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '2_variable_attribute_code',
       '2_variable_attribute_label', '3_variable_code', '3_variable_label',
       '3_variable_attribute_code', '3_variable_attribute_label',
       '4_variable_code', '4_variable_label', '4_variable_attribute_code',
       '4_variable_attribute_label', 'value', 'value_unit',
       'value_variable_code', 'value_variable_label'],
      dtype='object')

### Variables to keep and rename


**Rough overview of columns**

```python
allcols = ['statistics_code', 'statistics_label',                                                                   # identifier of statistic
            'time_code', 'time_label', 'time',                                                                      # time label 
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_code', '2_variable_attribute_label',       # nationality
            '3_variable_code', '3_variable_label', '3_variable_attribute_code', '3_variable_attribute_label',       # gender
            '4_variable_code', '4_variable_label', '4_variable_attribute_code', '4_variable_attribute_label',       # subject
            'value', 'value_unit', 'value_variable_code', 'value_variable_label']                                   # value = number of students
```

In [41]:
# checking that variable 1 only has one distinct value, signifying that the scope of the data is all of Germany
# stud['1_variable_code'].unique()

The identifiers for the statistic can be dropped, as can the code and label for time. Since we only include data for all of Germany for now, all `1_variable` columns can also be removed.
The general strategy for the next two variables is to only keep the `[2|3]_variable_attribute_code`s. They contain NaN for the total values, which will allow relatively easy filtering later to remove these "totals" lines to avoid double counting issues.
For variable 4, we keep `variable_attribute_code` and `variable_attribute_label`. The former may be helpful for clustering subjects later on, the latter is more transparent.


In [42]:
# remove_cols = ['statistics_code', 'statistics_label', 
#             'time_code', 'time_label',
#             '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
#             '2_variable_code', '2_variable_label', '2_variable_attribute_label',       # 
#             '3_variable_code', '3_variable_label', '3_variable_attribute_label',
#             '4_variable_code', '4_variable_label', 
#             'value_unit', 'value_variable_code', 'value_variable_label']


Below we identify the columns that should be kept and create a dictionary for a more transparent naming scheme.

In [43]:
# stud.time.unique()      # checking unique values in `time`


`time` corresponds to the year of record. Can be mapped to plain year for simplicity. Currently, '2018-10P6M' presumably indicates the academic year 2018/2019, which began in October 2018. This could be mapped to the integer 2018 (or '2018/19', but the year of start should be a sufficient identifier). These are effectively categorical variables, but treating them as integer is more memory efficient and fine for sorting.



In [44]:
# colname_remap = {
#     'time': 'acyear',
#     '2_variable_attribute_code': 'nationality',
#     '3_variable_attribute_code': 'gender',
#     '4_variable_attribute_code': 'subj_code',
#     '4_variable_attribute_label': 'subj_name',
#     'value': 'stud_count'
#     }      # dict for remapping the column names

In [45]:
# st_all = stud.drop(remove_cols,axis='columns')
# st_all = st_all.rename(columns=colname_remap)

In [46]:
# st_all.head()

Now, we need to make sure `stud_count` is an integer and simplify `acyear`.

In [47]:
# check non-digit values for `stud_count`
# st_all.loc[st_all.stud_count.str.isdigit() == False,'stud_count'].unique()

In [48]:
# st_all.loc[st_all.stud_count == '0']

Currently, zero values are represented by '-', so we replace all instances of '-' by 0.

In [49]:
# replace all instances of '-' in stud_count by 0
# st_all.stud_count = st_all.stud_count.str.replace('-','0')
# st_all.loc[st_all.stud_count == '0']

In [50]:
# now we can cast as type int
# st_all.stud_count = st_all.stud_count.astype(int)
# st_all.dtypes

In [51]:
# st_all.head()

To take care of the year, we can just split at the hyphen to keep only the year and then cast as int as well.

In [52]:
# st_all.acyear = st_all.acyear.str.split('-').str[0].astype(int)


In [53]:
# st_all.dtypes

### Removing totals columns

*This can actually be achieved by changing selections at dataset download. I leave the code below for documentation, but comment it out.*

Now all datatypes should be fine, let's check that we can indeed remove the rows with NULL for `gender` or `nationality`. These should correspond to the totals, which we can easily reconstruct.

In [54]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018)].sort_values(['nationality','gender'])

If the concept is right, the sum of all stud_counts where neither `gender` nor `nationality` is na should be 1088.

In [55]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018) & (st_all.nationality.isna() == False) & (st_all.gender.isna() == False)].stud_count.sum()

This checks out, so we can (and should) indeed remove all rows with NaN in either of those two columns. 

In [56]:
#st_all.isna().sum()

In [57]:
#st_all = st_all.dropna(subset=['gender','nationality'])

In [58]:
#st_all.isna().sum()

In [59]:
#st_all.head()

### Saving

Great, no na values left! The dataset should be usable now (barring further extension for subject clustering). Let's reset the index and save the cleaned up version for easier access.

In [60]:
#st_all.reset_index(drop=True,inplace=True)
#st_all.to_csv('./datasets/GER_2018_2023_students_all_cleaned.csv')

## Create cleaning function

Based on findings above, I create a function to encapsulate cleaning this particular dataformat with some switches for the different subtypes of data I have. I leave the code above intact for documentation purposes.

Different tables from Statistisches Bundesamt might require further adaptations of function. 

In [61]:
# cleaning and saving function
def clean_dat(in_df, dropcols, rencols, csvname='',complexdate=False,insgesamt=False):
    '''Function to streamline cleaning student data from Statistisches Bundesamt
    
    in_df: the input dataframe
    dropcols: list of column names to drop
    rencols: dictionary of column name changes
    csvname: string for the path and name of the cleaned csv file
    '''
    df = in_df.copy()
    
    
    # replace all instances of '-' in stud_count by '0' and cast as int
    # doing this before renaming to slightly increase chances of portability of code
    # (still assumes that column 'value' exists and works the same)
    if 'value' in df.columns:
        df.value = df.value.str.replace('-','0').astype(int)
    else:
        print(f'Column `value` not found in dataframe {in_df}')
    
    # transform `time` to a plain year integer, corresponding to starting date of academic year
    if 'time' in df.columns:
        if complexdate:
            df.time = df.time.str.split('-').str[0].astype(int)
        else:                                                       # if complexdate flag is not set, simply cast `time` to int
            df.time = df.time.astype(int)
    else:
        print(f'Column `time` not found in dataframe {in_df}')

    # removing columns
    df = df.drop(dropcols,axis='columns')
    
    # renaming columns
    df = df.rename(columns=rencols)
    
    # special check for tables with na values due to datarows with na in subj_grp_code (the "Insgesamt"/total rows we don't need)
    if insgesamt:
        df.dropna(subset=['subj_grp_code'],inplace=True)

    
    # reset index
    df.reset_index(drop=True,inplace=True)
    # optionally save csv file if path is provided
    if csvname:                                          
        df.to_csv(csvname)
    
    return df
    
    


## Cleaning student data

The datasets for students by subject overall and for freshers are in a subset relation. While it is tempting to concatenate them because of their parallel structure, I am keeping the datasets distinct after all.
This section generates the clean dataframes, saves the respective csv files and does a sanity check for na-values.

In [62]:
# list and dictionary for data cleaning
stud_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', 
            'value_unit', 'value_variable_code', 'value_variable_label']

stud_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'subj_code',
    '4_variable_attribute_label': 'subj_name',
    'value': 'number'
    }      # dict for remapping the column names

In [63]:
df_s = clean_dat(stud,stud_drop,stud_remap,'./datasets/clean_studall_GER_2018_2023.csv',complexdate=True)
df_f = clean_dat(freshers,stud_drop,stud_remap,'./datasets/clean_freshers_GER_2018_2023.csv',complexdate=True)

In [64]:
df_s.head()

Unnamed: 0,year,nationality,gender,subj_code,subj_name,number
0,2018,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
1,2022,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
2,2022,NATA,GESW,SF280,Kartografie,44
3,2023,NATA,GESM,SF086,"Katholische Theologie, - Religionslehre",407
4,2023,NATA,GESW,SF272,Alte Geschichte,18


In [65]:
print(df_s.isna().sum())
print(df_f.isna().sum())

year           0
nationality    0
gender         0
subj_code      0
subj_name      0
number         0
dtype: int64
year           0
nationality    0
gender         0
subj_code      0
subj_name      0
number         0
dtype: int64


Great, concatenation and cleaning seems to have worked as intended.

Student datasets are clean in `df_s` for students overall and `df_f` for freshers.

## Exam data

In [66]:
exams.head().T

Unnamed: 0,0,1,2,3,4
statistics_code,21321,21321,21321,21321,21321
statistics_label,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen
time_code,JAHR,JAHR,JAHR,JAHR,JAHR
time_label,Jahr,Jahr,Jahr,Jahr,Jahr
time,2022,2021,2021,2021,2022
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,NAT,NAT,NAT,NAT,NAT


In [67]:
ex_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', '4_variable_attribute_label',
            '5_variable_code', '5_variable_label',
            'value_unit', 'value_variable_code', 'value_variable_label']

ex_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'passed',
    '5_variable_attribute_code': 'subj_code',
    '5_variable_attribute_label': 'subj_name',
    'value': 'number'
    } 

In [68]:
dfex = clean_dat(exams,ex_drop,ex_remap,'./datasets/clean_exams_GER_2018_2023.csv')

In [69]:
dfex.isna().sum()

year           0
nationality    0
gender         0
passed         0
subj_code      0
subj_name      0
number         0
dtype: int64

In [70]:
dfex.head()

Unnamed: 0,year,nationality,gender,passed,subj_code,subj_name,number
0,2022,NATA,GESW,PRUEFBESTN,SF181,Wirtschaftspädagogik,1
1,2021,NATA,GESW,PRUEFBESTJ,SF042,Wirtschaftsrecht,193
2,2021,NATD,GESW,PRUEFBESTJ,SF294,Materialwissenschaften,81
3,2021,NATA,GESW,PRUEFBESTN,SF008,Anglistik/Englisch,6
4,2022,NATA,GESM,PRUEFBESTN,SF255,Archivwesen,0


Exam data is student-related and considering that people typically have to be registered students to take exams, they're probably a subset of the `stud_all` subgroup of the dfs (students) dataset. However, I'll keep the dataset separate for now, not least because of the additional `passed` property.

Exam data is clean in dataframe `dfex`.

## Cleaning personnel and professorial data

Similar to the student datasets above, the general personnel data and the professorial counts are pretty parallel, but in a subset relation, so I'll kep them distinct.

In [71]:
# inspecting personnel data structure to identify column structure
pers.head().T

Unnamed: 0,0,1,2,3,4
statistics_code,21341,21341,21341,21341,21341
statistics_label,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals
time_code,JAHR,JAHR,JAHR,JAHR,JAHR
time_label,Jahr,Jahr,Jahr,Jahr,Jahr
time,2018,2018,2021,2022,2020
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,BERLF1,BERLF1,BERLF1,BERLF1,BERLF1


In [72]:
pers.columns

Index(['statistics_code', 'statistics_label', 'time_code', 'time_label',
       'time', '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '2_variable_attribute_code',
       '2_variable_attribute_label', '3_variable_code', '3_variable_label',
       '3_variable_attribute_code', '3_variable_attribute_label', 'value',
       'value_unit', 'value_variable_code', 'value_variable_label'],
      dtype='object')

In [73]:
# list and dict for column dropping and renaming
pers_drop = [
    'statistics_code', 'statistics_label', 'time_code', 'time_label',
    '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '3_variable_code', '3_variable_label',
        '3_variable_attribute_label',
       'value_unit', 'value_variable_code', 'value_variable_label'
]

pers_rename ={
    'time': 'year',
    '2_variable_attribute_label': 'subj_grp_name',
    '2_variable_attribute_code': 'subj_grp_code',
    '3_variable_attribute_code': 'gender',
    'value': 'number'
}



In [74]:
df_pers = clean_dat(pers,pers_drop,pers_rename,'./datasets/clean_pers_all_GER_2018_2023.csv',insgesamt=True)
df_prof = clean_dat(prof,pers_drop,pers_rename,'./datasets/clean_prof_GER_2018_2023.csv',insgesamt=True)

In [75]:
# # keeping this code here for documentation, cleaning implemented in function `clean_dat` now

# display(dfp.isna().sum())
# display(dfp.loc[dfp.subj_grp_code.isna()])
# display(dfp.loc[dfp.subj_grp_name=='Insgesamt'])

# # The problem is fundamentally the same in both `subj_grp_code` and `subj_grp_name` columns, we don't need these "Total" rows, so drop them.  
# dfp.dropna(subset=['subj_grp_code'],inplace=True)
# dfp.isna().sum()

In [76]:
print(df_pers.isna().sum())
print(df_prof.isna().sum())

year             0
subj_grp_code    0
subj_grp_name    0
gender           0
number           0
dtype: int64
year             0
subj_grp_code    0
subj_grp_name    0
gender           0
number           0
dtype: int64


Overall personnel data is now clean in `df_pers`, professorial data in `df_prof`.

## Subject codes

We have a table of subject codes, but unfortunately the format in the dataframes is not exactly the same. This section takes care of harmonising this.

In [149]:
subjectcodes.head()

Unnamed: 0,subj_grp_code,subj_grp_name,subj_grp_name_eng,subj_cluster_code,subj_cluster_name,subj_cluster_name_eng,subj_code,subj_name,subj_name_eng
0,1,Geisteswissenschaften,Humanities,1,Geisteswissenschaften allgemein,Humanities General,4,Interdisziplinäre Studien (Schwerpunkt Geistes...,Interdisciplinary Studies (Focus on Humanities)
1,1,Geisteswissenschaften,Humanities,1,Geisteswissenschaften allgemein,Humanities General,90,Lernbereich Geisteswissenschaften,Learning Area Humanities
2,1,Geisteswissenschaften,Humanities,2,Evang. Theologie -Religionslehre,Protestant Theology and Religious Education,161,Diakoniewissenschaft,Diaconal Studies
3,1,Geisteswissenschaften,Humanities,2,Evang. Theologie -Religionslehre,Protestant Theology and Religious Education,544,Evang. Religionspädagogik kirchliche Bildungsa...,Protestant Religious Education and Church Educ...
4,1,Geisteswissenschaften,Humanities,2,Evang. Theologie -Religionslehre,Protestant Theology and Religious Education,53,Evang. Theologie -Religionslehre,Protestant Theology and Religious Education


### Subject group/cluster codes in personnel and professorial data 

In [152]:
print(subjectcodes.subj_grp_code.unique())
print(subjectcodes.subj_grp_name.unique())


[ 1  2  3  4  5  7  8  9 10]
['Geisteswissenschaften' 'Sport'
 'Rechts- Wirtschafts- und Sozialwissenschaften'
 'Mathematik Naturwissenschaften' 'Humanmedizin/Gesundheitswissenschaften'
 'Agrar- Forst- und Ernährungswissenschaften Veterinärmedizin'
 'Ingenieurwissenschaften' 'Kunst Kunstwissenschaft'
 'Außerhalb der Studienbereichsgliederung']


Frustratingly, the codes are not consistent between the personnel and professor datasets:

In [87]:
print(df_prof.subj_grp_code.nunique())
print(df_prof.subj_grp_code.unique())
print(df_prof.subj_grp_name.unique())

12
['FACHGR-SPR' 'FACHGR-ZE-02' 'FACHGR-AGRW' 'FACHGR-MEDVET' 'FACHGR-ZE-01'
 'FACHGR-RWSW' 'FACHGR-NATW' 'FACHGR-INGW' 'FACHGR-KUNSTW' 'FACHGR-MEDHUM'
 'FACHGR-AGRVET' 'FACHGR-SKW']
['Sport' 'Zentr.Einricht.d.Hochschulkliniken (nur Humanmed.)'
 'Agrar-, Forst- und Ernährungswissenschaften' 'Veterinärmedizin'
 'Zentr.Einrichtungen (o. klinikspez. Einrichtungen)'
 'Rechts-, Wirtschafts- und Sozialwissenschaften'
 'Mathematik, Naturwissenschaften' 'Ingenieurwissenschaften'
 'Kunst, Kunstwissenschaft' 'Humanmedizin/Gesundheitswissenschaften'
 'Agrar-, Forst- u.Ernährungswiss., Veterinärmedizin'
 'Geisteswissenschaften']


In [86]:
print(df_pers.subj_grp_code.nunique())
print(df_pers.subj_grp_code.unique())
print(df_pers.subj_grp_name.unique())

92
['SB-NATW-08' 'SB-KUNSTW-03' 'SB-NATW-05' 'SB-KUNSTW-04' 'SB-INGW-09'
 'HRZ-01' 'SB-NATW-01' 'KL-ZD-01' 'SB-NATW-06' 'THEOL-KT-01' 'SB-RWSW-05'
 'SB-INGW-02' 'EINR-WISS-01' 'MEDHUM-KP-01' 'SB-NATW-07' 'SF013'
 'SB-SKW-09' 'SB-AGRW-01' 'SB-AGRW-03' 'KL-EINR-AUSB-01' 'SB-INGW-04'
 'SB-SKW-01' 'SB-RWSW-10' 'SKW-SONST-01' 'MEDHUM-VK-01' 'MEDHUM-ALLG-01'
 'FACHGR-INGW-01' 'SB-INGW-08' 'SB-MEDHUM-01' 'PHIL-KLASS-01' 'SB-RWSW-12'
 'SB-INGW-07' 'SB-RWSW-02' 'SB-NATW-02' 'SB-RWSW-11' 'EINR-SOZ-01'
 'MEDVET-ALLG-01' 'EINR-VERS-01' 'EINR-KLFRMD-02' 'SB-AGRW-02'
 'MEDHUM-KT-01' 'MEDZN-KP-01' 'SB-INGW-12' 'FACHGR-RWSW-01' 'AFEW-ALLG-01'
 'SB-RWSW-04' 'SB-KUNSTW-02' 'SB-AGRW-04' 'SB-INGW-05' 'KL-EINR-SOZ-01'
 'SB-KUNSTW-01' 'THEOL-EV-01' 'SB-NATW-04' 'SB-KUNSTW-05' 'SB-RWSW-03'
 'SB-SKW-14' 'MEDVET-VK-01' 'SB-NATW-09' 'EINR-HSFRMD-02' 'SB-SKW-17'
 'MEDVET-KT-01' 'SB-INGW-11' 'SB-INGW-03' 'SB-SKW-07' 'BIB-ZENTRAL-01'
 'SB-SKW-18' 'HS-VERW-02' 'MEDVET-KP-01' 'EINR-SONST-01' 'SB-RWSW-09'
 'SB-RWSW-0

It seems that the personnel data contain both subject group and subject cluster categorisation in the `subj_grp...` columns.

Suspicion: The values corresponding to subject group codes might actually be aggregate data from the clusters. If so, we shall have to remove those rows.

In [None]:
df_pers.loc[(df_pers.subj_grp_code.isin(df_prof.subj_grp_code.unique()))]

Unnamed: 0,year,subj_grp_code,subj_grp_name,gender,number
996,2019,FACHGR-AGRVET,"Agrar-, Forst- u.Ernährungswiss., Veterinärmed...",GESM,3175
997,2023,FACHGR-ZE-02,Zentr.Einricht.d.Hochschulkliniken (nur Humanm...,GESW,1716
998,2019,FACHGR-ZE-01,Zentr.Einrichtungen (o. klinikspez. Einrichtun...,GESW,7148
999,2020,FACHGR-SKW,Geisteswissenschaften,GESW,10797
1000,2021,FACHGR-ZE-01,Zentr.Einrichtungen (o. klinikspez. Einrichtun...,GESW,7565
...,...,...,...,...,...
1099,2021,FACHGR-ZE-01,Zentr.Einrichtungen (o. klinikspez. Einrichtun...,GESM,7507
1100,2022,FACHGR-MEDHUM,Humanmedizin/Gesundheitswissenschaften,GESM,34325
1101,2022,FACHGR-RWSW,"Rechts-, Wirtschafts- und Sozialwissenschaften",GESW,22639
1102,2020,FACHGR-AGRVET,"Agrar-, Forst- u.Ernährungswiss., Veterinärmed...",GESW,3374


The relatively large numbers for, e.g., Geisteswissenschaften strongly suggest that this is an aggregate. Before throwing the rows out, let's verify though. 

### Student data subject codes

Things are looking a bit better for the subject codes in the student data. Here it's just the format that's slightly different: the subjectcodes table uses plain integers, the df_s/df_f tables have strings starting with SF. We should adapt in one directions.

In [158]:
subjectcodes.subj_code.unique()

array([  4,  90, 161, 544,  53, 162, 545,  86, 169, 127, 136, 272,  12,
        68, 273, 548, 183, 275,  37,  22, 188, 152, 284,  18, 160,  31,
        70,   5,  95,  43,  34, 271,  67, 189, 119, 120,   6,   8,  59,
        84, 131, 137, 150,  16,  56, 206, 139, 146, 207, 153, 209, 130,
         1,   2,  10,  15,  73,  78,  81,  83,  85, 180, 122, 145, 158,
       187,  24, 173, 174, 292, 302,  98,  29,  30, 154,  55,  38,  44,
        36, 129, 147, 148, 149, 208, 245, 253, 135,  42, 257, 258, 255,
       259, 265, 262, 260, 266, 261, 168, 263, 256, 264, 268, 172, 269,
        11,  21, 167, 304, 182, 166, 274, 210, 175, 181, 184, 464, 179,
       132, 117, 270, 321,  52, 365, 115, 254, 361, 190, 303,  49, 186,
        19, 312, 105, 118, 276,  14, 128,  25,  32,  96, 126,   9,  26,
       300, 282,  65,  66, 385,  39, 110, 111, 124,  50, 283, 195, 232,
       233, 234, 107, 156,  93,  61,  64, 138, 125,   3,  28,  60,  97,
       220, 353, 371, 227,  58,  75, 320,  71, 140,  72, 199, 38

In [82]:
print(f'Are the subject codes identical for df_s and df_d? ', all(df_f.subj_code.unique()==df_s.subj_code.unique()))
print(df_s.subj_code.unique())


Are the subject codes identical for df_s and df_d?  True
['SF241' 'SF220' 'SF280' 'SF086' 'SF272' 'SF001' 'SF233' 'SF142' 'SF040'
 'SF074' 'SF076' 'SF126' 'SF032' 'SF173' 'SF026' 'SF186' 'SF178' 'SF134'
 'SF181' 'SF212' 'SF048' 'SF282' 'SF120' 'SF273' 'SF119' 'SF033' 'SF087'
 'SF231' 'SF075' 'SF266' 'SF104' 'SF188' 'SF429' 'SF094' 'SF085' 'SF148'
 'SF082' 'SF146' 'SF084' 'SF195' 'SF042' 'SF205' 'SF139' 'SF137' 'SF179'
 'SF211' 'SF242' 'SF365' 'SF232' 'SF180' 'SF143' 'SF027' 'SF073' 'SF002'
 'SF117' 'SF035' 'SF185' 'SF077' 'SF172' 'SF049' 'SF135' 'SF125' 'SF333'
 'SF361' 'SF194' 'SF079' 'SF204' 'SF121' 'SF385' 'SF118' 'SF034' 'SF138'
 'SF083' 'SF145' 'SF003' 'SF274' 'SF105' 'SF187' 'SF095' 'SF149' 'SF041'
 'SF170' 'SF088' 'SF210' 'SF008' 'SF230' 'SF136' 'SF116' 'SF144' 'SF036'
 'SF268' 'SF078' 'SF124' 'SF171' 'SF028' 'SF271' 'SF189' 'SF031' 'SF122'
 'SF197' 'SF044' 'SF207' 'SF235' 'SF275' 'SF177' 'SF264' 'SF102' 'SF051'
 'SF284' 'SF092' 'SF024' 'SF080' 'SF029' 'SF123' 'SF277' 'SF009' 'S

# Further EDA and processing on cleaned datasets

Since the dataset does not contain a single line per observation, but aggregate counts, getting the typical descriptive statistics takes a bit of care.

In [None]:
# we can ignore the stats for year
df_s.describe()

Unnamed: 0,year,number
count,7032.0,7032.0
mean,2020.5,2479.224261
std,1.707947,8134.877221
min,2018.0,0.0
25%,2019.0,44.0
50%,2020.5,286.0
75%,2022.0,1245.75
max,2023.0,114157.0


The `describe()` method gives some data for the number of students. We know that there are 7032 datarows, but this does not tell us the total number of students. The mean corresponds to the average size of a datarow, i.e. the average number of students per subject, per nationality, per gender. Similarly for the quartiles. If we want a more straightforwardly interpretable data overview, we need to do some aggregation on our own. (Some of the data might overlap with what will be queried later in SQL.) 

In [104]:
# let's get the number of students by subject
df_s.groupby('subj_code')['number'].mean()

subj_code
SF001     180.166667
SF002     335.125000
SF003    3982.208333
SF004    5326.666667
SF005     101.625000
            ...     
SF458     700.250000
SF464     895.666667
SF544     171.250000
SF545     123.208333
SF548     257.625000
Name: number, Length: 293, dtype: float64

In [None]:
# how about the overall average number of students

In [None]:
def describe_agr(df,targetcol,aggrcols):
    '''Provides some descriptive statistics using 
    '''

In [100]:
df_f.describe()

Unnamed: 0,year,number
count,7032.0,7032.0
mean,2020.5,352.98194
std,1.707947,1084.899318
min,2018.0,0.0
25%,2019.0,5.0
50%,2020.5,44.0
75%,2022.0,200.0
max,2023.0,15178.0
