# Introduction

## Dataset

https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003

Number of students in Germany by subject of study, nationality, gender from 2018-2024. I will focus on the final year 2023/24 for a start. (Will I?)

Check this document for official practice of subject codes and subject cluster classifications: https://www.destatis.de/DE/Methoden/Klassifikationen/Bildung/studenten-pruefungsstatistik.pst_all?__blob=publicationFile&v=12

## Questions 

1. What was the total number of students in Germany in 2023/24?
2. What were the 10 subjects with the highest number of students in 2023/24?
3. How was the gender distribution of students in 2023/24?
4. How was the gender distribution in the 5 most studied subjects?
5. What were the top subjects by gender?
6. Which were the top 5 subjects studied by non-citizens? 
7. (Sort the subjects into clusters and provide a cluster identifier to the dataframe.) How are the student numbers distributed across subject clusters?
8. How does the number of students change over the time period by cluster?
9. How does the number of students change over the time period for language related subjects?
10. How does the number of students change for linguistics in a narrow perspective?





# Setup

## Load libraries

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

## Load dataset(s)

In [2]:
# fields of study, all Germany 2018-2024
# using the non-flat csv
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0003
stud = pd.read_csv('./datasets/raw/21311-0003_de_flat_allstudents.csv',sep=';')

# for alternative table with data per state see:
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0006

In [52]:
# additional datasets
# freshers by course of study
# https://www-genesis.destatis.de/datenbank/online/statistic/21311/table/21311-0012
freshers = pd.read_csv('./datasets/raw/21311-0012_de_flat_freshers.csv',sep=';')

# employees at higher education institutions by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0002
pers = pd.read_csv('./datasets/raw/21341-0002_de_flat_personell.csv',sep=';')


# professors by subject group
# https://www-genesis.destatis.de/datenbank/online/statistic/21341/table/21341-0003
prof = pd.read_csv('./datasets/raw/21341-0002_de_flat_personell.csv',sep=';')


# exams by subject
# https://www-genesis.destatis.de/datenbank/online/statistic/21321/table/21321-0003
exams = pd.read_csv('./datasets/raw/21321-0003_de_flat_exams.csv',sep=';')


#

# Preprocessing

## Student data total

In [4]:
# not used, kept for documentation
#stud_nofl = pd.read_csv('./datasets/21311-0003_de_2018-2024_GER.csv',sep=';')
#stud_nofl.head(10)

The non-flat csv seems to be aimed at presentation in spreadsheet editors. Reformating might be possible, but potentially complex. Let's use the flat-csv instead.

In [5]:
stud.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
statistics_code,21311,21311,21311,21311,21311,21311,21311,21311,21311,21311
statistics_label,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten,Statistik der Studenten
time_code,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST,SEMEST
time_label,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester,Semester
time,2018-10P6M,2022-10P6M,2022-10P6M,2023-10P6M,2023-10P6M,2018-10P6M,2019-10P6M,2020-10P6M,2022-10P6M,2021-10P6M
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT,NAT


The data structure is also rather complex, which probably makes sense for standardisation purposes at the Statistisches Bundesamt, but for present purposes it makes sense to create a more transparent dataframe. The advantage to the non-flat csv is that every line seems to cleanly correspond to a datapoint. Labels are intermingled, so some cleanup will be required.

The value of `4_variable_attribute_label` also hints at a problem to be encountered later, namely shifts in the subject code allocation. That course of study seems to have been assigned to a new code in 2020; there are likely to be other datapoints with this issue, to be checked later.

In [6]:
stud.columns

Index(['statistics_code', 'statistics_label', 'time_code', 'time_label',
       'time', '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '2_variable_attribute_code',
       '2_variable_attribute_label', '3_variable_code', '3_variable_label',
       '3_variable_attribute_code', '3_variable_attribute_label',
       '4_variable_code', '4_variable_label', '4_variable_attribute_code',
       '4_variable_attribute_label', 'value', 'value_unit',
       'value_variable_code', 'value_variable_label'],
      dtype='object')

## Variables to keep and rename


**Rough overview of columns**

```python
allcols = ['statistics_code', 'statistics_label',                                                                   # identifier of statistic
            'time_code', 'time_label', 'time',                                                                      # time label 
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_code', '2_variable_attribute_label',       # nationality
            '3_variable_code', '3_variable_label', '3_variable_attribute_code', '3_variable_attribute_label',       # gender
            '4_variable_code', '4_variable_label', '4_variable_attribute_code', '4_variable_attribute_label',       # subject
            'value', 'value_unit', 'value_variable_code', 'value_variable_label']                                   # value = number of students
```

In [7]:
# checking that variable 1 only has one distinct value, signifying that the scope of the data is all of Germany
stud['1_variable_code'].unique()

array(['DINSG'], dtype=object)

The identifiers for the statistic can be dropped, as can the code and label for time. Since we only include data for all of Germany for now, all `1_variable` columns can also be removed.
The general strategy for the next two variables is to only keep the `[23]_variable_attribute_code`s. They contain NaN for the total values, which will allow relatively easy filtering later to remove these "totals" lines to avoid double counting issues.
For variable 4, we keep `variable_attribute_code` and `variable_attribute_label`. The former may be helpful for clustering subjects later on, the latter is more transparent.


In [8]:
remove_cols = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',       # datascope
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',       # 
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', 
            'value_unit', 'value_variable_code', 'value_variable_label']


Below we identify the columns that should be kept and create a dictionary for a more transparent naming scheme.

In [9]:
stud.time.unique()      # checking unique values in `time`

array(['2018-10P6M', '2022-10P6M', '2023-10P6M', '2019-10P6M',
       '2020-10P6M', '2021-10P6M'], dtype=object)


`time` corresponds to the year of record. Can be mapped to plain year for simplicity. Currently, '2018-10P6M' presumably indicates the academic year 2018/2019, which began in October 2018. This could be mapped to the integer 2018 (or '2018/19', but the year of start should be a sufficient identifier). These are effectively categorical variables, but treating them as integer is more memory efficient and fine for sorting.



In [10]:
colname_remap = {
    'time': 'acyear',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'subj_code',
    '4_variable_attribute_label': 'subj_name',
    'value': 'stud_count'
    }      # dict for remapping the column names

In [11]:
st_all = stud.drop(remove_cols,axis='columns')
st_all = st_all.rename(columns=colname_remap)

In [12]:
st_all.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018-10P6M,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
1,2022-10P6M,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
2,2022-10P6M,NATA,GESW,SF280,Kartografie,44
3,2023-10P6M,NATA,GESM,SF086,"Katholische Theologie, - Religionslehre",407
4,2023-10P6M,NATA,GESW,SF272,Alte Geschichte,18


Now, we need to make sure `stud_count` is an integer and simplify `acyear`.

In [13]:
# check non-digit values for `stud_count`
st_all.loc[st_all.stud_count.str.isdigit() == False,'stud_count'].unique()

array(['-'], dtype=object)

In [14]:
st_all.loc[st_all.stud_count == '0']

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count


Currently, zero values are represented by '-', so we replace all instances of '-' by 0.

In [15]:
# replace all instances of '-' in stud_count by 0
st_all.stud_count = st_all.stud_count.str.replace('-','0')
st_all.loc[st_all.stud_count == '0']

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
26,2018-10P6M,NATA,GESM,SF087,Körperbehindertenpädagogik (ab 2016 zu SF190),0
33,2018-10P6M,NATA,GESW,SF429,Stahlbau,0
56,2018-10P6M,NATA,GESM,SF180,Kaukasistik,0
58,2019-10P6M,NATA,GESW,SF027,Blinden-/Sehbehindertenpädagogik(ab 2016 zu SF...,0
69,2023-10P6M,NATD,GESM,SF027,Blinden-/Sehbehindertenpädagogik(ab 2016 zu SF...,0
...,...,...,...,...,...,...
6996,2021-10P6M,NATA,GESM,SF087,Körperbehindertenpädagogik (ab 2016 zu SF190),0
7010,2022-10P6M,NATD,GESM,SF196,Studienkolleg,0
7022,2020-10P6M,NATD,GESM,SF041,Sonstiges Orientierungsstudium,0
7026,2023-10P6M,NATA,GESW,SF061,Meliorationswesen,0


In [16]:
# now we can cast as type int
st_all.stud_count = st_all.stud_count.astype(int)
st_all.dtypes

acyear         object
nationality    object
gender         object
subj_code      object
subj_name      object
stud_count      int64
dtype: object

In [17]:
st_all.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018-10P6M,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,3
1,2022-10P6M,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,21
2,2022-10P6M,NATA,GESW,SF280,Kartografie,44
3,2023-10P6M,NATA,GESM,SF086,"Katholische Theologie, - Religionslehre",407
4,2023-10P6M,NATA,GESW,SF272,Alte Geschichte,18


To take care of the year, we can just split at the hyphen to keep only the year and then cast as int as well.

In [18]:
st_all.acyear = st_all.acyear.str.split('-').str[0].astype(int)


In [19]:
st_all.dtypes

acyear          int64
nationality    object
gender         object
subj_code      object
subj_name      object
stud_count      int64
dtype: object

### Removing totals columns

*This can actually be achieved by changing selections at dataset download. I leave the code below for documentation, but comment it out.*

Now all datatypes should be fine, let's check that we can indeed remove the rows with NULL for `gender` or `nationality`. These should correspond to the totals, which we can easily reconstruct.

In [20]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018)].sort_values(['nationality','gender'])

If the concept is right, the sum of all stud_counts where neither `gender` nor `nationality` is na should be 1088.

In [21]:
#st_all.loc[(st_all.subj_code =='SF142') & (st_all.acyear == 2018) & (st_all.nationality.isna() == False) & (st_all.gender.isna() == False)].stud_count.sum()

This checks out, so we can (and should) indeed remove all rows with NaN in either of those two columns. 

In [22]:
#st_all.isna().sum()

In [23]:
#st_all = st_all.dropna(subset=['gender','nationality'])

In [24]:
#st_all.isna().sum()

In [25]:
#st_all.head()

### Saving

Great, no na values left! The dataset should be usable now (barring further extension for subject clustering). Let's reset the index and save the cleaned up version for easier access.

In [26]:
#st_all.reset_index(drop=True,inplace=True)
#st_all.to_csv('./datasets/GER_2018_2023_students_all_cleaned.csv')

## Create cleaning function

Based on findings above, I create a function to encapsulate cleaning this particular dataformat - I have at least one other student dataset after all. I leave the code above intact for documentation purposes.

Different tables from Statistisches Bundesamt might require further adaptations of function. 

In [36]:
# cleaning functions
def clean_dat(in_df, dropcols, rencols, csvname,complexdate=False):
    '''Function to streamline cleaning student data from Statistisches Bundesamt
    
    in_df: the input dataframe
    dropcols: list of column names to drop
    rencols: dictionary of column name changes
    csvname: string for the path and name of the cleaned csv file
    '''
    df = in_df.copy()
    
    
    # replace all instances of '-' in stud_count by '0' and cast as int
    # doing this before renaming to slightly increase chances of portability of code
    # (still assumes that column 'value' exists and works the same)
    if 'value' in df.columns:
        df.value = df.value.str.replace('-','0').astype(int)
    else:
        print(f'Column `value` not found in dataframe {in_df}')
    
    # transform `time` to a plain year integer, corresponding to starting date of academic year
    if 'time' in df.columns:
        if complexdate:
            df.time = df.time.str.split('-').str[0].astype(int)
        else:                                                       # if complexdate flag is not set, simply cast `time` to int
            df.time = df.time.astype(int)
    else:
        print(f'Column `time` not found in dataframe {in_df}')

    # removing columns
    df = df.drop(dropcols,axis='columns')
    
    # renaming columns
    df = df.rename(columns=rencols)
    
    # reset index and save
    df.reset_index(drop=True,inplace=True)
    df.to_csv(csvname)
    
    return df
    
    


## Student data - total

Applying the cleaning function.

In [None]:
stud_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', 
            'value_unit', 'value_variable_code', 'value_variable_label']

stud_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'subj_code',
    '4_variable_attribute_label': 'subj_name',
    'value': 'number'
    }      # dict for remapping the column names

In [38]:
st_all = clean_dat(stud,stud_drop,stud_remap,'./datasets/clean_stud_all_GER_2018_2023.csv',complexdate=True)

In [39]:
st_all.dtypes

acyear          int64
nationality    object
gender         object
subj_code      object
subj_name      object
stud_count      int64
dtype: object

## Student data - freshers

In [33]:
freshers.head()

Unnamed: 0,statistics_code,statistics_label,time_code,time_label,time,1_variable_code,1_variable_label,1_variable_attribute_code,1_variable_attribute_label,2_variable_code,...,3_variable_attribute_code,3_variable_attribute_label,4_variable_code,4_variable_label,4_variable_attribute_code,4_variable_attribute_label,value,value_unit,value_variable_code,value_variable_label
0,21311,Statistik der Studenten,SEMEST,Semester,2018-10P6M,DINSG,Deutschland insgesamt,DG,Deutschland,NAT,...,GESM,männlich,BILSF1,Studienfach,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,-,Anzahl,BIL016,Studienanfänger
1,21311,Statistik der Studenten,SEMEST,Semester,2022-10P6M,DINSG,Deutschland insgesamt,DG,Deutschland,NAT,...,GESW,weiblich,BILSF1,Studienfach,SF220,Milch- und Molkereiwirtschaft,3,Anzahl,BIL016,Studienanfänger
2,21311,Statistik der Studenten,SEMEST,Semester,2022-10P6M,DINSG,Deutschland insgesamt,DG,Deutschland,NAT,...,GESW,weiblich,BILSF1,Studienfach,SF280,Kartografie,11,Anzahl,BIL016,Studienanfänger
3,21311,Statistik der Studenten,SEMEST,Semester,2023-10P6M,DINSG,Deutschland insgesamt,DG,Deutschland,NAT,...,GESM,männlich,BILSF1,Studienfach,SF086,"Katholische Theologie, - Religionslehre",32,Anzahl,BIL016,Studienanfänger
4,21311,Statistik der Studenten,SEMEST,Semester,2023-10P6M,DINSG,Deutschland insgesamt,DG,Deutschland,NAT,...,GESW,weiblich,BILSF1,Studienfach,SF272,Alte Geschichte,2,Anzahl,BIL016,Studienanfänger


In [40]:
# clean freshers data
st_fresh = clean_dat(freshers,stud_drop,stud_remap,'./datasets/clean_stud_fresh_GER_2018_2023.csv',complexdate=True)

In [41]:
st_fresh.head()

Unnamed: 0,acyear,nationality,gender,subj_code,subj_name,stud_count
0,2018,NATD,GESM,SF241,Kerntechnik/Kernverfahrenstechn.(ab 2020 zu SF...,0
1,2022,NATD,GESW,SF220,Milch- und Molkereiwirtschaft,3
2,2022,NATA,GESW,SF280,Kartografie,11
3,2023,NATA,GESM,SF086,"Katholische Theologie, - Religionslehre",32
4,2023,NATA,GESW,SF272,Alte Geschichte,2


In [44]:
st_fresh.isna().sum()

acyear         0
nationality    0
gender         0
subj_code      0
subj_name      0
stud_count     0
dtype: int64

st_fresh is clean.

## Exam data

In [53]:
exams.head().T

Unnamed: 0,0,1,2,3,4
statistics_code,21321,21321,21321,21321,21321
statistics_label,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen,Statistik der Prüfungen
time_code,JAHR,JAHR,JAHR,JAHR,JAHR
time_label,Jahr,Jahr,Jahr,Jahr,Jahr
time,2022,2021,2021,2021,2022
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,NAT,NAT,NAT,NAT,NAT


In [54]:
ex_drop = ['statistics_code', 'statistics_label', 
            'time_code', 'time_label',
            '1_variable_code', '1_variable_label', '1_variable_attribute_code', '1_variable_attribute_label',   
            '2_variable_code', '2_variable_label', '2_variable_attribute_label',     
            '3_variable_code', '3_variable_label', '3_variable_attribute_label',
            '4_variable_code', '4_variable_label', '4_variable_attribute_label',
            '5_variable_code', '5_variable_label',
            'value_unit', 'value_variable_code', 'value_variable_label']

ex_remap = {
    'time': 'year',
    '2_variable_attribute_code': 'nationality',
    '3_variable_attribute_code': 'gender',
    '4_variable_attribute_code': 'passed',
    '5_variable_attribute_code': 'subj_code',
    '5_variable_attribute_label': 'subj_name',
    'value': 'number'
    } 

In [55]:
exams_cl = clean_dat(exams,ex_drop,ex_remap,'./datasets/clean_exams_GER_2018_2023.csv')

In [57]:
exams_cl.isna().sum()

year           0
nationality    0
gender         0
passed         0
subj_code      0
subj_name      0
number         0
dtype: int64

In [59]:
exams_cl.head()

Unnamed: 0,year,nationality,gender,passed,subj_code,subj_name,number
0,2022,NATA,GESW,PRUEFBESTN,SF181,Wirtschaftspädagogik,1
1,2021,NATA,GESW,PRUEFBESTJ,SF042,Wirtschaftsrecht,193
2,2021,NATD,GESW,PRUEFBESTJ,SF294,Materialwissenschaften,81
3,2021,NATA,GESW,PRUEFBESTN,SF008,Anglistik/Englisch,6
4,2022,NATA,GESM,PRUEFBESTN,SF255,Archivwesen,0


## Personell data

In [None]:
# inspecting personell data structure
pers.head().T

Unnamed: 0,0,1,2,3,4
statistics_code,21341,21341,21341,21341,21341
statistics_label,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals
time_code,JAHR,JAHR,JAHR,JAHR,JAHR
time_label,Jahr,Jahr,Jahr,Jahr,Jahr
time,2018,2018,2021,2022,2020
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,BERLF1,BERLF1,BERLF1,BERLF1,BERLF1


In [34]:
pers.columns

Index(['statistics_code', 'statistics_label', 'time_code', 'time_label',
       'time', '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '2_variable_attribute_code',
       '2_variable_attribute_label', '3_variable_code', '3_variable_label',
       '3_variable_attribute_code', '3_variable_attribute_label', 'value',
       'value_unit', 'value_variable_code', 'value_variable_label'],
      dtype='object')

In [None]:
pers_drop = [
    'statistics_code', 'statistics_label', 'time_code', 'time_label',
    '1_variable_code', '1_variable_label',
       '1_variable_attribute_code', '1_variable_attribute_label',
       '2_variable_code', '2_variable_label', '3_variable_code', '3_variable_label',
        '3_variable_attribute_label',
       'value_unit', 'value_variable_code', 'value_variable_label'
]

pers_rename ={
    'time': 'year',
    '2_variable_attribute_label': 'subj_grp_name',
    '2_variable_attribute_code': 'subj_grp_code',
    '3_variable_attribute_code': 'gender',
    'value': 'number'
}



In [42]:
pers_cl = clean_dat(pers,pers_drop,pers_rename,'./datasets/clean_personell_GER_2018_2023.csv')

In [45]:
pers_cl.isna().sum()

year              0
subj_grp_code    12
subj_grp_name     0
gender            0
number            0
dtype: int64

In [46]:
pers_cl.loc[pers_cl.subj_grp_code.isna()]

Unnamed: 0,year,subj_grp_code,subj_grp_name,gender,number
1002,2023,,Insgesamt,GESM,161195
1012,2019,,Insgesamt,GESM,155848
1024,2021,,Insgesamt,GESM,161505
1035,2023,,Insgesamt,GESW,120393
1041,2020,,Insgesamt,GESM,159568
1045,2018,,Insgesamt,GESM,154020
1054,2022,,Insgesamt,GESW,117180
1079,2018,,Insgesamt,GESW,101522
1083,2020,,Insgesamt,GESW,109707
1086,2022,,Insgesamt,GESM,161055


In [47]:
pers_cl.loc[pers_cl.subj_grp_name=='Insgesamt']

Unnamed: 0,year,subj_grp_code,subj_grp_name,gender,number
1002,2023,,Insgesamt,GESM,161195
1012,2019,,Insgesamt,GESM,155848
1024,2021,,Insgesamt,GESM,161505
1035,2023,,Insgesamt,GESW,120393
1041,2020,,Insgesamt,GESM,159568
1045,2018,,Insgesamt,GESM,154020
1054,2022,,Insgesamt,GESW,117180
1079,2018,,Insgesamt,GESW,101522
1083,2020,,Insgesamt,GESW,109707
1086,2022,,Insgesamt,GESM,161055


The problem is the same in both columns, we don't need these "Total" rows, so drop them.  

In [48]:
pers_cl.dropna(subset=['subj_grp_code'],inplace=True)
pers_cl.isna().sum()

year             0
subj_grp_code    0
subj_grp_name    0
gender           0
number           0
dtype: int64

`pers_cl` is now clean.

## Professorial data

In [49]:
prof.head().T

Unnamed: 0,0,1,2,3,4
statistics_code,21341,21341,21341,21341,21341
statistics_label,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals,Statistik des Hochschulpersonals
time_code,JAHR,JAHR,JAHR,JAHR,JAHR
time_label,Jahr,Jahr,Jahr,Jahr,Jahr
time,2018,2018,2021,2022,2020
1_variable_code,DINSG,DINSG,DINSG,DINSG,DINSG
1_variable_label,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt,Deutschland insgesamt
1_variable_attribute_code,DG,DG,DG,DG,DG
1_variable_attribute_label,Deutschland,Deutschland,Deutschland,Deutschland,Deutschland
2_variable_code,BERLF1,BERLF1,BERLF1,BERLF1,BERLF1


Structure looks to be the same as `pers`, treat in parallel.

In [50]:
prof_cl = clean_dat(prof,pers_drop,pers_rename,'./datasets/clean_prof_GER_2018_2023.csv')
prof_cl.isna().sum()

year              0
subj_grp_code    12
subj_grp_name     0
gender            0
number            0
dtype: int64

In [51]:
prof_cl.dropna(subset=['subj_grp_code'],inplace=True)
prof_cl.isna().sum()

year             0
subj_grp_code    0
subj_grp_name    0
gender           0
number           0
dtype: int64

We could consider merging the two student datasets and the two personell-related datasets with the addition of a distinguishing column. However, for the students the freshers presumably represent a subset of the total number of students -- is that a problem? Maybe not with labels 'total' vs 'fresher'.

Same holds for professors actually. So maybe integrating both pairs of datasets is a good idea. 

# Further EDA on cleaned dataset

In [None]:
st_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   acyear       7032 non-null   int64 
 1   nationality  7032 non-null   object
 2   gender       7032 non-null   object
 3   subj_code    7032 non-null   object
 4   subj_name    7032 non-null   object
 5   stud_count   7032 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 329.8+ KB


In [None]:
st_all.describe()

Unnamed: 0,acyear,stud_count
count,7032.0,7032.0
mean,2020.5,2479.224261
std,1.707947,8134.877221
min,2018.0,0.0
25%,2019.0,44.0
50%,2020.5,286.0
75%,2022.0,1245.75
max,2023.0,114157.0


In [30]:
dictionaire = {
    'mainclass1': {
        'subclass1': {
            'subjcode111': 'subjectname111',
            'subjcode112': 'subjectname112',
        },
        'subclass2': {
            'subjcode121': 'subjectname121',
            'subjcode122': 'subjectname122',
        }
    }
}

In [31]:
test = pd.DataFrame(dictionaire)
test.head()

Unnamed: 0,mainclass1
subclass1,"{'subjcode111': 'subjectname111', 'subjcode112..."
subclass2,"{'subjcode121': 'subjectname121', 'subjcode122..."


Prompt for data extraction from pst_all (request for csv in previous context) at Claude. Initially, I tried ChatGPT, but it struggled with the pst_all format (presumably the three columns) and hit its limit while processing.

> **Prompt**:
>
> The uploaded pst_all contains three columns per page in the main text body. The document is in German. Do not change any of the subject titles, but drop any footnote markers. Extract the following data into a csv-file:
> - "Fächergruppe" into the columns 'subj_group_code' for the numerical code and 'subj_group_name' for the name (e.g. Geisteswissenschaften)
> - "Studienbereich" (underlined in the pst_all) into columns 'subj_cluster_code' and 'subj_cluster_name' respectively
> - "Studienfach" into columns 'subj_code' and 'subj_name'
>
> For each of the three levels, also provide a column '..._name_eng' with an approximate English translation for the respective German '..._name'. The translation need not be literal, but can consider more common English terminology as long as the content is coherent. Make sure to use the same translation for any recurring term.


**To note for future reference. I had originally suspected a dictionary/JSON would be the simplest representation of the data (albeit requiring some post-processing to generate a table later), but ChatGPT had suggested I generate a csv in an earlier chat. However, it may actually have considered the ease with which I would be able to use the data rather than its own computational efficiency.**

(Note that I was not able to use this prompt in the same chat that generated the csv-file due to length restrictions on the chat. So I am still not quite sure if repeating values may have been stored because this answer is a bit more general.)

> **Prompt**
> 
> I just had you create a csv file with slightly under 300 rows of data extracted from a pst_all involving an underlying nested structure of subject groups and their codes, subject clusters the groups consist of including their numerical code and subjects including their code. Since the pst_all was in German, I additionally asked for an additional column in the csv file with an English translation for each name column of the three structural levels. For the higher level values, there was naturally a lot of repetition due to the large number of subjects. For future reference, in such contexts, do the repeating names and translations have to be recomputed for every line of the csv? Would a different output formatting have been less computationally expensive for you (i.e. using less energy and/or tokens), e.g. as a python dictionary?

> **Claude reply**:
> 
> When working with hierarchical data like you described, you're right to consider computational efficiency, especially for a large dataset with repetitive elements.
> 
> For your specific case with grouped subject data in German/English:
> 
> CSV format is indeed less efficient when dealing with hierarchical data with repetition. Each time a higher-level group value appears in multiple rows, I needed to:
> 1. Store that repeated text in each row
> 2. Translate the same text multiple times (or recall the previous translation)
> 
> A more efficient approach would have been:
> 
> **Nested data structures** like a Python dictionary or JSON would better preserve the hierarchy and eliminate repetition:
> 
> ```
> {
>   "group_code1": {
>     "name_de": "Gruppe Name",
>     "name_en": "Group Name",
>     "clusters": {
>       "cluster_code1": {
>         "name_de": "Cluster Name",
>         "name_en": "Cluster Name",
>         "subjects": [
>           {"code": "subj1", "name_de": "Fach1", "name_en": "Subject1"},
>           {"code": "subj2", "name_de": "Fach2", "name_en": "Subject2"}
>         ]
>       }
>     }
>   }
> }
> ```
> 
> This structure would:
> - Store each name/translation only once
> - Preserve the hierarchy naturally
> - Require significantly fewer tokens to represent
> - Be more computationally efficient to generate
> - Use less memory/energy overall
> 
> If you need to work with this data in the future, requesting structured formats that match the natural hierarchy of your data will be both more efficient for AI processing and likely more useful for your subsequent data handling.

So for the future, generating a JSON or dictionary for this type of data and then do local post-processing might be the more computationally effective and responsible option.