## 1. Data access
- from: https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention
- Creating an account, logging in
- downloading a zip file, ‘archive.zip’, 
- the extracted file was called, dataset.csv’
- copy 'dataset.csv' to the location of this notebook

The official data repository is zenodo and can be cited as: Valentim Realinho, Jorge Machado, Luís Baptista, & Mónica V. Martins. (2021). Predict students' dropout and academic success (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5777340


## 2. Data exploration

### MDPI Data Descriptor 
The data description, category meaning and a detailed explanation of the columns and values where found here: 
- https://doi.org/10.3390/data7110146


### Early Prediction of student’s Performance in Higher Education: A Case Study
- Read the linked paper using DOI: 10.1007/978-3-030-72657-7_16
- We found the dataset relates to a paper that was published in a book with the following authors Mónica V. Martins , Daniel Tolledo1, Jorge Machado , Luís M. T. Baptista , and Valentim Realinho
- Our initial plan is to reproduce the figures of the paper



- Taken from: https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention

This dataset contains data from a higher education institution on various variables related to undergraduate students, including demographics, social-economic factors, and academic performance, to investigate the impact of these factors on student dropout and academic success

|Column name|Description| Type| Our comments |
|----|-----|-----|----|
| Marital status| The marital status of the student. |(Categorical)|    Family responsibilities impacting dropouts?|
| Application mode| The method of application used by the student.| (Categorical)| 
| Application order| The order in which the student applied. |(Numerical)| Students choose several programs when applying(?) No information listed in MDPI |
| Course: |The course taken by the student. |(Categorical)| 
| Daytime/evening attendance| Whether the student attends classes during the day or in the evening. |(Categorical)| Implications to dropouts? Daytime - 1, Evening - 0 | |
| Previous qualification |The qualification obtained by the student before enrolling in higher education. |(Categorical)| Implications to dropouts? |
| Nacionality:| The nationality of the student.| (Categorical)| More committment for foreign students?|
| Mother's qualification:| The qualification of the student's mother. |(Categorical)| See MDPI for what the numbers mean |
| Father's qualification:| The qualification of the student's father. |(Categorical)| See MDPI for what the numbers mean |
| Mother's occupation:| The occupation of the student's mother.| (Categorical)| See MDPI for what the numbers mean |
| Father's occupation:| The occupation of the student's father.| (Categorical)| See MDPI for what the numbers mean |
| Displaced:| Whether the student is a displaced person.| (Categorical)| Refugee? Not described in MDPI. Yes - 1, No - 0 ||
| Educational special needs| Whether the student has any special educational needs.| (Categorical)|Yes - 1, No - 0 |
| Debtor| Whether the student is a debtor.| (Categorical)| Yes - 1, No - 0 |
| Tuition fees up to date| Whether the student's tuition fees are up to date.| (Categorical)| Yes - 1, No - 0 |
| Gender| The gender of the student.| (Categorical)| Male - 1, Female - 0 |
| Scholarship holder| Whether the student is a scholarship holder.| (Categorical)|  Yes - 1, No - 0 |
| Age at enrollment| The age of the student at the time of enrollment.| (Numerical)|
| International| Whether the student is an international student. |(Categorical)| Yes - 1, No - 0 |
| Curricular units 1st sem (credited)| The number of curricular units credited by the student in the first semester. |(Numerical)|
| Curricular units 1st sem (enrolled)| The number of curricular units enrolled by the student in the first semester. |(Numerical)| We assumed this means registered for a course |
| Curricular units 1st sem (evaluations)| The number of curricular units evaluated by the student in the first semester. |(Numerical)|
| Curricular units 1st sem (approved)| The number of curricular units approved by the student in the first semester. |(Numerical)| We assumed this means Passed a course |


In [9]:
import pandas as pd
import numpy as np

In [10]:
# open the datset and display
df = pd.read_csv("dataset.csv",sep=',')
df.head(4)

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate


In [32]:
df.shape

(4424, 35)

In [33]:
df = df.dropna() # check for NA values 

In [34]:
df.shape

(4424, 35)

In [35]:
df['Target'].value_counts()

Graduate    2209
Dropout     1421
Enrolled     794
Name: Target, dtype: int64

The authors stated that “The final dataset consisted of 3623 records and 25 independent variables.” (Page 169)
However our findings are that there are 4424 rows and 35 columns

In [36]:
for i in df.columns:
    print(i) 
    #print (df[i].value_counts())

Marital status
Application mode
Application order
Course
Daytime/evening attendance
Previous qualification
Nacionality
Mother's qualification
Father's qualification
Mother's occupation
Father's occupation
Displaced
Educational special needs
Debtor
Tuition fees up to date
Gender
Scholarship holder
Age at enrollment
International
Curricular units 1st sem (credited)
Curricular units 1st sem (enrolled)
Curricular units 1st sem (evaluations)
Curricular units 1st sem (approved)
Curricular units 1st sem (grade)
Curricular units 1st sem (without evaluations)
Curricular units 2nd sem (credited)
Curricular units 2nd sem (enrolled)
Curricular units 2nd sem (evaluations)
Curricular units 2nd sem (approved)
Curricular units 2nd sem (grade)
Curricular units 2nd sem (without evaluations)
Unemployment rate
Inflation rate
GDP
Target


In [37]:
#taken from the paper                                 # mapping data set from MDPI Paper  
DemographicFactors = ['Marital status', 
                         'Gender', 
                         'Marital status', 
                         'Nacionality',               # 'Noted typo in the Column name, Nationality/Nacionality'
                         'Displaced',              
                         'Age at enrollment'] 

In [38]:
#taken from the paper
socioEconomicFactors =  ['Mother’s qualification', 
                           'Father’s qualification',           
                           'Mother’s occupation',             
                           'Father’s occupation',    
                           'Educational special needs',                    
                           'Debtor'
                           'Tuition fees up to date'
                           'Scholarship holder']                   

In [39]:
#taken from the paper
macroEconomicFactors = ['Unemployment rate',                     
                       'Inflation rate', 
                       'GDP'] 


In [40]:
#taken from the paper
academicDataAtEnrollment =  ['Application mode', 
                           'Application order',           
                           'Course',             
                           'Daytime/evening attendance',    
                           'Previous qualification']

In [None]:
# The Portuguese grading system
# https://www.studyineurope.eu/study-in-portugal/grades


In [41]:
#taken from the paper
academicData1stSemester =  ['Curricular units 1st sem (credited)', 
                           'Curricular units 1st sem (enrolled)',           
                           'Curricular units 1st sem (evaluations)',             
                           'Curricular units 1st sem (approved)',    
                           'Curricular units 1st sem (grade)',                    
                           'Curricular units 1st sem (without evaluations)']                   

In [42]:
#taken from the paper
academicData2ndSemester =  ['Curricular units 2nd sem (credited)', 
                           'Curricular units 2nd sem (enrolled)',           
                           'Curricular units 2nd sem (evaluations)',             
                           'Curricular units 2nd sem (approved)',    
                           'Curricular units 2nd sem (grade)',                    
                           'Curricular units 2nd sem (without evaluations)']                   

In [43]:
for i in academicData1stSemester: 
    print (df[i].value_counts())

0     3847
2       94
1       85
3       69
6       51
4       47
7       41
5       41
8       31
9       27
11      17
10      15
14      15
13      13
12      12
15       5
18       4
17       3
16       3
19       2
20       2
Name: Curricular units 1st sem (credited), dtype: int64
6     1910
5     1010
7      656
8      296
0      180
12      66
10      52
11      45
9       36
15      25
14      22
4       21
13      20
18      19
17      16
16      13
3       10
2        9
1        7
21       6
19       2
23       2
26       1
Name: Curricular units 1st sem (enrolled), dtype: int64
8     791
7     703
6     598
9     402
0     349
10    340
11    239
12    223
5     220
13    140
14    105
15     70
16     47
17     33
18     30
19     23
4      19
21     17
20     12
22     10
23      9
2       8
3       6
24      6
1       6
26      4
25      3
27      2
29      2
45      2
32      1
36      1
31      1
28      1
33      1
Name: Curricular units 1st sem (evaluations), dtype: i

In [46]:
df[academicData1stSemester].head(10)

Unnamed: 0,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations)
0,0,0,0,0,0.0,0
1,0,6,6,6,14.0,0
2,0,6,0,0,0.0,0
3,0,6,8,6,13.428571,0
4,0,6,9,5,12.333333,0
5,0,5,10,5,11.857143,0
6,0,7,9,7,13.3,0
7,0,5,5,0,0.0,0
8,0,6,8,6,13.875,0
9,0,6,9,5,11.4,0


In [44]:
for i in academicData2ndSemester: 
    print (df[i].value_counts())

0     3894
1      107
2       92
4       78
5       68
3       49
6       26
11      20
7       16
9       15
12      14
10      13
8       12
13       9
14       4
15       2
16       2
18       2
19       1
Name: Curricular units 2nd sem (credited), dtype: int64
6     1913
5     1054
8      661
7      304
0      180
11      60
9       50
10      48
12      44
13      37
14      22
4       17
17      12
2        5
19       3
3        3
1        3
15       2
23       2
18       2
16       1
21       1
Name: Curricular units 2nd sem (enrolled), dtype: int64
8     792
6     614
7     563
9     456
0     401
10    355
5     288
11    255
12    226
13    126
14     98
15     73
16     49
17     25
18     22
19     19
4      10
21     10
22     10
20      8
2       4
23      4
26      3
24      3
1       3
3       2
27      2
28      1
25      1
33      1
Name: Curricular units 2nd sem (evaluations), dtype: int64
6     965
0     870
5     726
4     414
7     331
8     321
3     285
2     19

In [None]:
#https://towardsdatascience.com/performance-metrics-confusion-matrix-precision-recall-and-f1-score-a8fe076a2262

#The Confusion Matrix & Precision-Recall Tradeoff
#https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/confusion-matrix-precision-recall-tradeoff/

In [None]:
# SMOTE 
#https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/