In [1]:
from functions import *

# Student Info

---

The student info dataframe contains information about students including the module and presentation they took, demographic information and the final result of their studies.

In [2]:
# looking at the student_info dataframe
student_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


## Student Info Contents

* **code_module**: The code module represents the course the student is taking.
* **code_presentation**: The code presentations are the year and semester the student is taking the course.
* **id_student**: The student ID is a unique identifier for each student
* **gender**: The gender represents the binary gender of a student 'M' for students who identify as male and 'F' for students who identify as female.
* **region**: Region represents the location of the student when they took the module. All regions are in the UK, Scotland, Ireland or Wales.
* **highest_education**: Highest education is representative of a students highest level of formal academic achievement.
    - Education levels in order from least to most formal education: 
        - No formal quals (qualifications)
        - Lower than A Level which is nearly but not quite analagous to under high school level
        - A Level or equivalent which is again nearly analagous to high school level, but more like college ready
        - HE Qualification which stands for higher education qualification
        - Post Graduate Qualification
* **imd_band**: The imd_band represents the Indices of multiple deprivation (IMD) score which is a commonly used method in the UK to measure poverty or deprivation in an area. The lower the score, the more 'deprived' the area is.
* **age_band**: There are only three bins for age; 0-35, 35-55 and over 55
* **num_of_prev_attempts**: The number of times the student has attempted the course previously.
* **studied_credits**: The number of credits for the module the student is taking.
* **disability**: Disability status is represented by a binary 'Y', yes a student does identify as having a disability and 'N', no a student does not identify as having a disability.
* **final_results**: * The final result is the students overall result in the class.
    - Possible Results include:
         - Pass: The student passed the course
         - Fail: The student did not pass the course
         - Withdraw: The student withdrew before the course term ended
         - Distinction: The student passed the class with distinction

* num_of_prev_attempts will be changed to prev_attempts to save space

In [3]:
# rename num_of_prev_attempts column to prev_attempts to save space
student_info = student_info.rename(columns={'num_of_prev_attempts':'prev_attempts'})

---

## Student Info Information

In [4]:
# get size counts of student_info
get_size(student_info)

Unnamed: 0,Count
Columns,12
Rows,32593


In [5]:
md(f'''
Student Info has {len(student_info.columns)} columns and {"{:,}".format(len(student_info))} rows
''')


Student Info has 12 columns and 32,593 rows


In [6]:
# show student info data types
student_info.dtypes

code_module          object
code_presentation    object
id_student            int64
gender               object
region               object
highest_education    object
imd_band             object
age_band             object
prev_attempts         int64
studied_credits       int64
disability           object
final_result         object
dtype: object

* `id_student` is currently `int64` datatype, but would be more appropriate to recast it as categorical.
* `object` datatypes can have unexpected behavior and should be recast to `string`


In [7]:
# changing id_student to the object data type
student_info['id_student'] = student_info['id_student'].astype(str)
student_info = student_info.convert_dtypes()
student_info.dtypes

code_module          string
code_presentation    string
id_student           string
gender               string
region               string
highest_education    string
imd_band             string
age_band             string
prev_attempts         Int64
studied_credits       Int64
disability           string
final_result         string
dtype: object

**Null Values**

In [8]:
null_vals(student_info)

index,Null Values
code_module,0
code_presentation,0
id_student,0
gender,0
region,0
highest_education,0
imd_band,1111
age_band,0
prev_attempts,0
studied_credits,0


In [9]:
# store sum of imd null values
imd_null = student_info['imd_band'].isnull().sum()
md(f'''The imd_band variable has {imd_null} null values which we may have to work around.''')

The imd_band variable has 1111 null values which we may have to work around.

**Duplicate Values**

In [10]:
# show duplicate values in student info if any
get_dupes(student_info)

There are no Duplicate Values

**Unique Counts**

In [11]:
# Get number of unique values per variable in student info
count_unique(student_info)

index,Count
code_module,7
code_presentation,4
id_student,28785
gender,2
region,13
highest_education,5
imd_band,10
age_band,3
prev_attempts,7
studied_credits,61


In [12]:
# store count of total student ids
total_students = student_info['id_student'].count()
# store count of unique student ids
unique_students = student_info['id_student'].nunique()

In [19]:
md(f'''
* There are {"{:,}".format(total_students)} entries for students but only {"{:,}".format(unique_students)} unique student IDs.
* This may represent students who have taken the course more than once or who are taking multiple modules
''')


* There are 32,593 entries for students but only 28,785 unique student IDs.
* This may represent students who have taken the course more than once or who are taking multiple modules


**Unique Categorical Values**

In [20]:
unique_vals(student_info)

index,Values
code_module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
code_presentation,"['2013J', '2014J', '2013B', '2014B']"
gender,"['M', 'F']"
region,"['East Anglian Region', 'Scotland', 'North Western Region', 'South East Region', 'West Midlands Region', 'Wales', 'North Region', 'South Region', 'Ireland', 'South West Region', 'East Midlands Region', 'Yorkshire Region', 'London Region']"
highest_education,"['HE Qualification', 'A Level or Equivalent', 'Lower Than A Level', 'Post Graduate Qualification', 'No Formal quals']"
imd_band,"['90-100%', '20-30%', '30-40%', '50-60%', '80-90%', '70-80%', , '60-70%', '40-50%', '10-20%', '0-10%']"
age_band,"['55<=', '35-55', '0-35']"
disability,"['N', 'Y']"
final_result,"['Pass', 'Withdrawn', 'Fail', 'Distinction']"


In imd_band the % sign is missing in 10-20. We will add that for consistency and clarity

In [30]:
# changing all 10-20 values in student_info imd_band to 10-20% for consistency's sake
student_info.loc[student_info['imd_band'] == '10-20', 'imd_band'] = '10-20%'
# making sure it updated
dataframe(student_info['imd_band'].explode().unique(), columns=['imd_band']).sort_values(by='imd_band').reset_index(drop=True)

Unnamed: 0,imd_band
0,0-10%
1,10-20%
2,20-30%
3,30-40%
4,40-50%
5,50-60%
6,60-70%
7,70-80%
8,80-90%
9,90-100%


**Numerical Values**

In [35]:
# show statistical breakdown of numerical values in student info
student_info.describe().round(1)

Unnamed: 0,prev_attempts,studied_credits
count,32593.0,32593.0
mean,0.2,79.8
std,0.5,41.1
min,0.0,30.0
25%,0.0,60.0
50%,0.0,60.0
75%,0.0,120.0
max,6.0,655.0


In [49]:
# store the highest number of module previous attempts by students
max_attempts = student_info['prev_attempts'].max()
max_credits = student_info['studied_credits'].max()
min_credits = student_info['studied_credits'].min()

In [55]:
md(f'''
* Most students do not have a previous attempt, but there is a high of {max_attempts} attempts.
    * We can only have data for up to two of the students attempts since we only have two years worth of data.
* The maximum amount of credits a student took during the module was {max_credits}
    * This over twenty times the minimum of {min_credits} credits.
* It is unknown how these courses were weighted, but this amount of credits at the same time may have influenced student success
''')


* Most students do not have a previous attempt, but there is a high of 6 attempts.
    * We can only have data for up to two of the students attempts since we only have two years worth of data.
* The maximum amount of credits a student took during the module was 655
    * This over twenty times the minimum of 30 credits.
* It is unknown how these courses were weighted, but this amount of credits at the same time may have influenced student success
