# PredictStudentsDropoutAndAcademicSuccess

# Polytechnic Institute of Portalegre

### TableOfContents

* [Initialization](#Initialization)

* [RandomForest](#RandomForest)

* [Next](#Next)

# Initialization
[TableOfContents](#TableOfContents)

In [40]:
#import
#--------------------------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

from mlxtend.plotting import plot_decision_regions

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, balanced_accuracy_score, f1_score
#--------------------------------------------------

#global constants
#--------------------------------------------------

#--------------------------------------------------

#load csv
#--------------------------------------------------
DF = pd.read_csv("PredictStudentsDropoutAndAcademicSuccess.csv", sep=";")
df_rf = pd.read_csv("PredictStudentsDropoutAndAcademicSuccess.csv", sep=";")
df_02 = pd.read_csv("PredictStudentsDropoutAndAcademicSuccess.csv", sep=";")
df_03 = pd.read_csv("PredictStudentsDropoutAndAcademicSuccess.csv", sep=";")

#--------------------------------------------------

#data mapping
#--------------------------------------------------
# Function to convert col to camelCase
conversion_dict = {
    "Marital status": "maritalStatus",
    "Application mode": "applicationMode",
    "Application order": "applicationOrder",
    "Course": "course",
    "Daytime/evening attendance": "daytimeEveningAttendance",
    "Previous qualification": "previousQualification",
    "Previous qualification (grade)": "previousQualificationGrade",
    "Nacionality": "nationality",
    "Mothers qualification": "motherQualification",
    "Fathers qualification": "fatherQualification",
    "Mothers occupation": "motherOccupation",
    "Fathers occupation": "fatherOccupation",
    "Admission grade": "admissionGrade",
    "Displaced": "displaced",
    "Educational special needs": "educationalSpecialNeeds",
    "Debtor": "debtor",
    "Tuition fees up to date": "tuitionFeesUpToDate",
    "Gender": "gender",
    "Scholarship holder": "scholarshipHolder",
    "Age at enrollment": "ageAtEnrollment",
    "International": "international",
    "Curricular units 1st sem (credited)": "curricularUnits1stSemCredited",
    "Curricular units 1st sem (enrolled)": "curricularUnits1stSemEnrolled",
    "Curricular units 1st sem (evaluations)": "curricularUnits1stSemEvaluations",
    "Curricular units 1st sem (approved)": "curricularUnits1stSemApproved",
    "Curricular units 1st sem (grade)": "curricularUnits1stSemGrade",
    "Curricular units 1st sem (without evaluations)": "curricularUnits1stSemWithoutEvaluations",
    "Curricular units 2nd sem (credited)": "curricularUnits2ndSemCredited",
    "Curricular units 2nd sem (enrolled)": "curricularUnits2ndSemEnrolled",
    "Curricular units 2nd sem (evaluations)": "curricularUnits2ndSemEvaluations",
    "Curricular units 2nd sem (approved)": "curricularUnits2ndSemApproved",
    "Curricular units 2nd sem (grade)": "curricularUnits2ndSemGrade",
    "Curricular units 2nd sem (without evaluations)": "curricularUnits2ndSemWithoutEvaluations",
    "Unemployment rate": "unemploymentRate",
    "Inflation rate": "inflationRate",
    "GDP": "gdp",
    "Target": "target"
}

# Marital Status
maritalStatus = {
    1: "single",
    2: "married",
    3: "widower",
    4: "divorced",
    5: "facto union",
    6: "legally separated"
}

# Application Mode
applicationMode = {
    1: "1st phase - general contingent",
    2: "Ordinance No. 612/93",
    5: "1st phase - special contingent (Azores Island)",
    7: "Holders of other higher courses",
    10: "Ordinance No. 854-B/99",
    15: "International student (bachelor)",
    16: "1st phase - special contingent (Madeira Island)",
    17: "2nd phase - general contingent",
    18: "3rd phase - general contingent",
    26: "Ordinance No. 533-A/99, item b2) (Different Plan)",
    27: "Ordinance No. 533-A/99, item b3 (Other Institution)",
    39: "Over 23 years old",
    42: "Transfer",
    43: "Change of course",
    44: "Technological specialization diploma holders",
    51: "Change of institution/course",
    53: "Short cycle diploma holders",
    57: "Change of institution/course (International)"
}

# Course
course = {
    33: "Biofuel Production Technologies",
    171: "Animation and Multimedia Design",
    8014: "Social Service (evening attendance)",
    9003: "Agronomy",
    9070: "Communication Design",
    9085: "Veterinary Nursing",
    9119: "Informatics Engineering",
    9130: "Equinculture",
    9147: "Management",
    9238: "Social Service",
    9254: "Tourism",
    9500: "Nursing",
    9556: "Oral Hygiene",
    9670: "Advertising and Marketing Management",
    9773: "Journalism and Communication",
    9853: "Basic Education",
    9991: "Management (evening attendance)"
}

# Nationality
nationality = {
    1: "Portuguese",
    2: "German",
    6: "Spanish",
    11: "Italian",
    13: "Dutch",
    14: "English",
    17: "Lithuanian",
    21: "Angolan",
    22: "Cape Verdean",
    24: "Guinean",
    25: "Mozambican",
    26: "Santomean",
    32: "Turkish",
    41: "Brazilian",
    62: "Romanian",
    100: "Moldova (Republic of)",
    101: "Mexican",
    103: "Ukrainian",
    105: "Russian",
    108: "Cuban",
    109: "Colombian"
}

# Binary attributes
daytimeEveningAttendance = {1: "daytime", 0: "evening"}
displaced = {1: "yes", 0: "no"}
educationalSpecialNeeds = {1: "yes", 0: "no"}
debtor = {1: "yes", 0: "no"}
tuitionFeesUpToDate = {1: "yes", 0: "no"}
gender = {1: "male", 0: "female"}
scholarshipHolder = {1: "yes", 0: "no"}
international = {1: "yes", 0: "no"}

# Sample dictionary to map macro data triplets to a year
econ_to_year = {
    (10.8, 1.4, 1.74): 2010,
    (13.9, -0.3, 0.79): 2011,
    (9.4, -0.8, -3.12): 2012,
    (16.2, 0.3, -0.92): 2013,
    (15.5, 2.8, -4.06): 2014,
    (8.9, 1.4, 3.51): 2015,
    (12.7, 3.7, -1.70): 2016,
    (11.1, 0.6, 2.02): 2017,
    (7.6, 2.6, 0.32): 2018,
    (12.4, 0.5, 1.79): 2019
}

# Target
targetMap = {
    "Dropout": 0,
    "Enrolled": 1,
    "Graduate": 2,
}

targetMapReverse = {v: k for k, v in targetMap.items()}

# Previous Qualification
previousQualification = {
    1: "Secondary education",
    2: "Higher education - bachelor's degree",
    3: "Higher education - degree",
    4: "Higher education - master's",
    5: "Higher education - doctorate",
    6: "Frequency of higher education",
    9: "12th year of schooling - not completed",
    10: "11th year of schooling - not completed",
    12: "Other - 11th year of schooling",
    14: "10th year of schooling",
    15: "10th year of schooling - not completed",
    19: "Basic education 3rd cycle (9th/10th/11th year) or equiv.",
    38: "Basic education 2nd cycle (6th/7th/8th year) or equiv.",
    39: "Technological specialization course",
    40: "Higher education - degree (1st cycle)",
    42: "Professional higher technical course",
    43: "Higher education - master (2nd cycle)"
}

# Mother's Qualification
motherQualification = {
    1: "Secondary Education - 12th Year of Schooling or Eq.",
    2: "Higher Education - Bachelor's Degree",
    3: "Higher Education - Degree",
    4: "Higher Education - Master's",
    5: "Higher Education - Doctorate",
    6: "Frequency of Higher Education",
    9: "12th Year of Schooling - Not Completed",
    10: "11th Year of Schooling - Not Completed",
    11: "7th Year (Old)",
    12: "Other - 11th Year of Schooling",
    14: "10th Year of Schooling",
    18: "General commerce course",
    19: "Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv.",
    22: "Technical-professional course",
    26: "7th year of schooling",
    27: "2nd cycle of the general high school course",
    29: "9th Year of Schooling - Not Completed",
    30: "8th year of schooling",
    34: "Unknown",
    35: "Can't read or write",
    36: "Can read without having a 4th year of schooling",
    37: "Basic education 1st cycle (4th/5th year) or equiv.",
    38: "Basic Education 2nd Cycle (6th/7th/8th Year) or Equiv.",
    39: "Technological specialization course",
    40: "Higher education - degree (1st cycle)",
    41: "Specialized higher studies course",
    42: "Professional higher technical course",
    43: "Higher Education - Master (2nd cycle)",
    44: "Higher Education - Doctorate (3rd cycle)"
}

# Father's Qualification
fatherQualification = {
    1: "Secondary Education - 12th Year of Schooling or Eq.",
    2: "Higher Education - Bachelor's Degree",
    3: "Higher Education - Degree",
    4: "Higher Education - Master's",
    5: "Higher Education - Doctorate",
    6: "Frequency of Higher Education",
    9: "12th Year of Schooling - Not Completed",
    10: "11th Year of Schooling - Not Completed",
    11: "7th Year (Old)",
    12: "Other - 11th Year of Schooling",
    13: "2nd year complementary high school course",
    14: "10th Year of Schooling",
    18: "General commerce course",
    19: "Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv.",
    20: "Complementary High School Course",
    22: "Technical-professional course",
    25: "Complementary High School Course - not concluded",
    26: "7th year of schooling",
    27: "2nd cycle of the general high school course",
    29: "9th Year of Schooling - Not Completed",
    30: "8th year of schooling",
    31: "General Course of Administration and Commerce",
    33: "Supplementary Accounting and Administration",
    34: "Unknown",
    35: "Can't read or write",
    36: "Can read without having a 4th year of schooling",
    37: "Basic education 1st cycle (4th/5th year) or equiv.",
    38: "Basic Education 2nd Cycle (6th/7th/8th Year) or Equiv.",
    39: "Technological specialization course",
    40: "Higher education - degree (1st cycle)",
    41: "Specialized higher studies course",
    42: "Professional higher technical course",
    43: "Higher Education - Master (2nd cycle)",
    44: "Higher Education - Doctorate (3rd cycle)"
}

# Combine Qualification
combineQualification = {
    1: "Secondary Education - 12th Year of Schooling or Eq.",
    2: "Higher Education - Bachelor's Degree",
    3: "Higher Education - Degree",
    4: "Higher Education - Master's",
    5: "Higher Education - Doctorate",
    6: "Frequency of Higher Education",
    9: "12th Year of Schooling - Not Completed",
    10: "11th Year of Schooling - Not Completed",
    11: "7th Year (Old)",
    12: "Other - 11th Year of Schooling",
    13: "2nd year complementary high school course",
    14: "10th Year of Schooling",
    15: "10th Year of Schooling - Not Completed",
    18: "General commerce course",
    19: "Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv.",
    20: "Complementary High School Course",
    22: "Technical-professional course",
    25: "Complementary High School Course - not concluded",
    26: "7th year of schooling",
    27: "2nd cycle of the general high school course",
    29: "9th Year of Schooling - Not Completed",
    30: "8th year of schooling",
    31: "General Course of Administration and Commerce",
    33: "Supplementary Accounting and Administration",
    34: "Unknown",
    35: "Can't read or write",
    36: "Can read without having a 4th year of schooling",
    37: "Basic education 1st cycle (4th/5th year) or equiv.",
    38: "Basic Education 2nd Cycle (6th/7th/8th Year) or Equiv.",
    39: "Technological specialization course",
    40: "Higher education - degree (1st cycle)",
    41: "Specialized higher studies course",
    42: "Professional higher technical course",
    43: "Higher Education - Master (2nd cycle)",
    44: "Higher Education - Doctorate (3rd cycle)"
}

# Ordinal grouping mapping
qualificationOrdinal = {
    **{code: 0 for code in [34, 35]}, #No documented education
    **{code: 1 for code in [11, 12, 26, 27, 29, 30, 31, 37, 38, 36]}, #Basic/primary schooling
    **{code: 2 for code in [9, 10, 14, 15, 13, 18, 20, 22, 25, 19, 33, 41]}, #Interrupted or completed
    **{code: 3 for code in [39, 42]}, #Technological or sub-degree
    **{code: 4 for code in [1, 2, 3, 40, 4, 43, 5, 44, 6]} #Degree and beyond
}

# Ordinal grouping mapping name
qualificationOrdinalName = {
    0: "None. No documented education",
    1: "Basic. Basic/primary schooling",
    2: "Secondary. Interrupted or completed",
    3: "Post-Secondary. Technological or sub-degree",
    4: "Higher Ed. Degree and beyond",
}

# Mother's Occupation
motherOccupation = {
    0: "Student",
    1: "Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers",
    2: "Specialists in Intellectual and Scientific Activities",
    3: "Intermediate Level Technicians and Professions",
    4: "Administrative staff",
    5: "Personal Services, Security and Safety Workers and Sellers",
    6: "Farmers and Skilled Workers in Agriculture, Fisheries and Forestry",
    7: "Skilled Workers in Industry, Construction and Craftsmen",
    8: "Installation and Machine Operators and Assembly Workers",
    9: "Unskilled Workers",
    10: "Armed Forces Professions",
    90: "Other Situation",
    99: "(blank)",
    122: "Health professionals",
    123: "teachers",
    125: "Specialists in information and communication technologies (ICT)",
    131: "Intermediate level science and engineering technicians and professions",
    132: "Technicians and professionals, of intermediate level of health",
    134: "Intermediate level technicians from legal, social, sports, cultural and similar services",
    141: "Office workers, secretaries in general and data processing operators",
    143: "Data, accounting, statistical, financial services and registry-related operators",
    144: "Other administrative support staff",
    151: "personal service workers",
    152: "sellers",
    153: "Personal care workers and the like",
    171: "Skilled construction workers and the like, except electricians",
    173: "Skilled workers in printing, precision instrument manufacturing, jewelers, artisans and the like",
    175: "Workers in food processing, woodworking, clothing and other industries and crafts",
    191: "cleaning workers",
    192: "Unskilled workers in agriculture, animal production, fisheries and forestry",
    193: "Unskilled workers in extractive industry, construction, manufacturing and transport",
    194: "Meal preparation assistants"
}

# Father's Occupation
fatherOccupation = {
    0: "Student",
    1: "Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers",
    2: "Specialists in Intellectual and Scientific Activities",
    3: "Intermediate Level Technicians and Professions",
    4: "Administrative staff",
    5: "Personal Services, Security and Safety Workers and Sellers",
    6: "Farmers and Skilled Workers in Agriculture, Fisheries and Forestry",
    7: "Skilled Workers in Industry, Construction and Craftsmen",
    8: "Installation and Machine Operators and Assembly Workers",
    9: "Unskilled Workers",
    10: "Armed Forces Professions",
    90: "Other Situation",
    99: "(blank)",
    101: "Armed Forces Officers",
    102: "Armed Forces Sergeants",
    103: "Other Armed Forces personnel",
    112: "Directors of administrative and commercial services",
    114: "Hotel, catering, trade and other services directors",
    121: "Specialists in the physical sciences, mathematics, engineering and related techniques",
    122: "Health professionals",
    123: "teachers",
    124: "Specialists in finance, accounting, administrative organization, public and commercial relations",
    131: "Intermediate level science and engineering technicians and professions",
    132: "Technicians and professionals, of intermediate level of health",
    134: "Intermediate level technicians from legal, social, sports, cultural and similar services",
    135: "Information and communication technology technicians",
    141: "Office workers, secretaries in general and data processing operators",
    143: "Data, accounting, statistical, financial services and registry-related operators",
    144: "Other administrative support staff",
    151: "personal service workers",
    152: "sellers",
    153: "Personal care workers and the like",
    154: "Protection and security services personnel",
    161: "Market-oriented farmers and skilled agricultural and animal production workers",
    163: "Farmers, livestock keepers, fishermen, hunters and gatherers, subsistence",
    171: "Skilled construction workers and the like, except electricians",
    172: "Skilled workers in metallurgy, metalworking and similar",
    174: "Skilled workers in electricity and electronics",
    175: "Workers in food processing, woodworking, clothing and other industries and crafts",
    181: "Fixed plant and machine operators",
    182: "assembly workers",
    183: "Vehicle drivers and mobile equipment operators",
    192: "Unskilled workers in agriculture, animal production, fisheries and forestry",
    193: "Unskilled workers in extractive industry, construction, manufacturing and transport",
    194: "Meal preparation assistants",
    195: "Street vendors (except food) and street service providers"
}

# Combine Occupation
combineOccupation = {
    0: "Student",
    1: "Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers",
    2: "Specialists in Intellectual and Scientific Activities",
    3: "Intermediate Level Technicians and Professions",
    4: "Administrative staff",
    5: "Personal Services, Security and Safety Workers and Sellers",
    6: "Farmers and Skilled Workers in Agriculture, Fisheries and Forestry",
    7: "Skilled Workers in Industry, Construction and Craftsmen",
    8: "Installation and Machine Operators and Assembly Workers",
    9: "Unskilled Workers",
    10: "Armed Forces Professions",
    90: "Other Situation",
    99: "(blank)",
    101: "Armed Forces Officers",
    102: "Armed Forces Sergeants",
    103: "Other Armed Forces personnel",
    112: "Directors of administrative and commercial services",
    114: "Hotel, catering, trade and other services directors",
    121: "Specialists in the physical sciences, mathematics, engineering and related techniques",
    122: "Health professionals",
    123: "teachers",
    124: "Specialists in finance, accounting, administrative organization, public and commercial relations",
    125: "Specialists in information and communication technologies (ICT)",
    131: "Intermediate level science and engineering technicians and professions",
    132: "Technicians and professionals, of intermediate level of health",
    134: "Intermediate level technicians from legal, social, sports, cultural and similar services",
    135: "Information and communication technology technicians",
    141: "Office workers, secretaries in general and data processing operators",
    143: "Data, accounting, statistical, financial services and registry-related operators",
    144: "Other administrative support staff",
    151: "personal service workers",
    152: "sellers",
    153: "Personal care workers and the like",
    154: "Protection and security services personnel",
    161: "Market-oriented farmers and skilled agricultural and animal production workers",
    163: "Farmers, livestock keepers, fishermen, hunters and gatherers, subsistence",
    171: "Skilled construction workers and the like, except electricians",
    172: "Skilled workers in metallurgy, metalworking and similar",
    173: "Skilled workers in printing, precision instrument manufacturing, jewelers, artisans and the like",
    174: "Skilled workers in electricity and electronics",
    175: "Workers in food processing, woodworking, clothing and other industries and crafts",
    181: "Fixed plant and machine operators",
    182: "assembly workers",
    183: "Vehicle drivers and mobile equipment operators",
    191: "cleaning workers",
    192: "Unskilled workers in agriculture, animal production, fisheries and forestry",
    193: "Unskilled workers in extractive industry, construction, manufacturing and transport",
    194: "Meal preparation assistants",
    195: "Street vendors (except food) and street service providers"
}

# Occupation Income Level
occupationOrdinal = {
    **{code: 0 for code in [0, 90, 99]},
    **{code: 1 for code in [9, 191, 192, 193, 194, 195, 163]},
    **{code: 2 for code in [6, 7, 171, 172, 173, 174, 175, 8, 181, 182, 183, 152, 153, 154, 151]},
    **{code: 3 for code in [3, 131, 132, 134, 135, 4, 141, 143, 144, 5, 10, 101, 102, 103, 161]},
    **{code: 4 for code in [2, 121, 122, 123, 124, 125]},
    **{code: 5 for code in [1, 112, 114]}
}

# Occupation Income Level Name
occupationOrdinalName = {
    0: "None/Unknown",
    1: "Low income",
    2: "Lower-Middle income",
    3: "Middle income",
    4: "Upper-Middle income",
    5: "High income"
}
#--------------------------------------------------

#common functions
#--------------------------------------------------
globalHeader = ""
isHeader = True
def __________section__________(header = globalHeader, width = 50):
    global globalHeader, isHeader
    
    GREEN = "\033[32m"
    RED = "\033[31m"
    BOLD = "\033[1m"
    RESET = "\033[0m"
    
    if isHeader:
        bar = ">" * width
        print(GREEN + BOLD + "\n" + bar + RESET)
        print(GREEN + BOLD + f">[SECTION]{header.upper()}" + RESET)
        globalHeader = header
        isHeader = False

    else:
        bar = "<" * width
        print(RED + BOLD + f"<[SECTION]{globalHeader.upper()}" + RESET)
        print(RED + BOLD + bar + "\n" + RESET)
        isHeader = True

def _c_(comment):

    BLUE = "\033[34m"
    ITALIC = "\033[3m"
    RESET = "\033[0m"

    print(BLUE + ITALIC + f"-------[C]{comment.lower()}" + RESET)
        
def nanCheck():
    _c_("Check how many missing values per column")
    print(df.isnull().sum())

    _c_("Show only columns that have missing values")
    print(df.isnull().sum()[df.isnull().sum() > 0])

    _c_("Check if any missing value exists in the entire DataFrame")
    print(df.isnull().values.any())

    _c_("List all rows that contain any NaN")
    print(df[df.isnull().any(axis=1)])

# RandomForest
[TableOfContents](#TableOfContents)

In [41]:

__________section__________("general check")
#check new column names
df_rf.rename(columns=conversion_dict, inplace=True)
#print(df_rf.columns)

_c_("check type")
print(df_rf.dtypes[df_rf.dtypes == "object"])

nanCheck()
__________section__________()




__________section__________("econ and year")
# Map the tuple of 3 macro features to year
df_rf["year"] = df_rf.apply(lambda row: econ_to_year.get((row["unemploymentRate"], row["inflationRate"], row["gdp"])),axis=1)

#check missing year
print("Missing year:", df_rf["year"].isnull().sum())

#EconomicStressIndex
df_rf["economicStressIndex"] = df_rf["unemploymentRate"] + df_rf["inflationRate"] - df_rf["gdp"]

#Is_Economy_Good binary flag
df_rf["isEconomyGood"] = ((df_rf["gdp"] > 1.5) & (df_rf["unemploymentRate"] < 10)).astype(int)

#normalizedUnemploymentByYear
df_rf["normalizedUnemploymentByYear"] = df_rf.groupby("year")["unemploymentRate"].transform(lambda x: (x - x.mean()) / x.std())
nanCheck()
__________section__________()




__________section__________("target")
#check all row got target
print("Target unique:", df_rf["target"].unique())
print("Unmapped target labels:", sorted(set(df_rf["target"].dropna().astype(str).str.strip().str.title()) - set(targetMap.keys())))

#convert target to targetInt
df_rf["targetInt"] = (df_rf["target"].map(targetMap).astype("Int64"))
__________section__________()




__________section__________("parent qualification and occupation")
# create a new column based on parent qualification mapped to ordinal
df_rf["motherQualificationOrdinal"] = df_rf["motherQualification"].map(qualificationOrdinal)
df_rf["fatherQualificationOrdinal"] = df_rf["fatherQualification"].map(qualificationOrdinal)

# create a new column based on parent income mapped to ordinal
df_rf["motherOccupationOrdinal"] = df_rf["motherOccupation"].map(occupationOrdinal)
df_rf["fatherOccupationOrdinal"] = df_rf["fatherOccupation"].map(occupationOrdinal)

#average parent edu and occupation
df_rf["avgParentalEducation"] = df_rf[["motherQualificationOrdinal", "fatherQualificationOrdinal"]].mean(axis=1)
df_rf["avgParentalIncome"] = df_rf[["motherOccupationOrdinal", "fatherOccupationOrdinal"]].mean(axis=1)

#parentalDisparity
df_rf["parentalEduDisparity"] = abs(df_rf["motherQualificationOrdinal"] - df_rf["fatherQualificationOrdinal"])
df_rf["parentalIncomeDisparity"] = abs(df_rf["motherOccupationOrdinal"] - df_rf["fatherOccupationOrdinal"])
__________section__________()




__________section__________("academic activity and performance")
# Define inactive students (zero academic activity)
inactive_mask = (
    (df_rf["curricularUnits1stSemEnrolled"] == 0) &
    (df_rf["curricularUnits2ndSemEnrolled"] == 0) &
    (df_rf["curricularUnits1stSemEvaluations"] == 0) &
    (df_rf["curricularUnits2ndSemEvaluations"] == 0) &
    (df_rf["curricularUnits1stSemGrade"] == 0) &
    (df_rf["curricularUnits2ndSemGrade"] == 0)
)

# Subset those inactive students
inactive_students = df_rf[inactive_mask]

# Print how many
print(f"Fully inactive students: {len(inactive_students)}")

# Check their target label distribution
print("\nTarget label distribution for inactive students:")
print(inactive_students["target"].value_counts())

# Check other features for patterns
print("\nDescriptive stats for selected features of inactive students:")
#print(inactive_students[["applicationOrder", "applicationMode", "ageAtEnrollment", "scholarshipHolder", "international", "displaced", "motherQualification", "fatherOccupation", "year", "isEconomyGood"]].describe(include="all"))

# Add binary flag to main dataframe
df_rf["noAcademicActivity"] = inactive_mask.astype(int)

# See how 'noAcademicActivity' affects the target label
print("\nTarget distribution grouped by noAcademicActivity flag:")
#print(df_rf.groupby("noAcademicActivity")["target"].value_counts(normalize=True))

#Approved Rate
df_rf["approvedRate1stSem"] = df_rf["curricularUnits1stSemApproved"] / df_rf["curricularUnits1stSemEnrolled"].replace(0, np.nan)
df_rf["approvedRate2ndSem"] = df_rf["curricularUnits2ndSemApproved"] / df_rf["curricularUnits2ndSemEnrolled"].replace(0, np.nan)

# Performance Index
df_rf["performanceIndex1stSem"] = df_rf["approvedRate1stSem"] * df_rf["curricularUnits1stSemGrade"]
df_rf["performanceIndex2ndSem"] = df_rf["approvedRate2ndSem"] * df_rf["curricularUnits2ndSemGrade"]

#Credit Load Reduction
df_rf["creditLoadReduction1stSem"] = df_rf["curricularUnits1stSemCredited"] / df_rf["curricularUnits1stSemEnrolled"].replace(0, np.nan)
df_rf["creditLoadReduction2ndSem"] = df_rf["curricularUnits2ndSemCredited"] / df_rf["curricularUnits2ndSemEnrolled"].replace(0, np.nan)

# Evalution rate
df_rf["evalRate1stSem"] = df_rf["curricularUnits1stSemEvaluations"] / (df_rf["curricularUnits1stSemEvaluations"] + df_rf["curricularUnits1stSemWithoutEvaluations"])
df_rf["evalRate2ndSem"] = df_rf["curricularUnits2ndSemEvaluations"] / (df_rf["curricularUnits2ndSemEvaluations"] + df_rf["curricularUnits2ndSemWithoutEvaluations"])

#noAcademicActivity flag student
df_rf["noAcademicActivity"] = inactive_mask.astype(int)
__________section__________()




__________section__________("Application Mode")
# Application Mode shifted cause 0 is first choice unsuitable
df_rf["applicationOrderShifted"] = df_rf["applicationOrder"] + 1
__________section__________()




__________section__________("drop columnn")
columns_to_drop = [
    "target",
    "motherQualification",
    "fatherQualification",
    "motherOccupation",
    "fatherOccupation",
    "applicationOrder"
]

df_model = df_rf.drop(columns=columns_to_drop)

columns_optional = [
    "unemploymentRate", "inflationRate", "gdp",
    "curricularUnits1stSemEnrolled", "curricularUnits2ndSemEnrolled",
    "curricularUnits1stSemApproved", "curricularUnits2ndSemApproved",
    "curricularUnits1stSemGrade", "curricularUnits2ndSemGrade",
    "curricularUnits1stSemCredited", "curricularUnits2ndSemCredited",
    "curricularUnits1stSemEvaluations", "curricularUnits1stSemWithoutEvaluations",
    "curricularUnits2ndSemEvaluations", "curricularUnits2ndSemWithoutEvaluations"
]

# Drop these only if confirmed redundant
#df_model = df_model.drop(columns=columns_optional)

#df = df_rf.drop("columnName", axis=1)
#axis=1 → means you're dropping a column (not a row)
#df = df_rf.drop(5, axis=0)
#df = df_rf.drop(5, axis=0)
__________section__________()




__________section__________("Save snapshot")
df_rf.to_csv("student_dataset_engineered.csv", index=False)
__________section__________()




__________section__________("training")
X = df_model.drop(columns=["targetInt"])
y = df_model["targetInt"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

#Train the Random Forest Model
model = RandomForestClassifier(
    n_estimators=300, 
    max_depth=None, 
    random_state=42, 
    class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
__________section__________()




__________section__________("evaluate")

#2D Scatter Plot (with PCA)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_test)
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("2D PCA Scatter Plot (Predicted Classes)")
plt.colorbar(scatter, ticks=[0, 1, 2], label="Predicted Class")
plt.grid(True)
plt.show()

#Feature Importance Bar Plot
importances = model.feature_importances_
features = X.columns if hasattr(X, 'columns') else [f'Feature {i}' for i in range(X.shape[1])]
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.barh(np.array(features)[indices], importances[indices])
plt.xlabel("Importance Score")
plt.title("Random Forest Feature Importances")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

#Confusion Matrix Heatmap
cm = confusion_matrix(y_test, y_pred)
class_names = ["Dropout", "Enrolled", "Graduate"]
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

#Decision Boundary (2D only)
# Reduce to 2D
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(X_2d, y, test_size=0.2, random_state=42, stratify=y)
# Train lightweight RF model
rf_2d = RandomForestClassifier(n_estimators=50, random_state=42)
rf_2d.fit(X_train_2d, y_train_2d)
# Plot decision boundary
plot_decision_regions(X_test_2d, y_test_2d.to_numpy(), clf=rf_2d, legend=2)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("Decision Boundary (2D PCA Projection)")
plt.show()

#Prediction Confidence Histogram
probs = model.predict_proba(X_test)
confidences = probs.max(axis=1)
plt.figure(figsize=(8, 5))
plt.hist(confidences, bins=20, color='skyblue', edgecolor='black')
plt.xlabel("Prediction Confidence")
plt.ylabel("Number of Predictions")
plt.title("Distribution of Prediction Confidence")
plt.grid(True)
plt.show()







print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Dropout", "Enrolled", "Graduate"]))

print("\nAccuracy:", accuracy_score(y_test, y_pred))

importances = model.feature_importances_
features = X.columns

sns.barplot(x=importances, y=features)
plt.title("Feature Importance")
plt.show()

print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred))
print("Macro F1 Score:", f1_score(y_test, y_pred, average='macro'))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Dropout", "Enrolled", "Graduate"], yticklabels=["Dropout", "Enrolled", "Graduate"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
__________section__________()



[32m[1m
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>[0m
[32m[1m>[SECTION]GENERAL CHECK[0m
[34m[3m-------[C]check type[0m
target    object
dtype: object
[34m[3m-------[C]check how many missing values per column[0m
maritalStatus                              0
applicationMode                            0
applicationOrder                           0
course                                     0
daytimeEveningAttendance                   0
previousQualification                      0
previousQualificationGrade                 0
nationality                                0
motherQualification                        0
fatherQualification                        0
motherOccupation                           0
fatherOccupation                           0
admissionGrade                             0
displaced                                  0
educationalSpecialNeeds                    0
debtor                                     0
tuitionFeesUpToDate                        0
gend

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

# Next
[TableOfContents](#TableOfContents)

In [None]:
# RandomForest
### ----------
🔧 Random Forest Key Parameters

🔹 n_estimators=100
Number of trees in the forest.
More trees = better generalization, but longer training time.
Rule of thumb: start with 100–300; increase if your dataset is large.

🔹 max_depth=None
How deep each tree is allowed to grow.
None means nodes are expanded until pure or until all leaves contain < min_samples_split.
Deeper trees → more complex rules → risk of overfitting.

🔹 random_state=42
Makes the model reproducible.
Fixes the randomness so you get the same result every time you run it.

🔹 class_weight='balanced'
Adjusts weights inversely proportional to class frequencies.
Useful when classes are imbalanced (e.g. too many graduates vs. few dropouts).
Other option: class_weight={0:2, 1:1, 2:1} (manual tuning).
### ----------
📥 X (IInput Features) — Conditions & Types
Random Forest can handle mixed data types, but you must still preprocess appropriately:

Feature Type	Examples	Notes
✅ Numeric	Age, GPA, Grade average	OK without encoding
✅ Binary	Gender (0/1), Scholarship (Yes/No)	Convert Yes/No to 0/1
⚠️ Categorical	Course Name, Nationality, Father’s Job	Must be encoded
❌ Text	Free-text essay, names	Must be vectorized or dropped
❌ High-cardinality	100+ unique categories (e.g. 500 job titles)	Risk of overfitting — avoid or group them
### ----------
✅ How to Encode Categorical Columns:
Use LabelEncoder or pd.get_dummies() for most.
Avoid keeping strings directly.
### ----------
🎯 y (Target Variable) — Conditions
Requirement	Example
✅ Must be categorical or discrete	e.g., 0 = dropout, 1 = enrolled, 2 = graduated
❌ No continuous/float values	Use regression instead of classification
Can be:	integers, strings (if using LabelEncoder)
### ----------
⚠️ What to Avoid
Mistake	Why It’s a Problem
Keeping raw strings or object dtype	Trees can't handle text directly
Too many missing values	Causes unreliable splits
No feature scaling	Not needed for trees — unlike SVM/LogReg
Unbalanced classes + no class_weight	Model may ignore minority class
Using irrelevant or highly correlated features	Adds noise; harms performance
Data leakage	Don't include features that leak the target (e.g. final grade when predicting dropout)
### ----------
🎯 What to Aim for in X (Input)
Best Practice	Description
Cleaned & preprocessed	No NaN, strings, or irrelevant cols
Compact	10–100 features is good starting point
Encoded properly	Categorical → OneHot / LabelEncoded
Balanced information	Avoid too many similar/redundant features
Derived Features	Feature engineering like ratios or differences can help (e.g. "approved/enrolled")
### ----------
💡 Other Things You Should Know
📊 Evaluation Metrics (for multi-class)
Use more than just accuracy:
classification_report(): shows precision, recall, f1-score per class
confusion_matrix(): shows how each class is predicted
Visual tools: confusion matrix heatmap, feature importance bar chart
🧪 Feature Importance
Random Forest can rank features by importance:
model.feature_importances_
This helps you:
Understand what the model "cares about"
Reduce features for faster training
🧠 Hyperparameter Tuning (optional)
Use GridSearchCV to find the best combo of:
n_estimators, max_depth, min_samples_split, max_features





In [None]:
7. GridSearchCV — How to Use

✅ GridSearchCV helps you find the best hyperparameters by trying out multiple combinations and cross-validating them.

🛠️ Example for Random Forest:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'class_weight': ['balanced']
}

rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,  # 5-fold cross validation
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)