<a href="https://www.kaggle.com/code/gizemnalbantarslan/btk-datathon-2024-pycaret-catboostregressor?scriptVersionId=199056568" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **About Me** 
Hello. I am Gizem Nalbant Arslan, a data scientist who loves to draw insights using the power of data and is always ready to explore new horizons in the world of machine learning.

After graduating from Industrial Engineering, I joined the continuous improvement department in the automotive industry. Leading projects here allowed me to discover my passion for data science, and I embarked on a fresh start in my career. Your support along this journey means a lot to me.

You can reach my [LinkedIn](https://www.linkedin.com/in/gizem-nalbant-arslan/) profile. 🔗

I look forward to meeting you.

Skills:

* Data Science 📊
* SQL 🗄️
* Python 🐍
* Machine Learning 🤖
* Statistical Models 📈

# About the competition
In the data, we observe the applications received by the Entrepreneurship Foundation since 2014 and the column named Evaluation Score. This file also contains anonymously shared university, family details, residence details, etc. of the applicants.

For the 11,049 people who applied in 2023, we have all the data again except for the Evaluation Score column (test_x.csv). Our task is to predict the column named Evaluation Score of these 11.049 people.

# Import

In [1]:
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
import optuna

In [2]:
pip install pycaret

Note: you may need to restart the kernel to use updated packages.


# Load datasets and drop

In [3]:
# Load the dataset
df_ = pd.read_csv("/kaggle/input/datathon-btk-2024/train.csv")
test_df = pd.read_csv("/kaggle/input/datathon-btk-2024/test_x.csv")

df_.dropna(subset=["Degerlendirme Puani"], inplace=True)
df = pd.concat([df_, test_df], ignore_index=True)

drop_columns = [
    'Dogum Yeri',
    'Ikametgah Sehri',
    'Universite Adi',
    'Bölüm',
    'Lise Adi',
    'Lise Adi Diger',
    'Lise Sehir',
    'Lise Bolumu',
    'Lise Bolum Diger',
    'Burslu ise Burs Yuzdesi',
    'Burs Aldigi Baska Kurum',
    'Baska Kurumdan Aldigi Burs Miktari',
    'Daha Once Baska Bir Universiteden Mezun Olmus',
    'Uye Oldugunuz Kulubun Ismi',
    'Stk Projesine Katildiniz Mi?',
    'Girisimcilikle Ilgili Deneyiminizi Aciklayabilir misiniz?',
    'Ingilizce Seviyeniz?',
    "Hangi STK'nin Uyesisiniz?",
    'Daha Önceden Mezun Olunduysa, Mezun Olunan Üniversite'
]

df.drop(columns=drop_columns, inplace=True)

def convert_strings_to_lowercase(df):
    turkish_chars = {'ı': 'i', 'ü': 'u', 'ö': 'o', 'ğ': 'g', 'ş': 's', 'ç': 'c'}
    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = df[column].str.lower().replace(turkish_chars, regex=True)
    return df
df = convert_strings_to_lowercase(df)

df.columns

  df_ = pd.read_csv("/kaggle/input/datathon-btk-2024/train.csv")


Index(['Basvuru Yili', 'Degerlendirme Puani', 'Cinsiyet', 'Dogum Tarihi',
       'Universite Turu', 'Burs Aliyor mu?', 'Universite Kacinci Sinif',
       'Universite Not Ortalamasi', 'Lise Turu', 'Lise Mezuniyet Notu',
       'Baska Bir Kurumdan Burs Aliyor mu?', 'Anne Egitim Durumu',
       'Anne Calisma Durumu', 'Anne Sektor', 'Baba Egitim Durumu',
       'Baba Calisma Durumu', 'Baba Sektor', 'Kardes Sayisi',
       'Girisimcilik Kulupleri Tarzi Bir Kulube Uye misiniz?',
       'Profesyonel Bir Spor Daliyla Mesgul musunuz?',
       'Spor Dalindaki Rolunuz Nedir?', 'Aktif olarak bir STK üyesi misiniz?',
       'Girisimcilikle Ilgili Deneyiminiz Var Mi?',
       'Ingilizce Biliyor musunuz?', 'id'],
      dtype='object')

# EDA

In [4]:
#Editing date data

#Edit Date of Birth - Create variable N_application_year
def get_year(tarih):
    if pd.isna(tarih):  # NaN değerler
        return tarih

    parcalar = str(tarih).replace('/', ' ').replace('.', ' ').replace('-', ' ').split()
    if len(parcalar) == 3:  # Tarih formatı uygunsa
        return parcalar[-1]  # Son eleman yıl olacaktır
    if len(parcalar) > 3: #D:M:Y 00:00 formatında olanlar için
        return parcalar[2]
    return None

def two_digit_years(year):
    year = str(year).strip('_')  # Alt çizgileri temizle
    if len(year) == 2 and year.isdigit():
        if year.startswith == 0:
            return int('20' + year)
        # İki basamaklı yıl
        return int('19' + year)
    return year

def clean_invalid_years(year):
    year = str(year).strip('_')  # Alt çizgileri temizle
    if year.isdigit() and 1960 <= int(year) <= 2016:  # Geçerli yıl
        return int(year)
    return np.nan  # Geçersiz yıl

# Removing the year from the 'Date of Birth' column
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(get_year)

#Convert two-digit years to 1900s or 2000s
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(two_digit_years)

# Clear invalid data and check for leftovers
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(clean_invalid_years)

#Create age of applicant variable
df["N_basvuru_yası"] = df["Basvuru Yili"] - df["Dogum Tarihi"]

In [5]:
# Drop NA values
df.dropna(subset=['Cinsiyet', 'Baska Bir Kurumdan Burs Aliyor mu?', 'Universite Turu'], inplace=True)

# Handle missing values
df['Ingilizce Biliyor musunuz?'] = df['Ingilizce Biliyor musunuz?'].fillna('nk')
df['Anne Sektor'] = df['Anne Sektor'].fillna('nk')
df['Baba Sektor'] = df['Baba Sektor'].fillna('nk')
df['Spor Dalindaki Rolunuz Nedir?'] = df['Spor Dalindaki Rolunuz Nedir?'].fillna('nk')

df.head(20)

na_columns = [col for col in df.columns if df[col].isnull().sum() > 0]

if "Degerlendirme Puani" in na_columns:
    na_columns.remove("Degerlendirme Puani")

for col in na_columns:
    mode_value = df[col].mode()[0]
    df[col] = df[col].fillna(mode_value)

# Drop id column
df.drop(columns='id', inplace=True)

In [6]:
# There were too many different values in the university GPA section. We have categorized them into simpler categories.
bulunmuyor = ["NaN", "hazirligim", "not ortalamasi yok", "ortalama bulunmuyor"]
kotu = ["0 - 1.79","1.00 - 2.50","1.80 - 2.49","2.00 - 2.50","2.50 ve altı"]
orta = ["2.50 - 2.99","2.50 - 3.00","2.50 -3.00","3.00-2.50"]
iyi = ["3.00 - 3.49","3.00 - 3.50","3.00 - 4.00","3.50 - 4.00","3.50-3","4-3.5","4.0-3.5"]

def kategorize_et(col):
    if col in bulunmuyor:
        return "bulunmuyor"
    elif col in kotu:
        return "kotu"
    elif col in orta:
        return "orta"
    else:
        return "iyi"

# Update the 'University Grade Point Average' column with apply()
df["Universite Not Ortalamasi"] = df["Universite Not Ortalamasi"].apply(kategorize_et)


In [7]:
#We also categorize high school GPAs. 
#Here we will use new lists as the grade ranges are different from the university.
lise_bulunmuyor = ["NaN", "not ortalamasi yok"]
lise_kotu = ["0 - 25", "0 - 24", "25 - 49", "25 - 50", "2.50 ve altı", "44-0"]
lise_orta = ["3.00-2.50", "50 - 74", "50 - 75", "54-45", "69-55"]
lise_iyi = ["100-85", "3.00 - 4.00", "3.50-3", "3.50-3.00", "4.00-3.50", "75 - 100", "84-70"]

def lise_not_kategorize_et(col):
    if col in lise_bulunmuyor:
        return "bulunmuyor"
    elif col in lise_kotu:
        return "kotu"
    elif col in lise_orta:
        return "orta"
    else:
        return "iyi"

# We update the “High School Graduation Grade” column.
df["Lise Mezuniyet Notu"] = df["Lise Mezuniyet Notu"].apply(lise_not_kategorize_et)

In [8]:
#The types of high schools were written in different characters.
def lise_turu_kategorize_et(lise):
    if isinstance(lise, float):  # If the value is float, we assign it to the “Other” category.
        return "diger"
    if "fen" in lise:
        return "fen lisesi"
    elif "anadolu" in lise:
        return "anadolu lisesi"
    elif "düz" in lise:
        return "duz lise"
    elif "imam" in lise:
        return "imam hatip lisesi"
    elif "meslek" in lise:
        return "meslek lisesi"
    elif "ozel" in lise:
        return "ozel lise"
    else:
        return "diger"

df["Lise Turu"] = df["Lise Turu"].apply(lise_turu_kategorize_et)

In [9]:
# Identify categorical and numerical columns
def grab_col_names(dataframe, cat_th=12, car_th=81):
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtype == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtype != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtype == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    num_cols = [col for col in dataframe.columns if dataframe[col].dtype != "O" and dataframe[col].dtype != "datetime64[ns]"]
    num_cols = [col for col in num_cols if col not in num_but_cat]
    return cat_cols, num_cols, cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(df)

# Encoding

In [10]:
# One-hot encode categorical variables
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)

df.head()

Unnamed: 0,Degerlendirme Puani,Dogum Tarihi,N_basvuru_yası,Cinsiyet_erkek,Cinsiyet_kadin,Universite Turu_ozel,Burs Aliyor mu?_hayir,Universite Kacinci Sinif_1,Universite Kacinci Sinif_2,Universite Kacinci Sinif_3,...,Ingilizce Biliyor musunuz?_nk,Basvuru Yili_2015,Basvuru Yili_2016,Basvuru Yili_2017,Basvuru Yili_2018,Basvuru Yili_2019,Basvuru Yili_2020,Basvuru Yili_2021,Basvuru Yili_2022,Basvuru Yili_2023
0,52.0,1994.0,20.0,True,False,True,False,False,False,True,...,True,False,False,False,False,False,False,False,False,False
1,30.0,1993.0,21.0,True,False,True,True,False,False,True,...,True,False,False,False,False,False,False,False,False,False
2,18.0,1986.0,28.0,True,False,True,True,True,False,False,...,True,False,False,False,False,False,False,False,False,False
3,40.0,1991.0,23.0,True,False,True,False,False,False,True,...,True,False,False,False,False,False,False,False,False,False
4,24.0,1992.0,22.0,True,False,True,False,False,True,False,...,True,False,False,False,False,False,False,False,False,False


# Modelling

In [11]:
# Separate train and test data
train_df = df[df['Degerlendirme Puani'].notnull()]
test_df = df[df['Degerlendirme Puani'].isnull()]

y = train_df['Degerlendirme Puani']
X = train_df.drop(['Degerlendirme Puani'], axis=1)

In [12]:
from pycaret.regression import *

In [13]:
# PyCaret setup: Regresyon modeli için setup oluşturuyoruz
regression_setup = setup(data = train_df, target = 'Degerlendirme Puani', session_id=123)

regression_setup = setup(data=train_df,
                         target='Degerlendirme Puani',
                         session_id=42,  
                         train_size=0.8,  
                         normalize=True,  
                         verbose=False,  
                         )

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Degerlendirme Puani
2,Target type,Regression
3,Original data shape,"(64518, 120)"
4,Transformed data shape,"(64518, 120)"
5,Transformed train set shape,"(45162, 120)"
6,Transformed test set shape,"(19356, 120)"
7,Numeric features,2
8,Preprocess,True
9,Imputation type,simple


In [14]:
# We compare some of the models we have chosen and choose the best one.
best = compare_models(include=["gbr","xgboost","lightgbm","catboost"])

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,5.3681,47.2174,6.8711,0.8559,0.294,0.2565,8.909
lightgbm,Light Gradient Boosting Machine,5.422,47.9915,6.9272,0.8535,0.2958,0.2591,2.313
xgboost,Extreme Gradient Boosting,5.4439,48.5559,6.9679,0.8518,0.2967,0.2588,1.73
gbr,Gradient Boosting Regressor,5.7423,53.339,7.3027,0.8372,0.3163,0.288,6.065


Processing:   0%|          | 0/21 [00:00<?, ?it/s]

In [15]:
# Let's display the best model
print(best)

<catboost.core.CatBoostRegressor object at 0x795a569fe6b0>


In [16]:
# Optimize the best model
tuned_model = tune_model(best)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,5.5137,47.8911,6.9203,0.8562,0.3003,0.2726
1,5.5679,50.5607,7.1106,0.8386,0.3055,0.2724
2,5.5901,50.3893,7.0985,0.846,0.2972,0.2652
3,5.5315,50.8957,7.1341,0.8426,0.3089,0.2741
4,5.4977,49.5499,7.0392,0.8421,0.3064,0.2666
5,5.6015,51.1733,7.1536,0.8443,0.3092,0.2802
6,5.6818,52.0018,7.2112,0.8502,0.3099,0.279
7,5.5921,50.5071,7.1068,0.8464,0.309,0.2772
8,5.6085,50.8805,7.1331,0.8453,0.3078,0.2797
9,5.4718,48.6872,6.9776,0.8542,0.2989,0.262


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [17]:
#Training the all data.
final_model = finalize_model(tuned_model)

In [18]:
# We remove the target column (Assessment Score) because it is completely missing in the test data.
test_df = test_df.drop(columns=['Degerlendirme Puani'], errors='ignore')

# We make predictions on test data.
test_predictions = predict_model(final_model, data=test_df)

test_df['Degerlendirme Puani'] = test_predictions["prediction_label"]

In [19]:
dictionary = {"Degerlendirme Puani":test_predictions["prediction_label"]}
dfSubmission = pd.DataFrame(dictionary)
dfSubmission.head()

Unnamed: 0,Degerlendirme Puani
65124,37.615803
65125,27.328562
65126,9.811973
65127,19.849055
65128,40.392236
