<a href="https://www.kaggle.com/code/gizemnalbantarslan/btk-datathon-2024-pycaret-catboostregressor?scriptVersionId=248813502" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **About Me** 
Hello. I am Gizem Nalbant Arslan, a data scientist who loves to draw insights using the power of data and is always ready to explore new horizons in the world of machine learning.

After graduating from Industrial Engineering, I joined the continuous improvement department in the automotive industry. Leading projects here allowed me to discover my passion for data science, and I embarked on a fresh start in my career. Your support along this journey means a lot to me.

You can reach my [LinkedIn](https://www.linkedin.com/in/gizem-nalbant-arslan/) profile. 🔗

I look forward to meeting you.

Skills:

* Data Science 📊
* SQL 🗄️
* Python 🐍
* Machine Learning 🤖
* Statistical Models 📈

# About the competition
In the data, we observe the applications received by the Entrepreneurship Foundation since 2014 and the column named Evaluation Score. This file also contains anonymously shared university, family details, residence details, etc. of the applicants.

For the 11,049 people who applied in 2023, we have all the data again except for the Evaluation Score column (test_x.csv). Our task is to predict the column named Evaluation Score of these 11.049 people.

# Import

In [None]:
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
import optuna

In [None]:
pip install pycaret

# Load datasets and drop

In [None]:
# Load the dataset
df_ = pd.read_csv("/kaggle/input/datathon-btk-2024/train.csv")
test_df = pd.read_csv("/kaggle/input/datathon-btk-2024/test_x.csv")

df_.dropna(subset=["Degerlendirme Puani"], inplace=True)
df = pd.concat([df_, test_df], ignore_index=True)

drop_columns = [
    'Dogum Yeri',
    'Ikametgah Sehri',
    'Universite Adi',
    'Bölüm',
    'Lise Adi',
    'Lise Adi Diger',
    'Lise Sehir',
    'Lise Bolumu',
    'Lise Bolum Diger',
    'Burslu ise Burs Yuzdesi',
    'Burs Aldigi Baska Kurum',
    'Baska Kurumdan Aldigi Burs Miktari',
    'Daha Once Baska Bir Universiteden Mezun Olmus',
    'Uye Oldugunuz Kulubun Ismi',
    'Stk Projesine Katildiniz Mi?',
    'Girisimcilikle Ilgili Deneyiminizi Aciklayabilir misiniz?',
    'Ingilizce Seviyeniz?',
    "Hangi STK'nin Uyesisiniz?",
    'Daha Önceden Mezun Olunduysa, Mezun Olunan Üniversite'
]

df.drop(columns=drop_columns, inplace=True)

def convert_strings_to_lowercase(df):
    turkish_chars = {'ı': 'i', 'ü': 'u', 'ö': 'o', 'ğ': 'g', 'ş': 's', 'ç': 'c'}
    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = df[column].str.lower().replace(turkish_chars, regex=True)
    return df
df = convert_strings_to_lowercase(df)

df.columns

# EDA

In [None]:
#Editing date data

#Edit Date of Birth - Create variable N_application_year
def get_year(tarih):
    if pd.isna(tarih):  # NaN değerler
        return tarih

    parcalar = str(tarih).replace('/', ' ').replace('.', ' ').replace('-', ' ').split()
    if len(parcalar) == 3:  # Tarih formatı uygunsa
        return parcalar[-1]  # Son eleman yıl olacaktır
    if len(parcalar) > 3: #D:M:Y 00:00 formatında olanlar için
        return parcalar[2]
    return None

def two_digit_years(year):
    year = str(year).strip('_')  # Alt çizgileri temizle
    if len(year) == 2 and year.isdigit():
        if year.startswith == 0:
            return int('20' + year)
        # İki basamaklı yıl
        return int('19' + year)
    return year

def clean_invalid_years(year):
    year = str(year).strip('_')  # Alt çizgileri temizle
    if year.isdigit() and 1960 <= int(year) <= 2016:  # Geçerli yıl
        return int(year)
    return np.nan  # Geçersiz yıl

# Removing the year from the 'Date of Birth' column
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(get_year)

#Convert two-digit years to 1900s or 2000s
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(two_digit_years)

# Clear invalid data and check for leftovers
df['Dogum Tarihi'] = df['Dogum Tarihi'].apply(clean_invalid_years)

#Create age of applicant variable
df["N_basvuru_yası"] = df["Basvuru Yili"] - df["Dogum Tarihi"]

In [None]:
# Drop NA values
df.dropna(subset=['Cinsiyet', 'Baska Bir Kurumdan Burs Aliyor mu?', 'Universite Turu'], inplace=True)

# Handle missing values
df['Ingilizce Biliyor musunuz?'] = df['Ingilizce Biliyor musunuz?'].fillna('nk')
df['Anne Sektor'] = df['Anne Sektor'].fillna('nk')
df['Baba Sektor'] = df['Baba Sektor'].fillna('nk')
df['Spor Dalindaki Rolunuz Nedir?'] = df['Spor Dalindaki Rolunuz Nedir?'].fillna('nk')

df.head(20)

na_columns = [col for col in df.columns if df[col].isnull().sum() > 0]

if "Degerlendirme Puani" in na_columns:
    na_columns.remove("Degerlendirme Puani")

for col in na_columns:
    mode_value = df[col].mode()[0]
    df[col] = df[col].fillna(mode_value)

# Drop id column
df.drop(columns='id', inplace=True)

In [None]:
# There were too many different values in the university GPA section. We have categorized them into simpler categories.
bulunmuyor = ["NaN", "hazirligim", "not ortalamasi yok", "ortalama bulunmuyor"]
kotu = ["0 - 1.79","1.00 - 2.50","1.80 - 2.49","2.00 - 2.50","2.50 ve altı"]
orta = ["2.50 - 2.99","2.50 - 3.00","2.50 -3.00","3.00-2.50"]
iyi = ["3.00 - 3.49","3.00 - 3.50","3.00 - 4.00","3.50 - 4.00","3.50-3","4-3.5","4.0-3.5"]

def kategorize_et(col):
    if col in bulunmuyor:
        return "bulunmuyor"
    elif col in kotu:
        return "kotu"
    elif col in orta:
        return "orta"
    else:
        return "iyi"

# Update the 'University Grade Point Average' column with apply()
df["Universite Not Ortalamasi"] = df["Universite Not Ortalamasi"].apply(kategorize_et)


In [None]:
#We also categorize high school GPAs. 
#Here we will use new lists as the grade ranges are different from the university.
lise_bulunmuyor = ["NaN", "not ortalamasi yok"]
lise_kotu = ["0 - 25", "0 - 24", "25 - 49", "25 - 50", "2.50 ve altı", "44-0"]
lise_orta = ["3.00-2.50", "50 - 74", "50 - 75", "54-45", "69-55"]
lise_iyi = ["100-85", "3.00 - 4.00", "3.50-3", "3.50-3.00", "4.00-3.50", "75 - 100", "84-70"]

def lise_not_kategorize_et(col):
    if col in lise_bulunmuyor:
        return "bulunmuyor"
    elif col in lise_kotu:
        return "kotu"
    elif col in lise_orta:
        return "orta"
    else:
        return "iyi"

# We update the “High School Graduation Grade” column.
df["Lise Mezuniyet Notu"] = df["Lise Mezuniyet Notu"].apply(lise_not_kategorize_et)

In [None]:
#The types of high schools were written in different characters.
def lise_turu_kategorize_et(lise):
    if isinstance(lise, float):  # If the value is float, we assign it to the “Other” category.
        return "diger"
    if "fen" in lise:
        return "fen lisesi"
    elif "anadolu" in lise:
        return "anadolu lisesi"
    elif "düz" in lise:
        return "duz lise"
    elif "imam" in lise:
        return "imam hatip lisesi"
    elif "meslek" in lise:
        return "meslek lisesi"
    elif "ozel" in lise:
        return "ozel lise"
    else:
        return "diger"

df["Lise Turu"] = df["Lise Turu"].apply(lise_turu_kategorize_et)

In [None]:
# Identify categorical and numerical columns
def grab_col_names(dataframe, cat_th=12, car_th=81):
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtype == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtype != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtype == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    num_cols = [col for col in dataframe.columns if dataframe[col].dtype != "O" and dataframe[col].dtype != "datetime64[ns]"]
    num_cols = [col for col in num_cols if col not in num_but_cat]
    return cat_cols, num_cols, cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(df)

# Encoding

In [None]:
# One-hot encode categorical variables
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)

df.head()

# Modelling

In [None]:
# Separate train and test data
train_df = df[df['Degerlendirme Puani'].notnull()]
test_df = df[df['Degerlendirme Puani'].isnull()]

y = train_df['Degerlendirme Puani']
X = train_df.drop(['Degerlendirme Puani'], axis=1)

In [None]:
from pycaret.regression import *

In [None]:
# PyCaret setup: Regresyon modeli için setup oluşturuyoruz
regression_setup = setup(data = train_df, target = 'Degerlendirme Puani', session_id=123)

regression_setup = setup(data=train_df,
                         target='Degerlendirme Puani',
                         session_id=42,  
                         train_size=0.8,  
                         normalize=True,  
                         verbose=False,  
                         )

In [None]:
# We compare some of the models we have chosen and choose the best one.
best = compare_models(include=["gbr","xgboost","lightgbm","catboost"])

In [None]:
# Let's display the best model
print(best)

In [None]:
# Optimize the best model
tuned_model = tune_model(best)

In [None]:
#Training the all data.
final_model = finalize_model(tuned_model)

In [None]:
# We remove the target column (Assessment Score) because it is completely missing in the test data.
test_df = test_df.drop(columns=['Degerlendirme Puani'], errors='ignore')

# We make predictions on test data.
test_predictions = predict_model(final_model, data=test_df)

test_df['Degerlendirme Puani'] = test_predictions["prediction_label"]

In [None]:
dictionary = {"Degerlendirme Puani":test_predictions["prediction_label"]}
dfSubmission = pd.DataFrame(dictionary)
dfSubmission.head()