## Imports
This section contains all the required imports for this model.

In [1]:
from sklearn.svm import SVC
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

## Data Pre-processing
The data for this project can be found [here](https://www.kaggle.com/datasets/asinow/schizohealth-dataset). Once the data has been imported we drop all rows that contain an empty or NaN value. After this we rename the columns, converting them from Turkish to English. The `Age` column is then converted into categories (1–6) representing age groups. Similarly, the symptom score (`Positive_Symptom_Score`, `Negative_Symptom_Score`) and the `GAF_Score` (a mental health assessment metric) columns are also put into numbered categories based on severity ranges (e.g., scores ≤20 become category 1, 20–40 become category 2). This helps to simplify the data and make it easier for analysis.

In [2]:
df = pd.read_csv("schizophrenia_dataset.csv")
df.dropna(inplace=True)
df.rename(columns={
    "Hasta_ID": "Patient_ID",
    "Yaş": "Age",
    "Cinsiyet": "Gender",
    "Eğitim_Seviyesi": "Education_Level",
    "Medeni_Durum": "Marital_Status",
    "Meslek": "Occupation",
    "Gelir_Düzeyi": "Income_Level",
    "Yaşadığı_Yer": "Living_Area",
    "Tanı": "Diagnosis",
    "Hastalık_Süresi": "Disease_Duration",
    "Hastaneye_Yatış_Sayısı": "Hospitalizations",
    "Ailede_Şizofreni_Öyküsü": "Family_History",
    "Madde_Kullanımı": "Substance_Use",
    "İntihar_Girişimi": "Suicide_Attempt",
    "Pozitif_Semptom_Skoru": "Positive_Symptom_Score",
    "Negatif_Semptom_Skoru": "Negative_Symptom_Score",
    "GAF_Skoru": "GAF_Score",
    "Sosyal_Destek": "Social_Support",
    "Stres_Faktörleri": "Stress_Factors",
    "İlaç_Uyumu": "Medication_Adherence"
}, inplace=True)

def categorize_ages(age):
    if age < 18:
        return 1
    elif 18 <= age < 25:
        return 2
    elif 25 <= age < 35:
        return 3
    elif 35 <= age < 45:
        return 4
    elif 45 <= age < 60:
        return 5
    else:
        return 6

df['Age'] = df["Age"].apply(categorize_ages)

def categorize_positive_symptoms(symptoms):
    if symptoms <= 20:
        return 1
    elif 20 <= symptoms <= 40:
        return 2
    elif 40 <= symptoms <= 60:
        return 3
    elif 60 <= symptoms <= 80:
        return 4
    else:
        return 5

df['Positive_Symptom_Score'] = df["Positive_Symptom_Score"].apply(categorize_positive_symptoms)

def categorize_negative_symptoms(symptoms):
    if symptoms <= 20:
        return 1
    elif 20 <= symptoms <= 40:
        return 2
    elif 40 <= symptoms <= 60:
        return 3
    elif 60 <= symptoms <= 80:
        return 4
    else:
        return 5

df['Negative_Symptom_Score'] = df['Negative_Symptom_Score'].apply(categorize_negative_symptoms)

def categorize_gaf(score):
    if score <= 20:
        return 1
    elif 20 <= score <= 40:
        return 2
    elif 40 <= score <= 60:
        return 3
    elif 60 <= score <= 80:
        return 4
    else:
        return 5

df['GAF_Score'] = df['GAF_Score'].apply(categorize_gaf)

df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'schizophrenia_dataset.csv'

## Data Fitting
Once our data has been pre-processed, we define our features and target variables. We then split the data, giving 20% to be used in training and the other 80% for testing. After this we use a ColumnTransformer to scale the `Age`, `Hospitalizations`, `Positive_Symptom_Score`, `Negative_Symptom_Score`, `GAF_Score` and we One Hot Encode our `Gender` and `Living_Area` columns. After this we apply the Synthetic Minority Oversampling Technique (SMOTE) to our minority classes. This helps to address class imbalance and preserve categorical information without introducing bias. Finally, we fit the data to our SVM, using a classification model.

In [7]:
X = df[[
    "Age", "Gender", "Hospitalizations", "Occupation", "Education_Level",
    "Positive_Symptom_Score", "Negative_Symptom_Score", "GAF_Score",
    "Stress_Factors", "Medication_Adherence", "Family_History",
    "Social_Support", "Living_Area", "Suicide_Attempt", "Substance_Use"
]]
Y = df["Diagnosis"]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.2)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["Age", "Hospitalizations", "Positive_Symptom_Score", "Negative_Symptom_Score", "GAF_Score"]),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["Gender", "Living_Area"])]
)

X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

smote = SMOTE(sampling_strategy="not majority")
X_resampled, y_resampled = smote.fit_resample(X_train_preprocessed, Y_train)

svc = SVC(kernel="linear", C=5)
svc.fit(X_resampled, y_resampled)

## Predictions
Once our data has been fitted, we make our predictions and evaluate them using a classification report and accuracy score.

In [8]:
y_pred = svc.predict(X_test_preprocessed)
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5700
           1       1.00      1.00      1.00      2300

    accuracy                           1.00      8000
   macro avg       1.00      1.00      1.00      8000
weighted avg       1.00      1.00      1.00      8000

