<Input 벡터>

0. patient_number(int)
1. Basic information: csv input (col 2-5)
=> info_vector = [patient_age, patient_sex, height_percentile, weight_percentile] *앞에서부터 각각 int, 0(남자), 1(여자), Float, Float
2. Bruise: streamlit을 이용한 manual input (col 6-16)
=> bruise_vector = [head_count, head_length, arms_count, arms_length, legs_count, legs_length, torso_count, torso_length, buttocks_count, buttocks_length, specific_shape]*_count(int), _length(float), specific_shape(0 or 1)
3. History taking: streamlit을 이용한 manual input (col 17-25)
=> response_vector = [consciousness_state, guardian_status, abuse_likely, match_explanation, developemental_stage, treatment_delayed, consistent_history, poor_condition, inappropriate_relationship] *0(예) or 1(아니오) or None(유보)
4. Lab: csv input (col 26-44)
=> lab_vector = [CBC_RBC, CBC_WBC, CBC_Platelet, Hb, PT_INR, aPTT, AST, ALT, ALP, Na, K, Cl, Calcium, Phosphorus, 25hydroxyvitaminD, Serum_albumin, Pre_albumin, Transferrin, Glucose] *모두 float
5. X-ray assessment: txt input(여러 부위의 .txt형식 판독문을 합쳐서 하나의 .txt 파일로 input) (col 45-53)
=> xray_vector = [skull, ribs, humerus, radius_ulna, femur, tibia_fibula, pelvis, spiral_fx, metaphyseal_fx]
6. Video/Audio: Video(.mp4) input (col 54-83)
=> emotion_vector = [Happiness, Sadness, Anger, Surprise, Fear] *모두 float(0-1), 의식 없을 시 제외
*앞 7개는 0 ~ 10(int) or 0 (해당 영상 없음), 뒤 2개는 0(아니요,해당 영상 없음) or 1
7. true_label : 0 (아동학대), 1 (아동학대 아님)

<Output 벡터> 

=> abuse_risk_score(int), abuse_cause = [원인1(str), 관여율1(int), 원인2(str), 관여율2(int), 원인3(str), 관여율3(int)]
*현재 임의의 데이터 x_train, y_train을 이용해 training한 XGBoost 사용

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# CSV 파일을 읽고 첫 번째 행을 변수명으로 사용
file_path = "./final_files/combined.CSV"  # 여기에 파일 경로를 넣어주세요
data = pd.read_csv(file_path)

# 벡터별로 컬럼 분리
info_vector = data.iloc[:, 1:5]  # column 2-5
bruise_vector = data.iloc[:, 5:16]  # column 6-16
response_vector = data.iloc[:, 16:25]  # column 17-25
lab_vector = data.iloc[:, 25:44]  # column 26-44
xray_vector = data.iloc[:, 44:53] # column 45-53
video_vector = data.iloc[:, 53:83] # column 54-83


# 벡터 길이 맞추기
max_len = max(info_vector.shape[1], bruise_vector.shape[1], response_vector.shape[1], lab_vector.shape[1], xray_vector.shape[1], video_vector.shape[1])

# 패딩 적용하여 벡터 길이 맞추기
info_vector_padded = np.pad(info_vector, ((0, 0), (0, max_len - info_vector.shape[1])), 'constant')
bruise_vector_padded = np.pad(bruise_vector, ((0, 0), (0, max_len - bruise_vector.shape[1])), 'constant')
response_vector_padded = np.pad(response_vector, ((0, 0), (0, max_len - response_vector.shape[1])), 'constant')
lab_vector_padded = np.pad(lab_vector, ((0, 0), (0, max_len - lab_vector.shape[1])), 'constant')
xray_vector_padded = np.pad(xray_vector, ((0, 0), (0, max_len - xray_vector.shape[1])), 'constant')
video_vector_padded = np.pad(video_vector, ((0, 0), (0, max_len - video_vector.shape[1])), 'constant')

# 패딩된 벡터들을 데이터프레임으로 결합
X = pd.concat([pd.DataFrame(info_vector_padded), pd.DataFrame(bruise_vector_padded), 
               pd.DataFrame(response_vector_padded), pd.DataFrame(lab_vector_padded), pd.DataFrame(xray_vector_padded), pd.DataFrame(video_vector_padded)], axis=1)

# 첫 번째 컬럼(patient_number)와 마지막 컬럼(true_label)을 제외하고 사용
y = data.iloc[:, -1]  # 마지막 컬럼을 y로 설정

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)

# 데이터 형식 확인 및 변환
X_train = X_train.astype(float)
X_test = X_test.astype(float)
y_train = y_train.astype(float)
y_test = y_test.astype(float)

# numpy 배열로 변환
X_train_np = X_train.values  # 또는 X_train.to_numpy()
y_train_np = y_train.values

# DMatrix로 변환
dtrain = xgb.DMatrix(X_train_np, label=y_train_np)
dtest = xgb.DMatrix(X_test.values, label=y_test.values)


# 모델 파라미터 설정
params = {
    'objective': 'binary:logistic',
    'max_depth': 5,
    'eta': 0.1,
    'verbosity': 0
}
num_round = 100

# 모델 학습
model = xgb.train(params, dtrain, num_round)

# Assuming model has been trained and predictions have been made as before
# Normalize abuse risk score to 0-100% scale (if it's a regression output)
abuse_risk_score_raw = 100 * (1- model.predict(dtest)[0])

# Extract the top 3 causes from the feature importances
importance = model.get_score(importance_type='weight')

# Normalize the importance values to sum to 1 for involvement rate calculation
total_importance = sum(importance.values())
normalized_importance = {k: v / total_importance for k, v in importance.items()}

# Sort and get the top 3 important features
sorted_importance = sorted(normalized_importance.items(), key=lambda x: x[1], reverse=True)
top_3_causes = sorted_importance[:3]

# Map feature indices to actual input vector names for interpretation
feature_names = ['head_count', 'head_length', 'arms_count', 'arms_length', 'legs_count', 'legs_length', 
                 'torso_count', 'torso_length', 'buttocks_count', 'buttocks_length', 'specific_shape',
                 'Happiness', 'Sadness', 'Anger', 'Surprise', 'Fear',
                 'abuse_likely', 'match_explanation', 'developemental_stage', 'treatment_delayed', 'consistent_history', 'poor_condition', 'inappropriate_relationship',
                 'age_months', 'sex', 'height_cm', 'weight_kg',
                 'Skull', 'Rib', 'Humerus', 'Radius_Ulna', 'Femur', 'Tibia_Fibula', 'Spiral_fx', 'Metaphyseal_fx',
                 'CBC_RBC', 'CBC_WBC', 'CBC_Platelet', 'Hb', 'PT_INR', 'aPTT', 'AST', 'ALT', 'ALP', 'Na', 'K', 'C', 
                 'Calcium', 'Phosphorus', '25hydroxyvitaminD', 'Serum_albumin', 'Pre_albumin', 'Transferrin', 'Glucose']

# Create abuse_cause vector with top 3 features and normalize involvement rates
abuse_cause = []
for feature, importance in top_3_causes:
    # Extract the index from feature (e.g., 'f0' -> 0)
    feature_index = int(feature[1:])  # Remove 'f' and convert to int
    feature_name = feature_names[feature_index]  # Map to human-readable feature name
    involvement_rate = round(importance, 3)  # Normalize to 0-1 range and round for better readability
    abuse_cause.append((feature_name, involvement_rate))

# Output the normalized abuse risk score and top 3 causes
print("Abuse Risk Score:", abuse_risk_score, "%")
for i, cause in enumerate(abuse_cause, 1):
    print(f"Cause {i}: {cause[0]}, Involvement Rate: {cause[1]}")