컬럼명: 실제 데이터셋 컬럼명과 맞지 않으면 오류 발생. 반드시 맞게 수정!  

외부 변수: 반드시 "뉴스 발행일 이전" 값만 사용해야 함.  

배치 사이즈/에폭: 메모리 상황에 따라 조절 가능.  

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics import r2_score

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# 1. 데이터 준비 및 타겟 생성
df = pd.read_csv("C:/Users/user/fin_project/db/news(23-25)_summarized_external_clean.csv")  # 데이터셋 불러오기

In [3]:
# test용
df = df.sample(1000, random_state=42).reset_index(drop=True)

In [4]:
# 다중 시점 종가를 타겟으로 설정
target_cols = ['D_plus_1_date_close', 'D_plus_2_date_close', 'D_plus_3_date_close', 'D_plus_7_date_close', 'D_plus_14_date_close']
df['target'] = df[target_cols].values.tolist()  # 각 row에 plus 종가가 리스트로 저장됨


In [5]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1️⃣ target scaler 준비
target_scaler = StandardScaler()

# 2️⃣ target 컬럼을 numpy array로 변환 (N_samples, 5)
target_array = np.array(df['target'].tolist())  # shape: (N_samples, 5)

# 3️⃣ 스케일링 적용
target_scaled = target_scaler.fit_transform(target_array)

# 4️⃣ df에 저장 (혹은 Dataset에 바로 넘겨도 됨)
df['target_scaled'] = target_scaled.tolist()


In [6]:
df['target_scaled']

0      [-0.6135334347162632, -0.5944887074609537, -0....
1      [-0.5614470148344942, -0.5414732724923887, -0....
2      [0.7564483848853699, 0.7237498535052072, 0.812...
3      [-0.44173943370143653, -0.43251126071805424, -...
4      [-0.27605035252330734, -0.28037862731967794, -...
                             ...                        
995    [-0.3333187262191037, -0.3170179002847319, -0....
996    [-0.6072061134283199, -0.5872989138965069, -0....
997    [-0.3338747298472182, -0.3159558923727014, -0....
998    [-0.4409054282592647, -0.4279977270919244, -0....
999    [-0.5559425789161604, -0.5379686463826879, -0....
Name: target_scaled, Length: 1000, dtype: object

In [7]:

# 외부 변수 컬럼명 리스트(뉴스 발행일 이전 데이터만)
external_cols = [
    'fx', 'bond10y', 'base_rate',
    'D_minus_14_date_close', 'D_minus_14_date_volume', 'D_minus_14_date_foreign',
    'D_minus_14_date_institution', 'D_minus_14_date_individual',
    'D_minus_7_date_close', 'D_minus_7_date_volume', 'D_minus_7_date_foreign',
    'D_minus_7_date_institution', 'D_minus_7_date_individual',
    'D_minus_3_date_close', 'D_minus_3_date_volume', 'D_minus_3_date_foreign',
    'D_minus_3_date_institution', 'D_minus_3_date_individual',
    'D_minus_2_date_close', 'D_minus_2_date_volume', 'D_minus_2_date_foreign',
    'D_minus_2_date_institution', 'D_minus_2_date_individual',
    'D_minus_1_date_close', 'D_minus_1_date_volume', 'D_minus_1_date_foreign',
    'D_minus_1_date_institution', 'D_minus_1_date_individual'
]

In [8]:
# 2. 데이터셋 클래스
class NewsDataset(Dataset):
    def __init__(self, df, tokenizer, external_cols, max_len=512):
        # 전처리된 뉴스 본문 텍스트 리스트로 저장
        self.texts = df['article_preprocessed'].tolist()
        # external_cols로 지정한 외부 변수들 → FloatTensor로 변환
        self.external = torch.tensor(df[external_cols].values, dtype=torch.float32)
        # target 컬럼 (아까 만든 다중 시점 종가 리스트) → FloatTensor로 변환
        self.targets = torch.tensor(df['target_scaled'].values.tolist(), dtype=torch.float32)  # (N, 5)
        
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        # 데이터셋 길이 반환
        return len(self.texts)

    # 뉴스 텍스트를 tokenizer로 인코딩
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'external': self.external[idx]
        }, self.targets[idx]

In [9]:
# 3. 모델 정의 (다중 출력) - 수정 버전
class NewsImportancePredictor(nn.Module):
    def __init__(self, embedding_model_name='klue/bert-base', 
                 external_feature_dim=27, ae_hidden=256, ae_bottleneck=32,
                 fcl_hidden=512, dropout=0.2):
        super().__init__()
        
        self.bert = AutoModel.from_pretrained(embedding_model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
        
        # autoencoder
        self.encoder = nn.Sequential(
            nn.Linear(768, ae_hidden),
            nn.GELU(),
            nn.Linear(ae_hidden, ae_bottleneck)
        )
        self.decoder = nn.Sequential(
            nn.Linear(ae_bottleneck, ae_hidden),
            nn.GELU(),
            nn.Linear(ae_hidden, 768)
        )

        # predictor 수정 (값 폭 제한용 Tanh 추가)
        self.predictor = nn.Sequential(
            nn.Linear(ae_bottleneck + external_feature_dim, fcl_hidden),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(fcl_hidden, fcl_hidden//2),
            nn.GELU(),
            nn.Linear(fcl_hidden//2, 5),
            # nn.Tanh()  # 값 폭발 방지용으로 추가해볼 것!
            nn.Hardtanh(min_val=-2, max_val=10)  
        )

    def forward(self, input_ids, attention_mask, external):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        embeddings = outputs.last_hidden_state[:,0,:]
        latent = self.encoder(embeddings)
        reconstructed = self.decoder(latent)
        combined = torch.cat([latent, external], dim=1)
        pred = self.predictor(combined)
        return pred, reconstructed


In [10]:
# 4. 오토인코더 사전 학습
# 버트 임베딩 그대로 복원하도록 ae만 따로 학습
# 버트는 프리징
def train_autoencoder(model, dataloader, epochs=10, device='cuda'):
    ae_optimizer = torch.optim.AdamW(
        list(model.encoder.parameters()) + list(model.decoder.parameters()), lr=1e-4)
    criterion = nn.MSELoss()
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in dataloader:
            input_ids = batch[0]['input_ids'].to(device)
            attention_mask = batch[0]['attention_mask'].to(device)
            # 여기서 버트 프리징
            with torch.no_grad():
                bert_outputs = model.bert(input_ids=input_ids, attention_mask=attention_mask)
                bert_embeds = bert_outputs.last_hidden_state[:,0,:]
            latent = model.encoder(bert_embeds)
            reconstructed = model.decoder(latent)
            loss = criterion(reconstructed, bert_embeds)
            ae_optimizer.zero_grad()
            loss.backward()
            ae_optimizer.step()
            total_loss += loss.item()
        print(f"[AE] Epoch {epoch+1}/{epochs} | Loss: {total_loss/len(dataloader):.4f}")

In [11]:
import matplotlib.pyplot as plt

def evaluate_metrics(preds, targets):
    mse = nn.MSELoss()(preds, targets).item()
    mae = nn.L1Loss()(preds, targets).item()
    rmse = mse ** 0.5
    return mae, rmse

def plot_loss(train_losses, val_losses):
    plt.figure(figsize=(8, 6))
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training & Validation Loss')
    plt.legend()
    plt.grid(True)
    plt.show()


In [12]:
def evaluate_r2(preds, targets):
    r2 = r2_score(targets, preds)
    return r2

def train_predictor(model, train_loader, val_loader, target_scaler, device='cuda', epochs=50, patience=5):
    optimizer = torch.optim.AdamW(model.predictor.parameters(), lr=1e-4)
    criterion = nn.MSELoss()
    best_val_loss = float('inf')
    patience_counter = 0
    best_model_state = None

    train_losses = []
    val_losses = []

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch in train_loader:
            input_ids = batch[0]['input_ids'].to(device)
            attention_mask = batch[0]['attention_mask'].to(device)
            external = batch[0]['external'].to(device)
            targets = batch[1].to(device)

            preds, _ = model(input_ids, attention_mask, external)
            loss = criterion(preds, targets)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_loader)
        train_losses.append(avg_train_loss)
        print(f"[Main] Epoch {epoch+1}/{epochs} | Train Loss: {avg_train_loss:.4f}")

        model.eval()
        with torch.no_grad():
            val_loss = 0
            all_preds = []
            all_targets = []

            for batch in val_loader:
                input_ids = batch[0]['input_ids'].to(device)
                attention_mask = batch[0]['attention_mask'].to(device)
                external = batch[0]['external'].to(device)
                targets = batch[1].to(device)

                preds, _ = model(input_ids, attention_mask, external)
                loss = criterion(preds, targets)
                val_loss += loss.item()

                all_preds.append(preds.cpu())
                all_targets.append(targets.cpu())

            avg_val_loss = val_loss / len(val_loader)
            val_losses.append(avg_val_loss)

            all_preds = torch.cat(all_preds, dim=0).numpy()
            all_targets = torch.cat(all_targets, dim=0).numpy()

            all_preds_inverse = target_scaler.inverse_transform(all_preds)
            all_targets_inverse = target_scaler.inverse_transform(all_targets)

            print("all_preds shape:", all_preds.shape)
            print("all_targets shape:", all_targets.shape)
            print("all_preds range: min =", all_preds.min(), ", max =", all_preds.max())
            print("all_targets range: min =", all_targets.min(), ", max =", all_targets.max())

            # 기존 MAE, RMSE 계산
            mae, rmse = evaluate_metrics(
                torch.tensor(all_preds_inverse, dtype=torch.float32),
                torch.tensor(all_targets_inverse, dtype=torch.float32)
            )
            
            # 추가로 R2 계산
            r2 = evaluate_r2(all_preds_inverse, all_targets_inverse)

            print(f"        Validation Loss: {avg_val_loss:.4f} | MAE: {mae:.4f} | RMSE: {rmse:.4f} | R2: {r2:.4f}")

        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
            best_model_state = model.state_dict()
        else:
            patience_counter += 1
            print(f"        EarlyStopping patience {patience_counter}/{patience}")
            if patience_counter >= patience:
                print(f"        EarlyStopping triggered at epoch {epoch+1}!")
                if best_model_state is not None:
                    model.load_state_dict(best_model_state)
                break
        
    plot_loss(train_losses, val_losses)

In [13]:
# 6. 전체 파이프라인 실행
if __name__ == "__main__":
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    model = NewsImportancePredictor(external_feature_dim=len(external_cols)).to(device)
    dataset = NewsDataset(df, model.tokenizer, external_cols)
    
    # Train / Val Split
    train_idx, val_idx = train_test_split(np.arange(len(dataset)), test_size=0.2, random_state=42)
    train_set = torch.utils.data.Subset(dataset, train_idx)
    val_set = torch.utils.data.Subset(dataset, val_idx)

    # AE 전용 DataLoader → batch 16 추천
    ae_train_loader = DataLoader(train_set, batch_size=16, shuffle=True)

    # Predictor 학습용 DataLoader → 기존 유지
    train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=32)

    # 1) 오토인코더 사전학습 (여기에 ae_train_loader 사용!)
    train_autoencoder(model, ae_train_loader, epochs=5, device=device)

    # 2) 주요 예측기 학습 (얼리스타핑 적용)
    train_predictor(model, train_loader, val_loader, target_scaler, device=device, epochs=10, patience=2)


[AE] Epoch 1/5 | Loss: 0.6702
[AE] Epoch 2/5 | Loss: 0.3828
[AE] Epoch 3/5 | Loss: 0.3158
[AE] Epoch 4/5 | Loss: 0.2947
[AE] Epoch 5/5 | Loss: 0.2836
[Main] Epoch 1/10 | Train Loss: 41.2540
all_preds shape: (200, 5)
all_targets shape: (200, 5)
all_preds range: min = -2.0 , max = 10.0
all_targets range: min = -0.65086555 , max = 7.851671
        Validation Loss: 40.7646 | MAE: 906044.5625 | RMSE: 1155572.6007 | R2: -29.2957
[Main] Epoch 2/10 | Train Loss: 41.9861
all_preds shape: (200, 5)
all_targets shape: (200, 5)
all_preds range: min = -2.0 , max = 10.0
all_targets range: min = -0.65086555 , max = 7.851671
        Validation Loss: 40.7646 | MAE: 906044.5625 | RMSE: 1155572.6007 | R2: -29.2957
        EarlyStopping patience 1/2
[Main] Epoch 3/10 | Train Loss: 42.2840


KeyboardInterrupt: 

얼리스타핑: patience=5, 검증 손실이 5회 연속 개선되지 않으면 조기 종료

독립변수: 텍스트 임베딩 + 뉴스 발행일 이전 외부 변수만 사용