# Small_Bussiness_Customer_Order_QnA
- train data: 2303354
- validation data: 295153

In [3]:
import pandas as pd
import json
import os
import glob
from tqdm import tqdm

# CSV 경로 정렬하여 수집
# train_path = sorted(glob.glob("../data/Small_Business_Customer_Order_QnA/Training/라벨링데이터_train/*.csv"))
train_path = sorted(glob.glob("../data/Small_Business_Customer_Order_QnA/Validation/라벨링데이터_validation/*.csv"))

# CSV 파일명 추출
csv_names = [os.path.splitext(os.path.basename(path))[0] for path in train_path]

# 전처리 함수 정의
def process_df(df: pd.DataFrame) -> pd.DataFrame:
    df_processed = df[["QA여부", "발화문"]].copy()
    df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
    df_processed.dropna(inplace=True)
    df_processed.reset_index(drop=True, inplace=True)
    return df_processed

# 디렉토리 생성
# output_dir = "../data/Small_Business_Customer_Order_QnA/Training/labeling_data_train_json"
output_dir = "../data/Small_Business_Customer_Order_QnA/Validation/labeling_data_val_json"

os.makedirs(output_dir, exist_ok=True)

# 각 CSV 파일 처리 및 jsonl 저장
for path, name in tqdm(zip(train_path, csv_names), total=len(train_path)):
    df = process_df(pd.read_csv(path))
    all_conversations = []
    current_dialogue = []

    for _, row in df.iterrows():
        role = row["QA여부"]
        content = row["발화문"].strip()

        if role not in ["user", "system"]:
            continue

        current_dialogue.append({
            "role": role,
            "content": content
        })

        if len(current_dialogue) >= 2 and \
           current_dialogue[-2]["role"] == "user" and current_dialogue[-1]["role"] == "system":
            all_conversations.append({"messages": current_dialogue.copy()})
            current_dialogue = []

    output_path = os.path.join(output_dir, f"{name}.jsonl")
    with open(output_path, "w", encoding="utf-8") as f:
        for convo in all_conversations:
            f.write(json.dumps(convo, ensure_ascii=False) + "\n")

    print(f"✅ {name}.jsonl 저장 완료! 총 {len(all_conversations)}개 대화")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
  7%|▋         | 1/14 [00:02<00:28,  2.16s/it]

✅ 가구인테리어_validation.jsonl 저장 완료! 총 31963개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 14%|█▍        | 2/14 [00:02<00:15,  1.30s/it]

✅ 건강_validation.jsonl 저장 완료! 총 10240개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 21%|██▏       | 3/14 [00:03<00:11,  1.02s/it]

✅ 기타_validation.jsonl 저장 완료! 총 9965개 대화


  df = process_df(pd.read_csv(path))
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 29%|██▊       | 4/14 [00:06<00:19,  1.91s/it]

✅ 디지털가전_validation.jsonl 저장 완료! 총 47231개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 36%|███▌      | 5/14 [00:07<00:12,  1.42s/it]

✅ 병원_validation.jsonl 저장 완료! 총 8159개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 43%|████▎     | 6/14 [00:08<00:10,  1.31s/it]

✅ 뷰티_validation.jsonl 저장 완료! 총 16065개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 50%|█████     | 7/14 [00:11<00:12,  1.81s/it]

✅ 생활잡화_validation.jsonl 저장 완료! 총 41651개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 57%|█████▋    | 8/14 [00:12<00:09,  1.59s/it]

✅ 슈퍼_validation.jsonl 저장 완료! 총 17504개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 64%|██████▍   | 9/14 [00:14<00:08,  1.69s/it]

✅ 식품_validation.jsonl 저장 완료! 총 27690개 대화


  df = process_df(pd.read_csv(path))
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 71%|███████▏  | 10/14 [00:15<00:05,  1.49s/it]

✅ 음식점_validation.jsonl 저장 완료! 총 12100개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 79%|███████▊  | 11/14 [00:17<00:04,  1.57s/it]

✅ 의류_validation.jsonl 저장 완료! 총 26940개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 86%|████████▌ | 12/14 [00:18<00:03,  1.59s/it]

✅ 출산육아_validation.jsonl 저장 완료! 총 23099개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
 93%|█████████▎| 13/14 [00:19<00:01,  1.32s/it]

✅ 카페_validation.jsonl 저장 완료! 총 9583개 대화


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed["QA여부"].replace({"q": "user", "a": "system"}, inplace=True)
100%|██████████| 14/14 [00:20<00:00,  1.45s/it]

✅ 패션_validation.jsonl 저장 완료! 총 12963개 대화



