# PAS에 맞게 정보 파일 만들기

## item key 파일

In [5]:
import re
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

RAW_TEXT = r"""
## [Academic Self-Efficacy] - 41개

I usually do well in mathematics.
Mathematics is harder for me than for most others. -
Math is not one of my strengths. -
Mathematics is easy for me.

I am good at working out difficult mathematics problems.

I am good at explaining mathematics problems.

Mathematics is harder for me than any other subject. -

Mathematics makes me confused. -

How confident in math tasks: working out how long it would take to get from one place to another using a timetable

How confident in math tasks: Calculating how much more expensive a computer would be after adding tax

How confident in math tasks: Calculating how many square metres of tiles you need to cover a floor

How confident in math tasks: Solving an equation like 6x[sup]2[/sup]+5=29

How confident in math tasks: Finding the actual distance between two places on a map with a 1?10,000 scale

How confident in math tasks: Solving an equation like 2(x+3) = (x+3)(x-3)

How confident in math tasks: Calculating the power consumption of an electronic appliance per week

How confident in math tasks: Solving an equation like 3x+5=17

How confident in math tasks: Extracting mathematical information from diagrams, graphs, or simulations

How confident in math tasks: Interpreting mathematical solutions in the context of a real-life challenge

How confident in math tasks: Using the concept of statistical variation to make a decision

How confident in math tasks: Identifying mathematical aspects of a real-world problem

How confident in math tasks: Identifying constraints and assumptions behind mathematical modelling

How confident in math tasks: Representing a situation mathematically using variables, symbols, or diagrams

How confident in math tasks: Evaluating the significance of observed patterns in data

How confident in math tasks: Calculating the properties of an irregularly shaped object

Confident can do in future: Planning when to do school work on my own

Confident can do in future: Completing school work independently

Confident can do in future: Assessing my progress with learning

I believe I am competent in math area.

In comparison to other classmates, I believe I did fairly well in math task.

Mathmetics is a task that I believed I was pretty good at.

I felt very competent after working on the math task for a long.

Even though there are many school assignments, I can finish them on time.

I am capable of completing difficult tasks.

 I know how to schedule my time to accomplish my tasks.

I know how to take notes.

I know how to study to perform well on tests.

I am good at finding information and writing academic assignments.

I am a very good student.

I usually do very well in school and at academic tasks.

I understand my academic tasks.

I am very capable of succeeding at school.

## [Intrinsic Motivation] - 37개

I enjoy learning mathematics
I wish I did not have to study mathematics. -
Mathematics is boring. -
I learn many interesting things in mathematics.
I like mathematics.
I like any schoolwork that involves numbers.
I like to solve mathematics problems.
I look forward to mathematics class.
Mathematics is one of my favorite subjects.

I experience pleasure and satisfaction while learning new things.

I feel pleasure when I discover things I have never seen before.

I feel pleasure when I surpass myself in my studies.

I experience pleasure when I read subjects that I find interesting.

I feel pleasure when I surpass myself in my personal accomplishments.

I experience pleasure in broadening my knowledge about subjects that appeal to me.

I feel completely absorbed when I read things that deeply interest me.

I experience intense positive feelings when I communicate my own ideas to others.

I was thinking about how much I appreciated the math task while I worked on it.

I believed it was my decision to do the math assignment.

The math assignment seemed fascinating to me.

It was enjoyable to do the math work.

I had a great time doing the math work.

I am pleased with my performance on the math assignment.

I found the math to be tedious. -

While working on the math task, I felt like I was doing what I wanted to do.

I found the math work to be quite intriguing.

I found the math assignment to be quite enjoyable.

I am curious about many different things.

I am more curious than most people I know.

I spend time to find more information about things that interest me.

I like learning new things.

This school year, how often: I lost interest during mathematics lessons. -

I enjoy projects that require creative solutions.

I enjoy thinking about new ways to solve problems.

I enjoy solving complex problems.

I enjoy learning new things.

During COVID closures, agree/disagree: I enjoyed learning by myself.

## [academic stress] - 37개 문항

I found it hard to wind down.

I tended to over-react to situations.

I felt that I was using a lot of nervous energy.

I found myself getting agitated.

I found it difficult to relax.

I was intolerant of anything that kept me from getting on with what I was doing.

I felt that I was rather touchy.

I have done my schoolwork to the fullest, but my grades are still bad; because of that, it took me a long time to get excited again.

It is very difficult to accept when you experience failure to achieve academic achievement.

Poor grades on schoolwork have affected my confidence.

I evaluate myself when I get an unsatisfactory grade.

I feel stressed because my grades this semester need to be better than previous semester.

I can't finish my schoolwork optimally and on time if there are distractions.

I choose not to do and submit assignments because these assignments are beyond my ability.

When I have difficulty doing schoolwork, it is very difficult for me to find a solution.

For me, doing difficult school assignments is a precious life experience. -

Study stress has taken over me, so I am lazy to study and do assignments.

It was hard for me to face the pressure of work from school teachers.

The pressure of the current assignments makes me not sure I can complete the existing subject scores.

I feel nervous about approaching exams.

I often worry that it will be difficult for me in mathematics classes.

I get very tense when I have to do mathematics homework.

I get very nervous doing mathematics problems.

I worry about performing poorly in mathematics.

I feel anxious about failing in mathematics.

I worry that I am not prepared for life after finishing my current stage of education.

I feel pressure from my family to follow a specific educational or career path.

I worry that I will not have enough money to do what I would like to do in the future.

I worry that my academic performance will affect my future opportunities.

Worries about mathematics performance cause me stress.

During COVID closures, I felt anxious about school work.

How did you feel the last time you attended a mathematics class at school: Nervous or tense

How did you feel the last time you did your homework/studied for school: Nervous or tense

While performing the math work, I felt tense.

During the math task, I was nervous.

During the math assignment, I felt under duress.
"""


header_pattern = re.compile(r"^##\s*\[(.+?)\]", re.IGNORECASE)
rev_dash_pattern = re.compile(r"\s*-\s*$")
bold_wrapper_pattern = re.compile(r"^\*{1,2}(.+?)\*{1,2}$")

def clean_line(line: str) -> str:
    line = line.strip()
    if not line:
        return ""
    m = bold_wrapper_pattern.match(line)
    if m:
        line = m.group(1).strip()
    return line.strip()

def detect_reverse(line: str) -> tuple:
    sign_char = "+"
    if rev_dash_pattern.search(line):
        sign_char = "-"
        line = rev_dash_pattern.sub("", line).rstrip()
    return line.strip(), sign_char

def normalize_group(g: str) -> tuple:
    gl = g.lower()
    if "self-efficacy" in gl:
        return "Academic Self-Efficacy", "se"
    if "intrinsic" in gl:
        return "Intrinsic Motivation", "im"
    if "stress" in gl:
        return "Academic Stress", "as"
    return g, "x"

# 파싱
lines = [ln.rstrip() for ln in RAW_TEXT.splitlines()]
rows = []
current_group = None

for raw in lines:
    ln = raw.strip()
    hm = header_pattern.match(ln)
    if hm:
        current_group = hm.group(1).strip()
        continue
    if not ln or current_group is None:
        continue
    ln = clean_line(ln)
    if not ln:
        continue
    item_text, sign_char = detect_reverse(ln)
    if item_text:
        rows.append({"Group_Raw": current_group, "Item": item_text, "SignChar": sign_char})

df_raw = pd.DataFrame(rows)

# 그룹명 정규화 및 Key 생성
df_raw[["Facet", "Prefix"]] = df_raw["Group_Raw"].apply(lambda x: pd.Series(normalize_group(x)))
df_raw["Counter"] = df_raw.groupby("Prefix").cumcount() + 1
df_raw["Key"] = df_raw["Prefix"].astype(str)  # se1, im1, as1 (소문자)
df_raw["Sign"] = df_raw["SignChar"] + df_raw["Key"]  # +se1, -se2 (부호+key)

df_raw = df_raw.reset_index(drop=True)
df_raw["Full#"] = range(len(df_raw))
df_raw["Short#"] = range(len(df_raw))

# 최종 ItemKey DataFrame
df_item_key = df_raw[["Full#", "Short#", "Sign", "Key", "Facet", "Item"]].copy()

print("=" * 80)
print("ItemKey 생성 완료")
print("=" * 80)
print(f"총 문항 수: {len(df_item_key)}")
print(f"\n그룹별 문항 수:")
print(df_item_key.groupby('Facet').size())
print(f"\n역문항(-) 개수: {df_item_key['Sign'].str[0].value_counts()['-']}")
df_item_key.head(10)

output_file = "all_data_ItemKey.xlsx"
df_item_key.to_excel(output_file, index=False)
print(f"✅ 저장 완료: {output_file}")

ItemKey 생성 완료
총 문항 수: 114

그룹별 문항 수:
Facet
Academic Self-Efficacy    41
Academic Stress           36
Intrinsic Motivation      37
dtype: int64

역문항(-) 개수: 9
✅ 저장 완료: all_data_ItemKey.xlsx


## test셋 split 파일

In [2]:
"""
TIMSS Item Key - Balanced Train/Test Split 생성 (3개 척도 균형 샘플링)

요구사항 반영:
1) Academic self-efficacy / Intrinsic motivation / Academic stress 중
   '가장 문항 수가 적은 그룹'을 기준(min_n)으로 나머지 그룹도 랜덤으로 min_n개로 맞춤
   (예: 41, 37, 37 -> 모두 37개씩 사용)
2) 각 그룹에서 (대략) train 80%, test 20%로 분할
   - test_n = round(min_n * 0.2)
   - train_n = min_n - test_n
3) 결과는 "원본 전체 문항 index(Id)" 기준으로 train_index / test_index 를 JSON에 저장
   + (선택) 그룹별로 어떤 Id가 샘플링/분할됐는지도 함께 저장

주의:
- 이 코드는 "Item Key 엑셀(또는 DataFrame)"에 최소한 [Id, Group] 컬럼이 있다고 가정합니다.
- 이전에 만든 'timss_ItemKey_extended.xlsx' (Id, Key, Item, Sign, Group) 를 그대로 읽도록 작성했습니다.
"""

import json
import random
import pandas as pd

# =============================================================================
# 0) 설정
# =============================================================================
RANDOM_SEED = 42          # 재현성 필요 없으면 None으로 두거나 seed 줄 삭제
ITEMKEY_FILE = "all_data_ItemKey.xlsx"   # 너가 만든 ItemKey 파일명
OUTPUT_JSON = "traintest_split_balanced_80_20.json"

# Group 이름은 ItemKey 파일의 Group 값과 일치해야 함
GROUPS = [
    "Academic Self-Efficacy",
    "Intrinsic Motivation",
    "Academic Stress",
]

TRAIN_RATIO = 0.80
TEST_RATIO  = 0.20

if RANDOM_SEED is not None:
    random.seed(RANDOM_SEED)

# =============================================================================
# 1) ItemKey 로드 (Id, Group 필요)
# =============================================================================
df_item = pd.read_excel(ITEMKEY_FILE)

required_cols = {"Id", "Group"}
missing = required_cols - set(df_item.columns)
if missing:
    raise ValueError(f"ItemKey 파일에 필요한 컬럼이 없습니다: {missing}. "
                     f"현재 컬럼: {list(df_item.columns)}")

# Id 정수화 / 중복 체크
df_item["Id"] = df_item["Id"].astype(int)
if df_item["Id"].duplicated().any():
    dup = df_item[df_item["Id"].duplicated()]["Id"].tolist()
    raise ValueError(f"Id 중복이 있습니다. 중복 Id 예: {dup[:10]}")

# 그룹 필터링 + 존재 여부 확인
df_item = df_item[df_item["Group"].isin(GROUPS)].copy()
for g in GROUPS:
    if (df_item["Group"] == g).sum() == 0:
        raise ValueError(f"Group '{g}'에 해당하는 문항이 0개입니다. "
                         f"ItemKey의 Group 표기와 GROUPS 리스트를 맞춰주세요.")

# =============================================================================
# 2) 그룹별 문항 수 확인 후 최소 개수(min_n)로 균형 샘플링
# =============================================================================
group_to_ids = {g: sorted(df_item.loc[df_item["Group"] == g, "Id"].tolist()) for g in GROUPS}
group_sizes = {g: len(ids) for g, ids in group_to_ids.items()}
min_n = min(group_sizes.values())

print("=" * 80)
print("Balanced sampling 기준(min_n) 계산")
print("=" * 80)
for g in GROUPS:
    print(f"- {g:22s}: {group_sizes[g]} items")
print(f"\n=> min_n = {min_n} (가장 적은 그룹 기준으로 통일)")

# 각 그룹에서 min_n개 랜덤 샘플링 (그룹이 이미 min_n이면 그대로 사용)
balanced_ids = {}
for g, ids in group_to_ids.items():
    if len(ids) == min_n:
        chosen = ids[:]  # 그대로
    else:
        chosen = random.sample(ids, min_n)
        chosen.sort()
    balanced_ids[g] = chosen

# =============================================================================
# 3) 그룹별 train/test (80/20) 분할
# =============================================================================
# 비율 "정도"라고 했으니, test_n을 round로 잡아 대략 20% 맞춤
test_n = int(round(min_n * TEST_RATIO))
test_n = max(1, min(test_n, min_n - 1))   # 안전장치 (전부 test/train 되는 것 방지)
train_n = min_n - test_n

print("\n" + "=" * 80)
print("Train/Test 크기")
print("=" * 80)
print(f"min_n={min_n} -> train_n={train_n} (~{TRAIN_RATIO:.0%}), test_n={test_n} (~{TEST_RATIO:.0%})")

split_by_group = {}
train_index = []
test_index = []

for g, ids in balanced_ids.items():
    test_ids = random.sample(ids, test_n)
    train_ids = [i for i in ids if i not in test_ids]

    test_ids.sort()
    train_ids.sort()

    split_by_group[g] = {
        "balanced_ids": ids,
        "train_ids": train_ids,
        "test_ids": test_ids,
        "n_balanced": len(ids),
        "n_train": len(train_ids),
        "n_test": len(test_ids),
    }

    train_index.extend(train_ids)
    test_index.extend(test_ids)

train_index = sorted(train_index)
test_index  = sorted(test_index)

# =============================================================================
# 4) JSON 저장
# =============================================================================
split_data = {
    "config": {
        "seed": RANDOM_SEED,
        "train_ratio": TRAIN_RATIO,
        "test_ratio": TEST_RATIO,
        "min_n_per_group": min_n,
        "train_n_per_group": train_n,
        "test_n_per_group": test_n,
        "groups": GROUPS,
        "itemkey_file": ITEMKEY_FILE,
    },
    "train_index": train_index,
    "test_index": test_index,
    # 디버깅/재현용(원하면 지워도 됨)
    "by_group": split_by_group,
}

with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(split_data, f, indent=4, ensure_ascii=False)

# =============================================================================
# 5) 요약 출력
# =============================================================================
print("\n" + "=" * 80)
print("결과 요약")
print("=" * 80)
print(f"- 전체 balanced 문항 수: {min_n * len(GROUPS)}")
print(f"- Train 전체: {len(train_index)}")
print(f"- Test  전체: {len(test_index)}")
print(f"\n✓ 저장 완료: {OUTPUT_JSON}")

print("\n그룹별 상세:")
for g in GROUPS:
    info = split_by_group[g]
    print(f"  [{g}] balanced={info['n_balanced']}, train={info['n_train']}, test={info['n_test']}")


Balanced sampling 기준(min_n) 계산
- Academic Self-Efficacy: 41 items
- Intrinsic Motivation  : 37 items
- Academic Stress       : 36 items

=> min_n = 36 (가장 적은 그룹 기준으로 통일)

Train/Test 크기
min_n=36 -> train_n=29 (~80%), test_n=7 (~20%)

결과 요약
- 전체 balanced 문항 수: 108
- Train 전체: 87
- Test  전체: 21

✓ 저장 완료: traintest_split_balanced_80_20.json

그룹별 상세:
  [Academic Self-Efficacy] balanced=36, train=29, test=7
  [Intrinsic Motivation] balanced=36, train=29, test=7
  [Academic Stress] balanced=36, train=29, test=7


# 정형화된 응답 패턴 샘플 생성 (Synthetic Test Dataset)

## 목적
SE, IM, AS 세 변수에 대해 각각 긍정/부정 응답 패턴을 조합하여 2³=8개의 정형화된 샘플 생성

## 응답 패턴
- **긍정 응답**: 정방향 문항(+)은 2점, 역방향 문항(-)은 1점
- **부정 응답**: 정방향 문항(+)은 1점, 역방향 문항(-)은 2점

## 8가지 패턴
1. SE+, IM+, AS+ (모두 긍정)
2. SE+, IM+, AS-
3. SE+, IM-, AS+
4. SE+, IM-, AS-
5. SE-, IM+, AS+
6. SE-, IM+, AS-
7. SE-, IM-, AS+
8. SE-, IM-, AS- (모두 부정)

In [1]:
import pandas as pd
import json
from itertools import product

# Excel 파일 로드
df_itemkey = pd.read_excel('all_data_ItemKey.xlsx')

print("=" * 80)
print("원본 Excel 파일 정보")
print("=" * 80)
print(f"총 문항 수: {len(df_itemkey)}")
print(f"Key 유니크 값: {df_itemkey['Key'].unique()}")
print(f"\n문제: Key가 'se', 'im', 'as'만 있어서 각 문항 구분 불가")
print("해결: Full# 값을 사용하여 'i0', 'i1', 'i2' ... 'i113' 형식으로 키 생성")

# Full# 기준으로 키 생성 (i0, i1, i2, ..., i113)
df_itemkey['ResponseKey'] = 'i' + df_itemkey['Full#'].astype(str)

print("\n" + "=" * 80)
print("변환된 키 구조")
print("=" * 80)
print(df_itemkey[['Full#', 'Sign', 'Key', 'ResponseKey', 'Facet']].head(15))


def get_response_value(sign_char, is_positive):
    """
    문항의 Sign과 응답 패턴에 따라 응답값 반환 (Y/N 척도)
    
    Args:
        sign_char: '+' or '-' (정방향/역방향)
        is_positive: True (긍정 응답) or False (부정 응답)
    
    Returns:
        응답값 (1-2)
    """
    if is_positive:
        # 긍정 응답: 정방향(+)은 2점, 역방향(-)은 1점
        return 2 if sign_char == '+' else 1
    else:
        # 부정 응답: 정방향(+)은 1점, 역방향(-)은 2점
        return 1 if sign_char == '+' else 2


# 8가지 패턴 생성 (True=긍정, False=부정)
patterns = list(product([True, False], repeat=3))

print("\n" + "=" * 80)
print("생성할 8가지 패턴")
print("=" * 80)
for i, (se_pos, im_pos, as_pos) in enumerate(patterns, 1):
    se_label = 'SE+' if se_pos else 'SE-'
    im_label = 'IM+' if im_pos else 'IM-'
    as_label = 'AS+' if as_pos else 'AS-'
    print(f"Pattern {i}: {se_label}, {im_label}, {as_label}")


# 샘플 생성
synthetic_samples = []

for pattern_idx, (se_positive, im_positive, as_positive) in enumerate(patterns, 1):
    sample = {
        'case': f'synthetic_pattern_{pattern_idx}',
        'pattern': {
            'SE': 'positive' if se_positive else 'negative',
            'IM': 'positive' if im_positive else 'negative',
            'AS': 'positive' if as_positive else 'negative'
        }
    }
    
    # 각 문항에 대한 응답값 생성 (역문항 고려)
    for _, row in df_itemkey.iterrows():
        response_key = row['ResponseKey']  # i0, i1, i2, ..., i113
        sign_char = row['Sign'][0]  # '+' or '-'
        facet = row['Facet']
        
        # 해당 문항이 어느 변수에 속하는지 확인
        if facet == 'Academic Self-Efficacy':
            is_positive = se_positive
        elif facet == 'Intrinsic Motivation':
            is_positive = im_positive
        elif facet == 'Academic Stress':
            is_positive = as_positive
        else:
            is_positive = True  # 기본값
        
        # 역문항을 고려한 실제 응답값 생성
        sample[response_key] = get_response_value(sign_char, is_positive)
    
    synthetic_samples.append(sample)

print(f"\n✅ {len(synthetic_samples)}개 정형화 샘플 생성 완료")


# 각 패턴별 평균 응답값 확인
print("\n" + "=" * 80)
print("각 패턴별 그룹 평균 응답값")
print("=" * 80)

for sample in synthetic_samples:
    case = sample['case']
    pattern = sample['pattern']
    
    se_keys = df_itemkey[df_itemkey['Facet'] == 'Academic Self-Efficacy']['ResponseKey'].tolist()
    im_keys = df_itemkey[df_itemkey['Facet'] == 'Intrinsic Motivation']['ResponseKey'].tolist()
    as_keys = df_itemkey[df_itemkey['Facet'] == 'Academic Stress']['ResponseKey'].tolist()
    
    se_mean = sum(sample[k] for k in se_keys) / len(se_keys)
    im_mean = sum(sample[k] for k in im_keys) / len(im_keys)
    as_mean = sum(sample[k] for k in as_keys) / len(as_keys)
    
    print(f"\n{case}")
    print(f"  Pattern: SE={pattern['SE']}, IM={pattern['IM']}, AS={pattern['AS']}")
    print(f"  평균: SE={se_mean:.2f}, IM={im_mean:.2f}, AS={as_mean:.2f}")


# Test-set.json 저장
test_set = synthetic_samples

with open('Test-set.json', 'w', encoding='utf-8') as f:
    json.dump(test_set, f, ensure_ascii=False, indent=2)

print("\n" + "=" * 80)
print("✅ Test-set.json 저장 완료")
print("=" * 80)
print(f"총 샘플 수: {len(test_set)}")
print(f"각 샘플은 {len(df_itemkey)}개 문항에 대한 역문항 고려된 실제 응답값 포함")
print(f"키 형식: i0, i1, i2, ..., i113")
print(f"\n각 샘플 케이스:")
for sample in test_set:
    pattern = sample['pattern']
    print(f"  - {sample['case']}: SE={pattern['SE']}, IM={pattern['IM']}, AS={pattern['AS']}")


# 샘플 검증
print("\n" + "=" * 80)
print("샘플 검증")
print("=" * 80)

for sample_idx in [0, -1]:
    sample = test_set[sample_idx]
    print(f"\n[{sample['case']}]")
    print(f"Pattern: {sample['pattern']}")
    print("\n각 그룹별 샘플 응답 (처음 5개 문항):")
    
    for facet in df_itemkey['Facet'].unique():
        subset = df_itemkey[df_itemkey['Facet'] == facet].head(5)
        print(f"\n  {facet}:")
        for _, row in subset.iterrows():
            response_key = row['ResponseKey']
            sign = row['Sign']
            value = sample[response_key]
            item_text = row['Item'][:40]
            print(f"    {sign:5s} | {response_key:5s} | 응답={value} | {item_text}...")

print("\n" + "=" * 80)
print("완료!")
print("=" * 80)

원본 Excel 파일 정보
총 문항 수: 114
Key 유니크 값: ['se' 'im' 'as']

문제: Key가 'se', 'im', 'as'만 있어서 각 문항 구분 불가
해결: Full# 값을 사용하여 'i0', 'i1', 'i2' ... 'i113' 형식으로 키 생성

변환된 키 구조
    Full# Sign Key ResponseKey                   Facet
0       0  +se  se          i0  Academic Self-Efficacy
1       1  -se  se          i1  Academic Self-Efficacy
2       2  -se  se          i2  Academic Self-Efficacy
3       3  +se  se          i3  Academic Self-Efficacy
4       4  +se  se          i4  Academic Self-Efficacy
5       5  +se  se          i5  Academic Self-Efficacy
6       6  -se  se          i6  Academic Self-Efficacy
7       7  -se  se          i7  Academic Self-Efficacy
8       8  +se  se          i8  Academic Self-Efficacy
9       9  +se  se          i9  Academic Self-Efficacy
10     10  +se  se         i10  Academic Self-Efficacy
11     11  +se  se         i11  Academic Self-Efficacy
12     12  +se  se         i12  Academic Self-Efficacy
13     13  +se  se         i13  Academic Self-Efficacy
14     14  