# "Bank Marketing" 데이터

고객의 정보와 마케팅 캠페인 반응을 분석하여 마케팅 성공 여부 예측 모델링 

## 1. 데이터 불러오기

In [1]:
from pyspark.sql import SparkSession

# SparkSession 생성
spark = SparkSession.builder.appName("BankMarketing-Analy").getOrCreate()

In [2]:
file_path = './bank.csv'

df = spark.read.csv(file_path, inferSchema = True, header = True,sep = ';')
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- y: string (nullable = true)



은행마케팅 캠페인 데이터셋 

고객의 속성, 캠페인 연락 기록, 마지막 캠페인 성공여부(y) 를 포함하고 있다.

In [3]:
df.show()

+---+-------------+-------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
|age|          job|marital|education|default|balance|housing|loan| contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+-------------+-------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
| 30|   unemployed|married|  primary|     no|   1787|     no|  no|cellular| 19|  oct|      79|       1|   -1|       0| unknown| no|
| 33|     services|married|secondary|     no|   4789|    yes| yes|cellular| 11|  may|     220|       1|  339|       4| failure| no|
| 35|   management| single| tertiary|     no|   1350|    yes|  no|cellular| 16|  apr|     185|       1|  330|       1| failure| no|
| 30|   management|married| tertiary|     no|   1476|    yes| yes| unknown|  3|  jun|     199|       4|   -1|       0| unknown| no|
| 59|  blue-collar|married|secondary|     no|      0|    yes|  no| unknown| 

In [4]:
df.select("duration", "y").groupBy("duration", "y").count().orderBy("duration").show()

+--------+---+-----+
|duration|  y|count|
+--------+---+-----+
|       4| no|    1|
|       5| no|    9|
|       6| no|    2|
|       7| no|    6|
|       8| no|    9|
|       9| no|   10|
|      10| no|    9|
|      11| no|    8|
|      12| no|    5|
|      13| no|    9|
|      14| no|   10|
|      15| no|   10|
|      16| no|   11|
|      17| no|    5|
|      18| no|    7|
|      19| no|    7|
|      20| no|   11|
|      21| no|    8|
|      22| no|   11|
|      23| no|    6|
+--------+---+-----+
only showing top 20 rows



In [5]:
df.groupBy("duration", "y").count().filter("y = 'yes'").orderBy("duration").show(100)

+--------+---+-----+
|duration|  y|count|
+--------+---+-----+
|      30|yes|    1|
|      76|yes|    1|
|      78|yes|    1|
|      80|yes|    1|
|      87|yes|    1|
|      91|yes|    2|
|      93|yes|    3|
|      96|yes|    1|
|      97|yes|    3|
|     100|yes|    1|
|     103|yes|    1|
|     104|yes|    2|
|     106|yes|    1|
|     107|yes|    1|
|     109|yes|    1|
|     110|yes|    1|
|     120|yes|    1|
|     121|yes|    1|
|     124|yes|    1|
|     125|yes|    1|
|     128|yes|    1|
|     129|yes|    1|
|     130|yes|    1|
|     132|yes|    1|
|     134|yes|    1|
|     138|yes|    1|
|     142|yes|    1|
|     144|yes|    2|
|     146|yes|    1|
|     147|yes|    1|
|     149|yes|    1|
|     152|yes|    2|
|     154|yes|    1|
|     157|yes|    1|
|     158|yes|    1|
|     159|yes|    1|
|     161|yes|    3|
|     164|yes|    1|
|     166|yes|    1|
|     167|yes|    1|
|     169|yes|    1|
|     170|yes|    1|
|     171|yes|    2|
|     177|yes|    1|
|     178|yes

### duration 칼럼
떄문에 정확도가 1이 나옴
즉, 이 컬럼만으로도 정답 예측이 가기 때문에 드랍하고 다시하기

In [6]:
df = df.drop("duration")

## 2. 데이터 EDA

In [7]:
# 결측치 알아보기
from pyspark.sql.functions import col, isnan, when, count

df.select([
    count(when(col(c).isNull() | isnan(c), c)).alias(c)
    for c in df.columns
]).show()

+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+-----+--------+--------+---+
|age|job|marital|education|default|balance|housing|loan|contact|day|month|campaign|pdays|previous|poutcome|  y|
+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+-----+--------+--------+---+
|  0|  0|      0|        0|      0|      0|      0|   0|      0|  0|    0|       0|    0|       0|       0|  0|
+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+-----+--------+--------+---+



In [8]:
df.groupBy("job").count().orderBy("count", ascending=False).show()

+-------------+-----+
|          job|count|
+-------------+-----+
|   management|  969|
|  blue-collar|  946|
|   technician|  768|
|       admin.|  478|
|     services|  417|
|      retired|  230|
|self-employed|  183|
| entrepreneur|  168|
|   unemployed|  128|
|    housemaid|  112|
|      student|   84|
|      unknown|   38|
+-------------+-----+



## 3. 데이터 전처리하기

In [9]:
# 타겟변수 인코딩
from pyspark.ml.feature import StringIndexer

label_indexer = StringIndexer(inputCol = 'y', outputCol = 'label_y')
df = label_indexer.fit(df).transform(df)

In [10]:
df = df.drop("y",'day')
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- month: string (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- label_y: double (nullable = false)



In [11]:
df.schema.fields

[StructField('age', IntegerType(), True),
 StructField('job', StringType(), True),
 StructField('marital', StringType(), True),
 StructField('education', StringType(), True),
 StructField('default', StringType(), True),
 StructField('balance', IntegerType(), True),
 StructField('housing', StringType(), True),
 StructField('loan', StringType(), True),
 StructField('contact', StringType(), True),
 StructField('month', StringType(), True),
 StructField('campaign', IntegerType(), True),
 StructField('pdays', IntegerType(), True),
 StructField('previous', IntegerType(), True),
 StructField('poutcome', StringType(), True),
 StructField('label_y', DoubleType(), False)]

In [12]:
df.show()

+---+-------------+-------+---------+-------+-------+-------+----+--------+-----+--------+-----+--------+--------+-------+
|age|          job|marital|education|default|balance|housing|loan| contact|month|campaign|pdays|previous|poutcome|label_y|
+---+-------------+-------+---------+-------+-------+-------+----+--------+-----+--------+-----+--------+--------+-------+
| 30|   unemployed|married|  primary|     no|   1787|     no|  no|cellular|  oct|       1|   -1|       0| unknown|    0.0|
| 33|     services|married|secondary|     no|   4789|    yes| yes|cellular|  may|       1|  339|       4| failure|    0.0|
| 35|   management| single| tertiary|     no|   1350|    yes|  no|cellular|  apr|       1|  330|       1| failure|    0.0|
| 30|   management|married| tertiary|     no|   1476|    yes| yes| unknown|  jun|       4|   -1|       0| unknown|    0.0|
| 59|  blue-collar|married|secondary|     no|      0|    yes|  no| unknown|  may|       1|   -1|       0| unknown|    0.0|
| 35|   manageme

In [13]:
# 데이터 분할
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

In [14]:
# 카테고리형 변수 인코딩
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml import Pipeline

# column 이 string인 칼럼 추출
string_cols = [field.name for field in df.schema.fields if field.dataType.simpleString() == 'string']

# StringIndexer 리스트 생성)
indexers = [StringIndexer(inputCol = c, outputCol = c+'_index', handleInvalid = 'keep') for c in string_cols]

# one-hot encoding
encoder = OneHotEncoder(inputCols = [c+'_index' for c in string_cols],
                        outputCols = [c +'_onehot' for c in string_cols],
                        handleInvalid = 'keep') # 학습 시 보지 못한 값이 테스트 데이터에 있어도 에러가 안 나게 처리하는 옵션

# 파이프라인 스테이지 정의
stages = indexers + [encoder]
pipeline = Pipeline(stages = stages)

# 학습 및 변환
model = pipeline.fit(train_df) # 원핫인코딩의 학습
df_encoded = model.transform(train_df)
df_encoded_test = model.transform(test_df)

# 결과확인
df_encoded.select(string_cols + [c+'_index' for c in string_cols] + [c+ '_onehot' for c in string_cols]).show(5)

+-------+-------+---------+-------+-------+----+---------+-----+--------+---------+-------------+---------------+-------------+-------------+----------+-------------+-----------+--------------+---------------+--------------+----------------+--------------+--------------+-------------+--------------+--------------+---------------+
|    job|marital|education|default|housing|loan|  contact|month|poutcome|job_index|marital_index|education_index|default_index|housing_index|loan_index|contact_index|month_index|poutcome_index|     job_onehot|marital_onehot|education_onehot|default_onehot|housing_onehot|  loan_onehot|contact_onehot|  month_onehot|poutcome_onehot|
+-------+-------+---------+-------+-------+----+---------+-----+--------+---------+-------------+---------------+-------------+-------------+----------+-------------+-----------+--------------+---------------+--------------+----------------+--------------+--------------+-------------+--------------+--------------+---------------+
|stu

⭐ spark 에서는 원핫 인코딩도 **'학습'** 의 과정을 거친다는 개념이 들어가 있음.

따라서, train/test로 나눈 뒤에 train 데이터로만 학습하고 test에 적용해야한다.!

In [15]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- month: string (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- label_y: double (nullable = false)



In [16]:
df.schema.fields

[StructField('age', IntegerType(), True),
 StructField('job', StringType(), True),
 StructField('marital', StringType(), True),
 StructField('education', StringType(), True),
 StructField('default', StringType(), True),
 StructField('balance', IntegerType(), True),
 StructField('housing', StringType(), True),
 StructField('loan', StringType(), True),
 StructField('contact', StringType(), True),
 StructField('month', StringType(), True),
 StructField('campaign', IntegerType(), True),
 StructField('pdays', IntegerType(), True),
 StructField('previous', IntegerType(), True),
 StructField('poutcome', StringType(), True),
 StructField('label_y', DoubleType(), False)]

In [17]:
for field in df_encoded.schema.fields:
    print(f"Column Name: {field.name}, Data Type: {field.dataType.simpleString()}")

Column Name: age, Data Type: int
Column Name: job, Data Type: string
Column Name: marital, Data Type: string
Column Name: education, Data Type: string
Column Name: default, Data Type: string
Column Name: balance, Data Type: int
Column Name: housing, Data Type: string
Column Name: loan, Data Type: string
Column Name: contact, Data Type: string
Column Name: month, Data Type: string
Column Name: campaign, Data Type: int
Column Name: pdays, Data Type: int
Column Name: previous, Data Type: int
Column Name: poutcome, Data Type: string
Column Name: label_y, Data Type: double
Column Name: job_index, Data Type: double
Column Name: marital_index, Data Type: double
Column Name: education_index, Data Type: double
Column Name: default_index, Data Type: double
Column Name: housing_index, Data Type: double
Column Name: loan_index, Data Type: double
Column Name: contact_index, Data Type: double
Column Name: month_index, Data Type: double
Column Name: poutcome_index, Data Type: double
Column Name: job_

In [21]:
# 특징벡터 만들기
from pyspark.ml.feature import VectorAssembler

# 수치형 컬럼
integer_cols = [field.name for field in df_encoded.schema.fields if field.dataType.simpleString() == 'int']

# 원핫인코딩 컬럼
onehot_cols = [col for col in df_encoded.columns if col.endswith('_onehot')]

# 수치형 + 인코딩된 범주형 합치기
all_features = [col for col in (integer_cols + onehot_cols) if col != 'label_y']

# 벡터화
assembler = VectorAssembler(inputCols=all_features, outputCol='features')
df_final = assembler.transform(df_encoded)
df_final_test = assembler.transform(df_encoded_test)

print(all_features)

['age', 'balance', 'campaign', 'pdays', 'previous', 'job_onehot', 'marital_onehot', 'education_onehot', 'default_onehot', 'housing_onehot', 'loan_onehot', 'contact_onehot', 'month_onehot', 'poutcome_onehot']


## 4. 모델생성 및 확인 + 평가

### 로지스틱 회귀분석

In [22]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol='features', labelCol='label_y')
pipeline = Pipeline(stages=[lr])
model = pipeline.fit(df_final)

In [23]:
# 결과 확인
predictions = model.transform(df_final_test)
predictions.select('features', 'label_y', 'prediction', 'probability').show(5, truncate=False)

+----------------------------------------------------------------------------------------------------+-------+----------+-----------------------------------------+
|features                                                                                            |label_y|prediction|probability                              |
+----------------------------------------------------------------------------------------------------+-------+----------+-----------------------------------------+
|(58,[0,2,3,15,19,25,27,31,33,36,46,53],[19.0,3.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])         |0.0    |0.0       |[0.8104264939386411,0.18957350606135892] |
|(58,[0,1,2,3,15,19,22,27,31,33,36,46,53],[20.0,1191.0,1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0    |0.0       |[0.6655290841626393,0.33447091583736066] |
|(58,[0,1,2,3,9,19,22,27,30,33,37,40,53],[21.0,1903.0,2.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0    |0.0       |[0.9738706598246873,0.026129340175312654]|
|(58,[0,1,2,3,15

In [24]:
# 모델 성능 평가하기
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    labelCol = 'label_y',
    rawPredictionCol='rawPrediction', 
    metricName= 'areaUnderROC'
)

auc = evaluator.evaluate(predictions)
print(f'ROC AUC: {auc:.4f}')

ROC AUC: 0.6593


In [25]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# 정확도
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol='label_y', predictionCol='prediction', metricName='accuracy'
)
accuracy = accuracy_evaluator.evaluate(predictions)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.8917


### 랜덤 포레스트

In [26]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol='features', labelCol='label_y', numTrees=100)
pipeline = Pipeline(stages=[rf])
model = pipeline.fit(df_final)
predictions = model.transform(df_final_test)

auc = evaluator.evaluate(predictions)
print(f'ROC AUC: {auc:.4f}')

accuracy = accuracy_evaluator.evaluate(predictions)
print(f'Accuracy: {accuracy:.4f}')

ROC AUC: 0.7076
Accuracy: 0.8882


In [27]:
# 파이프라인 내 모델 중 마지막 단계가 분류기라고 가정
rf_model = model.stages[-1]  # 마지막 단계 모델 꺼내기

# 피처 중요도 출력
importances = rf_model.featureImportances

for name, imp in zip(all_features, importances):
    print(f"{name}: {imp}")

age: 0.07208277274083542
balance: 0.044496703825274105
campaign: 0.023260679470037186
pdays: 0.08745533676086642
previous: 0.05944854743959491
job_onehot: 0.006047665515205972
marital_onehot: 0.010606755644969518
education_onehot: 0.004747390029002987
default_onehot: 0.0022706076994766086
housing_onehot: 0.001201413960583692
loan_onehot: 0.025695794252365723
contact_onehot: 0.001736960612841787
month_onehot: 0.00037799249910343383
poutcome_onehot: 0.003189676777006131


 **정확도(Accuracy)** 는 높은 편이고 거의 차이 없음
0.89 vs 0.88 → 큰 차이 아님 (±0.01)

만약 데이터가 불균형하다면, 정확도는 믿을 수 있는 지표가 아니다.

AUC는 분류 모델이 얼마나 잘 구분을 하는지 보여주는 지표

🔍 **데이터는 같고 모델만 다른데 AUC가 왜 다르냐?**

✅ 1. 모델 구조에 따라 예측 방식이 다름

AUC는 단순히 맞았냐 틀렸냐가 아니라 모델이 예측한 "확률(probability)"의 분포를 기준으로 계산돼요.

Logistic Regression: 선형 decision boundary 기반

Random Forest: 앙상블 트리 기반, 비선형 경계

Gradient Boosting: 오차 줄이는 방향으로 반복 학습

➡️ 같은 데이터라도 모델마다 확률 분포가 다르게 나와요.

그 결과 ROC 곡선의 모양과 AUC 값도 달라집니다.

### Gradient Boosting 

In [28]:
from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(labelCol="label_y", featuresCol="features", maxIter=50, maxDepth=5)
model = gbt.fit(df_final)
predictions = model.transform(df_final_test)

auc = evaluator.evaluate(predictions)
print(f'ROC AUC: {auc:.4f}')

accuracy = accuracy_evaluator.evaluate(predictions)
print(f'Accuracy: {accuracy:.4f}')

ROC AUC: 0.6799
Accuracy: 0.8813


**전처리를 하지 않았을 경우 AUC는 0.6 ~0.7 사이, 정확도는 0.8 ~0.9 사이의 값을 갖는다.**

따라서, 정확도를 높이기 위한 데이터 EDA를 실시!🏋️‍♀️

In [29]:
spark.stop()