### mart 판매 데이터를 기반으로 판매액을 예측하시오.
- 제공된 데이터 목록: mart_train.csv (훈련 데이터), mart_test.csv (평가용 데이터)
- 예측할 컬럼: total (총 판매액)
학습용 데이터(mart_train.csv)를 이용하여 총 판매액을 예측하는 모델을 만든 후, 이를 평가용 데이터(mart_test.csv)에 적용하여 얻은 예측값을 다음과 같은 형식의 CSV 파일로 생성하시오.
- 제출 파일은 다음 한 개의 컬럼을 포함해야 합니다.
- pred: 예측된 총 판매액
- 제출 파일명: 'result.csv'
- 제출한 모델의 성능은 RMSE(Root Mean Square Error) 평가지표에 따라 채점한다.
- 제출 CSV 파일명 및 형태: result.csv

~~~
pred
10000
20000
30000
40000
...
~~~

### 답안 제출 참고
- pd.read_csv('result.csv') 로 제출 코드 확인

# 1. 문제정의
- RMSE
- target: total
- 예측 파일명: result.csv
- 컬럼 1개(pred)

# 2. 라이브러리 및 데이터 불러오기

In [2]:
# 데이터 불러오기
import pandas as pd
train = pd.read_csv("mart_train.csv")
test = pd.read_csv("mart_test.csv")

# 3. 탐색적 데이터 분석(EDA)

In [3]:
# 데이터 크기 확인
print(train.shape)
print(test.shape)

(700, 10)
(300, 9)


In [4]:
# train 샘플 확인
print(train.head())

  branch       city customer_type  gender            product_line      total  \
0      A     Yangon        Member  Female       Health and beauty  823457.25   
1      C  Naypyitaw        Normal  Female  Electronic accessories  120330.00   
2      A     Yangon        Normal    Male      Home and lifestyle  510788.25   
3      A     Yangon        Member    Male       Health and beauty  733572.00   
4      A     Yangon        Normal    Male       Sports and travel  951567.75   

  payment_method  rating time_of_day  day_name  
0        Ewallet     9.1   afternoon  Saturday  
1           Cash     9.6     morning    Friday  
2    Credit card     7.4   afternoon    Sunday  
3        Ewallet     8.4     evening    Sunday  
4        Ewallet     5.3     morning    Friday  


In [5]:
# test 샘플 확인
print(test.head())

  branch       city customer_type  gender         product_line payment_method  \
0      C  Naypyitaw        Normal  Female  Fashion accessories        Ewallet   
1      B   Mandalay        Normal    Male   Food and beverages    Credit card   
2      B   Mandalay        Member  Female  Fashion accessories    Credit card   
3      B   Mandalay        Member    Male    Health and beauty           Cash   
4      B   Mandalay        Member  Female   Home and lifestyle           Cash   

   rating time_of_day   day_name  
0     9.6   afternoon   Thursday  
1     4.3     evening  Wednesday  
2     5.0     evening  Wednesday  
3     9.2     morning     Sunday  
4     6.3   afternoon   Saturday  


In [6]:
# 자료형 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   branch          700 non-null    object 
 1   city            700 non-null    object 
 2   customer_type   700 non-null    object 
 3   gender          700 non-null    object 
 4   product_line    700 non-null    object 
 5   total           700 non-null    float64
 6   payment_method  700 non-null    object 
 7   rating          700 non-null    float64
 8   time_of_day     700 non-null    object 
 9   day_name        700 non-null    object 
dtypes: float64(2), object(8)
memory usage: 54.8+ KB


In [7]:
# train 기초 통계값 확인
train.describe()

Unnamed: 0,total,rating
count,700.0,700.0
mean,485078.0,7.003429
std,364390.7,1.713078
min,19041.75,4.0
25%,200119.5,5.5
50%,381874.5,7.0
75%,706127.6,8.425
max,1563975.0,10.0


In [None]:
# test 기초 통계값 확인


In [None]:
# train 기초 통계값 (Object)


In [None]:
# test 기초 통계값 (Object)


In [10]:
# train 결측치
train.isnull().sum()

branch            0
city              0
customer_type     0
gender            0
product_line      0
total             0
payment_method    0
rating            0
time_of_day       0
day_name          0
dtype: int64

In [11]:
# test 결측치
test.isnull().sum()

branch            0
city              0
customer_type     0
gender            0
product_line      0
payment_method    0
rating            0
time_of_day       0
day_name          0
dtype: int64

In [8]:
# 타겟 (value_count())
train['total'].value_counts(normalize=True)

total
283641.75     0.002857
263875.50     0.002857
415422.00     0.002857
326450.25     0.002857
130851.00     0.002857
                ...   
293391.00     0.001429
137103.75     0.001429
348232.50     0.001429
104107.50     0.001429
1535625.00    0.001429
Name: proportion, Length: 695, dtype: float64

In [None]:
# target (분포)


In [None]:
# target (분포, 시각화)


# 4. 데이터 전처리

In [12]:
# target
target=train.pop('total')

In [13]:
# 원핫인코딩
print(train.shape, test.shape)
train=pd.get_dummies(train)
test=pd.get_dummies(test)
print(train.shape, test.shape)


(700, 9) (300, 9)
(700, 30) (300, 30)


# 5. 검증 데이터 분할

In [14]:
# 검증데이터 분할
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train,target,test_size=0.2, random_state=2024)
print(X_tr.shape, y_tr.shape)
print(X_val.shape, y_val.shape)

(560, 30) (560,)
(140, 30) (140,)


# 6. 머신러닝 학습 및 평가

In [18]:
# RMSE
from sklearn.metrics import mean_squared_error
def rmse(y_true,y_pred):
  return mean_squared_error(y_true,y_pred)**0.5


In [19]:
# 선형회귀
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_tr,y_tr)
pred=lr.predict(X_val)
print(pred.shape)
print(pred[:10])
print(rmse(y_val,pred))

(140,)
[507509.39393016 444657.20847826 441391.89647437 648394.87878023
 504307.48658152 538876.73315236 494515.40070588 498537.8252635
 595614.15586321 567093.56240393]
408435.6813769399


In [22]:
# 랜덤포레스트
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(random_state=0)
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y_tr=le.fit_transform(y_tr)
rf.fit(X_tr,y_tr)
pred=rf.predict(X_val)
print(pred.shape)
print(rmse(y_val,pred))


(140,)
628842.0560755332


In [None]:
# Xgboost


In [None]:
# LightGBM


# 7. 예측 및 결과 파일 생성

In [34]:
# test예측 및 csv 생성
pred=lr.predict(test)
print(pred.shape)
result=pd.DataFrame({
    'pred':pred
})
result.to_csv('result.csv',index=False)


(300,)


In [35]:
# result.csv 확인
r=pd.read_csv('result.csv')
r.head()

Unnamed: 0,pred
0,442235.174292
1,443894.978268
2,429116.49303
3,506176.487154
4,575431.899942


# [심화] 성능 개선

In [None]:
# 데이터 불러오기
import pandas as pd
train = pd.read_csv("mart_train.csv")
test = pd.read_csv("mart_test.csv")

# target 데이터


# 레이블 인코딩


# 검증데이터 분리


# 선형회귀


# 랜덤포레스트


# Xgboost


# LightGBM


# 최종 제출 파일
