### 데이터 이해 : mtcars.csv
1. mpg : 연비 (Miles/gallon) >> target(종속변수)
2. Unnamed: 0 : 자동차 모델명
3. cyl : 엔진 기통수 (Number of cylinders)
4. disp : 배기량 (Displacement)
5. hp : 마력 (Gross horsepower)
6. drat : 뒤 차축비 (Rear axle ratio)
7. wt : 무게 (Weight)
8. qsec : 1/4 mile 도달 시간 (1/4 mile time)
9. vs : V형 엔진/직렬 엔진 (V engine/ Straight engine)
10. am : 변속기 (Transmission)
11. gear : 전진기어 개수 (Number of forward gears)
12. carb : 기화기 개수 (Number of carburetors)

## 데이터 분석 과정
1. 데이터 준비
2. 데이터 관찰 및 가공
3. 데이터 분리 (학습 데이터/ 테스트 데이터)
4. 학습 및 평가(검증)
5. 결과 출력 및 저장

## tips
- 실제 시험 환경 특성 상, 반드시 print() 함수 사용해야 함
- type(), head(), shape() 등의 결괏값 확인시 print() 사용 필수!

### 1. 데이터 준비하기: Data Load

In [2]:
import pandas as pd
data = pd.read_csv('bigData-main/mtcars.csv')

In [6]:
# data 변수 상위 5개 출력
display(data.head())

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6.0,160.0,110,3.9,2.62,16.46,0,manual,4,4
1,Mazda RX4 Wag,21.0,6.0,160.0,110,3.9,2.875,17.02,0,manual,4,4
2,Datsun 710,22.8,4.0,108.0,93,3.85,2.32,18.61,1,manual,4,1
3,Hornet 4 Drive,21.4,6.0,258.0,110,3.08,3.215,0.1,1,auto,3,1
4,Hornet Sportabout,18.7,8.0,360.0,175,3.15,3.44,17.02,0,auto,3,2


### 데이터 둘러보기

In [19]:
# 데이터 모양 확인

print(data.shape)

(32, 12)


In [15]:
# 데이터 타입 확인

print(type(data))

<class 'pandas.core.frame.DataFrame'>


In [16]:
# 데이터 열 확인

print(data.columns)

Index(['Unnamed: 0', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs',
       'am', 'gear', 'carb'],
      dtype='object')


In [17]:
# 기초통계랑 확인

print(data.describe())

             mpg        cyl        disp          hp       drat         wt  \
count  32.000000  30.000000   32.000000   32.000000  32.000000  32.000000   
mean   20.090625   7.600000  230.721875  146.687500   3.596563   3.217250   
std     6.026948   8.194195  123.938694   68.562868   0.534679   0.978457   
min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   
25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   
50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   
75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   
max    33.900000  50.000000  472.000000  335.000000   4.930000   5.424000   

             qsec         vs     carb  
count   31.000000  32.000000  32.0000  
mean    19.866774   0.437500   2.8125  
std     15.310469   0.504016   1.6152  
min      0.100000   0.000000   1.0000  
25%     16.785000   0.000000   2.0000  
50%     17.600000   0.000000   2.0000  
75%     18.755000   1.0000

In [18]:
# hp 수치형 변수의 기초통계량 확인

print(data['hp'].describe())

count     32.000000
mean     146.687500
std       68.562868
min       52.000000
25%       96.500000
50%      123.000000
75%      180.000000
max      335.000000
Name: hp, dtype: float64


In [22]:
# 데이터 중복 제거 : unique

print('am   : ', data['am'].unique())
print('gear : ', data['gear'].unique())
print('vs   : ', data['vs'].unique())

am   :  ['manual' 'auto']
gear :  ['4' '3' '*3' '5' '*5']
vs   :  [0 1]


In [23]:
# 요약정보 확인

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  32 non-null     object 
 1   mpg         32 non-null     float64
 2   cyl         30 non-null     float64
 3   disp        32 non-null     float64
 4   hp          32 non-null     int64  
 5   drat        32 non-null     float64
 6   wt          32 non-null     float64
 7   qsec        31 non-null     float64
 8   vs          32 non-null     int64  
 9   am          32 non-null     object 
 10  gear        32 non-null     object 
 11  carb        32 non-null     int64  
dtypes: float64(6), int64(3), object(3)
memory usage: 3.1+ KB
None


In [25]:
# 상관관계 구하기

display(data.corr())

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,carb
mpg,1.0,-0.460227,-0.847551,-0.776168,0.681172,-0.867659,0.013668,0.664039,-0.550925
cyl,-0.460227,1.0,0.544876,0.323293,-0.372671,0.53369,-0.012755,-0.32396,0.23998
disp,-0.847551,0.544876,1.0,0.790949,-0.710214,0.88798,0.18181,-0.710416,0.394977
hp,-0.776168,0.323293,0.790949,1.0,-0.448759,0.658748,0.010807,-0.723097,0.749812
drat,0.681172,-0.372671,-0.710214,-0.448759,1.0,-0.712441,-0.120283,0.440278,-0.09079
wt,-0.867659,0.53369,0.88798,0.658748,-0.712441,1.0,0.0939,-0.554916,0.427606
qsec,0.013668,-0.012755,0.18181,0.010807,-0.120283,0.0939,1.0,-0.112146,-0.120312
vs,0.664039,-0.32396,-0.710416,-0.723097,0.440278,-0.554916,-0.112146,1.0,-0.569607
carb,-0.550925,0.23998,0.394977,0.749812,-0.09079,0.427606,-0.120312,-0.569607,1.0


### 종속변수/ 독립변수 분리

In [26]:
X = data.drop(columns = 'mpg')
y = data['mpg']

In [27]:
# 독립변수 X 의 상위 5개 행을 확인
print(X.head())

          Unnamed: 0  cyl   disp   hp  drat     wt   qsec  vs      am gear  \
0          Mazda RX4  6.0  160.0  110  3.90  2.620  16.46   0  manual    4   
1      Mazda RX4 Wag  6.0  160.0  110  3.90  2.875  17.02   0  manual    4   
2         Datsun 710  4.0  108.0   93  3.85  2.320  18.61   1  manual    4   
3     Hornet 4 Drive  6.0  258.0  110  3.08  3.215   0.10   1    auto    3   
4  Hornet Sportabout  8.0  360.0  175  3.15  3.440  17.02   0    auto    3   

   carb  
0     4  
1     4  
2     1  
3     1  
4     2  


In [28]:
# 독립변수 X의 컬럼 이름 출력
print(X.columns)

Index(['Unnamed: 0', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')


In [29]:
print(y.head())

0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64


### 2. 데이터 관찰 및 가공 : 전처리(preprocessing)