### [ 통계 관련 메서드들 ]
- 컬럼별로 데이터의 상테를 분석하기 위해서 제공
    * 평균 / 중앙값 / 표준편차 / 최소값 / 최대값
    * 컬럼과 컬럼의 관계 : 상관계수
    

[1] 모듈 로딩 <hr>

In [37]:
## [1] 모듈 로딩
import pandas as pd

In [38]:
## [2] 데이터 준비 및 로딩
DATA_FILE = '../Data/auto_mpg.csv'

mpgDF = pd.read_csv(DATA_FILE)

In [39]:
## [3] 로딩 데이터 확인
## 요약 정보 확인
mpgDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [40]:
## 실제 데이터와 요약 정보에 타입 일치 여부 체크
mpgDF.head(3)

## -> horsepower object ==> int64 ==> category  형변환 [고민]
## -> cylinders  int64  ==> category 형변환
## -> origin     int64  ==> category 형변환

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite


In [41]:
## 범주형/카테고리형
## - 예) 혈액형(A, B, O, AB), 성별(남, 여), 종교(기독교, 천주교, 불교, 사이비교)
##           ( 1, 2, 3, 4 )     ( 1, 2 )    (  1,      2,    3,     4  )
## - 숫자로 되어 있어도, 의미상 숫자가 가진 의미가 아님 !
## - 데이터 가진 의미와 다르게 연산/계산되면 안됨 ==> 의미를 기반 타입 지정

In [42]:
## 컬럼별 데이터 분포 및 기본 통계치 확인 [기] 수치데이터 컬럼만
mpgDF.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


In [43]:
## 컬럼별 데이터 분포 및 기본 통계치 확인 수치 + 텍스트 데이터 컬럼까지 모두 [매개변수 include='all']
## -> 수치형 데이터 : 평균, 분산, 최소값, 최대값, 4분위수값
## -> 텍스트 데이터 : 고유값, 최빈값, 빈도수
mpgDF.describe(include='all')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398.0,398
unique,,,,94.0,,,,,305
top,,,,150.0,,,,,ford pinto
freq,,,,22.0,,,,,6
mean,23.514573,5.454774,193.425879,,2970.424623,15.56809,76.01005,1.572864,
std,7.815984,1.701004,104.269838,,846.841774,2.757689,3.697627,0.802055,
min,9.0,3.0,68.0,,1613.0,8.0,70.0,1.0,
25%,17.5,4.0,104.25,,2223.75,13.825,73.0,1.0,
50%,23.0,4.0,148.5,,2803.5,15.5,76.0,1.0,
75%,29.0,8.0,262.0,,3608.0,17.175,79.0,2.0,


[3] 통계 메서드 사용 <hr>

In [44]:
## -------------------------------------------------------
## => 컬럼별 통계값 계산 및 출력
## -------------------------------------------------------
## => 컬럼별 계산
for col in mpgDF.columns:
    print(mpgDF[col].dtype, mpgDF[col].dtype != object, sep='\t')
    print(f'[{col:12}] {mpgDF[col].dtype} --- {mpgDF[col].dtype != object}')

    if mpgDF[col].dtype != object:
        print(f'mean    : {mpgDF[col].mean(numeric_only=True)}')     ## numeric_only는 숫자만 계산하는 거임

        print(f'min     : {mpgDF[col].min(numeric_only=True)}')

        print(f'max     : {mpgDF[col].max(numeric_only=True)}')

        print(f'median  : {mpgDF[col].median(numeric_only=True)}')

        print(f'mode    : {mpgDF[col].mode()[0]}')                  ## <= 최빈값 결과 Series

        print(f'sum     : {mpgDF[col].sum(numeric_only=True)}\n')
        


float64	True
[mpg         ] float64 --- True
mean    : 23.514572864321607
min     : 9.0
max     : 46.6
median  : 23.0
mode    : 13.0
sum     : 9358.8

int64	True
[cylinders   ] int64 --- True
mean    : 5.454773869346734
min     : 3
max     : 8
median  : 4.0
mode    : 4
sum     : 2171

float64	True
[displacement] float64 --- True
mean    : 193.42587939698493
min     : 68.0
max     : 455.0
median  : 148.5
mode    : 97.0
sum     : 76983.5

object	False
[horsepower  ] object --- False
int64	True
[weight      ] int64 --- True
mean    : 2970.424623115578
min     : 1613
max     : 5140
median  : 2803.5
mode    : 1985
sum     : 1182229

float64	True
[acceleration] float64 --- True
mean    : 15.568090452261307
min     : 8.0
max     : 24.8
median  : 15.5
mode    : 14.5
sum     : 6196.1

int64	True
[model year  ] int64 --- True
mean    : 76.01005025125629
min     : 70
max     : 82
median  : 76.0
mode    : 73
sum     : 30252

int64	True
[origin      ] int64 --- True
mean    : 1.5728643216080402
min

In [45]:
## ------------------------------------------------------
## => 고유값 : 컬럼에 존재하는 값의 종류  unique()메서드
## ------------------------------------------------------
# mpgDF.unique() ## 시리즈에만 쓸 수 있음 -> 반복문으로 돌려야됨

## 각 컬럼별 고유값 추출 및 출력
for col in mpgDF.columns:
    print(f'[{col}] ------------- {mpgDF[col].unique()}개', mpgDF[col].unique(), sep='\n')

[mpg] ------------- [18.  15.  16.  17.  14.  24.  22.  21.  27.  26.  25.  10.  11.   9.
 28.  19.  12.  13.  23.  30.  31.  35.  20.  29.  32.  33.  17.5 15.5
 14.5 22.5 24.5 18.5 29.5 26.5 16.5 31.5 36.  25.5 33.5 20.5 30.5 21.5
 43.1 36.1 32.8 39.4 19.9 19.4 20.2 19.2 25.1 20.6 20.8 18.6 18.1 17.7
 27.5 27.2 30.9 21.1 23.2 23.8 23.9 20.3 21.6 16.2 19.8 22.3 17.6 18.2
 16.9 31.9 34.1 35.7 27.4 25.4 34.2 34.5 31.8 37.3 28.4 28.8 26.8 41.5
 38.1 32.1 37.2 26.4 24.3 19.1 34.3 29.8 31.3 37.  32.2 46.6 27.9 40.8
 44.3 43.4 36.4 44.6 40.9 33.8 32.7 23.7 23.6 32.4 26.6 25.8 23.5 39.1
 39.  35.1 32.3 37.7 34.7 34.4 29.9 33.7 32.9 31.6 28.1 30.7 24.2 22.4
 34.  38.  44. ]개
[18.  15.  16.  17.  14.  24.  22.  21.  27.  26.  25.  10.  11.   9.
 28.  19.  12.  13.  23.  30.  31.  35.  20.  29.  32.  33.  17.5 15.5
 14.5 22.5 24.5 18.5 29.5 26.5 16.5 31.5 36.  25.5 33.5 20.5 30.5 21.5
 43.1 36.1 32.8 39.4 19.9 19.4 20.2 19.2 25.1 20.6 20.8 18.6 18.1 17.7
 27.5 27.2 30.9 21.1 23.2 23.8 23.9 20.3 

In [46]:
## ----------------------------------------------------
## => 데이터프레임 전체 통계값 계산 및 출력
## ----------------------------------------------------
for func in [mpgDF.mean, mpgDF.std, mpgDF.median, mpgDF.sum]:
    print(f'\n---------------------------')
    display(func(numeric_only=True))


---------------------------


mpg               23.514573
cylinders          5.454774
displacement     193.425879
weight          2970.424623
acceleration      15.568090
model year        76.010050
origin             1.572864
dtype: float64


---------------------------


mpg               7.815984
cylinders         1.701004
displacement    104.269838
weight          846.841774
acceleration      2.757689
model year        3.697627
origin            0.802055
dtype: float64


---------------------------


mpg               23.0
cylinders          4.0
displacement     148.5
weight          2803.5
acceleration      15.5
model year        76.0
origin             1.0
dtype: float64


---------------------------


mpg                9358.8
cylinders          2171.0
displacement      76983.5
weight          1182229.0
acceleration       6196.1
model year        30252.0
origin              626.0
dtype: float64

[4] 컬럼과 컬럼의 관계성 <hr>
- 상관계수 : corr()메서드
    * 범위 : -1 ~ 1
    * 종류 : 음의 상관관계, 양의 상관관계

In [47]:
## 전체
mpgDF.corr(numeric_only=True)

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
mpg,1.0,-0.775396,-0.804203,-0.831741,0.420289,0.579267,0.56345
cylinders,-0.775396,1.0,0.950721,0.896017,-0.505419,-0.348746,-0.562543
displacement,-0.804203,0.950721,1.0,0.932824,-0.543684,-0.370164,-0.609409
weight,-0.831741,0.896017,0.932824,1.0,-0.417457,-0.306564,-0.581024
acceleration,0.420289,-0.505419,-0.543684,-0.417457,1.0,0.288137,0.205873
model year,0.579267,-0.348746,-0.370164,-0.306564,0.288137,1.0,0.180662
origin,0.56345,-0.562543,-0.609409,-0.581024,0.205873,0.180662,1.0


In [49]:
## 연비 mpg 컬럼과 다른 컬럼들의 상관계수
mpg_corrSR = mpgDF.corr(numeric_only=True).loc['cylinders':,'mpg']
mpg_corrSR.sort_values()

weight         -0.831741
displacement   -0.804203
cylinders      -0.775396
acceleration    0.420289
origin          0.563450
model year      0.579267
Name: mpg, dtype: float64

In [50]:
## -> mpg 컬럼과 관련성 높은 컬럼은 weight, displacement, cylinders [음의 상관관계]
##                              model_year, origin [양의 상관관계]

##    종합해서 봤을 때, Weight, displacement, cylinders이 연비와 높은 관계성이 있음!