## 상관분석
선형관계를 확인하기 위해 상관분석을 실시

### Pearson's Correlation Coefficient
- 수치형 변수의 비교
- 상관계수가 0에 가까울수록 선형관계가 약하며 절대값이 1에 가까울 수록 선형관계가 강함

#### pandas - corr()
상관계수를 빠르게 뽑는 데이터프레임 전용 메서드

method에 'pearson', 'kendall', 'spearman'는 각각의 상관계수로 계산

#### scipy - pearsonr()
Pearson 상관분석을 실시하는 scipy의 함수

입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례로 출력

#### scipy - spearmanr()
Spearman 상관분석을 실시하는 scipy의 함수

입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례로 출력

#### scipy - kendalltau()
Kendall 상관분석을 실시하는 scipy의 함수

입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례로 출력

##### 함수명 뒤의 r, tau는 계수 또는 검정통계량 산출에 필요

In [42]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import kendalltau

In [6]:
bike = pd.read_csv('ex/bike.csv')
bike.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [8]:
# 상관계수
bike.corr(numeric_only=True)

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
season,1.0,0.029368,-0.008126,0.008879,0.258689,0.264744,0.19061,-0.147121,0.096758,0.164011,0.163439
holiday,0.029368,1.0,-0.250491,-0.007074,0.000295,-0.005215,0.001929,0.008409,0.043799,-0.020956,-0.005393
workingday,-0.008126,-0.250491,1.0,0.033772,0.029966,0.02466,-0.01088,0.013373,-0.319111,0.11946,0.011594
weather,0.008879,-0.007074,0.033772,1.0,-0.055035,-0.055376,0.406244,0.007261,-0.135918,-0.10934,-0.128655
temp,0.258689,0.000295,0.029966,-0.055035,1.0,0.984948,-0.064949,-0.017852,0.467097,0.318571,0.394454
atemp,0.264744,-0.005215,0.02466,-0.055376,0.984948,1.0,-0.043536,-0.057473,0.462067,0.314635,0.389784
humidity,0.19061,0.001929,-0.01088,0.406244,-0.064949,-0.043536,1.0,-0.318607,-0.348187,-0.265458,-0.317371
windspeed,-0.147121,0.008409,0.013373,0.007261,-0.017852,-0.057473,-0.318607,1.0,0.092276,0.091052,0.101369
casual,0.096758,0.043799,-0.319111,-0.135918,0.467097,0.462067,-0.348187,0.092276,1.0,0.49725,0.690414
registered,0.164011,-0.020956,0.11946,-0.10934,0.318571,0.314635,-0.265458,0.091052,0.49725,1.0,0.970948


In [9]:
bike[['casual', 'registered', 'count']].corr()

Unnamed: 0,casual,registered,count
casual,1.0,0.49725,0.690414
registered,0.49725,1.0,0.970948
count,0.690414,0.970948,1.0


In [10]:
bike[['casual', 'registered', 'count']].corr(method='spearman')

Unnamed: 0,casual,registered,count
casual,1.0,0.775785,0.847378
registered,0.775785,1.0,0.988901
count,0.847378,0.988901,1.0


In [11]:
pearsonr(bike['casual'], bike['registered'])

PearsonRResult(statistic=0.49724968508700823, pvalue=0.0)

In [12]:
stat,p = pearsonr(bike['casual'], bike['registered'])
print(stat) # 상관계수
print(p) # p-value

0.49724968508700823
0.0


In [24]:
bike[['temp', 'atemp', 'humidity', 'casual']].corr(numeric_only=True).round(2).min()

temp       -0.06
atemp      -0.04
humidity   -0.35
casual     -0.35
dtype: float64

In [34]:
bike.groupby('season')[['atemp', 'casual']].corr(numeric_only=True).round(3).reset_index()

Unnamed: 0,season,level_1,atemp,casual
0,1,atemp,1.0,0.478
1,1,casual,0.478,1.0
2,2,atemp,1.0,0.378
3,2,casual,0.378,1.0
4,3,atemp,1.0,0.381
5,3,casual,0.381,1.0
6,4,atemp,1.0,0.444
7,4,casual,0.444,1.0


In [36]:
bike_corr = bike[['season', 'atemp', 'casual']].groupby('season').corr(numeric_only=True)
bike_corr = bike_corr.reset_index()
bike_corr.loc[(bike_corr['level_1']=='atemp'),]['casual'].reset_index(drop=True)

0    0.478312
1    0.378122
2    0.381423
3    0.443751
Name: casual, dtype: float64

In [51]:
bike['is_sunny'] = np.where(bike['weather']==1, True, False)
bike_corr = bike.groupby('is_sunny')[['temp', 'casual']].corr().reset_index()
bike_corr

Unnamed: 0,is_sunny,level_1,temp,casual
0,False,temp,1.0,0.446361
1,False,casual,0.446361,1.0
2,True,temp,1.0,0.471053
3,True,casual,0.471053,1.0


In [62]:
round(abs(bike_corr.iloc[0,3] - bike_corr.iloc[2,3]), 5)

0.02469