## 기초분석
Pandas는 데이터를 구조적인 측면에서 좀 더 편하게 다룰 수 있게 하는 장점을 모아서 만든 패키지로 Pandas에서 제공하는 통계 분석은 기술 통계 및 데이터 요약이다. 고급 통계 기법은 scikit-learn이나 다른 통계 패키지를 이용하여 수행 할 수 있다.

In [1]:
from pandas import Series, DataFrame
import pandas as pd

In [2]:
#German credit data 입력
german = pd.read_csv('http://freakonometrics.free.fr/german_credit.csv')

In [4]:
german.columns.values

array(['Creditability', 'Account Balance', 'Duration of Credit (month)',
       'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',
       'Value Savings/Stocks', 'Length of current employment',
       'Instalment per cent', 'Sex & Marital Status', 'Guarantors',
       'Duration in Current address', 'Most valuable available asset',
       'Age (years)', 'Concurrent Credits', 'Type of apartment',
       'No of Credits at this Bank', 'Occupation', 'No of dependents',
       'Telephone', 'Foreign Worker'], dtype=object)

In [5]:
#list로 변환시켜 받음
list(german.columns.values)

['Creditability',
 'Account Balance',
 'Duration of Credit (month)',
 'Payment Status of Previous Credit',
 'Purpose',
 'Credit Amount',
 'Value Savings/Stocks',
 'Length of current employment',
 'Instalment per cent',
 'Sex & Marital Status',
 'Guarantors',
 'Duration in Current address',
 'Most valuable available asset',
 'Age (years)',
 'Concurrent Credits',
 'Type of apartment',
 'No of Credits at this Bank',
 'Occupation',
 'No of dependents',
 'Telephone',
 'Foreign Worker']

In [6]:
#컬럼만 가져옴
german_sample=german[['Creditability', 'Duration of Credit (month)','Purpose','Credit Amount']]

In [7]:
german_sample

Unnamed: 0,Creditability,Duration of Credit (month),Purpose,Credit Amount
0,1,18,2,1049
1,1,9,0,2799
2,1,12,9,841
3,1,12,0,2122
4,1,12,0,2171
...,...,...,...,...
995,0,24,3,1987
996,0,24,0,2303
997,0,21,0,12680
998,0,12,3,6468


In [8]:
#  불러들인 4개의 컬럼의 최소값 찾아내기
german_sample.min()

Creditability                   0
Duration of Credit (month)      4
Purpose                         0
Credit Amount                 250
dtype: int64

In [9]:
german_sample.max()

Creditability                     1
Duration of Credit (month)       72
Purpose                          10
Credit Amount                 18424
dtype: int64

In [11]:
#평균값
german_sample.mean()

Creditability                    0.700
Duration of Credit (month)      20.903
Purpose                          2.828
Credit Amount                 3271.248
dtype: float64

In [13]:
german_sample.describe
#전체요약

<bound method NDFrame.describe of      Creditability  Duration of Credit (month)  Purpose  Credit Amount
0                1                          18        2           1049
1                1                           9        0           2799
2                1                          12        9            841
3                1                          12        0           2122
4                1                          12        0           2171
..             ...                         ...      ...            ...
995              0                          24        3           1987
996              0                          24        0           2303
997              0                          21        0          12680
998              0                          12        3           6468
999              0                          30        2           6350

[1000 rows x 4 columns]>

## 상관관계와 공분산

공분산(Covariance) : X의 편차와 Y의 편차를 곱한 것의 평균, X와 Y의 단위에 영향을 받는다.

상관계수(Correlation) : 확률변수의 절대적 크기에 영향을 받지 않도록 단위화, 분산의 크기만큼 나눔

In [16]:
german_sample=german[['Duration of Credit (month)','Credit Amount','Age (years)']]

In [18]:
#상관계수
german_sample.corr()

Unnamed: 0,Duration of Credit (month),Credit Amount,Age (years)
Duration of Credit (month),1.0,0.624988,-0.03755
Credit Amount,0.624988,1.0,0.032273
Age (years),-0.03755,0.032273,1.0


### 해석
Duration of Credit (month)= Duration of Credit 과	Credit Amount은 상관관계 있음. Age (years)는 없음


In [19]:
german_sample=german[['Credit Amount','Type of apartment']]

In [20]:
german_sample

Unnamed: 0,Credit Amount,Type of apartment
0,1049,1
1,2799,1
2,841,1
3,2122,1
4,2171,2
...,...,...
995,1987,1
996,2303,2
997,12680,3
998,6468,2


In [23]:
#주거종류가 3가지인것으로 추측
#주거종류에 따라 Credit Amount 통계내기
german_grouped=german_sample['Credit Amount'].groupby(german_sample['Type of apartment'])
german_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002508CF24AF0>

In [24]:
#그룹별로 평균값도출하기
german_grouped.mean()

Type of apartment
1    3122.553073
2    3067.257703
3    4881.205607
Name: Credit Amount, dtype: float64

In [33]:
german_sample = german[['Credit Amount','Type of apartment','Purpose']]
german_grouped = german_sample['Credit Amount'].groupby(
                            [german_sample['Purpose'],
                            german_sample['Type of apartment']])
german_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002508D1C3250>

In [34]:
#서브그룹핑 해서 서브그룹별 평균까지 도출가능
german_grouped.mean()

Purpose  Type of apartment
0        1                    2597.225000
         2                    2811.024242
         3                    5138.689655
1        1                    5037.086957
         2                    4915.222222
         3                    6609.923077
2        1                    2727.354167
         2                    3107.450820
         3                    4100.181818
3        1                    2199.763158
         2                    2540.533040
         3                    2417.333333
4        1                    1255.500000
         2                    1546.500000
5        1                    1522.000000
         2                    2866.000000
         3                    2750.666667
6        1                    3156.444444
         2                    2492.423077
         3                    4387.266667
8        1                     902.000000
         2                    1243.875000
9        1                    5614.125000
       