## Pandas 라이브러리

- 데이터 분석과 처리를 쉽게 할 수 있게 도와주는 라이브러리
- TensorFlow 및 Scikit-Learn, Matplotlib 등 데이터 분석 및 시각화 기능을 수행하기 위해 필수적으로 사용되는 라이브러리

---


### Pandas 자료구조
- **Series** : 인덱싱 기능을 추가한 1차원 배열

|Index|Data|
|:---:|:--:|
|1| 'A' |
|2| 'B' |
|3| 'C' |
|4| 'D' |
|5| 'E' |

- **DataFrame** : 인덱스와 컬럼으로 구성된 2차원 데이터 구조

|--------|성별|연령대|매출금액|
|:------:|:--:|:----:|:------:|
|1번 고객| -- | ---- | ------ |
|2번 고객| -- | ---- | ------ |
|3번 고객| -- | ---- | ------ |
|4번 고객| -- | ---- | ------ |
|5번 고객| -- | ---- | ------ |

### Series Data

#### Series Data 생성하기

In [25]:
import pandas as pd
import numpy as np

In [26]:
# numpy의 arange함수로 배열 생성
a = np.arange(5)
# Series에 a 배열을 넣어 Series 객체 생성
s1 = pd.Series(a)

print(a)
print(s1)

[0 1 2 3 4]
0    0
1    1
2    2
3    3
4    4
dtype: int32


In [27]:
# 파이썬 list 자료형 생성
b = ['a', 'b', 'c', 'd', 'e']
# List로 Series 데이터 생성
s2 = pd.Series(b)
print(s2)

0    a
1    b
2    c
3    d
4    e
dtype: object


In [28]:
# 인덱스 명시하기
s2 = pd.Series(a, b)
print(s2)

a    0
b    1
c    2
d    3
e    4
dtype: int32


In [29]:
c = [['a', 'b', 'c'], ['d', 'e', 'f']]
s3 = pd.Series(c)
print(s3)

0    [a, b, c]
1    [d, e, f]
dtype: object


In [30]:
# 튜플로 생성하기
t = ('a', 'b', 'c', 'd', 'e')
s4 = pd.Series(t,b)
print(s4)

a    a
b    b
c    c
d    d
e    e
dtype: object


In [31]:
# 딕셔너리로 Series 데이터 생성
d = {'국어': [100, 30], '영어':80, '수학':90}
s5 = pd.Series(d)
print(s5)

국어    [100, 30]
영어           80
수학           90
dtype: object


#### Series 데이터 속성과 인덱스 활용

In [32]:
print(s5.index)
print(s5.values)
print(s5['국어'])
s5['영어'] = 90
print(s5['영어'])

Index(['국어', '영어', '수학'], dtype='object')
[list([100, 30]) 80 90]
[100, 30]
90


---
### DataFrame 생성하기
- Series가 1차원이면 Datframe은 2차원으로 확대된 버전
- 2차원임으로 index는 row, column으로 구성됨
- 데이터분석과 머신러닝에서 데이터 변형을 위해 가장 많이 사용.

#### 데이터 파일 읽기
- NBA 농구선수들의 게임 기록 데이터 200개 항목
    - Player : 선수이름
    - Pos : 포지션
    - 3P : 한 경기 평균 3점 슛 성공 횟수
    - 2P : 한 경기 평균 2점 슛 성공 횟수
    - TRB : 한 경기 리바운드 성공 횟수
    - AST : 한 경기 어시스트 성공 횟수
    - STL : 한 경기 스틸 성공 횟수
    - BLK : 한 경기 블로킹 성공 횟수

In [35]:
import pandas as pd

basket_ball = pd.read_csv('./basketball_stat.csv')
basket_ball

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
1,Steven Adams,C,0.0,6.0,9.5,1.6,1.5,1.0
2,Bam Adebayo,C,0.0,3.4,7.3,2.2,0.9,0.8
3,DeVaughn Akoon-Purcell,SG,0.0,0.4,0.6,0.9,0.3,0.0
4,LaMarcus Aldridge,C,0.1,8.3,9.2,2.4,0.5,1.3
...,...,...,...,...,...,...,...,...
195,Nik Stauskas,SG,1.0,1.0,1.9,1.2,0.3,0.1
196,D.J. Stephens,SG,0.0,1.0,0.0,0.0,1.0,0.0
197,Lance Stephenson,SG,1.1,1.6,3.2,2.1,0.6,0.1
198,Garrett Temple,SG,1.2,1.6,2.9,1.4,1.0,0.4


In [36]:
basket_ball = pd.read_excel('./basketball_stat.xlsx')
basket_ball

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
1,Steven Adams,C,0.0,6.0,9.5,1.6,1.5,1.0
2,Bam Adebayo,C,0.0,3.4,7.3,2.2,0.9,0.8
3,DeVaughn Akoon-Purcell,SG,0.0,0.4,0.6,0.9,0.3,0.0
4,LaMarcus Aldridge,C,0.1,8.3,9.2,2.4,0.5,1.3
...,...,...,...,...,...,...,...,...
195,Nik Stauskas,SG,1.0,1.0,1.9,1.2,0.3,0.1
196,D.J. Stephens,SG,0.0,1.0,0.0,0.0,1.0,0.0
197,Lance Stephenson,SG,1.1,1.6,3.2,2.1,0.6,0.1
198,Garrett Temple,SG,1.2,1.6,2.9,1.4,1.0,0.4


#### DataFrame 데이터 확인

- head(), tail() : 앞 또는 뒤에서 일부분(default : 5줄) 데이터 보기

In [38]:
basket_ball.head(3)

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
1,Steven Adams,C,0.0,6.0,9.5,1.6,1.5,1.0
2,Bam Adebayo,C,0.0,3.4,7.3,2.2,0.9,0.8


In [39]:
basket_ball.tail()

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
195,Nik Stauskas,SG,1.0,1.0,1.9,1.2,0.3,0.1
196,D.J. Stephens,SG,0.0,1.0,0.0,0.0,1.0,0.0
197,Lance Stephenson,SG,1.1,1.6,3.2,2.1,0.6,0.1
198,Garrett Temple,SG,1.2,1.6,2.9,1.4,1.0,0.4
199,Jared Terrell,SG,0.3,0.6,0.4,0.9,0.2,0.1


#### DataFrame 데이터 파악하기

- shape 속성 : (row, columns)
- index 속성 : 각 아이템의 식별값
- columns 속성 : 각각 데이터의 특성
- describe 함수 : 숫자형 데이터의 통계치 계산
- info 함수 : 데이터 타입, 각 아이템의 개수 등을 출력

In [40]:
basket_ball.shape

(200, 8)

In [41]:
basket_ball.columns

Index(['Player', 'Pos', '3P', '2P', 'TRB', 'AST', 'STL', 'BLK'], dtype='object')

In [43]:
basket_ball.describe()

Unnamed: 0,3P,2P,TRB,AST,STL,BLK
count,200.0,200.0,200.0,200.0,200.0,200.0
mean,0.8265,2.445,3.8665,1.6495,0.592,0.454
std,0.84321,1.885591,2.853901,1.342988,0.383892,0.491316
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.1,1.9,0.875,0.3,0.1
50%,0.65,1.9,3.2,1.2,0.5,0.3
75%,1.325,3.225,5.0,2.1,0.8,0.6
max,4.0,8.6,15.6,7.7,1.8,2.4


In [44]:
basket_ball.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  200 non-null    object 
 1   Pos     200 non-null    object 
 2   3P      200 non-null    float64
 3   2P      200 non-null    float64
 4   TRB     200 non-null    float64
 5   AST     200 non-null    float64
 6   STL     200 non-null    float64
 7   BLK     200 non-null    float64
dtypes: float64(6), object(2)
memory usage: 12.6+ KB


#### 원하는 컬럼 선택하기

In [45]:
basket_ball.head()

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
1,Steven Adams,C,0.0,6.0,9.5,1.6,1.5,1.0
2,Bam Adebayo,C,0.0,3.4,7.3,2.2,0.9,0.8
3,DeVaughn Akoon-Purcell,SG,0.0,0.4,0.6,0.9,0.3,0.0
4,LaMarcus Aldridge,C,0.1,8.3,9.2,2.4,0.5,1.3


In [46]:
basket_ball['3P']

0      1.3
1      0.0
2      0.0
3      0.0
4      0.1
      ... 
195    1.0
196    0.0
197    1.1
198    1.2
199    0.3
Name: 3P, Length: 200, dtype: float64

In [48]:
# []안에 칼럼 이름을 리스트 형식으로 써줌.
basket_ball[['3P']]

Unnamed: 0,3P
0,1.3
1,0.0
2,0.0
3,0.0
4,0.1
...,...
195,1.0
196,0.0
197,1.1
198,1.2


In [49]:
basket_ball[['Player','AST', 'BLK']]

Unnamed: 0,Player,AST,BLK
0,Alex Abrines,0.6,0.2
1,Steven Adams,1.6,1.0
2,Bam Adebayo,2.2,0.8
3,DeVaughn Akoon-Purcell,0.9,0.0
4,LaMarcus Aldridge,2.4,1.3
...,...,...,...
195,Nik Stauskas,1.2,0.1
196,D.J. Stephens,0.0,0.0
197,Lance Stephenson,2.1,0.1
198,Garrett Temple,1.4,0.4


#### 조건을 주어 원하는 Columns 선택하기

- & : and 둘다 만족
- | : or 둘 중 하나만 만족해도 ok

In [53]:
basket_ball[basket_ball['3P'] > 1.0]

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
14,Will Barton,SG,1.6,2.7,4.6,2.9,0.4,0.5
16,Kent Bazemore,SG,1.4,2.7,3.9,2.3,1.3,0.6
17,Bradley Beal,SG,2.5,6.8,5.0,5.5,1.5,0.7
18,Malik Beasley,SG,2.0,2.3,2.5,1.2,0.7,0.1
...,...,...,...,...,...,...,...,...
189,Iman Shumpert,SG,1.5,1.2,3.0,1.8,1.0,0.4
192,Marcus Smart,SG,1.6,1.4,2.9,4.0,1.8,0.4
193,J.R. Smith,SG,1.1,1.4,1.6,1.9,1.0,0.3
197,Lance Stephenson,SG,1.1,1.6,3.2,2.1,0.6,0.1


In [55]:
basket_ball[(basket_ball['3P'] > 1.0) & (basket_ball['2P'] > 5)]

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
17,Bradley Beal,SG,2.5,6.8,5.0,5.5,1.5,0.7
28,Devin Booker,SG,2.1,7.0,4.1,6.8,0.9,0.2
71,Joel Embiid,C,1.2,7.8,13.6,3.7,0.7,1.9
106,Jrue Holiday,SG,1.8,6.4,5.0,7.7,1.6,0.8
129,Zach LaVine,SG,1.9,6.5,4.7,4.5,1.0,0.4
144,CJ McCollum,SG,2.4,5.8,4.0,3.0,0.8,0.4
152,Donovan Mitchell,SG,2.4,6.1,4.1,4.2,1.4,0.4


#### 원하는 행 선택하기

- .loc, .lloc로 row 선택 가능
- loc : 인덱스 자체를 사용
- iloc : 0 based index로 사용

In [56]:
basket_ball.loc[0]

Player    Alex Abrines
Pos                 SG
3P                 1.3
2P                 0.5
TRB                1.5
AST                0.6
STL                0.5
BLK                0.2
Name: 0, dtype: object

In [57]:
basket_ball.loc[[0]]

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2


In [58]:
basket_ball.iloc[0]

Player    Alex Abrines
Pos                 SG
3P                 1.3
2P                 0.5
TRB                1.5
AST                0.6
STL                0.5
BLK                0.2
Name: 0, dtype: object

In [59]:
basket_ball.iloc[[0]]

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2


In [60]:
basket_ball.iloc[[0, 4, 5]]

Unnamed: 0,Player,Pos,3P,2P,TRB,AST,STL,BLK
0,Alex Abrines,SG,1.3,0.5,1.5,0.6,0.5,0.2
4,LaMarcus Aldridge,C,0.1,8.3,9.2,2.4,0.5,1.3
5,Rawle Alkins,SG,0.3,1.0,2.6,1.3,0.1,0.0


#### 행과 열 동시 선택하기

In [61]:
basket_ball.loc[[0,3,6], ['Player', 'BLK']]

Unnamed: 0,Player,BLK
0,Alex Abrines,0.2
3,DeVaughn Akoon-Purcell,0.0
6,Grayson Allen,0.2


In [62]:
basket_ball.iloc[[0, 3, 5],[0, 4, 7]]

Unnamed: 0,Player,TRB,BLK
0,Alex Abrines,1.5,0.2
3,DeVaughn Akoon-Purcell,0.6,0.0
5,Rawle Alkins,2.6,0.0
