# Pandas DataFrame 판다스 데이터프레임

## DataFrame

- 2차원 배열과 유사한 자료형

- 다차원 `list`, `dict` 자료형으로 데이터 구성 가능

- Similar data structure with relational database `table`, MS excel `.xlsx`, `.csv` file 

    - 하나의 `column` = 하나의 `Series` = 하나의 `row`

    - 하나의 `Dataframe` = 한 개 이상의 `Series` 묶음

- index 특징

    - row index: 행 인덱스: `axis=0`
    
        - RangeIndex `int index`대신 지정한 `label index` 사용해도, `int index` 병행 사용 가능
    
    - column index: 열 인덱스: `axis=1`
        
        - 지정 `label index` 사용 시, RangeIndex `int index` 사용 불가
        
![df_example.PNG](https://github.com/insung-ethan-j/Numpy_and_Pandas/blob/b779370a58187754994887cca639105a8b7ff6c6/img/df_example.PNG?raw=true)

In [1]:
import pandas as pd
import numpy as np

## 1. DataFrame 생성

- `pd.DataFrame(data)`

- data: `다차원 list`, `dict`
    
    - `다차원 list`: item length 동일, 서로 다른 dtype 가능
    
    - `dict`: item length 동일, 서로 다른 dtype 가능
    
        - 주의사항: data type에 따라서 item length issue breakouts

<br>

- `DataFrame`의 `cell(tuple)`: 모든 data type 및 다양한 data type 혼합 가능

    - 각 `axis`별 length 동일해야 함
    
<br>

- 2-dim list (3 row, 4 columns): `label index` 미지정 > RangeIndex `int index` 자동 생성

In [2]:
list_2dim = [[1, 2, 3.5, 4],
        ['a', 'b', 'c', 'd'],
        [0.1, 3, 0.5, 8]]

In [3]:
df_2dim = pd.DataFrame(list_2dim)

df_2dim

Unnamed: 0,0,1,2,3
0,1,2,3.5,4
1,a,b,c,d
2,0.1,3,0.5,8


<br>

- `axis`별 length가 다른 `list`

    - maximum length row를 기준으로 `DataFrame` 구조 생성
    
    - length가 모자른 `cell`: `NaN`으로 filled

In [4]:
len_list = [[1, 2, 3, 4, 5],
        ['a', 'b'],
        [0.1, 0.2, 0.5]]

In [5]:
df_len = pd.DataFrame(len_list)

df_len

Unnamed: 0,0,1,2,3,4
0,1,2,3.0,4.0,5.0
1,a,b,,,
2,0.1,0.2,0.5,,


<br>

- `dict` data로 `DataFrame` 생성

    - `dict`의 `key` value: `DataFrame`의 `columns name`으로 자동 지정

    - `dict`의 `value` 내부 item 개수는 standardized 돼야 함

In [6]:
my_dict = {'a':[10, 20, 30, 40],
           'b':[1, 2, 3, 4],
           'c':[5, 6, 7, 8]}

df_dict = pd.DataFrame(my_dict)

print(type(df_dict))
df_dict

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,a,b,c
0,10,1,5
1,20,2,6
2,30,3,7
3,40,4,8


<br>

- `pd.DataFrame(dict)`: 각 `key`별 매칭된 `value`의 길이가 모두 동일해야함
    
    - 개수가 모자란 `cell`(tuple)을 가진 `dict`로 `DataFrame` 생성: **ValueError**

In [7]:
defi_dict = {'a':[10],
        'b':[1, 2, 3, 4],
        'c':[5, 6, 7]}

defi_dict

{'a': [10], 'b': [1, 2, 3, 4], 'c': [5, 6, 7]}

In [8]:
# defi_df = pd.DataFrame(defi_dict)

# ValueError: All arrays must be of the same length

<br>

- `dict` 아닌 data type이라도 > `row index`, `columns index` 지정 가능
    
    - `index` 지정 DataFrame 생성   
    :`pd.DataFrame()`optional parameter `index=[]`, `columns=[]`


- `columns=` parameter: columns index name 지정:   
  \> columns 개수와 동일한 length를 가진 list 전달


- `index=` parameter: row index name 지정:   
  \> row 개수와 동일한 length를 가진 list 전달

In [9]:
list_2dim

[[1, 2, 3.5, 4], ['a', 'b', 'c', 'd'], [0.1, 3, 0.5, 8]]

In [10]:
df_index = pd.DataFrame(list_2dim,
                   index=['r1', 'r2', 'r3'],
                   columns=['c1', 'c2', 'c3', 'c4'])

df_index

Unnamed: 0,c1,c2,c3,c4
r1,1,2,3.5,4
r2,a,b,c,d
r3,0.1,3,0.5,8


<br>

- `dict`로 `DataFrame` 생성: `column`(key) 순서 변경, `row index` 지정해서 생성 가능

In [11]:
my_dict

{'a': [10, 20, 30, 40], 'b': [1, 2, 3, 4], 'c': [5, 6, 7, 8]}

In [12]:
df_order = pd.DataFrame(my_dict, index=list('rows'), columns=list('cba'))

df_order

Unnamed: 0,c,b,a
r,5,1,10
o,6,2,20
w,7,3,30
s,8,4,40


<br>

- parameter `columns=[]`: data에 없거나 더 많은 columns name `list`전달
    
    - 새로운 columns name으로 column 생성, `NaN` value filled

In [13]:
new_df = pd.DataFrame(my_dict, columns=list('abcd'))

new_df

Unnamed: 0,a,b,c,d
0,10,1,5,
1,20,2,6,
2,30,3,7,
3,40,4,8,


<br>

- `dict` data row 개수(`key`별 `value`의 item 개수) $\ne$ `index=[]` list item 개수

   \> **Value Error**

In [14]:
my_dict

{'a': [10, 20, 30, 40], 'b': [1, 2, 3, 4], 'c': [5, 6, 7, 8]}

In [15]:
# df_item = pd.DataFrame(my_dict, index=list('qwerty'))

# ValueError: Length of values (4) does not match length of index (6)

## 2. `DataFrame` 속성

- 속성은 소괄호를 붙이지 않음

    1. `df.index`: df 객체의 행 인덱스 배열을 반환

    2. `df.columns`: df 객체의 열 인덱스 배열을 반환

    3. `df.axes`: df 객체의 행, 열 인덱스를 아이템으로 가지는 배열을 반환

    4. `df.values`: df 객체의 data(value)를 아이템으로 가지는 2차원 배열을 반환

    5. `df.dtypes`: df 객체의 item data type을 columns 기준으로 반환

    6. `df.size`: df 객체의 item data 개수(길이)를 반환

    7. `df.shape`: df 객체의 shape`(axis=0,axis=1, ..., axis=n-1)`를 반환

    8. `df.T` : 행과 열이 교환된 DataFrame 반환(transposed)
    
<br>

- `dict` to `DataFrame`

- 지역별 연도별 유입 인구

In [16]:
pop_data = {'서울':[150, 180, 300],
            '경기':[200, 240, 450],
            '충청':[-10, 3, -13],
            '경상':[10, 20, 30],
            '전라':[5, 6, 7]
           }

pop_sample = pd.DataFrame(pop_data)
pop_sample

Unnamed: 0,서울,경기,충청,경상,전라
0,150,200,-10,10,5
1,180,240,3,20,6
2,300,450,-13,30,7


<br>

`DataFrame` 속성

1. `df.index`: df객체의 row index array 반환

    - `row index` 설정: row 개수와 동일한 item 가지는 `list`로 전달

In [17]:
year_list= [2016, 2017, 2018]
pop_sample.index = year_list

pop_sample

Unnamed: 0,서울,경기,충청,경상,전라
2016,150,200,-10,10,5
2017,180,240,3,20,6
2018,300,450,-13,30,7


<br>

- `row index` 이름 지정: `df.index.name = 'name'`

In [18]:
pop_sample.index.name = 'year'

pop_sample

Unnamed: 0_level_0,서울,경기,충청,경상,전라
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,150,200,-10,10,5
2017,180,240,3,20,6
2018,300,450,-13,30,7


<br>

`DataFrame` 속성

2. `df.columns`: `columns index` 추출

In [19]:
pop_sample.columns

Index(['서울', '경기', '충청', '경상', '전라'], dtype='object')

<br>

- `columns index` identifier 지정: `df.columns.name = 'name'`

In [20]:
pop_sample.columns.name = 'location'

pop_sample

location,서울,경기,충청,경상,전라
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,150,200,-10,10,5
2017,180,240,3,20,6
2018,300,450,-13,30,7


<br>

- row index modify

- `df.index`속성값 활용

    1. row의 개수와 동일한 `list`를 전달
    
    2. 속성값 `df.index`로 사용하는 인덱스 객체는 하나의 item만 수정 불가. 전체 전달

In [21]:
pop_sample.index = [1998, 1999, 2000]

pop_sample

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


<br>

**Index modification**

- DataFrame method: `df.rename(data, axis=0)`
    
    - `axis`: default: `axis=0` = row index = `df.index`

        - column index에 대한 수정: `axis=1` or `axis='columns'`

    - data: `dict` type, `{'기존 인덱스명':'바꿀 인덱스명'}`

    - optional parameter `inplace=`: default: `inplace=False`: 수행한 결과 반환, 원본 적용 X

        - `inplace=True` > 바뀐 결과 바로 적용

In [22]:
pop_sample.rename({1998:1990})

location,서울,경기,충청,경상,전라
1990,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


In [23]:
pop_sample

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


<br>

- 열 인덱스(column) 변경: `axis=1` or `axis=columns`

- `inplace=False` (default): 원본 변경 X

In [24]:
pop_sample.rename({"전라":"제주"}, axis=1)

location,서울,경기,충청,경상,제주
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


<br>

`DataFrame` 속성

3. `df.axes`: df 객체의 행, 열 인덱스를 아이템으로 가지는 배열을 반환

    - return list: 첫 번째 item = row index, 두 번째 item = columns index

In [25]:
pop_sample.axes

[Int64Index([1998, 1999, 2000], dtype='int64'),
 Index(['서울', '경기', '충청', '경상', '전라'], dtype='object', name='location')]

<br>

- `df.reset_index(drop=False)`: row index를 일괄 변경

- optional parameter: `drop`
    
    - default `drop=False`: 기존 "row index" data > 새로운 cloumn index "index"으로 넘김, RangeIndex `int index` 자동 생성
    
    - `drop=True`: 기존 "row index" 삭제, RangeIndex `int index` 자동 생성
    

- `drop`속성 documentation:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html?highlight=reset_index

In [26]:
pop_sample.reset_index()

location,index,서울,경기,충청,경상,전라
0,1998,150,200,-10,10,5
1,1999,180,240,3,20,6
2,2000,300,450,-13,30,7


In [27]:
pop_sample

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


In [28]:
pop_sample.reset_index(drop=True)

location,서울,경기,충청,경상,전라
0,150,200,-10,10,5
1,180,240,3,20,6
2,300,450,-13,30,7


In [29]:
pop_sample

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


<br>

`DataFrame` 객체 속성

4. `df.values`: 객체의 data(value)를 아이템으로 가지는 2차원 array 반환

In [30]:
pop_sample.values

array([[150, 200, -10,  10,   5],
       [180, 240,   3,  20,   6],
       [300, 450, -13,  30,   7]], dtype=int64)

<br>

`DataFrame` 객체 속성

5. `df.dtypes`: df 객체의 item data type을 columns 기준으로 반환

In [31]:
pop_sample.dtypes

location
서울    int64
경기    int64
충청    int64
경상    int64
전라    int64
dtype: object

<br>

`DataFrame` 객체 속성

6. `df.size`: df 객체의 item data 개수(길이)를 반환

In [32]:
pop_sample.size

15

<br>

- `len(df)`: 가장 큰 dimension: `axis=0`: row단위 개수만 반환

In [33]:
len(pop_sample)

3

<br>

`DataFrame` 객체 속성

7. `df.shape`: df 객체의 shape`(axis=0,axis=1, ..., axis=n-1)`를 반환

In [34]:
pop_sample.shape

(3, 5)

<br>

`DataFrame` 객체 속성

8. `df.T`: 행과 열이 교환된 DataFrame 반환(transeposed)

In [35]:
trans_df = pop_sample.T

trans_df

Unnamed: 0_level_0,1998,1999,2000
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
서울,150,180,300
경기,200,240,450
충청,-10,3,-13
경상,10,20,30
전라,5,6,7


In [36]:
trans_df.index

Index(['서울', '경기', '충청', '경상', '전라'], dtype='object', name='location')

In [37]:
trans_df.columns

Int64Index([1998, 1999, 2000], dtype='int64')

## 3. 인덱싱(indexing)

- default: columns indexing: return `Series`

    - `df[col]`

    - `df.col`

    - `df.get(col)` 
    
- row indexing 
    
    - `df.iloc[idx]` : RangeIndex `int index`
    
    - `df.loc[label]` : 지정한 `label index`: 기본 인덱스가 아니면 모두 loc 메소드 사용

<br>

- "서울" column 조회 3가지 방법

1. 기본적인 indexing 기호: `df[col_name]`
    
2. `df.col_name` > `col_name`이 변수명으로 사용할 수 있을 때만 가능

3. DataFrame method: `df.get(col_name)`

In [38]:
pop_sample['서울']

1998    150
1999    180
2000    300
Name: 서울, dtype: int64

In [39]:
pop_sample.서울

1998    150
1999    180
2000    300
Name: 서울, dtype: int64

In [40]:
pop_sample.get("서울")

1998    150
1999    180
2000    300
Name: 서울, dtype: int64

In [41]:
pop_sample

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


<br>

- `df.iloc[idx]`: idx번째 행(row) 추출

- `df.loc[label_index]`: `label index`에 해당하는 행(row) 추출
    
    - return value: `Series`
    
    - `Series` name: 기존 DataFrame에서 reference한 해당 행의 `label index`

In [42]:
print(type(pop_sample.iloc[0]))
print(pop_sample.iloc[0].name)

pop_sample.iloc[0]

<class 'pandas.core.series.Series'>
1998


location
서울    150
경기    200
충청    -10
경상     10
전라      5
Name: 1998, dtype: int64

<br>

- `label index`로 행 조회: `df.loc[label_index]`

In [43]:
pop_sample.loc[1999]

location
서울    180
경기    240
충청      3
경상     20
전라      6
Name: 1999, dtype: int64

<br>

- 여러개의 columns 조회: list in list로 colname 나열해서 전달
    
    - `df[[col_name1, col_name2, ...]]`: return `DataFrame`

In [44]:
pop_sample[['서울', '경기']]

location,서울,경기
1998,150,200
1999,180,240
2000,300,450


<br>

- 여러 개의 columns + 하나의 row data 추출 > return `Series`

In [45]:
pop_sample[['경기', '경상']].loc[1999]

location
경기    240
경상     20
Name: 1999, dtype: int64

In [46]:
pop_sample.loc[1999][['경기', '경상']]

location
경기    240
경상     20
Name: 1999, dtype: int64

<br>

- 두 개 이상의 row 조회: `df.loc[[row1, row2, ...]]` > return `DataFrame`

In [47]:
pop_sample.loc[[1998, 2000]]

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
2000,300,450,-13,30,7


In [48]:
pop_sample.loc[[1998, 2000]]['충청']

1998   -10
2000   -13
Name: 충청, dtype: int64

In [49]:
pop_sample['충청'].loc[[1998, 2000]]

1998   -10
2000   -13
Name: 충청, dtype: int64

<br>

- `2-dim list`로 indexing > `DataFrame` 반환

In [50]:
pop_sample['충청']

1998   -10
1999     3
2000   -13
Name: 충청, dtype: int64

In [51]:
pop_sample[['충청']]

location,충청
1998,-10
1999,3
2000,-13


In [52]:
pop_sample['충청'].loc[1999]

3

In [53]:
pop_sample[['충청']].loc[[1999]]

location,충청
1999,3


## 4. 슬라이싱 slicing

- row(행) slicing

    - 순서가 있음 > row 단독 slicing 가능
    
    - 기본 슬라이싱 문법: 기본 RangeIndex `int index`를 기준으로 적용
    
        - `int index` slicing: not including `stop index`
        
        - `label index` slicing: including `stop index`

- col(열) slicing

    - 순서가 없음 > column 단독 slicing 불가능
    
    - row slicing 결과에 대해: `label index` column slicing 가능: `int index` 불가능
    
        - `label index` slicing만 가능: including `stop index`

<br>

- row slicing: `df[start:stop:step]`

In [54]:
pop_sample[0:2]

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6


In [55]:
pop_sample[0:3:2]

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
2000,300,450,-13,30,7


In [56]:
pop_sample[:2000]

location,서울,경기,충청,경상,전라
1998,150,200,-10,10,5
1999,180,240,3,20,6
2000,300,450,-13,30,7


In [57]:
pop_sample[::-1]

location,서울,경기,충청,경상,전라
2000,300,450,-13,30,7
1999,180,240,3,20,6
1998,150,200,-10,10,5


<br>

- columns slicing

    - row indexing 결과에 columns slicing
    
    - `df.loc[:, start:stop:step]`: RangeIndex `int index`사용 X, `label index`만 사용

In [58]:
pop_sample.loc[:, '서울':'경기']

location,서울,경기
1998,150,200
1999,180,240
2000,300,450


<br>

- `int index` slicing

    - `df[:stop_column][:stop_row]` == `df.iloc[:stop_row][:stop_column]`

In [59]:
zr_data = np.zeros((4, 4))

zr_data

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [60]:
zr_df = pd.DataFrame(zr_data)

zr_df

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0


In [61]:
zr_df[:3][:2]

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0


In [62]:
zr_df.iloc[:3, :2]

Unnamed: 0,0,1
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0


### > 연습문제

1. 아래와 같은 데이터프레임을 생성하고 출력화면과 동일한 결과를 생성하세요.

![df_prac](https://github.com/insung-ethan-j/Numpy_and_Pandas/blob/b779370a58187754994887cca639105a8b7ff6c6/img/df_practice1.PNG?raw=true)

In [63]:
Col1 = pd.Series([0, 3, 'ks01', 2, 5])
Col2 = pd.Series(["big", "data", "is", "very", "good"])
Col3 = pd.Series([2.7, -5.0, 2.12, 8.31, -1.34])
Col4 = pd.Series([True, True, False, False, True])

col_list = [Col1, Col2, Col3, Col4]

In [64]:
df_prac = pd.DataFrame({'Col1':col_list[0], 'Col2':col_list[1],
                       'Col3':col_list[2], 'Col4':col_list[3]})

df_prac

Unnamed: 0,Col1,Col2,Col3,Col4
0,0,big,2.7,True
1,3,data,-5.0,True
2,ks01,is,2.12,False
3,2,very,8.31,False
4,5,good,-1.34,True


In [65]:
df_prac.index = list("ABCDE")

df_prac

Unnamed: 0,Col1,Col2,Col3,Col4
A,0,big,2.7,True
B,3,data,-5.0,True
C,ks01,is,2.12,False
D,2,very,8.31,False
E,5,good,-1.34,True


<br>

2. 'Col1', 'Col3' 함께 추출

In [66]:
df_prac[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
A,0,2.7
B,3,-5.0
C,ks01,2.12
D,2,8.31
E,5,-1.34


<br>

3. row 'A', 'C', 'D' 추출

In [67]:
df_prac.loc[['A', 'C', 'D']]

Unnamed: 0,Col1,Col2,Col3,Col4
A,0,big,2.7,True
C,ks01,is,2.12,False
D,2,very,8.31,False


<br>

4. row 'B', 'D' > columns 'Col1', 'Col2' 추출

In [68]:
df_prac[['Col1', 'Col2']].loc[['B', 'D']]

Unnamed: 0,Col1,Col2
B,3,data
D,2,very


## 5. columns, row 추가, 변경

- columns 추가, 변경: column indexing: `df['column']`

    1. scalar value

    2. `ndarray`, `list`(row개수와 item개수 일치)

    3. column 간의 연산 (a col $\pm$ b col = c col)

    4. `Series` object 전달



- row 추가, 변경: row indexing: `df.loc['row']`

    1. scalar value

    2. `ndarray`, `list`, `dict`(column개수와 item개수 일치)
    
    3. operation between rows
    
  

- Data analystic에서 row와 column의 의미
    
    - column: variable(charateristic)
    
    - row: 개체별 data(`record`)

> 전체 데이터를 구성하는 variable(columns)를 추가, 삭제하는 일은 빈번하게 발생하지만,   
특정 index를 기준으로 한 row data(record)를 추가, 삭제하는 일은 자주 발생하지 않음.   
데이터 처리를 하는 과정에서 record 추가, 삭제는 권장하지 않는 작업.

### 5.1. columns 추가

 1. 모든 row에 대해서 동일한 value를 가지는 column 추가: scalar value(single value)

- `df['column'] = scalar value`

In [69]:
pop_sample['제주'] = 1

pop_sample

location,서울,경기,충청,경상,전라,제주
1998,150,200,-10,10,5,1
1999,180,240,3,20,6,1
2000,300,450,-13,30,7,1


<br>

2. 서로 다른 value의 data로 구성된 column 추가: `df['column'] = list or ndarray`
    
- Condition: `ndarray` or `list`의 length는 row length와 일치해야 함

In [70]:
print(len(pop_sample))

pop_sample['부산'] = np.random.randint(1, 10, len(pop_sample))

pop_sample

3


location,서울,경기,충청,경상,전라,제주,부산
1998,150,200,-10,10,5,1,5
1999,180,240,3,20,6,1,5
2000,300,450,-13,30,7,1,2


<br>

3. column간의 연산 결과로 column추가: `파생변수`

In [71]:
pop_sample['수도권'] = pop_sample['서울'] + pop_sample['경기']

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권
1998,150,200,-10,10,5,1,5,350
1999,180,240,3,20,6,1,5,420
2000,300,450,-13,30,7,1,2,750


<br>

4. `Series object` 전달

- Condition: 대상 `DataFrame`과 추가할 `Series`의 길이(column item 개수) 파악

- `label index` 기준: `Series` data와 `DataFrame`의 data가 mapping

    - 반드시 대상 `DataFrame`의 길이와 `Series`의 길이가 일치하지 않아도 된다.
    
        - `Series`에 없는 `label index`: `NaN` value filled`        

In [72]:
pop_sr = pd.Series([10, -10], index=[1998, 2000])

pop_sr

1998    10
2000   -10
dtype: int64

In [73]:
pop_sample["강원"] = pop_sr

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원
1998,150,200,-10,10,5,1,5,350,10.0
1999,180,240,3,20,6,1,5,420,
2000,300,450,-13,30,7,1,2,750,-10.0


<br>

- length(data 개수)가 동일해도 `label index`기준 mapping: `Series`에 없는 column > `NaN` filled

In [74]:
no_label_sr = pd.Series([100, 200, 300])

no_label_sr

0    100
1    200
2    300
dtype: int64

In [75]:
pop_sample['test'] = no_label_sr

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150,200,-10,10,5,1,5,350,10.0,
1999,180,240,3,20,6,1,5,420,,
2000,300,450,-13,30,7,1,2,750,-10.0,


### 5.2. row 추가

- row 추가: row indeing

    1. scalar value
    
    2. operation between rows
    
    3. `ndarray`, `list`, `dict`: columns 개수와 item 개수 일치해야 함

<br>

- row 추가

    1. scalar value: `df.loc[idx] = scalar`: column 추가와 동일

In [76]:
pop_sample.loc[2001] = 0

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150,200,-10,10,5,1,5,350,10.0,
1999,180,240,3,20,6,1,5,420,,
2000,300,450,-13,30,7,1,2,750,-10.0,
2001,0,0,0,0,0,0,0,0,0.0,0.0


In [77]:
pop_sample.shape

(4, 10)

<br>

- row 추가 
    
    2. data value: `ndarry`, `list`, `dict` data type
    
        - column 개수와 data item 개수 일치
        
<br>

- `ndarray`로 추가

In [78]:
pop_sample.loc[2002] = np.random.randint(-100, 100, 10)

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150,200,-10,10,5,1,5,350,10.0,
1999,180,240,3,20,6,1,5,420,,
2000,300,450,-13,30,7,1,2,750,-10.0,
2001,0,0,0,0,0,0,0,0,0.0,0.0
2002,-90,49,3,-77,-34,7,11,21,83.0,50.0


<br>

- `dict`로 추가: `{'key':'value', ...}` > `'column':'value', ...` : column별 value 지정 가능

In [79]:
pop_sample.loc[2003] = {'서울':10, '경기':20, '충청':40, '경상':21, '전라':37,
                   '제주':103, '부산':28, '수도권':30, '강원':15, 'test':0}

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150,200,-10,10,5,1,5,350,10.0,
1999,180,240,3,20,6,1,5,420,,
2000,300,450,-13,30,7,1,2,750,-10.0,
2001,0,0,0,0,0,0,0,0,0.0,0.0
2002,-90,49,3,-77,-34,7,11,21,83.0,50.0
2003,10,20,40,21,37,103,28,30,15.0,0.0


<br>

- `list`로 추가 

In [80]:
pop_sample.loc[2004] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150,200,-10,10,5,1,5,350,10.0,
1999,180,240,3,20,6,1,5,420,,
2000,300,450,-13,30,7,1,2,750,-10.0,
2001,0,0,0,0,0,0,0,0,0.0,0.0
2002,-90,49,3,-77,-34,7,11,21,83.0,50.0
2003,10,20,40,21,37,103,28,30,15.0,0.0
2004,1,2,3,4,5,6,7,8,9.0,0.0


<br>

- row 추가

    3. operation between rows

In [81]:
pop_sample.loc[2005] = pop_sample.loc[2002] * pop_sample.loc[2004]

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0,10.0,
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0,,
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0,-10.0,
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0,83.0,50.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0,15.0,0.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,0.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0,747.0,0.0


<br>

## 6. row, columns 삭제

- columns 삭제

    1. `del` + column indexing `df['column']`

    2. `df.drop('column', axis=1)`

    3. `df.drop(columns='column')`
  
  

- row 삭제

    - `df.drop(idx)`: default: `axis=0`

In [82]:
pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원,test
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0,10.0,
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0,,
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0,-10.0,
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0,83.0,50.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0,15.0,0.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,0.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0,747.0,0.0


<br>

### 6.1. columns 삭제

- columns 삭제

    1. `del` + `df['column']`
    
        - 원본 DataFrame 변경 됨

In [83]:
del pop_sample['test']

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0,10.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0,
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0,-10.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0,83.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0,15.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0,747.0


<br>

- columns 삭제

    2. `df.drop('column', axis=1)`
    
        - 원본 반영 X, `inplace=True` optional parameter 설정 필요

In [84]:
pop_sample.drop('강원', axis=1)

location,서울,경기,충청,경상,전라,제주,부산,수도권
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0


In [85]:
pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권,강원
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0,10.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0,
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0,-10.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0,83.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0,15.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0,747.0


<br>

- columns 삭제

    3. `df.drop(columns='column')`
    
        - 원본 영향 X, optional parameter `inplace=True` 설정 필요

In [86]:
pop_sample.drop(columns='강원', inplace=True)

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0


<br>

### 6.2.  row 삭제

- row 삭제

    - `df.drop('row')`: default: `axis=0`
    
        - 원본 반영X, `inplace=True` 설정 필요

In [87]:
pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0
2004,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0


In [88]:
pop_sample.drop(2004, inplace=True)

pop_sample

location,서울,경기,충청,경상,전라,제주,부산,수도권
1998,150.0,200.0,-10.0,10.0,5.0,1.0,5.0,350.0
1999,180.0,240.0,3.0,20.0,6.0,1.0,5.0,420.0
2000,300.0,450.0,-13.0,30.0,7.0,1.0,2.0,750.0
2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,7.0,11.0,21.0
2003,10.0,20.0,40.0,21.0,37.0,103.0,28.0,30.0
2005,-90.0,98.0,9.0,-308.0,-170.0,42.0,77.0,168.0


<br>

### 6.3. 두 개 이상의 columns, row 삭제

- 두 개 이상의 columns, row 삭제: `list`로 묶어서 전달

In [89]:
pop_sample.drop(['제주', '수도권'], axis=1, inplace=True)

pop_sample

location,서울,경기,충청,경상,전라,부산
1998,150.0,200.0,-10.0,10.0,5.0,5.0
1999,180.0,240.0,3.0,20.0,6.0,5.0
2000,300.0,450.0,-13.0,30.0,7.0,2.0
2001,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,11.0
2003,10.0,20.0,40.0,21.0,37.0,28.0
2005,-90.0,98.0,9.0,-308.0,-170.0,77.0


In [90]:
pop_sample.drop([2003, 2005], inplace=True)

pop_sample

location,서울,경기,충청,경상,전라,부산
1998,150.0,200.0,-10.0,10.0,5.0,5.0
1999,180.0,240.0,3.0,20.0,6.0,5.0
2000,300.0,450.0,-13.0,30.0,7.0,2.0
2001,0.0,0.0,0.0,0.0,0.0,0.0
2002,-90.0,49.0,3.0,-77.0,-34.0,11.0


## 7. `DataFrame` 사이의 산술 연산 (Operation)

- Operation between `DataFrame` and `DataFrame`

    - columns, row sorted

    - columns index, row index 기준 operation 수행
    
    - common index 아닐 경우: `NaN` 반환
    
    - `fill_value=value`: `NaN`이 아닌 값으로 대체 가능

<br>

- Operator 종류

    1. add : `+`, `df1.add(df2)`
    
    2. subtract : `-`, `df1.sub(df2)`
    
    3. multiply: `*`, `df1.mul(df2)`
    
    4. divide: `/`, `df1.div(df2)`
    
    5. get int division: `//`, `df1.floordiv(df2)`
    
    6. modulo: `%`, `df1.mod(df2)`

In [91]:
op_df = pd.DataFrame(np.random.randint(1, 10, 9).reshape(3, 3),
                    index=list('abc'),
                    columns=['서울', '경기', '인천'])

op_df

Unnamed: 0,서울,경기,인천
a,1,6,6
b,3,7,2
c,5,4,7


In [92]:
nd_df = pd.DataFrame(np.random.randint(1, 10, (20)).reshape(4, 5),
                    columns=['서울', '경기', '인천', '대전', '부산'],
                    index=list('abcd'))

nd_df

Unnamed: 0,서울,경기,인천,대전,부산
a,4,9,4,8,5
b,5,5,9,2,3
c,5,7,6,9,3
d,8,8,2,5,9


<br>

1. add: `+`, `df1.add(df2)`
    
    - result: common index(row, column)만 정상 연산, 아닌 부분은 `NaN`반환`

In [93]:
op_df + nd_df

Unnamed: 0,경기,대전,부산,서울,인천
a,15.0,,,5.0,10.0
b,12.0,,,8.0,11.0
c,11.0,,,10.0,13.0
d,,,,,


<br>

- `fill_value=value` optional parameter: 없는 data를 `value`로 채움

In [94]:
op_df.add(nd_df, fill_value=0)

Unnamed: 0,경기,대전,부산,서울,인천
a,15.0,8.0,5.0,5.0,10.0
b,12.0,2.0,3.0,8.0,11.0
c,11.0,9.0,3.0,10.0,13.0
d,8.0,5.0,9.0,8.0,2.0


<br>

2. subtract: `-`, `df1.sub(df2)`

In [95]:
op_df - nd_df

Unnamed: 0,경기,대전,부산,서울,인천
a,-3.0,,,-3.0,2.0
b,2.0,,,-2.0,-7.0
c,-3.0,,,0.0,1.0
d,,,,,


In [96]:
op_df.sub(nd_df, fill_value=0)

Unnamed: 0,경기,대전,부산,서울,인천
a,-3.0,-8.0,-5.0,-3.0,2.0
b,2.0,-2.0,-3.0,-2.0,-7.0
c,-3.0,-9.0,-3.0,0.0,1.0
d,-8.0,-5.0,-9.0,-8.0,-2.0


<br>

- 특정 row끼리 연산

In [97]:
op_df.loc[['a', 'c']] - nd_df.loc[['a', 'c']]

Unnamed: 0,경기,대전,부산,서울,인천
a,-3,,,-3,2
c,-3,,,0,1


<br>

3. multiply: `*`, `df1.mul(df2)`

In [98]:
op_df * nd_df

Unnamed: 0,경기,대전,부산,서울,인천
a,54.0,,,4.0,24.0
b,35.0,,,15.0,18.0
c,28.0,,,25.0,42.0
d,,,,,


In [99]:
op_df.mul(nd_df, fill_value=1)

Unnamed: 0,경기,대전,부산,서울,인천
a,54.0,8.0,5.0,4.0,24.0
b,35.0,2.0,3.0,15.0,18.0
c,28.0,9.0,3.0,25.0,42.0
d,8.0,5.0,9.0,8.0,2.0


<br>

4. divide: `/`, `df1.div(df2)`

In [100]:
op_df / nd_df

Unnamed: 0,경기,대전,부산,서울,인천
a,0.666667,,,0.25,1.5
b,1.4,,,0.6,0.222222
c,0.571429,,,1.0,1.166667
d,,,,,


In [101]:
op_df.div(nd_df, fill_value=1)

Unnamed: 0,경기,대전,부산,서울,인천
a,0.666667,0.125,0.2,0.25,1.5
b,1.4,0.5,0.333333,0.6,0.222222
c,0.571429,0.111111,0.333333,1.0,1.166667
d,0.125,0.2,0.111111,0.125,0.5


<br>

5. int division: `//`, `df1.floordiv(df2)`

In [102]:
op_df.floordiv(nd_df, fill_value=1)

Unnamed: 0,경기,대전,부산,서울,인천
a,0.0,0.0,0.0,0.0,1.0
b,1.0,0.0,0.0,0.0,0.0
c,0.0,0.0,0.0,1.0,1.0
d,0.0,0.0,0.0,0.0,0.0


<br>

6. modulo: `%`, `df1.mod(df2)`

In [103]:
op_df.mod(nd_df, fill_value=1)

Unnamed: 0,경기,대전,부산,서울,인천
a,6.0,1.0,1.0,1.0,2.0
b,2.0,1.0,1.0,3.0,2.0
c,4.0,1.0,1.0,0.0,1.0
d,1.0,1.0,1.0,1.0,1.0


<br>

## 8. `DataFrame`과 `Series` 사이의 산술 연산 <br> Operation between `DataFrame` and `Series`

- Basic logic: `Series` object의 `row label index`, `DataFrame` object의 `column label index` mapping > BroadCasting 발생

- common `label index` 아닐 때: `NaN`값 반환

- 연산 method 수행: `axis=axis` parameter > 연산 적용할 축 지정 (0: row, 1: column)



- Operator 종류

    1. add : `+`, `.add()`
    
    2. subtract : `-`,  `.sub()`

    3. multifly : `*`, `.mul()`

In [104]:
my_df = pd.DataFrame(np.arange(12).reshape(3, 4),
                    index=[2010, 2011, 2012],
                    columns=list('abcd'))

my_df

Unnamed: 0,a,b,c,d
2010,0,1,2,3
2011,4,5,6,7
2012,8,9,10,11


In [105]:
my_sr = my_df.iloc[0]

my_sr

a    0
b    1
c    2
d    3
Name: 2010, dtype: int32

<br>

- Operation between `DataFrame` and `Series`: common `index name`

    \> mapping DataFrame object `column label index` and Series object `row label index`

    \> `Series` object `name`, `DataFrame`의 `row index name` 상관 없음
  
    \> 원본 반영X, 반환된 결과 저장 필요

In [106]:
my_df + my_sr

Unnamed: 0,a,b,c,d
2010,0,2,4,6
2011,4,6,8,10
2012,8,10,12,14


In [107]:
my_df

Unnamed: 0,a,b,c,d
2010,0,1,2,3
2011,4,5,6,7
2012,8,9,10,11


In [108]:
my_sr = my_sr.rename(2020)
print(my_sr)

my_df + my_sr

a    0
b    1
c    2
d    3
Name: 2020, dtype: int32


Unnamed: 0,a,b,c,d
2010,0,2,4,6
2011,4,6,8,10
2012,8,10,12,14


<br>

- Operation between `DataFrame` and `Series`

    - mapping DataFrame object `column label index` and Series object `row label index`

In [109]:
zr_df = pd.DataFrame(np.zeros(20).reshape(4,5),
                    columns=list('abcde'))

zr_df

Unnamed: 0,a,b,c,d,e
0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0


In [110]:
no_sr = pd.Series(np.arange(5))

no_sr

0    0
1    1
2    2
3    3
4    4
dtype: int32

In [111]:
zr_df.sub(no_sr)

Unnamed: 0,a,b,c,d,e,0,1,2,3,4
0,,,,,,,,,,
1,,,,,,,,,,
2,,,,,,,,,,
3,,,,,,,,,,


<br>

- Operation between `DataFrame` and `Series` in `axis=0`

    \> mapping DataFrame `row label index` and Series `row label index`

In [112]:
zr_df.sub(no_sr, axis=0)

Unnamed: 0,a,b,c,d,e
0,0.0,0.0,0.0,0.0,0.0
1,-1.0,-1.0,-1.0,-1.0,-1.0
2,-2.0,-2.0,-2.0,-2.0,-2.0
3,-3.0,-3.0,-3.0,-3.0,-3.0
4,,,,,


<br>

- DataFrame `column label index`에 없는 `row label index`를 가진 Series와의 연산

In [113]:
defi_sr = pd.Series([3, 3, 3], index=list('ace'))

defi_sr

a    3
c    3
e    3
dtype: int64

In [114]:
my_df

Unnamed: 0,a,b,c,d
2010,0,1,2,3
2011,4,5,6,7
2012,8,9,10,11


<br>

- Uncommon index (unable mapping) > return `NaN`

- Operation between `Series` and `DataFrame`: `fill_value=value` parameter 사용 불가

In [115]:
my_df + defi_sr

Unnamed: 0,a,b,c,d,e
2010,3.0,,5.0,,
2011,7.0,,9.0,,
2012,11.0,,13.0,,


In [116]:
my_df.add(defi_sr)

Unnamed: 0,a,b,c,d,e
2010,3.0,,5.0,,
2011,7.0,,9.0,,
2012,11.0,,13.0,,


In [117]:
my_df.add(defi_sr, fill_value=0)

NotImplementedError: fill_value 0 not supported.