<a href="https://colab.research.google.com/github/dudgus1286/pandas/blob/main/05_NaN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### [참고] <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Pandas Cheat Sheet</a>

https://pandas.pydata.org/docs/user_guide/missing_data.html

#### NaN(Not a Number) - 표현 불가능한 데이터(비어 있는 값)

- NaN : missing value 를 표현하는 기본 형태
- 기본적으로 float 형식으로 처리됨

#### NA(Not Available) : 결측값
#### None : 값의 부재(값이 존재하지 않거나, 없음, 정의되지 않음)

In [1]:
import pandas as pd
import numpy as np

### [실습 1]

#### 1) missing data 가 포함된 데이터 프레임 생성

In [3]:
df = pd.DataFrame(
    {
        "name":["Alfred", "Batman", "Catwoman"],
        "toy":[np.nan, "Batmobile", "Bullwhip"],
        "born":[None, pd.Timestamp("19400425"), pd.NA]
    }
)
df

Unnamed: 0,name,toy,born
0,Alfred,,
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,


#### 2) 데이터 타입 확인

In [4]:
df.dtypes

name    object
toy     object
born    object
dtype: object

#### 3) missing data 처리

**dropna : missing values 제거**

In [5]:
df.dropna?

df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

In [8]:
# how=any, all : 행 or 열에 하나라도 널 값이 있으면 없애기
# axis = 0(행), 1(열)

df.dropna(axis=1)

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [9]:
# how=any 행 or 열에 하나라도 널 값이 있으면 없애기
# axis = 0(행), 1(열)

df.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25 00:00:00


In [10]:
# how='all' 전체 행에 결측치가 모두 있어야 drop

df.dropna(how="all")

Unnamed: 0,name,toy,born
0,Alfred,,
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,


**fillna : missing values 를 임의의 값으로 채우기**

In [13]:
df.fillna?

df.fillna(
    value=None,
    method=None,
    axis=None,
    inplace=False,
    limit=None,
    downcast=None,
)

In [14]:
# 결측치를 무엇으로 채울 것인가?
# value : scalar, dict, Series, or DataFrame 가능

df.fillna(0)

Unnamed: 0,name,toy,born
0,Alfred,0,0
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,0


In [16]:
# 특정 값으로 채우기 - 값이 NaN, NaT, NA, None인 경우에만 채워짐

values={"name":"noname", "toy":"Bat", "born":pd.Timestamp("19000101")}

In [17]:
df.fillna(values)

Unnamed: 0,name,toy,born
0,Alfred,Bat,1900-01-01
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,1900-01-01


### [실습 2]

In [18]:
student_list = {
    "name":
    ["John", "Nate", "Yuna", "Abraham", "Brian", "Janny", "Nate", "John"],
    "job": [
        "teacher", "teacher", "teacher", "student", "student", "student",
        "teacher", "student"
    ],
    "age": [40, 35, 37, 10, 12, 11, None, None]
}
df = pd.DataFrame(student_list)
df

Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Janny,student,11.0
6,Nate,teacher,
7,John,student,


In [19]:
# 전체 행의 개수와 컬럼 개수 확인하기

df.shape

(8, 3)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    8 non-null      object 
 1   job     8 non-null      object 
 2   age     6 non-null      float64
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes


In [21]:
df.isnull()

Unnamed: 0,name,job,age
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,True
7,False,False,True


In [22]:
df.isna()

Unnamed: 0,name,job,age
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,True
7,False,False,True


Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Janny,student,11.0
6,Nate,teacher,0.0
7,John,student,0.0


Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Janny,student,11.0
6,Nate,teacher,
7,John,student,


0    37.0
1    37.0
2    37.0
3    11.0
4    11.0
5    11.0
6    37.0
7    11.0
Name: age, dtype: float64

Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Janny,student,11.0
6,Nate,teacher,37.0
7,John,student,11.0


### 학생 실습

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False


Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False


In [None]:
# 모든 결측치 0 으로 채우기


Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5


In [None]:
# 모든 결측치 특정 값으로 채우기


{'A': 0, 'B': 1, 'C': 2, 'D': 3}

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5


In [None]:
# 모든 결측치 D열의 중간값으로 채우기



Unnamed: 0,A,B,C,D
0,1.0,2.0,1.0,0
1,3.0,4.0,1.0,1
2,1.0,1.0,1.0,5


In [None]:
# 모든 결측치 D열의 최대값으로 채우기



Unnamed: 0,A,B,C,D
0,5.0,2.0,5.0,0
1,3.0,4.0,5.0,1
2,5.0,5.0,5.0,5


### [개념] NaN

- NaN 을 나타내는 방법으로 np.nan, pd.NaT, pd.NA, None 등이 있음
- 단, np.nan 은 개별 값으로 비교 시 비교가 안됨
- None 는 비교가 됨

np.nan != np.nan = True
np.nan == np.nan = False
pd.NaT == pd.NaT = False
pd.NA == pd.NA =  <NA>
None == None True


- 표현할 수 없는 값들을 서로 비교를 하기 원한다면 None 로 처리하는 것이 편함