## pandas 라이브러리로 데이터 처리하기

### 1. Series 로 feature를 보다 상세하게 탐색하기
#### 코로나 바이러스 데이터와 함께 pandas 라이브러리 익히기
- COVID-19-master 폴더 확인

In [1]:
import pandas as pd
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "06-17-2020.csv", encoding='utf-8-sig')

In [2]:
doc.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,45001.0,Abbeville,South Carolina,US,2020-06-18 04:33:18,34.223334,-82.461707,73,0,0,73,"Abbeville, South Carolina, US",297.631182,0.0
1,22001.0,Acadia,Louisiana,US,2020-06-18 04:33:18,30.295065,-92.414197,625,32,0,593,"Acadia, Louisiana, US",1007.333387,5.12
2,51001.0,Accomack,Virginia,US,2020-06-18 04:33:18,37.767072,-75.632346,1018,14,0,1004,"Accomack, Virginia, US",3150.142344,1.375246
3,16001.0,Ada,Idaho,US,2020-06-18 04:33:18,43.452658,-116.241552,986,22,0,964,"Ada, Idaho, US",204.739746,2.231237
4,19001.0,Adair,Iowa,US,2020-06-18 04:33:18,41.330756,-94.471059,12,0,0,12,"Adair, Iowa, US",167.785235,0.0


#### 데이터프레임에서 Series 추출하기
- 하나의 feature(column)만 선택하면 됨

In [3]:
countries = doc['Country_Region']
countries.head()

0    US
1    US
2    US
3    US
4    US
Name: Country_Region, dtype: object

#### Series 로 feature를 보다 상세하게 탐색하기
- size : 사이즈 반환
- count() : 데이터가 없는 경우를 뺀 사이즈 반환
- unique(): 유일한 값만 반환
- value_counts(): 데이터가 없는 경우를 제외하고, 각 값의 갯수를 반환

In [4]:
print (countries.size, countries.count())

3747 3747


In [5]:
print (countries.unique(), len(countries.unique()))

['US' 'Italy' 'Brazil' 'Russia' 'Mexico' 'Japan' 'Canada' 'Colombia'
 'Peru' 'Spain' 'India' 'United Kingdom' 'China' 'Chile' 'Netherlands'
 'Australia' 'Pakistan' 'Germany' 'Sweden' 'Ukraine' 'Denmark' 'France'
 'Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Austria' 'Azerbaijan'
 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize'
 'Benin' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brunei'
 'Bulgaria' 'Burkina Faso' 'Burma' 'Burundi' 'Cabo Verde' 'Cambodia'
 'Cameroon' 'Central African Republic' 'Chad' 'Comoros'
 'Congo (Brazzaville)' 'Congo (Kinshasa)' 'Costa Rica' "Cote d'Ivoire"
 'Croatia' 'Cuba' 'Cyprus' 'Czechia' 'Diamond Princess' 'Djibouti'
 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Eswatini' 'Ethiopia' 'Fiji'
 'Finland' 'Gabon' 'Gambia' 'Georgia' 'Ghana' 'Greece' 'Grenada'
 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana' 'Haiti' '

In [6]:
countries.value_counts()

US             3076
Russia           83
Japan            48
India            36
Colombia         34
               ... 
Switzerland       1
Eritrea           1
Montenegro        1
Dominica          1
Israel            1
Name: Country_Region, Length: 188, dtype: int64

### 2. 필요한 컬럼만 선택하기
- 여러 컬럼을 선택하면, 별도의 데이터프레임이 됨

In [7]:
covid_stat = doc[['Confirmed', 'Deaths', 'Recovered']]
covid_stat.head()

Unnamed: 0,Confirmed,Deaths,Recovered
0,73,0,0
1,625,32,0
2,1018,14,0
3,986,22,0
4,12,0,0


### 3. 특정 조건에 맞는 row 검색하기

In [8]:
doc = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')

In [9]:
doc_us = doc[doc['Country_Region'] == 'US']
doc_us

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
...,...,...,...,...,...,...,...,...,...,...,...,...
2246,66000.0,,Guam,US,2020-04-01 21:58:49,13.444300,144.793700,77,3,0,0,"Guam, US"
2273,,,Northern Mariana Islands,US,2020-04-01 21:58:49,15.097900,145.673900,6,1,0,0,",Northern Mariana Islands,US"
2279,,,Puerto Rico,US,2020-04-01 21:58:49,18.220800,-66.590100,286,11,0,0,"Puerto Rico, US"
2284,,,Recovered,US,2020-04-01 21:58:49,0.000000,0.000000,0,0,8474,0,"Recovered, US"


### 4. 없는 데이터(NaN) 처리하기

- 없는 데이터(결측치) 가 있는지 확인하기
  - isnull() : 없는 데이터가 있는지 확인 (True or False)
  - sum() : 없는 데이터가 있는 행의 갯수 확인 
  - 통상 isnull().sum() 으로 사용

In [10]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc.isnull().sum()

Province/State     3
Country/Region     0
Last Update        0
Confirmed          9
Deaths            37
Recovered         37
dtype: int64

#### 없는 데이터 삭제하기
- dropna() : 결측치를 가진 행을 모두 삭제

In [11]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc.dropna()
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
13,Hubei,Mainland China,1/22/2020 17:00,444.0,17.0,28.0


#### ""특정 컬럼값이 없는 데이터만 삭제하기""
  - subset으로 해당 컬럼을 지정해줌

In [12]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc.dropna(subset=['Confirmed'])
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
5,Guangdong,Mainland China,1/22/2020 17:00,26.0,,


#### 없는 데이터(NaN)을 특정값으로 일괄 변경하기
- fillna(특정값) : 특정값으로 결측치를 대체

In [12]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc.fillna(0)
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


#### ""없는 데이터(NaN)중 특정 컬럼에 대해 특정 값으로 일괄 변경하기""
- 별도 사전 데이터를 생성, 없는 데이터를 변경할 컬럼명만 키로 만들고, 변경할 특정 값을 키값으로 넣고, fillna() 함수에 적용해주면 됨

In [14]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
nan_data = {'Deaths': 0, 'Recovered':0}
doc = doc.fillna(nan_data)
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,Gansu,Mainland China,1/22/2020 17:00,,0.0,0.0


### 5. 특정 키값을 기준으로 데이터 합치기
  - groupby() : SQL 구문의 group by 와 동일, 특정 컬럼을 기준으로 그룹
  - sum() : 그룹으로 되어 있는 데이터를 합치기

In [15]:
doc = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')
doc.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


In [16]:
doc = doc.groupby('Country_Region').sum()
doc.head()

Unnamed: 0_level_0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,0.0,33.93911,67.709953,237,4,5,228
Albania,0.0,41.1533,20.1683,259,15,67,177
Algeria,0.0,28.0339,1.6596,847,58,61,728
Andorra,0.0,42.5063,1.5218,390,14,10,366
Angola,0.0,-11.2027,17.8739,8,2,1,5


- groupby 에 의해서, index가 Country_Region 의 각 국가로 변경됨

In [17]:
doc.columns

Index(['FIPS', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active'], dtype='object')

In [18]:
doc.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Ukraine', 'United Arab Emirates', 'United Kingdom', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Zambia',
       'Zimbabwe'],
      dtype='object', name='Country_Region', length=180)

- index로 검색해서 US 의 합계 검색 가능

In [19]:
doc[doc.index == 'US']

Unnamed: 0_level_0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
US,65168934.0,83042.72113,-197696.886157,213372,4757,8474,0


### ""6. 컬럼 타입 변경하기""
- pandas에서 데이터 타입은 dtype 으로 불리우며, 주요 데이터 타입은 다음과 같음
  - object 는 파이썬의 str 또는 혼용 데이터 타입 (문자열)
  - int64 는 파이썬의 int (정수)
  - float64 는 파이썬의 float (부동소숫점)
  - bool 는 파이썬의 bool (True 또는 False 값을 가지는 boolean)

In [3]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']]
doc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country/Region  38 non-null     object 
 1   Confirmed       29 non-null     float64
dtypes: float64(1), object(1)
memory usage: 736.0+ bytes


- astype({컬럼명: 변경할타입}) : 특정 컬럼의 타입을 변경
  - 변경할 데이터에 없는 데이터(NaN)이 있을 경우, 에러가 날 수 있음

In [4]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기
doc = doc.dropna(subset=['Confirmed'])     # 특정 컬럼에 없는 데이터 삭제하기
doc = doc.astype({'Confirmed': 'int64'})   # 특정 컬럼의 데이터 타입 변경하기
doc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 0 to 37
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country/Region  29 non-null     object
 1   Confirmed       29 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 696.0+ bytes


In [5]:
doc.head()

Unnamed: 0,Country/Region,Confirmed
0,Mainland China,1
1,Mainland China,14
2,Mainland China,6
3,Mainland China,1
5,Mainland China,26


### 7. 데이터프레임 컬럼명 변경하기
- columns 로 컬럼명을 변경할 수 있음

In [52]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc = doc[['Country/Region', 'Confirmed']] # 필요한 컬럼만 선택하기

In [54]:
doc.columns

Index(['Country/Region', 'Confirmed'], dtype='object')

In [55]:
doc.columns = ['Country_Region', 'Confirmed']

In [56]:
doc.columns

Index(['Country_Region', 'Confirmed'], dtype='object')

### 8. 데이터프레임에서 중복 행 확인/제거하기
- duplicated() : 중복 행 확인하기

In [13]:
doc = pd.read_csv("COVID-19-master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv", encoding='utf-8-sig')
doc = doc[['iso2', 'Country_Region']]
doc

Unnamed: 0,iso2,Country_Region
0,BW,Botswana
1,BI,Burundi
2,SL,Sierra Leone
3,AF,Afghanistan
4,AL,Albania
...,...,...
3555,US,US
3556,US,US
3557,US,US
3558,US,US


In [14]:
doc.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
3555     True
3556     True
3557     True
3558     True
3559     True
Length: 3560, dtype: bool

- 중복된 행만 확인하기

In [15]:
doc[doc.duplicated()]

Unnamed: 0,iso2,Country_Region
194,GB,United Kingdom
200,AU,Australia
201,AU,Australia
202,AU,Australia
203,AU,Australia
...,...,...
3555,US,US
3556,US,US
3557,US,US
3558,US,US


- drop_ducplicates() : 중복 행 삭제중복값
  - 특정 컬럼을 기준으로 중복 행 제거하기
    - subset=특정컬럼
  - 중복된 경우, 처음과 마지막 행 중 어느 행을 남길 것인지 결정하기 
    - 처음: keep='first' (디폴트)
    - 처음: keep='last'

In [16]:
doc = doc.drop_duplicates(subset='Country_Region', keep='last')
doc

Unnamed: 0,iso2,Country_Region
0,BW,Botswana
1,BI,Burundi
2,SL,Sierra Leone
3,AF,Afghanistan
4,AL,Albania
...,...,...
198,TC,United Kingdom
206,AU,Australia
221,CA,Canada
254,CN,China
