## 5.1 Pandas란?

- pandas는 <b>"python data analysis"</b>의 약자입니다.
> pandas는 정형 데이터 처리에 특화되어 있다.

- pandas 역시 다양한 머신러닝 라이브러리들에 의존성을 가지고 있습니다.
> scikit-learn, scipy, statsmodel, tensorflow, pytorch, ...


- 간단하게 생각하면, **python에서 excel의 기능을 사용**할 수 있게 됩니다.
> pandas = python + excel // pandas & excel // pandas VS MS Excel

- 하지만, pandas는 numpy array를 베이스로 지원하며 파이썬과 함께 강력한 시너지를 내기 때문에, 엑셀 그 이상의 퍼포먼스를 냅니다.
> pandas가 Excel에 비해 고성능 데이터처리에 적합하다.

![numpy_data_type](../images/pandas/dataframe.png)

- Pandas 라이브러리에서 기본적으로 데이터를 다루는 단위는 DataFrame입니다. 흔히 알고있는 spreadsheet와 같은 개념입니다.


- 이러한 형태의 데이터는 Structured Data 또는 Panel Data 또는 Tabular Data라고 부릅니다.


- pandas를 공부한다는 것은 결국 dataframe의 사용법을 익히고 활용하는 방법을 배운다는 것과 같습니다.


- pandas를 잘 활용하면 대부분의 structured data를 자유자재로 다룰 수 있게 됩니다.

![pandas_files](../images/pandas/pandas_files.png)

## 5.2. Pandas의 기본 자료구조(Series, DataFrame)

In [4]:
# pandas 라이브러리를 불러옵니다. pd를 약칭으로 사용합니다.
import pandas as pd
import numpy as np

print(pd.__version__) # pandas version 확인 

1.3.4


- DataFrame은 2차원 테이블이고, 테이블의 한 줄(행/열)을 Series라고 합니다.


- Series의 모임이 곧, DataFrame이 됩니다.

In [5]:
# s는 1, 3, 5, np.nan, 6, 8을 원소로 가지는 pandas.Series

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

- pandas는 date_range라는 함수를 통해, 날짜정보를 쉽게 생성해주는 객체도 제공합니다.

In [7]:
# 20210101부터 6일간의 날짜 범위를 생성하는 pandas.date_range

dates = pd.date_range('20220101', periods=6)
dates

DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06'],
              dtype='datetime64[ns]', freq='D')

In [80]:
# 6x4 행렬에 -1에서 1 사이의 랜덤한 숫자를 가지는 원소를 가지고,
#index열은 dates, 나머지 coulmns은 순서대로 A, B, C, D로 하는 DataFrame 생성

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns={'A', 'B', 'C', 'D'})
df

Unnamed: 0,C,A,B,D
2022-01-01,0.869991,0.916606,1.523467,0.070347
2022-01-02,0.837946,-1.928735,0.978699,-0.007602
2022-01-03,1.653679,0.568465,-0.722478,0.071582
2022-01-04,-1.830667,-2.428996,0.202074,0.491838
2022-01-05,-0.365804,-0.475981,1.562783,2.418113
2022-01-06,-0.547971,0.532926,-0.814008,-0.21676


## 5.3. Dataframe 기초 method

In [10]:
# dataframe의 맨 위 다섯줄을 보여주는 head()

df.head()

Unnamed: 0,C,A,B,D
2022-01-01,0.368122,1.163565,-0.383972,1.367342
2022-01-02,-1.05862,0.47451,-0.863362,-0.3097
2022-01-03,0.781929,0.956419,1.043555,1.322336
2022-01-04,2.653835,1.074864,-1.046183,-0.401755
2022-01-05,0.72686,-1.460259,-0.381278,-1.3547


In [11]:
# 3줄

df.head(3)

Unnamed: 0,C,A,B,D
2022-01-01,0.368122,1.163565,-0.383972,1.367342
2022-01-02,-1.05862,0.47451,-0.863362,-0.3097
2022-01-03,0.781929,0.956419,1.043555,1.322336


In [12]:
#뒤에서 부터 

df.tail(3)

Unnamed: 0,C,A,B,D
2022-01-04,2.653835,1.074864,-1.046183,-0.401755
2022-01-05,0.72686,-1.460259,-0.381278,-1.3547
2022-01-06,-2.006947,0.159437,-1.097654,-0.751865


In [13]:
# dataframe index

df.index

DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:
# dataframe columns

df.columns

Index(['C', 'A', 'B', 'D'], dtype='object')

In [16]:
# dataframe values

df.values

array([[ 0.36812212,  1.16356461, -0.38397187,  1.36734166],
       [-1.05861994,  0.47450985, -0.86336162, -0.30969963],
       [ 0.7819286 ,  0.95641868,  1.04355527,  1.32233589],
       [ 2.65383451,  1.07486445, -1.04618339, -0.40175461],
       [ 0.72686029, -1.46025878, -0.38127771, -1.35469959],
       [-2.00694671,  0.15943706, -1.0976544 , -0.75186465]])

In [18]:
# dataframe에 대한 전체적인 요약정보를 보여줍니다. 
#index, columns, null/not-null/dtype/memory usage가 표시됩니다.

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2022-01-01 to 2022-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   C       6 non-null      float64
 1   A       6 non-null      float64
 2   B       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes


In [19]:
# dataframe에 대한 전체적인 통계정보를 보여줍니다.

df.describe()

Unnamed: 0,C,A,B,D
count,6.0,6.0,6.0,6.0
mean,0.244196,0.394756,-0.454816,-0.02139
std,1.619186,0.987106,0.798174,1.120114
min,-2.006947,-1.460259,-1.097654,-1.3547
25%,-0.701934,0.238205,-1.000478,-0.664337
50%,0.547491,0.715464,-0.623667,-0.355727
75%,0.768162,1.045253,-0.381951,0.914327
max,2.653835,1.163565,1.043555,1.367342


In [25]:
# column B를 기준으로 내림차순 정렬

df.sort_values(by='B', ascending=False)

Unnamed: 0,C,A,B,D
2022-01-03,0.781929,0.956419,1.043555,1.322336
2022-01-05,0.72686,-1.460259,-0.381278,-1.3547
2022-01-01,0.368122,1.163565,-0.383972,1.367342
2022-01-02,-1.05862,0.47451,-0.863362,-0.3097
2022-01-04,2.653835,1.074864,-1.046183,-0.401755
2022-01-06,-2.006947,0.159437,-1.097654,-0.751865


In [23]:
df.sort_values(by='B', ascending=False).head(3)
# columns B를 기준으로 값이 큰 3개 보여주세요

Unnamed: 0,C,A,B,D
2022-01-03,0.781929,0.956419,1.043555,1.322336
2022-01-05,0.72686,-1.460259,-0.381278,-1.3547
2022-01-01,0.368122,1.163565,-0.383972,1.367342


## 5.4. DataFrame Indexing

> Indexing : 데이터에서 어떤 특정 조건을 만족하는 원소를 찾는 방법.

> 전체 DataFrame에서 조건에 만족하는 데이터를 쉽게 찾아서 조작할 때 유용하게 사용할 수 있습니다.

In [31]:
# pandas dataframe은 column 이름을 이용하여 기본적인 Indexing이 가능합니다.
# column A를 indexing

df['A']  

2022-01-01    1.163565
2022-01-02    0.474510
2022-01-03    0.956419
2022-01-04    1.074864
2022-01-05   -1.460259
2022-01-06    0.159437
Freq: D, Name: A, dtype: float64

df["2022-01-01"]

dataframe에 바로 indexing을 사용하면, column을 찾습니다. == dictionary의 indexing과 같다 == 'key'를 indexing == key == column 

In [33]:
# 특정날짜를 통한 Indexing   row기준으로

df.loc['2022-01-01']  #pd.Serise

C    0.368122
A    1.163565
B   -0.383972
D    1.367342
Name: 2022-01-01 00:00:00, dtype: float64

In [34]:
# 특정 위치를 통한 indexing  순서로 위에서 몇 번째
 
df.iloc[2]

C    0.781929
A    0.956419
B    1.043555
D    1.322336
Name: 2022-01-03 00:00:00, dtype: float64

In [35]:
# dataframe에서 slicing을 이용하면 row 단위로 잘려나옵니다.
# 앞에서 3줄을 slicing 합니다.

df[:3]

Unnamed: 0,C,A,B,D
2022-01-01,0.368122,1.163565,-0.383972,1.367342
2022-01-02,-1.05862,0.47451,-0.863362,-0.3097
2022-01-03,0.781929,0.956419,1.043555,1.322336


In [38]:
# df에서 index value를 기준으로 indexing도 가능합니다. (여전히 row 단위)
# 20210102부터 20210104까지 잘라봅니다. 
# index의 값을 사용하게되면 Index를 이용한 slicing

df['2022-01-02':'2022-01-04']

Unnamed: 0,C,A,B,D
2022-01-02,-1.05862,0.47451,-0.863362,-0.3097
2022-01-03,0.781929,0.956419,1.043555,1.322336
2022-01-04,2.653835,1.074864,-1.046183,-0.401755


In [40]:
df.loc['2022-01-02']

C   -1.058620
A    0.474510
B   -0.863362
D   -0.309700
Name: 2022-01-02 00:00:00, dtype: float64

In [41]:
# df.loc는 특정값을 기준으로 indexing합니다. (key - value)
# 2022-01-01값을 가지는 row를 가져옵니다.

df.loc[dates[0]]

C    0.368122
A    1.163565
B   -0.383972
D    1.367342
Name: 2022-01-01 00:00:00, dtype: float64

In [43]:
# df.loc에 2차원 indexing도 가능합니다.
#[:, ["A", "B"]]의 의미는 모든 row에 대해서 columns는 A, B만 가져오라는 의미입니다.

df.loc[:, ['A', 'D']]

Unnamed: 0,A,D
2022-01-01,1.163565,1.367342
2022-01-02,0.47451,-0.3097
2022-01-03,0.956419,1.322336
2022-01-04,1.074864,-0.401755
2022-01-05,-1.460259,-1.3547
2022-01-06,0.159437,-0.751865


In [45]:
# 이번엔 slicing을 통해 특정 row중에서 columns는 A, C

df.loc['2022-01-03':'2022-01-05', ['A', 'C']]

Unnamed: 0,A,C
2022-01-03,0.956419,0.781929
2022-01-04,1.074864,2.653835
2022-01-05,-1.460259,0.72686


In [46]:
# 특정 row를 index값을 통한 indexing

df.loc['2022-01-02',['A','B']]  #row 한 줄이라 Serise

A    0.474510
B   -0.863362
Name: 2022-01-02 00:00:00, dtype: float64

In [51]:
# 2차원 리스트 indexing과 같은 원리가 되었습니다.
df.loc['2022-01-03','A'] # 특정 row(index)에 특정 column값.

0.9564186846843563

In [52]:
# df.iloc는 정수를 이용한 indexing과 같습니다.(row 기준) 3은 4번째를 의미합니다.
df.iloc[3]

C    2.653835
A    1.074864
B   -1.046183
D   -0.401755
Name: 2022-01-04 00:00:00, dtype: float64

In [54]:
# iloc로 2차원 indexing을 하게되면, 
# row 기준으로 index 3,4를 가져오고 column 기준으로 0, 1을 가져옵니다.
df.iloc[3:5, 0:2]    #df.iloc의 indexing은 numpy array의 2차원 index와 동일 

Unnamed: 0,C,A
2022-01-04,2.653835,1.074864
2022-01-05,0.72686,-1.460259


In [57]:
# slicing이 아닌 직접 리스트 형태로 기재하는 indexing
df.iloc[[1, 2, 4], [0, 3]]  #filtering

Unnamed: 0,C,D
2022-01-02,-1.05862,-0.3097
2022-01-03,0.781929,1.322336
2022-01-05,0.72686,-1.3547


In [82]:
# Q. 2차원 indexing에 뒤에가 : 면 어떤 의미일까요?
df.iloc[1:3,  : ]

Unnamed: 0,C,A,B,D
2022-01-02,0.837946,-1.928735,0.978699,-0.007602
2022-01-03,1.653679,0.568465,-0.722478,0.071582


In [59]:
# numpy array의 2차원 indexing과 같다.
df.iloc[:, 1:3]

Unnamed: 0,A,B
2022-01-01,1.163565,-0.383972
2022-01-02,0.47451,-0.863362
2022-01-03,0.956419,1.043555
2022-01-04,1.074864,-1.046183
2022-01-05,-1.460259,-0.381278
2022-01-06,0.159437,-1.097654


In [81]:
df

Unnamed: 0,C,A,B,D
2022-01-01,0.869991,0.916606,1.523467,0.070347
2022-01-02,0.837946,-1.928735,0.978699,-0.007602
2022-01-03,1.653679,0.568465,-0.722478,0.071582
2022-01-04,-1.830667,-2.428996,0.202074,0.491838
2022-01-05,-0.365804,-0.475981,1.562783,2.418113
2022-01-06,-0.547971,0.532926,-0.814008,-0.21676


In [63]:
# pandas는 fancy indexing을 지원합니다. 
#(사실 numpy에서 지원하기 때문에 pandas도 지원합니다.)
# fancy indexing이란 조건문을 통해 indexing을 할 수 있는 방법으로 True와 False를 원소로 하는 리스트를 통해 masking하는 원리로 동작합니다.
# column A에 있는 원소들중에 0보다 큰 데이터를 가져옵니다.

df> 0


Unnamed: 0,C,A,B,D
2022-01-01,True,True,False,True
2022-01-02,False,True,False,False
2022-01-03,True,True,True,True
2022-01-04,True,True,False,False
2022-01-05,True,False,False,False
2022-01-06,False,True,False,False


In [65]:
df['A'] > 0

2022-01-01     True
2022-01-02     True
2022-01-03     True
2022-01-04     True
2022-01-05    False
2022-01-06     True
Freq: D, Name: A, dtype: bool

In [89]:
# fancy indexing
df['A'][df['A'] > 0]
#dataframe #chain index : indexing이 앞에서부터 뒤로 쭉 순서대로 적용됩니다. 

2022-01-01    0.916606
2022-01-03    0.568465
2022-01-06    0.532926
Name: A, dtype: float64

In [86]:
df[df > 0] 
df

Unnamed: 0,C,A,B,D
2022-01-01,0.869991,0.916606,1.523467,0.070347
2022-01-02,0.837946,-1.928735,0.978699,-0.007602
2022-01-03,1.653679,0.568465,-0.722478,0.071582
2022-01-04,-1.830667,-2.428996,0.202074,0.491838
2022-01-05,-0.365804,-0.475981,1.562783,2.418113
2022-01-06,-0.547971,0.532926,-0.814008,-0.21676


In [87]:
#df[df > 0]
df[df > 0]

Unnamed: 0,C,A,B,D
2022-01-01,0.869991,0.916606,1.523467,0.070347
2022-01-02,0.837946,,0.978699,
2022-01-03,1.653679,0.568465,,0.071582
2022-01-04,,,0.202074,0.491838
2022-01-05,,,1.562783,2.418113
2022-01-06,,0.532926,,


In [91]:
 # dataframe 하나를 복사합니다. 정말 말그대로 복사합니다.
df2 = df.copy()

In [92]:
# dataframe은 dictionary와 비슷한 방식으로 assignment가 가능합니다.
# df에 ['one', 'one','two','three','four','three'] 
#리스트를 column의 value로 하는 column E를 추가합니다.
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,C,A,B,D,E
2022-01-01,0.869991,0.916606,1.523467,0.070347,one
2022-01-02,0.837946,-1.928735,0.978699,-0.007602,one
2022-01-03,1.653679,0.568465,-0.722478,0.071582,two
2022-01-04,-1.830667,-2.428996,0.202074,0.491838,three
2022-01-05,-0.365804,-0.475981,1.562783,2.418113,four
2022-01-06,-0.547971,0.532926,-0.814008,-0.21676,three


In [93]:
# df.isin은 해당 value들이 들어있는 row에 대해선 True를 가지는 Series를 리턴한다.
df2['E'].isin(['two','four'])

2022-01-01    False
2022-01-02    False
2022-01-03     True
2022-01-04    False
2022-01-05     True
2022-01-06    False
Freq: D, Name: E, dtype: bool

In [94]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,C,A,B,D,E
2022-01-03,1.653679,0.568465,-0.722478,0.071582,two
2022-01-05,-0.365804,-0.475981,1.562783,2.418113,four


## 5.5. 외부 데이터 읽고 쓰기

In [97]:
# data 폴더에 있는 iris.csv를 불러오자.

data = pd.read_csv("data/Iris.csv")
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [100]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [102]:
set(data['Species'])  #종류를 알기 위해

{'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'}

In [108]:
# Species column을 숫자로 바꿔보자.
#Iris-setosa -> 0
#Iris-versicolor -> 1
#Iris-virginica -> 2

data.loc[data['Species'] == 'Iris-setosa', 'Species'] = 0
data.loc[data['Species'] == 'Iris-versicolor', 'Species'] = 1
data.loc[data['Species'] == 'Iris-virginica', 'Species'] = 2
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,2
146,147,6.3,2.5,5.0,1.9,2
147,148,6.5,3.0,5.2,2.0,2
148,149,6.2,3.4,5.4,2.3,2


In [109]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [110]:
set(data['Species'])

{0, 1, 2}

In [111]:
# 바꾼 Dataframe을 Iris_edited.csv 로 저장하자.
data.to_csv('data/Iris_edited.csv')

In [113]:
data1= pd.read_csv('data/Iris_edited.csv')
data1

Unnamed: 0.1,Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,0,1,5.1,3.5,1.4,0.2,0
1,1,2,4.9,3.0,1.4,0.2,0
2,2,3,4.7,3.2,1.3,0.2,0
3,3,4,4.6,3.1,1.5,0.2,0
4,4,5,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...,...,...
145,145,146,6.7,3.0,5.2,2.3,2
146,146,147,6.3,2.5,5.0,1.9,2
147,147,148,6.5,3.0,5.2,2.0,2
148,148,149,6.2,3.4,5.4,2.3,2


In [115]:
# 다른 파일도 불러오자.
data2 = pd.read_csv('data/kaggle_survey_2020_responses.csv')
data2

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
0,Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education ...,Select the title most similar to your current ...,For how many years have you been writing code ...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,...,"In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor..."
1,1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,...,,,,TensorBoard,,,,,,
2,289287,30-34,Man,United States of America,Master’s degree,Data Engineer,5-10 years,Python,R,SQL,...,,,,,,,,,,
3,860,35-39,Man,Argentina,Bachelor’s degree,Software Engineer,10-20 years,,,,...,,,,,,,,,,
4,507,30-34,Man,United States of America,Master’s degree,Data Scientist,5-10 years,Python,,SQL,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20032,126,18-21,Man,Turkey,Some college/university study without earning ...,,,,,,...,,,,,,,,,,
20033,566,55-59,Woman,United Kingdom of Great Britain and Northern I...,Master’s degree,Currently not employed,20+ years,Python,,,...,,,,,,,,,,
20034,238,30-34,Man,Brazil,Master’s degree,Research Scientist,< 1 years,Python,,,...,,,,,,,,,,
20035,625,22-24,Man,India,Bachelor’s degree,Software Engineer,3-5 years,Python,,SQL,...,Weights & Biases,,,TensorBoard,,,Trains,,,


In [122]:
# 박사 학위 소지자들만 골라보자.
#Q4 - > Doctoral degree
phd = data2[data2['Q4'] == 'Doctoral degree']

In [123]:
# 박사 학위 소지자들에 대한 정보만 kaggle_survey_2020_phd.csv로 다시 저장하자.
phd.to_csv('data/kaggle_survey_2020_phd.csv')

In [127]:
set(phd['Q3'])

{'Argentina',
 'Australia',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Brazil',
 'Canada',
 'Chile',
 'China',
 'Colombia',
 'Egypt',
 'France',
 'Germany',
 'Ghana',
 'Greece',
 'India',
 'Indonesia',
 'Iran, Islamic Republic of...',
 'Ireland',
 'Israel',
 'Italy',
 'Japan',
 'Kenya',
 'Malaysia',
 'Mexico',
 'Morocco',
 'Nepal',
 'Netherlands',
 'Nigeria',
 'Other',
 'Pakistan',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Republic of Korea',
 'Romania',
 'Russia',
 'Saudi Arabia',
 'Singapore',
 'South Africa',
 'South Korea',
 'Spain',
 'Sri Lanka',
 'Sweden',
 'Switzerland',
 'Taiwan',
 'Thailand',
 'Tunisia',
 'Turkey',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Viet Nam'}

In [130]:
# (OPTIONAL) 박사 학위 소지자이면서, 대한민국 국적을 가진 사람들을 뽑아보자.
kor = phd[phd['Q3']== 'Republic of Korea']
kor

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
5897,201039,25-29,Man,Republic of Korea,Doctoral degree,Research Scientist,3-5 years,Python,,,...,,,,TensorBoard,,,,,,
7902,236,45-49,Man,Republic of Korea,Doctoral degree,Product/Project Manager,< 1 years,,,,...,,,,,,,,,,
9303,3560,35-39,Woman,Republic of Korea,Doctoral degree,Data Scientist,10-20 years,Python,R,SQL,...,,,,,,,,,,
9826,1565,60-69,Man,Republic of Korea,Doctoral degree,Data Scientist,3-5 years,Python,R,,...,,,,,,,,,,
11224,510,35-39,Man,Republic of Korea,Doctoral degree,Research Scientist,< 1 years,Python,,,...,,,,,,,,,,
11958,161,40-44,Man,Republic of Korea,Doctoral degree,Machine Learning Engineer,,,,,...,,,,,,,,,,
12191,707,35-39,Man,Republic of Korea,Doctoral degree,Research Scientist,10-20 years,Python,,,...,,,,,,,,,,
12999,458,25-29,Man,Republic of Korea,Doctoral degree,Student,10-20 years,Python,,,...,,,,,,,,,,
16643,1045,55-59,Man,Republic of Korea,Doctoral degree,Product/Project Manager,< 1 years,Python,R,,...,,,,TensorBoard,,,,,,
19867,708,25-29,Man,Republic of Korea,Doctoral degree,Data Scientist,3-5 years,Python,,,...,,,,,,,,,,


In [132]:
kor.to_csv('data/kaggle_kor.csv')

In [136]:
ko = pd.read_csv('data/kaggle_kor.csv')
ko

Unnamed: 0.1,Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
0,5897,201039,25-29,Man,Republic of Korea,Doctoral degree,Research Scientist,3-5 years,Python,,...,,,,TensorBoard,,,,,,
1,7902,236,45-49,Man,Republic of Korea,Doctoral degree,Product/Project Manager,< 1 years,,,...,,,,,,,,,,
2,9303,3560,35-39,Woman,Republic of Korea,Doctoral degree,Data Scientist,10-20 years,Python,R,...,,,,,,,,,,
3,9826,1565,60-69,Man,Republic of Korea,Doctoral degree,Data Scientist,3-5 years,Python,R,...,,,,,,,,,,
4,11224,510,35-39,Man,Republic of Korea,Doctoral degree,Research Scientist,< 1 years,Python,,...,,,,,,,,,,
5,11958,161,40-44,Man,Republic of Korea,Doctoral degree,Machine Learning Engineer,,,,...,,,,,,,,,,
6,12191,707,35-39,Man,Republic of Korea,Doctoral degree,Research Scientist,10-20 years,Python,,...,,,,,,,,,,
7,12999,458,25-29,Man,Republic of Korea,Doctoral degree,Student,10-20 years,Python,,...,,,,,,,,,,
8,16643,1045,55-59,Man,Republic of Korea,Doctoral degree,Product/Project Manager,< 1 years,Python,R,...,,,,TensorBoard,,,,,,
9,19867,708,25-29,Man,Republic of Korea,Doctoral degree,Data Scientist,3-5 years,Python,,...,,,,,,,,,,
