# Pandas

### 구조화된 데이터의 처리를 지원하는 Python 라이브러리 => Python의 엑셀

- 구조화된 데이터의 처리를 지원하는 Python 라이브러리
- 고성능 Array 계산 라이브러리인 Numpy와 통합하여, 강력한 "스프레드시트" 처리 기능을 제공
- 인덱싱, 연상용 함수, 전처리 함수 등을 제공함

# 1. Pandas 개요

### data loading

In [1]:
import pandas as pd

In [3]:
# Data URL
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
# csv 타입 데이터 로드, separate는 빈 공간으로 지정하고, Column은 없음
df_data = pd.read_csv(data_url, sep='\s+', header = None) 

In [4]:
# 처음 다섯 줄 출력
df_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


![1](nb_images/pd1.png)

![2](nb_images/pd2.png)

![3](nb_images/pd3.png)

# 2. Series

## <span class="mark">Series = Numpy + Index</span>
### Numpy가 할 수 있는 모든 연산은 다 지원한다.
### 아래 내용은 Pandas가 사용할 수 있는 연산이나, <span class="mark">실제 사용 시 csv 파일등을 한번에 읽어오는 역할을 주로 사용</span>하지 아래와 같이 한 열 / 한 행 직접 연산하는 작업은 잘 하지 않는다.

In [12]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [2]:
example_obj = Series()

![4](nb_images/pd4.png)

In [4]:
list_data = [1,2,3,4,5]
example_obj = Series(data = list_data)
print(example_obj)

0    1
1    2
2    3
3    4
4    5
dtype: int64


![5](nb_images/pd5.png)

![6](nb_images/pd6.png)

In [5]:
list_data = [1,2,3,4,5]
list_name = ["a", "b", "c", "d", "e"]
example_obj = Series(data = list_data, index=list_name)
print(example_obj)

a    1
b    2
c    3
d    4
e    5
dtype: int64


![7](nb_images/pd7.png)

In [13]:
dict_data = {"a":1, "b":2, "c":3, "d":4, "e":5}
example_obj = Series(data = dict_data, dtype=np.float32, name = "example_data")
print(example_obj)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32


![8](nb_images/pd8.png)

In [14]:
example_obj["a"]

1.0

In [15]:
example_obj["a"] = 3.2
print(example_obj)

a    3.2
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32


![9](nb_images/pd9.png)

In [16]:
print(example_obj.values)
print(type(example_obj.values))

[3.2 2.  3.  4.  5. ]
<class 'numpy.ndarray'>


In [17]:
print(example_obj.index)
print(type(example_obj.index))

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
<class 'pandas.indexes.base.Index'>


In [18]:
example_obj.name = "number"
example_obj.index.name = "alphabet"
print(example_obj)

alphabet
a    3.2
b    2.0
c    3.0
d    4.0
e    5.0
Name: number, dtype: float32


![10](nb_images/pd10.png)

In [23]:
dict_data_1 = {"a":1, "b":2, "c":3, "d":4, "e":5}
indexes = ["a", "b", "c", "d", "e", "f", "g", "h"]
series_obj_1 = Series(data = dict_data_1, index = indexes)
print(series_obj_1)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    NaN
g    NaN
h    NaN
dtype: float64


In [25]:
# boolean operation
example_obj[example_obj > 2]

alphabet
a    3.2
c    3.0
d    4.0
e    5.0
Name: number, dtype: float32

In [26]:
example_obj * 2

alphabet
a     6.4
b     4.0
c     6.0
d     8.0
e    10.0
Name: number, dtype: float32

### Series 연산

In [82]:
s1 = Series(range(1, 6), index = list("abcde"))
s1

a    1
b    2
c    3
d    4
e    5
dtype: int32

In [84]:
s2 = Series(range(5, 11), index = list("bcedef"))
s2

b     5
c     6
e     7
d     8
e     9
f    10
dtype: int32

Series는 + 연산을 통하여 같은 index에 해당하는 숫자들은 + 연산을 하지만 한 쪽에만 index가 존재하는 값은 NaN을 반환합니다.

In [85]:
s1 + s2

a     NaN
b     7.0
c     9.0
d    12.0
e    12.0
e    14.0
f     NaN
dtype: float64

+연산과 동일한 작업을 add 함수로 할 수 있고, 추가적인 명령도 할 수 있습니다.

In [86]:
s1.add(s2)

a     NaN
b     7.0
c     9.0
d    12.0
e    12.0
e    14.0
f     NaN
dtype: float64

In [87]:
s1.add(s2, fill_value= 0)

a     1.0
b     7.0
c     9.0
d    12.0
e    12.0
e    14.0
f    10.0
dtype: float64

# 3. DataFrame

![11](nb_images/pd11.jpeg)

앞에서 설명한 바와 같이 Pandas의 Series = Numpy + index 라고 설명하였습니다. <br>
각 Numpy는 한 개의 data type을 가질 수 있습니다. 예를 들어 Numpy a의 dtype = np.float32, Numpy b의 dtype = str ... 과 같이 한 개의 Numpy 객체는 오직 한 개의 dtype을 가질 수 있어서 한 개의 Numpy 객체에 숫자와 문자를 같이 저장할 수 없습니다.

Series = Numpy + index 이므로 한 개의 Series에는 한 개의 dtype만 저장할 수 있습니다. DataFrame 관점에서는 여러개의 Series가 모여서 형성되므로 다양한 형태의 dtype을 가지는 Series로 구성된 DataFrame이 될 수 있습니다. 

### 아래는 column 기준의 Series 데이터를 읽는 방법에 대한 설명 입니다.

In [2]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [7]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}

df = DataFrame(data = raw_data)
df

Unnamed: 0,age,city,first_name,last_name
0,42,San Francisco,Jason,Miller
1,52,Baltimore,Molly,Jacobson
2,36,Miami,Tina,Ali
3,24,Douglas,Jake,Milner
4,73,Boston,Amy,Cooze


Column의 일부만 가져오면 해당 Column의 Series만 가져오게 됩니다.

In [8]:
DataFrame(data = raw_data, columns=['first_name', 'age'])

Unnamed: 0,first_name,age
0,Jason,42
1,Molly,52
2,Tina,36
3,Jake,24
4,Amy,73


기존 data에 없는 column을 가져오면 Series를 일단 생성하고 NaN 값을 채워 넣습니다.

In [9]:
DataFrame(data = raw_data, columns = ['first_name', 'last_name', 'age', 'salary'])

Unnamed: 0,first_name,last_name,age,salary
0,Jason,Miller,42,
1,Molly,Jacobson,52,
2,Tina,Ali,36,
3,Jake,Milner,24,
4,Amy,Cooze,73,


In [11]:
df.first_name

0    Jason
1    Molly
2     Tina
3     Jake
4      Amy
Name: first_name, dtype: object

In [12]:
df["first_name"]

0    Jason
1    Molly
2     Tina
3     Jake
4      Amy
Name: first_name, dtype: object

### 행 기준으로 데이터를 추출하는 방법

![12](nb_images/pd12.jpeg)

In [13]:
df.loc[1]

age                  52
city          Baltimore
first_name        Molly
last_name      Jacobson
Name: 1, dtype: object

In [14]:
df.iloc[1]

age                  52
city          Baltimore
first_name        Molly
last_name      Jacobson
Name: 1, dtype: object

In [15]:
df["age"][1:]

1    52
2    36
3    24
4    73
Name: age, dtype: int64

## loc vs. iloc
- loc : index의 이름 기준
- iloc : index number 기준

In [17]:
s = Series(data = np.nan, index = [10, 11, 12, 13, 14, 1, 2, 3, 4, 5])
s.loc[3:]

3   NaN
4   NaN
5   NaN
dtype: float64

In [18]:
s.iloc[3:]

13   NaN
14   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: float64

### DataFrame에서 행/열 가져오는 방법

- 행 : df.iloc/loc 
- 열 : df["열 이름"], df.열 이름

### Column에 새로운 값 할당하기

In [32]:
df["debt"] = np.nan
df

Unnamed: 0,age,city,first_name,last_name,debt
0,42,San Francisco,Jason,Miller,
1,52,Baltimore,Molly,Jacobson,
2,36,Miami,Tina,Ali,
3,24,Douglas,Jake,Milner,
4,73,Boston,Amy,Cooze,


Numpy의 Boolean Operation을 이용하여 Column을 생성할 수 있다. 이것 또한 Series가 Numpy로 만들어 졌기 때문에 가능합니다.

In [34]:
df.debt = df.age > 40
df

Unnamed: 0,age,city,first_name,last_name,debt
0,42,San Francisco,Jason,Miller,True
1,52,Baltimore,Molly,Jacobson,True
2,36,Miami,Tina,Ali,False
3,24,Douglas,Jake,Milner,False
4,73,Boston,Amy,Cooze,True


In [37]:
# 행/열 전치
df.T

Unnamed: 0,0,1,2,3,4
age,42,52,36,24,73
city,San Francisco,Baltimore,Miami,Douglas,Boston
first_name,Jason,Molly,Tina,Jake,Amy
last_name,Miller,Jacobson,Ali,Milner,Cooze
debt,True,True,False,False,True


In [38]:
# DataFrame → Numpy
df.values

array([[42, 'San Francisco', 'Jason', 'Miller', True],
       [52, 'Baltimore', 'Molly', 'Jacobson', True],
       [36, 'Miami', 'Tina', 'Ali', False],
       [24, 'Douglas', 'Jake', 'Milner', False],
       [73, 'Boston', 'Amy', 'Cooze', True]], dtype=object)

### 특정 Column 삭제

In [40]:
del df["debt"]
df

Unnamed: 0,age,city,first_name,last_name
0,42,San Francisco,Jason,Miller
1,52,Baltimore,Molly,Jacobson
2,36,Miami,Tina,Ali
3,24,Douglas,Jake,Milner
4,73,Boston,Amy,Cooze


### Nested Dict를 사용하면 행/열 각각 index name을 사용할 수 있다.
물론 이렇게 Pandas에서 직접 입력할 일을 거의 없다

In [41]:
data = {'Nevada': {2001: 2.4, 2002: 2.9},
 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

DataFrame(data)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


### 특정 Series 추가하기

In [42]:
values = Series(data = ["M", "F", "M"], index = [0, 1, 3])
df["sex"] = values
df

Unnamed: 0,age,city,first_name,last_name,sex
0,42,San Francisco,Jason,Miller,M
1,52,Baltimore,Molly,Jacobson,F
2,36,Miami,Tina,Ali,
3,24,Douglas,Jake,Milner,M
4,73,Boston,Amy,Cooze,


### DataFrame 연산

In [88]:
df1 = DataFrame(data = np.arange(9).reshape(3,3), index = list("abc"))
df1

Unnamed: 0,0,1,2
a,0,1,2
b,3,4,5
c,6,7,8


In [89]:
df2 = DataFrame(data = np.arange(16).reshape(4,4), index = list("abcd"))
df2

Unnamed: 0,0,1,2,3
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [90]:
df1 + df2

Unnamed: 0,0,1,2,3
a,0.0,2.0,4.0,
b,7.0,9.0,11.0,
c,14.0,16.0,18.0,
d,,,,


In [91]:
df1.add(df2, fill_value=0)

Unnamed: 0,0,1,2,3
a,0.0,2.0,4.0,3.0
b,7.0,9.0,11.0,7.0
c,14.0,16.0,18.0,11.0
d,12.0,13.0,14.0,15.0


![13](nb_images/pd13.jpeg)

In [99]:
df = DataFrame(np.arange(16).reshape(4,4), columns = list("abcd"))
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [100]:
s = Series(np.arange(10, 14), index = list("abcd"))
s

a    10
b    11
c    12
d    13
dtype: int32

In [101]:
df.add(s)

Unnamed: 0,a,b,c,d
0,10,12,14,16
1,14,16,18,20
2,18,20,22,24
3,22,24,26,28


In [102]:
df = DataFrame(np.arange(16).reshape(4,4), columns = list("abcd"))
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [103]:
s2 = Series(np.arange(10, 14))
s2

0    10
1    11
2    12
3    13
dtype: int32

In [104]:
df + s2

Unnamed: 0,a,b,c,d,0,1,2,3
0,,,,,,,,
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,


In [110]:
df.add(s2, axis = 0)

Unnamed: 0,a,b,c,d
0,10,11,12,13
1,15,16,17,18
2,20,21,22,23
3,25,26,27,28


# 4. Selection 

엑셀 데이터 로딩 시 <br>
cmd에서 conda install --y xlrd 하여 xlrd 모듈을 설치한다. <br>
Jupyter Notebook 에서는 !conda install --y xlrd 을 치면 바로 설치 가능함

In [56]:
import pandas as pd
import numpy as np

df = pd.read_excel("data/excel-comp-data.xlsx")
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


index number 또는 column name을 이용하여 data를 선택할 수 있습니다.
엑셀 데이터를 바로 읽어 왔을 때 첫 열은 row의 index를 나타내는 데 실제 데이터에 있는 값은 아니고 자동적으로 row의 번호가 생성되게 됩니다.

In [57]:
df["account"].head(2)

0    211829
1    320563
Name: account, dtype: int64

In [58]:
df[["account", "street", "state"]].head(3)

Unnamed: 0,account,street,state
0,211829,34456 Sean Highway,Texas
1,320563,1311 Alvis Tunnel,NorthCarolina
2,648336,62184 Schamberger Underpass Apt. 231,Iowa


Column 이름 없이 사용하는 index number는 <span class="mark">row 기준</span> 표시

In [59]:
df[:3]

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000


column 이름과 함께 row, index 사용 시 해당 column만 가져오게 됩니다.
사용 할 때, Series를 먼저 가져오고 해당 Series에서 필요한 row를 가져온다고 생각하면 됩니다.

In [60]:
df["account"][:3]

0    211829
1    320563
2    648336
Name: account, dtype: int64

다음은 필요한 Series를 list에서 먼저 가져오고, 그 다음 필요한 row를 가져와보도록 하겠습니다.

In [61]:
df[[0,1,2]][2:5]

Unnamed: 0,account,name,street
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144
4,121213,Bauch-Goldner,7274 Marissa Common


In [62]:
account_series = df["account"]
account_series[account_series < 250000]

0     211829
3     109996
4     121213
5     132971
6     145068
7     205217
8     209744
9     212303
10    214098
11    231907
12    242368
Name: account, dtype: int64

df.index를 접근하면 index 값을 변경할 수 있습니다.

In [63]:
df.index = range(5, 20)
df

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
5,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
6,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
7,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
8,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
9,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
10,132971,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
11,145068,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
12,205217,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
13,209744,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
14,212303,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000


In [64]:
# 원상 복구
df.index = range(0, 15)
df.head(15)

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,132971,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,145068,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,205217,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,209744,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,212303,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000


# 5. Drop

df.drop()을 이용하여 행/열을 삭제할 수 있습니다.
row 넘버를 이용하여 drop

In [70]:
df.drop(1)

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,132971,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,145068,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,205217,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,209744,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,212303,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000
10,214098,"Goodwin, Homenick and Jerde",649 Cierra Forks Apt. 078,Rosaberg,Tenessee,47743,45000,120000,55000


row 넘버의 list로 drop

In [66]:
df.drop([1, 3])

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,132971,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,145068,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,205217,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,209744,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,212303,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000
10,214098,"Goodwin, Homenick and Jerde",649 Cierra Forks Apt. 078,Rosaberg,Tenessee,47743,45000,120000,55000
11,231907,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415,150000,10000,162000


열을 기준으로 제거하고 싶으면 열의 이름과 axis = 1을 적용한다. axis = 1을 써야한다는게 와닿지는 않지만...

In [72]:
df.drop("account", axis = 1)

Unnamed: 0,name,street,city,state,postal-code,Jan,Feb,Mar
0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000


In [73]:
df.drop(["account", "name"], axis = 1)

Unnamed: 0,street,city,state,postal-code,Jan,Feb,Mar
0,34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000


# 6. lambda, map, apply

## lambda 함수
lambda 함수는 한 줄로 함수를 표현하는 익명 함수 기법

    lambda argument : expression
    
ex) labmda x, y : x + y

In [1]:
def f1(x, y):
    return x + y

f2 = lambda x,y : x + y

print(f1(1,2))
print(f2(1,2))

3
3


이름을 할당하지 않는 lambda 함수도 성립 가능

In [2]:
(lambda x:x+1)(5)

6

## map 함수

- 함수와 sequence 형 데이터를 인자로 받아
- 각 element 마다 입력 받은 함수를 적용하여 list로 반환
- 일반적으로 함수를 lambda 형태로 표현함

    map(function, sequence)
    

In [3]:
ex = [1,2,3,4,5]
f = lambda x:x**2
list(map(f, ex))

[1, 4, 9, 16, 25]

두 개 이상의 argument가 있을 때는 두 개의 sequence 형을 써야 함

In [25]:
f = lambda x, y:x + y
list(map(f, ex, ex))

[2, 4, 6, 8, 10]

## map for series

- Pandas의 series type의 데이터에도 map 함수 사용 가능
- function 대신 dict, sequence형 자료등으로 대체 가능

In [26]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

s1 = Series(np.arange(10))
s1.head(5)

0    0
1    1
2    2
3    3
4    4
dtype: int32

In [27]:
s1.map(lambda x:x**2).head(5)

0     0
1     1
2     4
3     9
4    16
dtype: int64

dict type으로 데이터 교체, 없는 값은 NaN

In [28]:
z = {1:'A', 2:'B', 3:'C'}
s1.map(z).head(5)

0    NaN
1      A
2      B
3      C
4    NaN
dtype: object

같은 위치의 데이터를 s2로 전환

In [29]:
s2 = Series(np.arange(10, 20))
s1.map(s2).head(5)

0    10
1    11
2    12
3    13
4    14
dtype: int32

In [30]:
df = pd.read_excel("data/wages.xlsx")
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,male,white,16,49
1,96396.988643,66.23,female,white,16,62
2,48710.666947,63.77,female,white,16,33
3,80478.096153,63.22,female,other,16,95
4,82089.345498,63.08,female,white,17,43
