# 1. Pandas

- 구조화된 데이터의 처리를 지원하는 Python 라이브러리. Python계의 엑셀!

## Pandas란?

- 구조화된 데이터의 처리를 지원하는 Python 라이브러리
- 고성능 Array 계산 라이브러리인 Numpy와 통합하여, 강력한 “스프레드시트” 처리 기능을 제공
- 인덱싱,연산용함수,전처리함수등을제공함

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

위의 세 줄은 고정으로 import 시켜놓고 가자.

## 데이터 로딩

In [2]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' # Data URL

Boston housing 문제의 데이터는 다음과 같이 되어있다.

<img src="../../img/Screen Shot 2019-03-17 at 5.06.15 PM.png" width="700">

In [3]:
df_data = pd.read_csv(data_url, sep='\s+', header = None) # csv 타입 데이터 로드, separate는 빈공간으로 지정하고, Column은 없음

'\s+'의 경우, '\s'는 정규 표현식으로서 공백문자(space)를 의미한다. 따라서 '\s+'는 "빈 칸으로 나눠서 띄워져 있는 것들은 다 가져와라." 라는 뜻으로 이해하면 된다.

그리고 header는 첫 줄에 column 이름들이 들어가 있는 지를 보는 paremeter이다. 없으면 None을 주면 된다.

In [4]:
df_data.head() # 처음 다섯줄 출력

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


head를 찍으면 처음 5칸을 보여준다.

column들이 없었으니 column을 따로 설정해보자.

In [5]:
df_data.columns = [
    'CRIM','ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO' ,'B', 'LSTAT', 'MEDV'] 
# Column Header 이름 지정

In [6]:
df_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


---

# 2. Series

## Pandas의 구성

<img src="../../img/Screen Shot 2019-03-17 at 5.24.59 PM.png" width="700">

Pandas는 기본적으로 Series와 DataFrame이라고 하는 두 가지 Object로 구성되어 있다.

Series라는 것은 간단하게 말해서 numpy에서의 하나의 vector

## 일반적인 pandas의 활용

<img src="../../img/Screen Shot 2019-03-17 at 5.28.58 PM.png" width="800">

## Series

- Column Vector를 표현하는 object

~~~python
example_obj = Series()
~~~

<img src="../../img/Screen Shot 2019-03-17 at 5.34.43 PM.png" width="700">

In [7]:
list_data = [1,2,3,4,5]

In [8]:
example_obj = Series(data = list_data) # data에는 dict도 가능

In [9]:
print(example_obj)

0    1
1    2
2    3
3    4
4    5
dtype: int64


- index & values(data)

<img src="../../img/Screen Shot 2019-03-17 at 5.38.19 PM.png" width="500">

나중에 DataFrame에서는 index가 중복이 가능하다. 

In [10]:
list_name = ["a","b","c","d","e"]
list_data = [1,2,3,4,5]

In [11]:
example_obj = Series(data = list_data, index=list_name) # index 이름을 지정

In [12]:
print(example_obj)

a    1
b    2
c    3
d    4
e    5
dtype: int64


다음과 같이 직접적으로 접근도 가능하다.

In [13]:
print(example_obj.index)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


In [14]:
print(example_obj.values)

[1 2 3 4 5]


In [15]:
print(type(example_obj.values))

<class 'numpy.ndarray'>


다음과 같이 dictionary도 가능하다.

In [16]:
dict_data = {"a":1, "b":2, "c":3, "d":4, "e":5}

In [17]:
example_obj = Series(dict_data, dtype=np.float32, name="example_data") 
# name - series 이름 설정. 쉽게 말해 하나의 column 이름 설정.

In [18]:
print(example_obj)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32


- Series의 인덱싱

In [19]:
print(example_obj["a"])

1.0


다음과 같이 직접적으로 접근해서 값을 변경할 수도 있다.

In [20]:
example_obj["a"] = 3.2

In [21]:
print(example_obj)

a    3.2
b    2.0
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32


다음과 같은 것들도 가능하다.

In [22]:
print(example_obj[example_obj > 2])

a    3.2
c    3.0
d    4.0
e    5.0
Name: example_data, dtype: float32


In [23]:
print(example_obj * 2)

a     6.4
b     4.0
c     6.0
d     8.0
e    10.0
Name: example_data, dtype: float32


- Data에 대한 정보 저장

In [24]:
example_obj.name = "number" # series 이름 변경
example_obj.index.name = "alphabet" # index 이름 변경

In [25]:
print(example_obj)

alphabet
a    3.2
b    2.0
c    3.0
d    4.0
e    5.0
Name: number, dtype: float32


- index 값을 기준으로 series 생성

In [26]:
dict_data_1 = {"a":1, "b":2, "c":3, "d":4, "e":5}
indexes = ["a","b","c","d","e","f","g","h"]

In [27]:
series_obj_1 = Series(dict_data_1, index=indexes)

In [28]:
print(series_obj_1)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    NaN
g    NaN
h    NaN
dtype: float64


---

# 3. Dataframe

- Series를 모아서 만든 Data Table = 기본 2차원

<img src="../../img/Screen Shot 2019-03-17 at 6.46.23 PM.png" width="600">

DataFrame에서는 index와 columns 두 가지로 data를 찾을 수 있다. 

~~~python
DataFrame()
~~~

<img src="../../img/Screen Shot 2019-03-17 at 7.52.22 PM.png" width="650">

In [29]:
# Example from - https://chrisalbon.com/python/pandas_map_values_to_values.html
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
            'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
            'age': [42, 52, 36, 24, 73],
            'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}

In [30]:
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])

In [31]:
print(df)

  first_name last_name  age           city
0      Jason    Miller   42  San Francisco
1      Molly  Jacobson   52      Baltimore
2       Tina       Ali   36          Miami
3       Jake    Milner   24        Douglas
4        Amy     Cooze   73         Boston


- column 선택

In [32]:
df = DataFrame(raw_data, columns = ["age", "city"])

In [33]:
print(df)

   age           city
0   42  San Francisco
1   52      Baltimore
2   36          Miami
3   24        Douglas
4   73         Boston


- 새로운 column 추가

In [34]:
df = DataFrame(raw_data, columns = ["first_name","last_name", "age", "city", "debt"])

In [35]:
print(df)

  first_name last_name  age           city debt
0      Jason    Miller   42  San Francisco  NaN
1      Molly  Jacobson   52      Baltimore  NaN
2       Tina       Ali   36          Miami  NaN
3       Jake    Milner   24        Douglas  NaN
4        Amy     Cooze   73         Boston  NaN


- column 선택 - series 추출

In [36]:
print(df.first_name)

0    Jason
1    Molly
2     Tina
3     Jake
4      Amy
Name: first_name, dtype: object


In [37]:
print(df["first_name"])

0    Jason
1    Molly
2     Tina
3     Jake
4      Amy
Name: first_name, dtype: object


- loc & iloc
    - loc(index location) - index 이름
    - iloc(index position) - index number
    
대체적으로 loc은 잘 안씀

In [38]:
print(df.loc[2:]) # row values

  first_name last_name  age     city debt
2       Tina       Ali   36    Miami  NaN
3       Jake    Milner   24  Douglas  NaN
4        Amy     Cooze   73   Boston  NaN


In [39]:
print(df["first_name"].iloc[2:]) # column values

2    Tina
3    Jake
4     Amy
Name: first_name, dtype: object


다른 것으로 비교해보자.

In [40]:
# Example from - https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])

In [41]:
print(s.loc[:3]) # 3까지 실행해라.

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
dtype: float64


loc은 위와 같이 index의 이름이 3인 것까지 출력하는 것을 볼 수 있다.

In [42]:
print(s.iloc[:3])

49   NaN
48   NaN
47   NaN
dtype: float64


iloc은 위와 같이 0, 1, 2 즉, index의 number가 3번째 전까지 출력하는 것을 볼 수 있다.

- column에 새로운 data 할당

In [43]:
df.debt = df.age > 40

In [44]:
print(df)

  first_name last_name  age           city   debt
0      Jason    Miller   42  San Francisco   True
1      Molly  Jacobson   52      Baltimore   True
2       Tina       Ali   36          Miami  False
3       Jake    Milner   24        Douglas  False
4        Amy     Cooze   73         Boston   True


- transpose

In [45]:
df = df.T

In [46]:
print(df)

                        0          1      2        3       4
first_name          Jason      Molly   Tina     Jake     Amy
last_name          Miller   Jacobson    Ali   Milner   Cooze
age                    42         52     36       24      73
city        San Francisco  Baltimore  Miami  Douglas  Boston
debt                 True       True  False    False    True


- DataFrame에 들어가있는 값 출력

In [47]:
print(df.values)

[['Jason' 'Molly' 'Tina' 'Jake' 'Amy']
 ['Miller' 'Jacobson' 'Ali' 'Milner' 'Cooze']
 [42 52 36 24 73]
 ['San Francisco' 'Baltimore' 'Miami' 'Douglas' 'Boston']
 [True True False False True]]


- csv로 변환

In [48]:
df_csv = df.to_csv()

In [49]:
print(df_csv)

,0,1,2,3,4
first_name,Jason,Molly,Tina,Jake,Amy
last_name,Miller,Jacobson,Ali,Milner,Cooze
age,42,52,36,24,73
city,San Francisco,Baltimore,Miami,Douglas,Boston
debt,True,True,False,False,True



- column 삭제

In [50]:
df = df.T
print(df)

  first_name last_name age           city   debt
0      Jason    Miller  42  San Francisco   True
1      Molly  Jacobson  52      Baltimore   True
2       Tina       Ali  36          Miami  False
3       Jake    Milner  24        Douglas  False
4        Amy     Cooze  73         Boston   True


In [51]:
del df["debt"]

In [52]:
print(df)

  first_name last_name age           city
0      Jason    Miller  42  San Francisco
1      Molly  Jacobson  52      Baltimore
2       Tina       Ali  36          Miami
3       Jake    Milner  24        Douglas
4        Amy     Cooze  73         Boston


---

# 4. Selection & Drop

In [53]:
!pip install xlrd



In [54]:
df = pd.read_excel("./excel-comp-data.xlsx")

In [55]:
print(df)

    account                              name  \
0    211829        Kerluke, Koepp and Hilpert   
1    320563                    Walter-Trantow   
2    648336        Bashirian, Kunde and Price   
3    109996       D'Amore, Gleichner and Bode   
4    121213                     Bauch-Goldner   
5    132971  Williamson, Schumm and Hettinger   
6    145068                        Casper LLC   
7    205217                  Kovacek-Johnston   
8    209744                    Champlin-Morar   
9    212303                    Gerhold-Maggio   
10   214098       Goodwin, Homenick and Jerde   
11   231907                        Hahn-Moore   
12   242368      Frami, Anderson and Donnelly   
13   268755                       Walsh-Haley   
14   273274                     McDermott PLC   

                                  street               city          state  \
0                     34456 Sean Highway         New Jaycob          Texas   
1                      1311 Alvis Tunnel      Port Khadijah

##  Selection with column names

- 1개의 column name 선택

In [56]:
df["account"].head(3)

0    211829
1    320563
2    648336
Name: account, dtype: int64

- 1개 이상의 column names 선택

In [57]:
df[["account", "street", "state"]].head(3)

Unnamed: 0,account,street,state
0,211829,34456 Sean Highway,Texas
1,320563,1311 Alvis Tunnel,NorthCarolina
2,648336,62184 Schamberger Underpass Apt. 231,Iowa


##  Selection with index number

- column 이름 없이 사용하는 index number는 row 기준 표시

In [58]:
df[:3]

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000


- column이름과 함께 row index 사용시, 해당 column만

In [59]:
df["account"][:3]

0    211829
1    320563
2    648336
Name: account, dtype: int64

## Series selection

In [60]:
account_serires = df["account"]

In [61]:
account_serires[:3]

0    211829
1    320563
2    648336
Name: account, dtype: int64

- 1개 이상의 index

In [62]:
account_serires[[1,5,2]]

1    320563
5    132971
2    648336
Name: account, dtype: int64

- Boolean index

In [63]:
account_serires[account_serires < 250000]

0     211829
3     109996
4     121213
5     132971
6     145068
7     205217
8     209744
9     212303
10    214098
11    231907
12    242368
Name: account, dtype: int64

##  Index 변경

In [64]:
df.index = df["account"]

In [65]:
del df["account"]

In [66]:
df.head()

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


## Basic, loc, iloc selection

- Column과 index number

In [67]:
df[["name","street"]][:2]

Unnamed: 0_level_0,name,street
account,Unnamed: 1_level_1,Unnamed: 2_level_1
211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway
320563,Walter-Trantow,1311 Alvis Tunnel


- Column과 index name

In [68]:
df.loc[[211829,320563],["name","street"]]

Unnamed: 0_level_0,name,street
account,Unnamed: 1_level_1,Unnamed: 2_level_1
211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway
320563,Walter-Trantow,1311 Alvis Tunnel


- Column number와 index number
 

In [69]:
df.iloc[:2,:2]

Unnamed: 0_level_0,name,street
account,Unnamed: 1_level_1,Unnamed: 2_level_1
211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway
320563,Walter-Trantow,1311 Alvis Tunnel


개인적으로 iloc을 많이 쓸 것 같다. column에 대해서도 슬라이싱을 할 때 다음과 같이 쓰면 편하다.

In [70]:
df["name"].iloc[2:]

account
648336          Bashirian, Kunde and Price
109996         D'Amore, Gleichner and Bode
121213                       Bauch-Goldner
132971    Williamson, Schumm and Hettinger
145068                          Casper LLC
205217                    Kovacek-Johnston
209744                      Champlin-Morar
212303                      Gerhold-Maggio
214098         Goodwin, Homenick and Jerde
231907                          Hahn-Moore
242368        Frami, Anderson and Donnelly
268755                         Walsh-Haley
273274                       McDermott PLC
Name: name, dtype: object

## index 재설정

In [71]:
df.index = list(range(0,15))

In [72]:
df.head()

Unnamed: 0,name,street,city,state,postal-code,Jan,Feb,Mar
0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


다시 index 이름을 account로 돌려놓자.

In [73]:
df.index.name = "account"

In [74]:
df.head()

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


##  Data drop

- index number로 drop

In [75]:
df.drop(1)

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
2,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000
10,"Goodwin, Homenick and Jerde",649 Cierra Forks Apt. 078,Rosaberg,Tenessee,47743,45000,120000,55000


- 한개 이상의 Index number로 drop

In [76]:
df.drop([0,1,2,3])

Unnamed: 0_level_0,name,street,city,state,postal-code,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000
5,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Jeremieburgh,Arkansas,62785,150000,120000,35000
6,Casper LLC,340 Consuela Bridge Apt. 400,Lake Gabriellaton,Mississipi,18008,62000,120000,70000
7,Kovacek-Johnston,91971 Cronin Vista Suite 601,Deronville,RhodeIsland,53461,145000,95000,35000
8,Champlin-Morar,26739 Grant Lock,Lake Juliannton,Pennsylvania,64415,70000,95000,35000
9,Gerhold-Maggio,366 Maggio Grove Apt. 998,North Ras,Idaho,46308,70000,120000,35000
10,"Goodwin, Homenick and Jerde",649 Cierra Forks Apt. 078,Rosaberg,Tenessee,47743,45000,120000,55000
11,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415,150000,10000,162000
12,"Frami, Anderson and Donnelly",182 Bertie Road,East Davian,Iowa,72686,162000,120000,35000
13,Walsh-Haley,2624 Beatty Parkways,Goodwinmouth,RhodeIsland,31919,55000,120000,35000


- axis 지정으로 축을 기준으로 drop -> column 중에 “city”

In [77]:
df.drop("city", axis=1)

Unnamed: 0_level_0,name,street,state,postal-code,Jan,Feb,Mar
account,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,Texas,28752,10000,62000,35000
1,Walter-Trantow,1311 Alvis Tunnel,NorthCarolina,38365,95000,45000,35000
2,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,Iowa,76517,91000,120000,35000
3,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Maine,46021,45000,120000,10000
4,Bauch-Goldner,7274 Marissa Common,California,49681,162000,120000,35000
5,"Williamson, Schumm and Hettinger",89403 Casimer Spring,Arkansas,62785,150000,120000,35000
6,Casper LLC,340 Consuela Bridge Apt. 400,Mississipi,18008,62000,120000,70000
7,Kovacek-Johnston,91971 Cronin Vista Suite 601,RhodeIsland,53461,145000,95000,35000
8,Champlin-Morar,26739 Grant Lock,Pennsylvania,64415,70000,95000,35000
9,Gerhold-Maggio,366 Maggio Grove Apt. 998,Idaho,46308,70000,120000,35000


axis는 0이 항상 가장 큰 차원(축)이고, 숫자가 크면 클수록 작은 차원(축)이 된다.

---

# 5. Dataframe Operations

## Series operation

- index을 기준으로 연산수행
- 겹치는 index가 없을 경우 NaN값으로 반환

In [78]:
s1 = Series(range(1,6), index=list("abced"))

In [79]:
print(s1)

a    1
b    2
c    3
e    4
d    5
dtype: int64


In [80]:
s2 = Series(range(5,11), index=list("bcedef"))

In [81]:
print(s2)

b     5
c     6
e     7
d     8
e     9
f    10
dtype: int64


In [82]:
s1.add(s2)

a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN
dtype: float64

In [83]:
print(s1 + s2)

a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN
dtype: float64


## Dataframe operation

- df는 column과 index를 모두 고려
- add operation을 쓰면 NaN값 0으로 변환 
- Operation types: add, sub, div, mul

In [84]:
df1 = DataFrame(np.arange(9).reshape(3,3), columns=list("abc"))

In [85]:
print(df1)

   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8


In [86]:
df2 = DataFrame(np.arange(16).reshape(4,4), columns=list("abcd"))

In [87]:
print(df2)

    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15


In [88]:
print(df1 + df2)

      a     b     c   d
0   0.0   2.0   4.0 NaN
1   7.0   9.0  11.0 NaN
2  14.0  16.0  18.0 NaN
3   NaN   NaN   NaN NaN


In [89]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d
0,0.0,2.0,4.0,3.0
1,7.0,9.0,11.0,7.0
2,14.0,16.0,18.0,11.0
3,12.0,13.0,14.0,15.0


## Series + Dataframe

- column을 기준으로 broadcasting이 발생함

In [90]:
df = DataFrame(np.arange(16).reshape(4,4), columns=list("abcd"))

In [91]:
print(df)

    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15


In [92]:
s = Series(np.arange(10,14), index=list("abcd"))

In [93]:
print(s)

a    10
b    11
c    12
d    13
dtype: int64


In [94]:
print(df + s)

    a   b   c   d
0  10  12  14  16
1  14  16  18  20
2  18  20  22  24
3  22  24  26  28


반대로 row를 기준으로 broadcasting을 할 수도 있다.

In [95]:
s2 = Series(np.arange(10,14))

In [96]:
print(df)

    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15


In [97]:
print(s2)

0    10
1    11
2    12
3    13
dtype: int64


In [98]:
df.add(s2, axis=0)

Unnamed: 0,a,b,c,d
0,10,11,12,13
1,15,16,17,18
2,20,21,22,23
3,25,26,27,28


---

# 6. Lambda, Map, Apply

## Lambda 함수

- 한 줄로 함수를 표현하는 익명 함수 기법
- Lisp 언어에서 시작된 기법으로 오늘날 현대언어에 많이 사용
- **lambda** argument : expression

In [99]:
f = lambda x,y: x + y

In [100]:
print(f(1,4))

5


이름을 할당하지 않고도 가능하다.

In [101]:
(lambda x: x +1)(5)

6

## Map 함수

- 함수와 sequence형 데이터를 인자로 받아
- **각 element마다** 입력받은 함수를 적용하여 list로 반환 
- 일반적으로 함수를 lambda형태로 표현함
- **map** (function, sequence)

In [102]:
ex = [1,2,3,4,5]
f = lambda x: x ** 2

In [103]:
print(list(map(f, ex)))

[1, 4, 9, 16, 25]


두 개이상의 argument가 있을 때는 두 개의 sequence형을 써야한다.

In [104]:
f = lambda x, y: x + y

In [105]:
print(list(map(f, ex, ex)))

[2, 4, 6, 8, 10]


lambda와 같이 익명 함수 그대로 사용할 수도 있다.

In [106]:
list(map(lambda x: x+x, ex))

[2, 4, 6, 8, 10]

## Map for series

- Pandas의 series type의 데이터에도 map 함수 사용가능 
- function 대신 dict, sequence형 자료등으로 대체 가능

In [107]:
s1 = Series(np.arange(10))

In [108]:
s1.head(5)

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [109]:
s1.map(lambda x: x**2).head(5)

0     0
1     1
2     4
3     9
4    16
dtype: int64

- dict type으로 데이터 교체. 없는 값은 NaN

In [110]:
z = {1: 'A', 2: 'B', 3: 'C'}

In [111]:
s1.map(z).head(5)

0    NaN
1      A
2      B
3      C
4    NaN
dtype: object

- 같은 위치의 데이터를 s2로 전환

In [112]:
s2 = Series(np.arange(10,20))

In [113]:
s1.map(s2).head(5)

0    10
1    11
2    12
3    13
4    14
dtype: int64

## Example - map for series

In [114]:
df = pd.read_csv("wages.csv")

In [115]:
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,male,white,16,49
1,96396.988643,66.23,female,white,16,62
2,48710.666947,63.77,female,white,16,33
3,80478.096153,63.22,female,other,16,95
4,82089.345498,63.08,female,white,17,43


In [116]:
df.sex.unique()

array(['male', 'female'], dtype=object)

map을 통해 sex_code이라는 series를 추가해보자.

In [117]:
df["sex_code"] = df.sex.map({"male":0, "female":1})

In [118]:
df.head()

Unnamed: 0,earn,height,sex,race,ed,age,sex_code
0,79571.299011,73.89,male,white,16,49,0
1,96396.988643,66.23,female,white,16,62,1
2,48710.666947,63.77,female,white,16,33,1
3,80478.096153,63.22,female,other,16,95,1
4,82089.345498,63.08,female,white,17,43,1


## Replace function

- Map함수의 기능 중 **데이터 변환 기능만** 담당
- 데이터 변환 시 많이 사용하는 함수

In [119]:
df.sex.replace({"male":0, "female":1}).head()

0    0
1    1
2    1
3    1
4    1
Name: sex, dtype: int64

In [120]:
df.sex.replace(["male", "female"], # Target list
               [0,1],              # Conversion list
               inplace=True)       # 데이터 변환 결과를 적용

In [121]:
df.head()

Unnamed: 0,earn,height,sex,race,ed,age,sex_code
0,79571.299011,73.89,0,white,16,49,0
1,96396.988643,66.23,1,white,16,62,1
2,48710.666947,63.77,1,white,16,33,1
3,80478.096153,63.22,1,other,16,95,1
4,82089.345498,63.08,1,white,17,43,1


## Apply for dataframe

- map과 달리, series **전체(column)에 해당** 함수를 적용
- 입력값이 series 데이터로 입력받아 handling 가능

In [122]:
df = pd.read_csv("wages.csv")
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,male,white,16,49
1,96396.988643,66.23,female,white,16,62
2,48710.666947,63.77,female,white,16,33
3,80478.096153,63.22,female,other,16,95
4,82089.345498,63.08,female,white,17,43


In [123]:
df_info = df[["earn", "height","age"]]
df_info.head()

Unnamed: 0,earn,height,age
0,79571.299011,73.89,49
1,96396.988643,66.23,62
2,48710.666947,63.77,33
3,80478.096153,63.22,95
4,82089.345498,63.08,43


In [124]:
f = lambda x : x.max() - x.min()

아래와 같이 각 column 별로 결과값 반환된다.

In [125]:
df_info.apply(f)

earn      318047.708444
height        19.870000
age           73.000000
dtype: float64

- 내장 연산 함수를 사용할 때도 똑같은 효과를 거둘 수 있음 
- mean, std 등 사용가능

In [126]:
df_info.sum()

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

In [127]:
df_info.apply(sum)

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

- scalar 값 이외에 series값의 반환도 가능함

In [128]:
f = lambda x : Series([x.min(), x.max(), x.mean()], 
                      index=["min", "max", "mean"])

In [129]:
df_info.apply(f)

Unnamed: 0,earn,height,age
min,-98.580489,57.34,22.0
max,317949.127955,77.21,95.0
mean,32446.292622,66.59264,45.328499


## Applymap for dataframe

- series 단위가 아닌 **element 단위로** 함수를 적용함 
- series 단위에 apply를 적용시킬 때와 같은 효과

In [130]:
f = lambda x : -x

In [131]:
df_info.applymap(f).head(5)

Unnamed: 0,earn,height,age
0,-79571.299011,-73.89,-49
1,-96396.988643,-66.23,-62
2,-48710.666947,-63.77,-33
3,-80478.096153,-63.22,-95
4,-82089.345498,-63.08,-43


In [132]:
df_info["earn"].apply(f).head(5)

0   -79571.299011
1   -96396.988643
2   -48710.666947
3   -80478.096153
4   -82089.345498
Name: earn, dtype: float64

---

# 7. Pandas Built-in functions

## describe

- Numeric type 데이터의 요약 정보를 보여줌

In [133]:
df = pd.read_csv("./wages.csv")
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,male,white,16,49
1,96396.988643,66.23,female,white,16,62
2,48710.666947,63.77,female,white,16,33
3,80478.096153,63.22,female,other,16,95
4,82089.345498,63.08,female,white,17,43


In [134]:
df.describe()

Unnamed: 0,earn,height,ed,age
count,1379.0,1379.0,1379.0,1379.0
mean,32446.292622,66.59264,13.354605,45.328499
std,31257.070006,3.818108,2.438741,15.789715
min,-98.580489,57.34,3.0,22.0
25%,10538.790721,63.72,12.0,33.0
50%,26877.870178,66.05,13.0,42.0
75%,44506.215336,69.315,15.0,55.0
max,317949.127955,77.21,18.0,95.0


## unique

- series data의 유일한 값을 list를 반환함. 다시 말해 series data의 unique set을 보여준다.

- 유일한 인종의 값 list 출력

In [135]:
df.race.unique()

array(['white', 'other', 'hispanic', 'black'], dtype=object)

- dict type으로 index

In [136]:
np.array(dict(enumerate(sorted(df["race"].unique()))))

array({0: 'black', 1: 'hispanic', 2: 'other', 3: 'white'}, dtype=object)

- label index 값과 label 값 각각 추출

In [137]:
value = list(map(int, np.array(list(enumerate(df["race"].unique())))[:, 0].tolist()))
key = np.array(list(enumerate(df["race"].unique())), dtype=str)[:, 1].tolist()

int로 변환해주기 위해 map 사용. map 앞에는 무조건 list가 붙어야함.

In [138]:
print(value)
print(key)

[0, 1, 2, 3]
['white', 'other', 'hispanic', 'black']


- label str - > index 값으로 변환

In [139]:
df["race"].head()

0    white
1    white
2    white
3    other
4    white
Name: race, dtype: object

In [140]:
df["race"].replace(to_replace=key, value=value, inplace=True)

In [141]:
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,male,0,16,49
1,96396.988643,66.23,female,0,16,62
2,48710.666947,63.77,female,0,16,33
3,80478.096153,63.22,female,1,16,95
4,82089.345498,63.08,female,0,17,43


- 성별에 대해서도 동일하게 적용

In [142]:
value = list(map(int, np.array(list(enumerate(df["sex"].unique())))[:, 0].tolist()))
key = np.array(list(enumerate(df["sex"].unique())), dtype=str)[:, 1].tolist()

In [143]:
print(value)
print(key)

[0, 1]
['male', 'female']


- ”sex”와 “race” column의 index labelling

In [144]:
df["sex"].head()

0      male
1    female
2    female
3    female
4    female
Name: sex, dtype: object

In [145]:
df["sex"].replace(to_replace=key, value=value, inplace=True)

In [146]:
df.head()

Unnamed: 0,earn,height,sex,race,ed,age
0,79571.299011,73.89,0,0,16,49
1,96396.988643,66.23,1,0,16,62
2,48710.666947,63.77,1,0,16,33
3,80478.096153,63.22,1,1,16,95
4,82089.345498,63.08,1,0,17,43


## sum

- 기본적인 column 또는 row 값의 연산을 지원
- sub, mean, min, max, count, median, mad, var 등

In [147]:
df.sum(axis=0) # axis - 아래 화살표를 손으로 표시하고 오른쪽으로 이동

earn      4.474344e+07
height    9.183125e+04
sex       8.590000e+02
race      5.610000e+02
ed        1.841600e+04
age       6.250800e+04
dtype: float64

In [148]:
df.sum(axis=1).head() # axis - 오른쪽 화살표를 손으로 표시하고 아래로 이동

0    79710.189011
1    96542.218643
2    48824.436947
3    80654.316153
4    82213.425498
dtype: float64

## isnull

- column 또는 row 값의 NaN (null) 값의 index를 반환함

In [149]:
df.isnull().head()

Unnamed: 0,earn,height,sex,race,ed,age
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


- Null인 값의 합

In [150]:
df.isnull().sum()

earn      0
height    0
sex       0
race      0
ed        0
age       0
dtype: int64

## sort_values

- column 값을 기준으로 데이터를 sorting

In [151]:
df.sort_values(["age", "earn"], ascending=False).head(10) # ascending=Ture 하면 오름차순

Unnamed: 0,earn,height,sex,race,ed,age
3,80478.096153,63.22,1,1,16,95
809,42963.362005,72.94,0,0,12,95
331,39169.750135,64.79,1,0,12,95
102,39751.19403,67.14,0,0,12,93
993,32809.632677,59.61,1,1,16,92
1017,8942.806716,62.97,1,0,10,91
1192,39757.94721,64.79,0,0,16,90
952,8162.682672,58.09,1,0,5,89
827,55712.348432,70.13,0,0,9,88
939,40744.874765,59.15,1,0,15,87


---

# 8. Groupby

## Groupby

## Hierarchical index

## Hierarchical index – unstack()

## Hierarchical index – swaplevel

## Hierarchical index – operations

## Groupby – gropued

## Groupby – aggregation

## Groupby – transformation

## Groupby – filter

---

# 9. Case study

## Data

---

# 10. Pivot table Crosstab

## Pivot Table

## Crosstab

---

# 11. Merge & Concat

## Merge

## Join method

## Data

## Left join

## Right join

## Full(outer) join

## Inner join

## Index based join

## Concat

---

# 12. DB Persistence

## Database connection

## XLS persistence

## Pickle persistence