# Pandas

NumPy는 잘 정돈된 수치 데이터를 가공처리 함에 있어 탁월하나,  
현실 세계의 다양한 형태 자료를 처리함에 있어 부족한 기능들이 있다.  
(예를 들면 null 값의 처리나, 행과 열에 이름달기나, pivot 등의 유용한 기능)

해서, NumPy를 확장시켜 멋진 추가기능을 넣은 라이브러리가 등장했으니,  
이름도 찬란한 __Pandas__

지난 NumPy 시간이 선형대수에 필요한 기능이라면,  
이번 Pandas 는 엑셀에 필요한 기능들  



In [1]:
import numpy as np
import pandas as pd
pd.__version__

'0.24.1'

---

# Pandas Objects
- Series
- DataFrame
- Index

---
# Series

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
print(data.index, '\n')
print(data.values)

RangeIndex(start=0, stop=4, step=1) 

[0.25 0.5  0.75 1.  ]


In [4]:
print(data[1], '\n')
print(data[1:3])

0.5 

1    0.50
2    0.75
dtype: float64


## ```Series``` as a generalized NumPy array

Series가 NumPy의 1d array와 차이점은 정수로 된 index를 가지고 있다인데,  
인덱스를 암시적으로 선언하면 정수가 되지만,  
명시적으로 선언하면 다른 값도 가능하다  

In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [6]:
print(data['b'], '\n')
print(data['b':'d'])

0.5 

b    0.50
c    0.75
d    1.00
dtype: float64


In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [8]:
data[5]

0.5

In [9]:
print(data[2], '\n')
print(data[1:3])

0.25 

5    0.50
3    0.75
dtype: float64


## ```Series``` as specialized dictionary

In [10]:
population_dict = {
    'California': 383325251,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)
population

California    383325251
Texas          26448193
New York       19651127
Florida        19552860
Illinois       12882135
dtype: int64

In [11]:
population['California']

383325251

In [12]:
# 그런데 딕셔너리랑은 다르게 array-style slicing이 된다
population['California':'Illinois']

California    383325251
Texas          26448193
New York       19651127
Florida        19552860
Illinois       12882135
dtype: int64

## 맹글어 봅시다

In [13]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [14]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [15]:
pd.Series({2: 'a', 1: 'b', 3: 'c'})

2    a
1    b
3    c
dtype: object

In [16]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

---
# DataFrame

## ```DataFrame``` as a generalized NumPy array

DataFrame은 인덱스로 연결된 Series들이라고 할 수 있다

In [17]:
area_dict = {
    'California': 423967,
    'Texas': 695662,
    'New York': 141297,
    'Florida': 170312,
    'Illinois': 149995
}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [18]:
states = pd.DataFrame({
    'population': population,
    'area': area
})
states

Unnamed: 0,population,area
California,383325251,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [19]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [20]:
states.columns

Index(['population', 'area'], dtype='object')

## ```DataFrame``` as specialized dictionary

In [21]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

## 맹글어 봅시다

In [22]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,383325251
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [23]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [24]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [25]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,383325251,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [26]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.431191,0.452349
b,0.953581,0.194748
c,0.288119,0.747566


# Index

index는 수정불가능한 배열(immutable array)이나 정렬된 집합(ordered set)으로 여겨질 수 있다.

In [27]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

## ```Index``` as immutable array

In [28]:
ind[1], ind[::2]

(3, Int64Index([2, 5, 11], dtype='int64'))

In [29]:
# Index는 numpy array가 가진 속성들도 지닌다
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [30]:
# immutable
ind[1] = 0

TypeError: Index does not support mutable operations

## ```Index``` as ordered set

이 이야기를 하려면 먼저 python의 built-in set이 어떤지를!

In [31]:
# 중복 허락치 않고

a = set()

a.add(1)
a.add(3)
a.add(5)
a.add(1)
a.add(3)
a

{1, 3, 5}

In [32]:
b = set([2, 3, 5, 7, 2, 3, 7])
b

{2, 3, 5, 7}

In [33]:
# and op, 교집합

a & b

{3, 5}

In [34]:
# or op, 합집합

a | b

{1, 2, 3, 5, 7}

In [35]:
# xor op, 합집합 - 교집합 (symmetric difference)

a ^ b

{1, 2, 7}

In [36]:
indA = pd.Index([1, 3, 5, 9])
indB = pd.Index([2, 3, 5, 7])

indA, indB

(Int64Index([1, 3, 5, 9], dtype='int64'),
 Int64Index([2, 3, 5, 7], dtype='int64'))

In [37]:
indA & indB

Int64Index([3, 5], dtype='int64')

In [38]:
indA | indB

Int64Index([1, 2, 3, 5, 7, 9], dtype='int64')

In [39]:
indA ^ indB

Int64Index([1, 2, 7, 9], dtype='int64')

# Data Indexing & Selection

NumPy에서 훑어보았듯이
- indexing ```arr[2, 1]```
- slicing ```arr[:, 1:5]```
- masking ```arr[arr > 0]```
- fancy indexing ```arr[0, [1, 5]]```
- combination ```arr[:, [1, 5]]```

Pandas에서 전반적으로 비슷한데 살짝 다른 부분을 확인

## ```Series``` as dictionary

In [40]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [41]:
data['b']

0.5

In [42]:
'a' in data

True

In [43]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [44]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [45]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

## ```Series``` as one-dimensional array

In [46]:
# slicing
# WARNING: the final index is INCLUDED in the slice!
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [47]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [48]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [49]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

### Indexers: loc, iloc, and ix

아주 애매한 상황이 있다.  
```data[1]``` 인덱싱에는 명시적인 인덱스가 사용되었지만,  
```data[1:3]``` 같은 경우 암묵적으로 offset으로 변환시켜 slicing을 할 수 있는 것!

In [62]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [63]:
# explicit index when indexing
print(data[1])
print('-----')

# implicit index when slicing
print(data[1:3])

a
-----
3    b
5    c
dtype: object


In [64]:
# loc은 명시적으로 인덱스로 접근/슬라이스를 하려면

print(data.loc[1])
print('-----')
print(data.loc[1:3])

a
-----
1    a
3    b
dtype: object


In [66]:
# iloc은 암묵적인 변환을 통해 offset index로 접근하게 한다

print(data.iloc[1])
print('-----')
print(data.iloc[1:3])

b
-----
3    b
5    c
dtype: object


## ```DataFrame``` as a dictionary

In [67]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [68]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [69]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [70]:
data.area is data['area']

True

In [71]:
data.pop is data['pop']

False

In [72]:
help(data.pop)

Help on method pop in module pandas.core.generic:

pop(item) method of pandas.core.frame.DataFrame instance
    Return item and drop from frame. Raise KeyError if not found.
    
    Parameters
    ----------
    item : str
        Column label to be popped
    
    Returns
    -------
    popped : Series
    
    Examples
    --------
    >>> df = pd.DataFrame([('falcon', 'bird',    389.0),
    ...                    ('parrot', 'bird',     24.0),
    ...                    ('lion',   'mammal',   80.5),
    ...                    ('monkey', 'mammal', np.nan)],
    ...                   columns=('name', 'class', 'max_speed'))
    >>> df
         name   class  max_speed
    0  falcon    bird      389.0
    1  parrot    bird       24.0
    2    lion  mammal       80.5
    3  monkey  mammal        NaN
    
    >>> df.pop('class')
    0      bird
    1      bird
    2    mammal
    3    mammal
    Name: class, dtype: object
    
    >>> df
         name  max_speed
    0  falcon      389.0
 

In [73]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## ```DataFrame``` as two-dimensional array

In [74]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [78]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [80]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [81]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


```ix``` indexer allows a hybrid of these two approaches

In [83]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [85]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Additional indexing conventions

In [86]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [87]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [88]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
