# 範例目標:

1. 實做欄位索引之間轉換
2. 重新組織資料


   1. `pd.MultiIndex`

      MultiIndex物件是標準索引Index物件的擴充套件，

      可以將MultiIndex看作一個元組陣列，其中每個元組都是唯一的。

      - 建立多索引 :

        可以從陣列列表（`MultiIndex.from_arrays(arrays, sortorder=None, names=_NoDefault.no_default)`）

        元組陣列（`MultiIndex.from_tuples(tuples, sortorder=None, names=None)`）

        交叉迭代器集（`MultiIndex.from_product(iterables, sortorder=None, names=_NoDefault.no_default)`） : 排列組合

        DaTaFrame（使用`MultiIndex.from_frame(df, sortorder=None, names=None)`）




   2. 欄位轉索引 : `.stack()`  將一欄位(column)轉成一索引(index)

   3. 索引轉欄位: `.unstack()` 將一索引(index)轉成一欄位(column) 

   4. 巢狀表格 : 
    
      `pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

       https://medium.com/%E6%95%B8%E6%93%9A%E4%B8%8D%E6%AD%A2-not-only-data/%E5%B8%B6%E4%BD%A0%E5%BF%AB%E9%80%9F%E7%90%86%E8%A7%A3-pandas-melt-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8-443976e00f2


   5. 重新組織給定的DataFrame :

       `.pivot(index='', columns='', values='')`

       根據給定的索引/列值重新組織給定的DataFrame<br>
       index : 新資料的索引名稱<br>
       columns: 新資料的欄位名稱<br>
       values :新資料的值名稱<br>


# 範例重點:
1. 不管是欄位轉索引或是索引轉欄位，皆由**最外層的開始轉換**
2. 重新組織資料時應注意參數的理解，可以多做嘗試

# [教學目標]

* 使用 read_csv 與 to_csv 方法存取資料
* 了解空值代表的含義與常見的解決策略
* 知道 Pandas 支援外部資料的格式有哪些
  - pd.read_csv 利用 na_values 自訂缺失值
  - to_csv 寫出資料，利用 compression_opts 設定壓縮格式

In [1]:
import pandas as pd
import numpy as np

`pd.MultiIndex`

MultiIndex物件是標準索引Index物件的擴充套件，

可以將MultiIndex看作一個元組陣列，其中每個元組都是唯一的。

建立多索引 :

可以從陣列列表（`MultiIndex.from_arrays(arrays, sortorder=None, names=_NoDefault.no_default)`）

元組陣列（`MultiIndex.from_tuples(tuples, sortorder=None, names=None)`）

交叉迭代器集（`MultiIndex.from_product(iterables, sortorder=None, names=_NoDefault.no_default)`） : 排列組合

DaTaFrame（使用`MultiIndex.from_frame(df, sortorder=None, names=None)`）

In [3]:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
index

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           names=['year', 'visit'])

In [4]:
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
columns

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

In [2]:
# mock some data
data = np.round(np.random.randn(4, 6), 1)
df = pd.DataFrame(data, index=index, columns=columns)
df

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,1.3,0.2,0.8,-0.7,-0.4,1.3
2013,2,-1.0,0.6,0.7,0.3,-0.8,1.1
2014,1,2.1,0.2,1.1,0.1,-2.1,-1.6
2014,2,0.2,1.1,1.1,0.4,0.4,0.4


**欄位轉索引 : 將一欄位(column)轉成一索引(index)，使用.stack()即可，可以將type這個欄位轉成了索引**

索引 : year、visit->
所以索引變成了year、visit、type

In [5]:
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,Bob,Guido,Sue
year,visit,type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013,1,HR,1.3,0.8,-0.4
2013,1,Temp,0.2,-0.7,1.3
2013,2,HR,-1.0,0.7,-0.8
2013,2,Temp,0.6,0.3,1.1
2014,1,HR,2.1,1.1,-2.1
2014,1,Temp,0.2,0.1,-1.6
2014,2,HR,0.2,1.1,0.4
2014,2,Temp,1.1,0.4,0.4


再做一次.stack()索引變成了year、visit、type、subject

In [6]:
df.stack().stack()

year  visit  type  subject
2013  1      HR    Bob        1.3
                   Guido      0.8
                   Sue       -0.4
             Temp  Bob        0.2
                   Guido     -0.7
                   Sue        1.3
      2      HR    Bob       -1.0
                   Guido      0.7
                   Sue       -0.8
             Temp  Bob        0.6
                   Guido      0.3
                   Sue        1.1
2014  1      HR    Bob        2.1
                   Guido      1.1
                   Sue       -2.1
             Temp  Bob        0.2
                   Guido      0.1
                   Sue       -1.6
      2      HR    Bob        0.2
                   Guido      1.1
                   Sue        0.4
             Temp  Bob        1.1
                   Guido      0.4
                   Sue        0.4
dtype: float64

**索引轉欄位: 將一索引(index)轉成一欄位(column) ，使用.unstack()即可**
    
可以將visit這個索引轉成了欄位，

索引 : year、visit     欄位:subject、type->
所以欄位變成了subject、type 、visit

In [7]:
df.unstack()

subject,Bob,Bob,Bob,Bob,Guido,Guido,Guido,Guido,Sue,Sue,Sue,Sue
type,HR,HR,Temp,Temp,HR,HR,Temp,Temp,HR,HR,Temp,Temp
visit,1,2,1,2,1,2,1,2,1,2,1,2
year,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3
2013,1.3,-1.0,0.2,0.6,0.8,0.7,-0.7,0.3,-0.4,-0.8,1.3,1.1
2014,2.1,0.2,0.2,1.1,1.1,1.1,0.1,0.4,-2.1,0.4,-1.6,0.4


In [8]:
df = pd.DataFrame({'Name':{0:'John', 1:'Bob', 2:'Shiela'}, 
                   'Course':{0:'Masters', 1:'Graduate', 2:'Graduate'}, 
                   'Age':{0:27, 1:23, 2:21}}) 
df

Unnamed: 0,Name,Course,Age
0,John,Masters,27
1,Bob,Graduate,23
2,Shiela,Graduate,21


In [10]:
df = pd.DataFrame({'Name':['John', 'Bob', 'Shiela'], 
                   'Course':['Masters', 'Graduate', 'Graduate'], 
                   'Age':[27, 23, 21]}) 
df

Unnamed: 0,Name,Course,Age
0,John,Masters,27
1,Bob,Graduate,23
2,Shiela,Graduate,21


**保留Name欄位其餘轉成欄位值**

**巢狀表格**
`pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

https://medium.com/%E6%95%B8%E6%93%9A%E4%B8%8D%E6%AD%A2-not-only-data/%E5%B8%B6%E4%BD%A0%E5%BF%AB%E9%80%9F%E7%90%86%E8%A7%A3-pandas-melt-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8-443976e00f2

In [11]:
df.melt()

Unnamed: 0,variable,value
0,Name,John
1,Name,Bob
2,Name,Shiela
3,Course,Masters
4,Course,Graduate
5,Course,Graduate
6,Age,27
7,Age,23
8,Age,21


**只轉換Name欄位**

In [12]:
df.melt(id_vars='Name')

Unnamed: 0,Name,variable,value
0,John,Course,Masters
1,Bob,Course,Graduate
2,Shiela,Course,Graduate
3,John,Age,27
4,Bob,Age,23
5,Shiela,Age,21


**保留Name欄位其餘轉成欄位值，之後再留下value_vars='Name'**

In [13]:
df.melt(value_vars='Name')

Unnamed: 0,variable,value
0,Name,John
1,Name,Bob
2,Name,Shiela


In [14]:
df = pd.DataFrame({'fff': ['one', 'one', 'one', 'two', 'two',
                           'two'],
                   'bbb': ['P', 'Q', 'R', 'P', 'Q', 'R'],
                   'baa': [2, 3, 4, 5, 6, 7],
                   'zzz': ['h', 'i', 'j', 'k', 'l', 'm']})
df

Unnamed: 0,fff,bbb,baa,zzz
0,one,P,2,h
1,one,Q,3,i
2,one,R,4,j
3,two,P,5,k
4,two,Q,6,l
5,two,R,7,m


`.pivot(index='', columns='', values='')`

根據給定的索引/列值重新組織給定的DataFrame<br>
index : 新資料的索引名稱<br>
columns: 新資料的欄位名稱<br>
values :新資料的值名稱<br>


In [16]:
df.pivot(index='fff', columns='bbb', values='baa')

bbb,P,Q,R
fff,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,2,3,4
two,5,6,7


# Numpy 運算

利用 read_csv 讀入資料

In [17]:
pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/example.csv')  

Unnamed: 0,TOK,UPDATE,DATE,SHOT,TIME,AUXHEAT,PHASE,STATE,PGASA,PGASZ,...,WFICRH,MEFF,ISEQ,WTH,WTOT,DWTOT,PL,PLTH,TAUTOT,TAUTH
0,JET,20031201,20001006,53521,10.0,NBIC,HSELM,TRANS,2.0,1.0,...,731900.0,2.0,NONE,3715000.0,5381000.0,1282000.0,12970000.0,12100000.0,0.4445,0.2194


**利用 na_values 自訂缺失值**

In [18]:
df = pd.read_csv('https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv')
df

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,1,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


In [19]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv',
    keep_default_na=True,
    na_values=['na', '--']
)
df

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,,LEXINGTON,N,,1,850.0
3,100004000.0,201.0,BERKELEY,12,1.0,,700.0
4,,203.0,BERKELEY,Y,3.0,2,1600.0
5,100006000.0,207.0,BERKELEY,Y,,1,800.0
6,100007000.0,,WASHINGTON,,2.0,HURLEY,950.0
7,100008000.0,213.0,TREMONT,Y,1.0,1,
8,100009000.0,215.0,TREMONT,Y,,2,1800.0


利用 to_csv 寫出資料

In [20]:
df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
                   'mask': ['red', 'purple'],
                   'weapon': ['sai', 'bo staff']})

df.to_csv(index=False)

'name,mask,weapon\r\nRaphael,red,sai\r\nDonatello,purple,bo staff\r\n'

**利用 compression_opts 設定壓縮格式**

In [21]:
df.to_csv('out.zip', compression='zip')