In [1]:
import pandas as pd
import numpy as np

### 中文文件名导致问题

直接打开中文文件名文件，可能会触发Pandas的bug

In [2]:
data = pd.read_csv('数据透视表.csv')

OSError: Initializing from file failed

比较保险的方法是使用Python原生open，将文件对象传入Pandas方法。
但可能因编码问题触发其他问题

In [3]:
with open('数据透视表.csv', 'rb') as inf:
    data = pd.read_csv(inf)
data.head()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte

可以显式指定编码

In [4]:
with open('数据透视表.csv', 'rb') as inf:
    data = pd.read_csv(inf, encoding='cp936')
data.head()

Unnamed: 0,订购日期,所属区域,产品类别,数量,销售额,成本
0,3/21/2007,苏州,宠物用品,16,19269.69,18982.85
1,4/28/2007,苏州,宠物用品,40,39465.17,40893.08
2,4/28/2007,苏州,宠物用品,20,21015.94,22294.09
3,5/31/2007,苏州,宠物用品,20,23710.26,24318.37
4,6/13/2007,苏州,宠物用品,16,20015.07,20256.69


### 类型推导问题
Pandas类型推导可能出现错误

In [5]:
xls = pd.ExcelFile('test-data1.xlsx')
data = xls.parse('Sheet1')
data.head()

Unnamed: 0,姓名,得分
0,小明,98
1,小红,87
2,小白,未考核


看起来是正常，但是可以看出数据类型是错误的。

In [6]:
data['得分']

0     98
1     87
2    未考核
Name: 得分, dtype: object

解决方法1：替换所有非法值，Pandas会自动使用合适的类型

In [7]:
data['得分']=data['得分'].replace('未考核', np.nan)
data

Unnamed: 0,姓名,得分
0,小明,98.0
1,小红,87.0
2,小白,


解决方法2：打开时指定应该被替换为na的值

In [8]:
data = xls.parse('Sheet1', na_values='未考核')
data['得分']

0    98.0
1    87.0
2     NaN
Name: 得分, dtype: float64

### 载入复杂表格

In [27]:
data = pd.read_table('test-data2.txt', index_col='Initial release date')

In [28]:
data

Unnamed: 0_level_0,Code name,Version number,API level,Security patches
Initial release date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"September 23, 2008",(No codename),1.0,1,Unsupported
"February 9, 2009","(Internally known as ""Petit Four"")",1.1,2,Unsupported
"April 27, 2009",Cupcake,1.5,3,Unsupported
"September 15, 2009",Donut,1.6,4,Unsupported
"October 26, 2009",Eclair,2.0 – 2.1,5 – 7,Unsupported
"May 20, 2010",Froyo,2.2 – 2.2.3,8,Unsupported
"December 6, 2010",Gingerbread,2.3 – 2.3.7,9 – 10,Unsupported
"February 22, 2011",Honeycomb,3.0 – 3.2.6,11 – 13,Unsupported
"October 18, 2011",Ice Cream Sandwich,4.0 – 4.0.4,14 – 15,Unsupported
"July 9, 2012",Jelly Bean,4.1 – 4.3.1,16 – 18,Unsupported


使用`skipfooter`参数忽略尾注。但因C解析器不支持该参数，会回退到Python解析器，可能导致其他问题。首先在Windows上编码会出现错误，因此显式指定编码为'utf-8'。

In [41]:
data = pd.read_table('test-data2.txt', skipfooter=2, encoding='utf-8')
data

  """Entry point for launching an IPython kernel.


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Code name
(No codename),1.0,"September 23, 2008",1,Unsupported
"(Internally known as ""Petit Four"")",1.1,"February 9, 2009",2,Unsupported
Cupcake,1.5,"April 27, 2009",3,Unsupported
Donut,1.6,"September 15, 2009",4,Unsupported
Eclair,2.0 – 2.1,"October 26, 2009",5 – 7,Unsupported
Froyo,2.2 – 2.2.3,"May 20, 2010",8,Unsupported
Gingerbread,2.3 – 2.3.7,"December 6, 2010",9 – 10,Unsupported
Honeycomb,3.0 – 3.2.6,"February 22, 2011",11 – 13,Unsupported
Ice Cream Sandwich,4.0 – 4.0.4,"October 18, 2011",14 – 15,Unsupported
Jelly Bean,4.1 – 4.3.1,"July 9, 2012",16 – 18,Unsupported


头部解析明显出了错误，因此改用手动指定头部

In [45]:
data = pd.read_table('test-data2.txt', skipfooter=2, encoding='utf-8', header=None, skiprows=1)
data.columns='Code name	Version number	Initial release date	API level	Security patches'.split('\t')
data

  """Entry point for launching an IPython kernel.


Unnamed: 0,Code name,Version number,Initial release date,API level,Security patches
0,(No codename),1.0,"September 23, 2008",1,Unsupported
1,"(Internally known as ""Petit Four"")",1.1,"February 9, 2009",2,Unsupported
2,Cupcake,1.5,"April 27, 2009",3,Unsupported
3,Donut,1.6,"September 15, 2009",4,Unsupported
4,Eclair,2.0 – 2.1,"October 26, 2009",5 – 7,Unsupported
5,Froyo,2.2 – 2.2.3,"May 20, 2010",8,Unsupported
6,Gingerbread,2.3 – 2.3.7,"December 6, 2010",9 – 10,Unsupported
7,Honeycomb,3.0 – 3.2.6,"February 22, 2011",11 – 13,Unsupported
8,Ice Cream Sandwich,4.0 – 4.0.4,"October 18, 2011",14 – 15,Unsupported
9,Jelly Bean,4.1 – 4.3.1,"July 9, 2012",16 – 18,Unsupported


另一个workaround是，只读取需要的行数，避免使用`skipfooter`

In [48]:
data = pd.read_table('test-data2.txt', nrows=16)
data

Unnamed: 0,Code name,Version number,Initial release date,API level,Security patches
0,(No codename),1.0,"September 23, 2008",1,Unsupported
1,"(Internally known as ""Petit Four"")",1.1,"February 9, 2009",2,Unsupported
2,Cupcake,1.5,"April 27, 2009",3,Unsupported
3,Donut,1.6,"September 15, 2009",4,Unsupported
4,Eclair,2.0 – 2.1,"October 26, 2009",5 – 7,Unsupported
5,Froyo,2.2 – 2.2.3,"May 20, 2010",8,Unsupported
6,Gingerbread,2.3 – 2.3.7,"December 6, 2010",9 – 10,Unsupported
7,Honeycomb,3.0 – 3.2.6,"February 22, 2011",11 – 13,Unsupported
8,Ice Cream Sandwich,4.0 – 4.0.4,"October 18, 2011",14 – 15,Unsupported
9,Jelly Bean,4.1 – 4.3.1,"July 9, 2012",16 – 18,Unsupported
