# 读取 CSV 文件
我们现在练习用这个小的学生成绩数据集读取 csv 文件。根据之前的介绍，`read_csv()` 用于将数据从 csv 文件加载到 Pandas 数据框中。只需要指定数据的文件路径。我将 `student_scores.csv` 存储在与这个 Jupyter notebook 相同的目录下，所以只需要提供文件名。

浏览这个 Jupyter notebook 的同时，运行每个框。

In [1]:
import pandas as pd

df = pd.read_csv('student_scores.csv')

`head()` 是一个有用的功能，可以在数据框上调用，用于显示前几行。我们用这个功能看一下数据是什么样的。

In [2]:
df.head()

Unnamed: 0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


请记住，CSV 代表逗号分隔值，但这些值实际可用不同的字符、制表符、空格等分隔。例如，如果文件用逗号分隔，仍然可以将 `read_csv()` 与 `sep` 参数一起使用。

In [3]:
df = pd.read_csv('student_scores.csv', sep=':')
df.head()

Unnamed: 0,"ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final"
0,"27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0"
1,"30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0"
2,"39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0"
3,"28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0"
4,"27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0"


明显没有成功，因为 CSV 文件是用逗号分隔的。由于没有冒号，没有被分隔的值，所有值都被读取到一个列！

## 标题
`read_csv` 的另一个功能是指定文件的哪一行作为标题，而标题指定了列标签。通常第一行是标题，但有时如果文件顶部有额外的元信息，我们希望指定另一行作为标题。可以这样操作。

In [4]:
df = pd.read_csv('student_scores.csv', header=2)
df.head()

Unnamed: 0,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0.1,91.0
0,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
1,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
2,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


这里使用第 3 行作为标题，上面的所有数据都被删除。默认情况下，`read_csv` 使用 header=0，使用第一行作为列标签。

如果文件中不包括列标签，可以使用 `header=None` 防止数据的第一行被误当做列标签。

In [7]:
df = pd.read_csv('student_scores.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
1,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
2,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
3,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
4,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0


还可以用以下方法自己指定列标签。

In [8]:
labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final']
df = pd.read_csv('student_scores.csv', names=labels)
df.head()

Unnamed: 0,id,name,attendance,hw,test1,project1,test2,project2,final
0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
1,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
2,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
3,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
4,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0


如果想告诉 pandas，正在替换的数据包含标题行，可以用以下方法指定这一行。

In [9]:
labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final']
df = pd.read_csv('student_scores.csv', header=0, names=labels)
df.head()

Unnamed: 0,id,name,attendance,hw,test1,project1,test2,project2,final
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


## 索引
除使用默认索引（从 0 递增 1 的整数）之外，还可以将一个或多个列指定为数据框的索引。

In [10]:
df = pd.read_csv('student_scores.csv', index_col='Name')
df.head()

Unnamed: 0_level_0,ID,Attendance,HW,Test1,Project1,Test2,Project2,Final
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Joe,27604,0.96,0.97,87.0,98.0,92.0,93.0,95.0
Alex,30572,1.0,0.84,92.0,89.0,94.0,92.0,91.0
Avery,39203,0.84,0.74,68.0,70.0,84.0,90.0,82.0
Kris,28592,0.96,1.0,82.0,94.0,90.0,81.0,84.0
Rick,27492,0.32,0.85,98.0,100.0,73.0,82.0,88.0


In [11]:
df = pd.read_csv('student_scores.csv', index_col=['Name', 'ID'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Attendance,HW,Test1,Project1,Test2,Project2,Final
Name,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Joe,27604,0.96,0.97,87.0,98.0,92.0,93.0,95.0
Alex,30572,1.0,0.84,92.0,89.0,94.0,92.0,91.0
Avery,39203,0.84,0.74,68.0,70.0,84.0,90.0,82.0
Kris,28592,0.96,1.0,82.0,94.0,90.0,81.0,84.0
Rick,27492,0.32,0.85,98.0,100.0,73.0,82.0,88.0


这个功能可单独用于进行多种操作，例如解析日期、填充空值、跳行等。可以在  `read_csv()` 后面进行不同步骤，实现这些操作。我们将用其它方法修改数据，你可以在 [这里](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) 查看如何用这个功能进行操作。

## 测试题 #1
使用 `read_csv()` 读入 `cancer_data.csv`，使用适当列作为索引。然后使用数据框上的 `.head()` 查看操作是否正确。

In [17]:
df_cancer =pd.read_csv("cancer_data.csv",index_col= "id")
df_cancer.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 测试题 #2
根据这个 [网站](http://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant) 上的特征描述，用 `read_csv()` 读入包含多个描述性列名称的 `powerplant_data.csv`。然后使用数据框上的 `.head()` 查看操作是否正确。*提示：先调用没有参数的  `read_csv()` ，查看一下数据是什么样。*

In [27]:
labers =[" Temperature","Ambient Pressure","Relative Humidity"," Exhaust Vacuum","Net hourly electrical energy output "]
df_powerplant =pd.read_csv("powerplant_data.csv",header=0,names = labers)
# df_powerplant = pd.read_csv("powerplant_data.csv",names = labels)
df_powerplant.head()

Unnamed: 0,Temperature,Ambient Pressure,Relative Humidity,Exhaust Vacuum,Net hourly electrical energy output
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43


# 写入 CSV 文件
太棒了！现在我们将含有电厂数据的第二个数据框保存为 csv 文件，供下一段使用。

In [21]:
df_powerplant.to_csv('powerplant_data_edited.csv')

看一下能不能获得我们预期的结果。

In [22]:
df = pd.read_csv('powerplant_data_edited.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,AT,V,AP,RH,PE
0,0,8.34,40.77,1010.84,90.01,480.48
1,1,23.64,58.49,1011.4,74.2,445.75
2,2,29.74,56.9,1007.15,41.91,438.76
3,3,19.07,49.69,1007.22,76.79,453.09
4,4,11.8,40.66,1017.13,97.2,464.43


这个 `Unnamed:0` 是什么？`to_csv()` 默认保存索引，除非指定不保存。如需忽略索引，必须提供参数 `index=False`

In [23]:
df_powerplant.to_csv('powerplant_data_edited.csv', index=False)

In [24]:
df = pd.read_csv('powerplant_data_edited.csv')
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43
