如何将csv（comma separated file）中的数据导入DataFrame中

In [1]:
# Let's bring in pandas to work with
import pandas as pd

# Pandas mades it easy to turn a CSV into a dataframe, we just call read_csv()
df = pd.read_csv('C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\Admission_Predict.csv')

# 注意到在numpy中读取csv文件用的是np.genfromtxt()

# And let's look at the first few rows
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


我们发现，默认的index从0开始，而是students' serial number从1开始，这是因为pandas自己生成了一个新的index。我们可以通过index_col将serial number设置为和index一样。

In [2]:
df = pd.read_csv('C:\\Users\\asus\\Desktop\\Coursera\\Applied Data Science with Python\\(1) Introduction to Data Science in Python\\dataset\\Admission_Predict.csv', index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


注意到我们有两列分别名为SOP和LOR，可能并非所有人都知道这是什么意思，因此我们改变列的名称。在Pandas中，我们使用rename()函数，它包含一个参数columns，我们需要为这个参数键入一个字典，其key为原列名，value为要替换的新列名。

In [3]:
new_df=df.rename(columns={'GRE Score':'GRE Score', 'TOEFL Score':'TOEFL ',
                   'University Rating':'University Rating', 
                   'SOP': 'Statement of Purpose','LOR': 'Letter of Recommendation',
                   'CGPA':'CGPA', 'Research':'Research',
                   'Chance of Admit':'Chance of Admit'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


我们看到只有SOP改变了，而LOR并没有改变，为什么？首先如果我们要确保我们得到的所有列名是正确的，我们可以使用DataFrame的columns attribute，来得到一个list

In [4]:
new_df.columns

Index(['GRE Score', 'TOEFL ', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

我们发现，在LOR的右边有一个空格，这就是为什么我们使用.rename()函数对LOR无效的原因。有好几种方式可以解决它，其中就是在字典中对应的那一项中加入空格。


In [5]:
new_df = new_df.rename(columns={'LOR ': 'Letter of Recommendation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


但这样太麻烦了，万一下一次是一个tab呢？或者两个空格呢？所以另一种方法是用strip()函数来进行清理。我们将strip()函数输入为mapper参数，然后以axis参数说明对象是columns还是index（row labels）。

In [6]:
new_df = df.rename(mapper=str.strip, axis='columns')

# Let's take a look at results
new_df.columns

# 我们发现new_df的列名被去除了空格

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA',
       'Research', 'Chance of Admit'],
      dtype='object')

In [7]:
# .rename()函数不改变DataFrame本身，在这里，df没有改变，只是其副本new_df发生了改变。

df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

我们还可以使用df.columns attribute，通过为其赋值一个由column names组成的列表，直接改变列的名称。这直接作用在原DataFrame上，当存在许多列而只想对其中几项进行操作时，其效率非常高。并且这种方法不会被列的名称中的小错误影响，例如刚刚遇到的空格的问题。

In [8]:
# As an example, lets change all of the column names to lower case. First we need to get our list
cols = list(df.columns)

# Then a little list comprehenshion
cols = [x.lower().strip() for x in cols]

# Then we just overwrite what is already in the .columns attribute
df.columns = cols

# And take a look at our results
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In this lecture, you've learned how to import a CSV file into a pandas DataFrame object, and how to do some
basic data cleaning to the column names. The CSV file import mechanisms in pandas have lots of different
options, and you really need to learn these in order to be proficient at data manipulation. Once you have
set up the format and shape of a DataFrame, you have a solid start to further actions such as conducting
data analysis and modeling.

Now, there are other data sources you can load directly into dataframes as well, including HTML web pages,
databases, and other file formats. But the CSV is by far the most common data format you'll run into, and an
important one to know how to manipulate in pandas.