The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working 
with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of 
content, with each column having a label. In fact, the distinction between a column and a row is really only a 
conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

In [1]:
import pandas as pd

In [2]:
# I'm going to jump in with an example. Lets create three school records for students and their 
# class grades. I'll create each as a series which has a student name, the class name, and the score. 

record1 = pd.Series({'Name': 'Alice',
                        'Class': 'Physics',
                        'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})

In [3]:
# Like a Series, the DataFrame object is index. Here I'll use a group of series, where each series 
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments

df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school1'])

# And just like the Series we can use the head() function to see the first several rows of the
# dataframe, including indices from both axes, and we can use this to verify the columns and the rows
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [4]:
# 另一种方法，以字典构成的列表来创建一个dataframe，其中每一个字典构成dataframe中的一行

students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]

# Then we pass this list of dictionaries into the DataFrame function

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])

# And lets print the head again

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [5]:
# 和series类似，我们可以用.iloc和.loc来提取数据。因为dataframe是二维的，在.loc中输入单一值会返回一个series（如果只有一行被匹配）。

df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [6]:
# 很重要的一点是，index（dataframe的y轴）和每列的名称（dataframe的x轴）都可以是不唯一的。在这个例子中，我们有两行共用一个名称'school1'。
# 如果不止一行被匹配，则会返回多行构成的子dataframe

df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [7]:
# We can check the data type of the return using the python type function.

type(df.loc['school2'])

pandas.core.series.Series

In [8]:
type(df.loc['school1'])

pandas.core.frame.DataFrame

Dataframe可以快速根据行与列检索对应的数据。例如，如果你想要列出school1中的学生姓名，应该向.loc提供两个参数，row index 和 column name。

In [9]:
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

如果我们想选中单一的列呢？有多种方法。例如，我们可以用.T attribute来转置这个dataframe。注意，这个转置是attribute，不会改变df。

In [10]:
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [11]:
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

由于iloc和loc是用来作row selection的，pandas保留了以索引符进行column selection的操作。在pandas dataframe中，列永远是有名字的，因此该操作永远是基于label的，并且由于它永远是基于label的，就不会存在类似于在series中以[]索引造成的冲突问题。

In [12]:
df['Name']

# 如果用df.loc['列名']，例如df.loc['Name']则会报错

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [13]:
# 注意，单一列提取的结果是一个Series Object
type(df['Name'])

pandas.core.series.Series

In [14]:
# 由于使用索引符的结果要么是DataFrame要么是Series，那么可以将iloc、loc与索引符连用。例如可以先用.loc选中所有与school1相关的行，
# 然后提取这些行的Name column

df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [15]:
# If you get confused, use type to check the responses from resulting operations

print(type(df.loc['school1'])) #should be a DataFrame
print(type(df.loc['school1']['Name'])) #should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


这种连用的操作是有弊端的，例如，在连用操作下，pandas返回的是DataFrame的一个副本，而并非它本身，因此最好避免这样操作。对于选择数据而言，这虽然稍慢，却不会造成大的影响。但如果想要改变数据，上述的问题会造成错误，因为这种操作改变的是副本而并非本身。

另一种方法是，我们知道.loc可以进行row selection，并且可以输入两个参数，row index 与 column names构成的list。.loc亦可以支持切片操作。

如果我们想要选中所有的行，则用:来表示从头到尾所有项，那么我们就可以通过添加第二项参数（column names构成的list，如果想选中单一列则输入列名的字符串）


In [16]:
# Here's an example, where we ask for all the names and scores for all schools using the .loc operator.

df.loc[:,['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


用drop函数从Series和DataFrame中删除数据，此函数需要单一参数，index（即row label）。这可能遇到另一个问题，drop函数不会改变DataFrame本身，而是会返回删除数据后的DataFrame的一个副本。

In [17]:
df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [18]:
# 但我们的原始DataFrame没有被改变

df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


drop函数可以有两个可选择的参数，第一个称作inplace，如果设置它为True，则删除的是原始DataFrame中的数据，而不是返回一个副本。第二个称作axes，
表示删除的是哪一个轴中的数据，默认为0，代表删除row axis中的数据。如果设置它为1则是删除列的数据。

In [19]:
# 通过.copy()生成 

copy_df = df.copy()

# Now lets drop the name column in this copy

copy_df.drop("Name", inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [20]:
# 还有另一个方法删除列，即运用索引符以及del，这种方法直接作用在原本DataFrame本身

del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


In [21]:
# 可以运用索引符向DataFrame中增加一列，例如，如果我们要加入名为class ranking的一列，默认值为None，可以如此操作：

df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,
