In [1]:
import pandas as pd

## 简单的访问数据：
可以直接显示整个DataFrame，若数据量较大最好设置展示的最大行数；也可以使用 head() 来展示前几行数据

In [2]:
pd.set_option("display.max_rows", 5)    # 数据量大，这里设置为最多显示5行

wine_reviews = pd.read_csv('winemag-data-50k-v2.csv', index_col = 0)
wine_reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia
49999,US,This is a particularly fine vintage for the po...,Dr. Wolfe's Family Red,90,15.0,Washington,Washington,Washington Other,Paul Gregutt,@paulgwine,Thurston Wolfe 2009 Dr. Wolfe's Family Red Red...,Red Blend,Thurston Wolfe


选取特定的某一列数据，可以用 DataFrame.column 或者 DataFrame[“column”]

In [3]:
wine_reviews.country

0           Italy
1        Portugal
           ...   
49998       Italy
49999          US
Name: country, Length: 50000, dtype: object

选取特定的某一行某一列的元素，可以用 DataFrame[“column”][“index”]

In [4]:
wine_reviews['country'][0]

'Italy'

## Pandas 中的索引

Pandas中的索引操作通过 **iloc** 和 **loc** 实现

### 第一个是**index-based selection**：通过数据中的数字索引位置来检索数据（**DataFrame.iloc**）

In [5]:
wine_reviews.iloc[0]                # 选取第一行

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

需要注意的是，不论使用loc还是iloc，他们都是 **row-first**, **column-second**（df.iloc[row, col]）。这与Python中的index使用正好相反，Python中是column-first, row-second.

In [6]:
wine_reviews.iloc[:, 0]            # 选取第一列，:表示所有

0           Italy
1        Portugal
           ...   
49998       Italy
49999          US
Name: country, Length: 50000, dtype: object

上面例子中使用的冒号 **:** 表示everything，这是沿袭了python中的用法，通过这个操作符你可以轻松的选择指定范围内的数据。<p>
例如第一列数据中只看前三行：

In [7]:
wine_reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

或者查看第一行与第三行之间的数据：

In [8]:
wine_reviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

在这个操作中**负数**也可以使用，-n代表倒数第n个值：

In [9]:
wine_reviews.iloc[-2:]                # 选取最后两行数据（从倒数第二行开始，数到最后）

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia
49999,US,This is a particularly fine vintage for the po...,Dr. Wolfe's Family Red,90,15.0,Washington,Washington,Washington Other,Paul Gregutt,@paulgwine,Thurston Wolfe 2009 Dr. Wolfe's Family Red Red...,Red Blend,Thurston Wolfe


### 第二个是**label-based selection**：通过数据中的标签或条件来检索数据（**DataFrame.loc**）

In [10]:
wine_reviews.loc[0,'country'] # 提取第一条数据中的国家

'Italy'

**iloc**从概念上来讲要比**loc**简单一些，因为**iloc**忽略了数据集中的标签/目录（indices）。我们使用**iloc**相当于把数据集当做一个矩阵（matrix）来处理，只需要用他的索引位置来定位数据。而**loc**则会用到标签信息来完成这项工作。<p>
    如果你的数据集有很多有意义的标签，那么__loc__用起来会更方便一些。比如下面的操作：

In [11]:
wine_reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] # 提取所需标签下的全部数据

Unnamed: 0,taster_name,taster_twitter_handle,points
0,Kerin O’Keefe,@kerinokeefe,87
1,Roger Voss,@vossroger,87
...,...,...,...
49998,,,90
49999,Paul Gregutt,@paulgwine,90


## 这里简单总结一下iloc和loc的异同
1. 相同点：都是先行后列的检索顺序
2. iloc适用于基于数字索引进行检索；loc可以使用表格的标签进行检索，还可以进行条件检索
3. 对于一个range（比如[1:10]），iloc是前闭后开（即提取到的是数据1到9），loc是前后都闭（即1到10）

### 修改索引

In [12]:
wine_reviews.set_index('title') # 用选取其中一个标签当做索引

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,variety,winery
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Nicosia 2013 Vulkà Bianco (Etna),Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,White Blend,Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro),Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...
Suavia 2008 Le Rive (Soave Classico),Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Garganega,Suavia
Thurston Wolfe 2009 Dr. Wolfe's Family Red Red (Washington),US,This is a particularly fine vintage for the po...,Dr. Wolfe's Family Red,90,15.0,Washington,Washington,Washington Other,Paul Gregutt,@paulgwine,Red Blend,Thurston Wolfe


## 条件检索
之前我们都是通过索引的基础使用方式来查找数据，实际应用中我们通常基于条件提出问题并检索数据。<p>
    比如我们对产自意大利的酒更加感兴趣，此时我们需要提取标签'国家'为意大利的数据。<p>
    我们可以从判断国家是否为意大利开始：

In [13]:
wine_reviews.country == 'Italy'

0         True
1        False
         ...  
49998     True
49999    False
Name: country, Length: 50000, dtype: bool

上面的操作返回布尔值True or False，我们可以将此操作融入到**loc**操作中帮助我们检索数据（**loc**功能可以使用条件检索）

In [14]:
wine_reviews.loc[wine_reviews.country == 'Italy']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49996,Italy,"Powerful, thick and concentrated, this has plu...",,90,60.0,Veneto,Amarone della Valpolicella,,,,San Cassiano 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",San Cassiano
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia


通过观察数量我们可以看到原数据集有50000条数据，此时满足条件的有7717条，这说明数据集中约15%的酒产自意大利。<p>
下一步我们希望找到那些评分比较好的酒。在这个数据集中，所有的红酒都被打上了80-100的分数，我们不妨找到那些高于90分的红酒吧~<p>
**我们可以通过AND符号（&）来合并两个问题条件**：

In [15]:
wine_reviews.loc[(wine_reviews.country == 'Italy') & (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49996,Italy,"Powerful, thick and concentrated, this has plu...",,90,60.0,Veneto,Amarone della Valpolicella,,,,San Cassiano 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",San Cassiano
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia


假如我们想要购买任何产自意大利的红酒或者分数高于90分的红酒，此时我们使用 pipe（|）符号来表达"or"的意思

In [16]:
wine_reviews.loc[(wine_reviews.country == 'Italy') | (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia
49999,US,This is a particularly fine vintage for the po...,Dr. Wolfe's Family Red,90,15.0,Washington,Washington,Washington Other,Paul Gregutt,@paulgwine,Thurston Wolfe 2009 Dr. Wolfe's Family Red Red...,Red Blend,Thurston Wolfe


### Pandas有很多内置的条件选择器（conditional selectors），这里着重介绍两个：isin 和 isnull
**isin()** 用来选取存在于列表中的数据，比如我们只提取产自意大利和法国的红酒：

In [17]:
wine_reviews.loc[wine_reviews.country.isin(['Italy', 'France'])] # 此处我们提供了一个条件列表，里面只有意大利和法国

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49996,Italy,"Powerful, thick and concentrated, this has plu...",,90,60.0,Veneto,Amarone della Valpolicella,,,,San Cassiano 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",San Cassiano
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia


**isnull()** 与 **notnull()** 相对应，用来确认数据是否为空：

In [18]:
wine_reviews.loc[wine_reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,90,45.0,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia
49999,US,This is a particularly fine vintage for the po...,Dr. Wolfe's Family Red,90,15.0,Washington,Washington,Washington Other,Paul Gregutt,@paulgwine,Thurston Wolfe 2009 Dr. Wolfe's Family Red Red...,Red Blend,Thurston Wolfe


## 给数据赋值（Assigning data）
可以简单的添加一个新列，并赋上常量值

In [19]:
wine_reviews['critic'] = 'everyone'
wine_reviews['critic']

0        everyone
1        everyone
           ...   
49998    everyone
49999    everyone
Name: critic, Length: 50000, dtype: object

或者赋一个可迭代的值：

In [20]:
wine_reviews['index_backwards'] = range(len(wine_reviews), 0, -1)
wine_reviews['index_backwards']

0        50000
1        49999
         ...  
49998        2
49999        1
Name: index_backwards, Length: 50000, dtype: int32

对上面的代码再做个额外解释：python中的range函数有三个参数：起点索引，终点索引，步长（默认为1）