In [1]:
import pandas as pd

df = pd.read_csv('data/wine-reviews/winemag-data-130k-v2.csv', index_col='wine_id')
df.head()

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# **Subsetting Data**

## Selecting Columns
To select columns there are 2 ways:
- **bracket notation**: longer but works all the time
- **dot notation**: nicer but does not work if:
    - there are spaces or invalid symbols in the column name (e.g. "person name", "km/h")
    - they shadow a reserved method or attribute of the DataFrame (e.g. "sum", "count", ..)
    - trying to create a new column
    
The result is equivalent.

In [2]:
df['country']   # bracket notation
df.country      # dot notation

df['country'].equals(df.country)  # checking equality

True

## Selecting Rows
To select the rows we can either use labels or positions
```python
.loc[row_label/s]    # by label
.iloc[row_index/s]   # by position
```
Inside the brackets there can be:
- a **single value**, e.g. `df.loc['myrowlabel']`
- a **list** of values, e.g. `df.loc[['myrowlabel1', 'myrowlabel2']]`
- a **slice**, e.g. `df.loc['myrowlabel1':'myrowlabel5']`

**NOTE**: 
When using a slice, remember that `loc` includes both edges, while `iloc` excludes the right edge (like slicing a list)
```python
.loc[a:b]   # includes both a and b
.iloc[a:b]   # includes only a, excludes b
```

In [3]:
#to better demonstrate the difference, let's first re-sort the data so the index label doesn't coincide with the position
df_sorted = df.sort_values('price')

df_sorted.head(3)

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
20484,US,"Fruity, soft and rather sweet, this wine smell...",,85,4.0,California,Clarksburg,Central Valley,Jim Gordon,@gordone_cellars,Dancing Coyote 2015 White (Clarksburg),White Blend,Dancing Coyote
112919,Spain,"Nice on the nose, this has a leafy note and a ...",Estate Bottled,84,4.0,Levante,Yecla,,Michael Schachner,@wineschach,Terrenal 2010 Estate Bottled Tempranillo (Yecla),Tempranillo,Terrenal
59507,US,"Sweet and fruity, this canned wine feels soft ...",Unoaked,83,4.0,California,California,California Other,Jim Gordon,@gordone_cellars,Pam's Cuties NV Unoaked Chardonnay (California),Chardonnay,Pam's Cuties


In [4]:
terrenal_wine_loc = df_sorted.loc[112919]   # selecting by label (row with index label 112919)
terrenal_wine_iloc = df_sorted.iloc[1]      # selecting by position (second row of the re-sorted dataset)

terrenal_wine_loc

country                                                              Spain
description              Nice on the nose, this has a leafy note and a ...
designation                                                 Estate Bottled
points                                                                  84
price                                                                    4
province                                                           Levante
region_1                                                             Yecla
region_2                                                               NaN
taster_name                                              Michael Schachner
taster_twitter_handle                                          @wineschach
title                     Terrenal 2010 Estate Bottled Tempranillo (Yecla)
variety                                                        Tempranillo
winery                                                            Terrenal
Name: 112919, dtype: obje

In [5]:
terrenal_wine_loc.equals(terrenal_wine_iloc)  # checking equality

True

## Selecting Rows & Columns
The `loc` and `iloc` methods also allow for a second argument to subset on the columns, chaining the two is also possible to mix approaches

In [6]:
# selecting by row's and columns' labels
sel1 = df_sorted.loc[112919, 'country':'designation']  

# selecting by row's and columns' position
sel2 = df_sorted.iloc[1, 0:3]    

sel1

country                                                    Spain
description    Nice on the nose, this has a leafy note and a ...
designation                                       Estate Bottled
Name: 112919, dtype: object

In [7]:
sel1.equals(sel2)  # checking equality

True

In [8]:
# to mix positional and label selection we can just chain the two
sel3 = df_sorted.iloc[1].loc['country':'designation']  

sel1.equals(sel3) # checking equality

True

### ***EXERCISE 4.1***
Select all the geographical information of the 32nd row in the `df` 

In [9]:
# insert solution here

## Filtering by condition
To filter by condition we can just use the square bracket notation, but instead of a column name put the condition.

Conditions can be created using standard python logical operators:
- `==` is equal to
- `!=` is not equal to
- `<` is lower than
- `<=` is lower or equal than
- `>` is higher than
- `>=` is higher or equal than

In [10]:
first5_df = df.head()
first5_df

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [11]:
first5_df[ first5_df.country == 'Portugal' ]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


the condition is actually turned into a boolean array (True if the condition is met, False otherwise)

In [12]:
(first5_df.country == 'Portugal')

wine_id
0    False
1     True
2    False
3    False
4    False
Name: country, dtype: bool

### Extra pandas conditions
There are also useful pandas-specific method that can be used on the dataframe, such as:
- `df.mycol.isna()` the column named 'mycol' is null
- `df.mycol.notna()` the column named 'mycol' is not null
- `df.mycol.isin(list_of_values)` the value in the 'mycol' column is almong the 'list_of_values'

In [58]:
first5_df.country.isin(['Portugal', 'Italy'])

wine_id
0     True
1     True
2    False
3    False
4    False
Name: country, dtype: bool

### Using string and datetime methods
If the column contains strings, the `.str.<method>` can be used to utilise a string method/attribute for each element in the column, for example:
```python
df['my_string_column'].str.startswith('I')
```
Similar can be done if a column has been converted to datatime, using the `.dt.<method>`, for example:
```python
df['my_date_column'].dt.year == 2015
```

In [15]:
first5_df.country.str.startswith('I')

wine_id
0     True
1    False
2    False
3    False
4    False
Name: country, dtype: bool

## Combining filters
To use multiple filters together we can use the bit-wise logical operators, which are:
- `&` = and
- `|` = or
- `~` = not

**NOTE**: when combining filters we **need** to put parenthesis around the conditions to properly convert the condition to a binary array

In [16]:
# we could write this..
first5_df[
    (first5_df.country == 'US') & 
    ~(first5_df.variety.str.contains('Pinot'))
]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian


In [57]:
# .. but for readibility the following approach is usually better
is_from_us = (first5_df.country == 'US')
not_pinot = ~(first5_df.variety.str.contains('Pinot'))

first5_df[is_from_us & not_pinot]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
wine_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian


### ***EXERCISE 4.2***
Find the wines that meet the following conditions:
- the taster name starts with "J"
- the score (points) is either higher than 99 or lower than 81
- it does not come from France
- the province contanins a "y" (anywhere in the name)

In [58]:
# insert solution here