**[Pandas Micro-Course Home Page](https://www.kaggle.com/learn/pandas)**

---


# Intro

You'll select specific values of a pandas `DataFrame` or `Series` to work on in most data operations, so it's a foundational skill for data science.

You will explore the [Wine Reviews dataset](https://www.kaggle.com/zynicide/wine-reviews) while practicing this skill.

# Relevant Resources
* **[Quickstart to indexing and selecting data](https://www.kaggle.com/residentmario/indexing-and-selecting-data/)** 
* [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html) section of pandas documentation
* [Pandas Cheat Sheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)




# Set Up

Run the following cell to load your data and some utility functions (including code to check your answers).

In [1]:
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.indexing_selecting_and_assigning import *
print("Setup complete.")

Setup complete.


Look at an overview of your data by running the following line

In [2]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Exercises

## 1.

Select the `description` column from `reviews` and assign the result to the variable `desc`.

In [3]:
# Your code here
desc = reviews['description']

q1.check()

# Since we've got one column, it's a Pandas series

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Follow-up question: what type of object is `desc`? If you're not sure, you can check by calling Python's `type` function: `type(desc)`.

In [5]:
#q1.hint()
q1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
desc = reviews.description
```
or 
```python
desc = reviews["description"]
```
`desc` is a pandas `Series` object, with an index matching the `reviews` DataFrame. 
In general, when we select a single column from a DataFrame, we'll get a Series.


## 2.

Select the first value from the description column of `reviews`, assigning it to variable `first_description`.

In [6]:
first_description = reviews['description'][0]

q2.check()
first_description

# Unlike R which requirws the [row,column] subsetting, here we've done [column,row]
# Note that the alternative way is reviews.description.iloc[0], but reviews.description[0] works too
# Multiple ways to do this!

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


```python
first_description = reviews.description.iloc[0]
```
Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result, such as `reviews.description.loc[0]`, `reviews.description[0]`, and more!  


"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."

In [7]:
#q2.hint()
q2.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
first_description = reviews.description.iloc[0]
```
Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result, such as `reviews.description.loc[0]`, `reviews.description[0]`, and more!  


## 3. 

Select the first row of data (the first record) from `reviews`, assigning it to the variable `first_row`.

In [9]:
first_row = reviews.iloc[0]

q3.check()
first_row

# Keep in mind that while we could do reviews.description[0] above, we can't do reviews[0] to get first row!

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

In [10]:
#q3.hint()
q3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
first_row = reviews.iloc[0]
```

## 4.

Select the first 10 values from the `description` column in `reviews`, assigning the result to variable `first_descriptions`.

Hint: format your output as a `pandas` `Series`.

In [12]:
first_descriptions = reviews['description'][0:10]

q4.check()
first_descriptions

# I feel they messed up here. Is'nt 0:9 the first 10 and 0:10 would be first 11? 
    # Nope. See the gotcha part of question 7! Weird indexing thing in Python!
# Alternative reviews.description.iloc[:10]
#desc.head(10) is prob simplest though

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


```python
first_descriptions = reviews.description.iloc[:10]
```
Note that many other options will return a valid result, such as `desc.head(10)` and `reviews.loc[:9, "description"]`.    


0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object

In [13]:
#q4.hint()
q4.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
first_descriptions = reviews.description.iloc[:10]
```
Note that many other options will return a valid result, such as `desc.head(10)` and `reviews.loc[:9, "description"]`.    


## 5.

Select the records with index labels `1`, `2`, `3`, `5`, and `8`, assigning the result to the variable `sample_reviews`.

In other words, generate the following DataFrame:

![](https://i.imgur.com/sHZvI1O.png)

In [26]:
# My attempt below
# sample_reviews = reviews.loc[1, 2, 3, 5, 8]

# Correct Answer - Seems same to me, but guess it's not like nesting in R
# Safe to say can only pass one object there to subset than?
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]

q5.check()
sample_reviews

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem
8,Germany,Savory dried thyme notes accent sunnier flavor...,Shine,87,12.0,Rheinhessen,,,Anna Lee C. Iijima,,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer,Heinz Eifel


In [21]:
q5.hint()
q5.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use either the `loc` or `iloc` operator to select rows of a DataFrame.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]
```

## 6.

Create a variable `df` containing the `country`, `province`, `region_1`, and `region_2` columns of the records with the index labels `0`, `1`, `10`, and `100`. In other words, generate the following `DataFrame`:

![](https://i.imgur.com/FUCGiKP.png)

In [42]:
# My attempt
# cols = ['country', 'province', 'region_1', 'region_2']
# rows = [0, 1, 10, 100]
# df = reviews.loc[cols].loc[rows]

# Correct Answer
cols = ['country', 'province', 'region_1', 'region_2']
rows = [0, 1, 10, 100]
df = reviews.loc[rows, cols]

q6.check()
df

# Note that like R, it DOES need to be rows, cols here!

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,country,province,region_1,region_2
0,Italy,Sicily & Sardinia,Etna,
1,Portugal,Douro,,
10,US,California,Napa Valley,Napa
100,US,New York,Finger Lakes,Finger Lakes


In [39]:
q6.hint()
q6.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use the `loc` operator.  (Note that it is also *possible* to solve this problem using the `iloc` operator, but this would require extra effort to convert each column name to a corresponding integer-valued index.)

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]
```

## 7.

Create a variable `df` containing the `country` and `variety` columns of the first 100 records. 

Hint: you may use `loc` or `iloc`. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the [reference](https://www.kaggle.com/residentmario/indexing-selecting-assigning-reference) for this tutorial section:

> `iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

> [...]

> ...[consider] when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `reviews.iloc[0:1000]` will return 1000 entries, while `reviews.loc[0:1000]` return 1001 of them! To get 1000 elements using `iloc`, you will need to go one higher and ask for `reviews.iloc[0:1001]`.

In [59]:
# My attempt
# cols = ['country', 'variety']
# rows = [0:99]
# df = reviews.loc[rows, cols]

# Correct Answer
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
# Not sure why we can't pass both of the objects in?
q7.check()
df

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


```python
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
```
or 
```python
cols_idx = [0, 11]
df = reviews.iloc[:100, cols_idx]
```


Unnamed: 0,country,variety
0,Italy,White Blend
1,Portugal,Portuguese Red
...,...,...
98,Italy,Sangiovese
99,US,Bordeaux-style Red Blend


In [53]:
q7.hint()
q7.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> It is most straightforward to solve this problem with the `loc` operator.  (However, if you decide to use `iloc`, remember to first convert each column into a corresponding integer-valued index.)

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
```
or 
```python
cols_idx = [0, 11]
df = reviews.iloc[:100, cols_idx]
```


## 8.

Create a DataFrame `italian_wines` containing reviews of wines made in `Italy`. Hint: `reviews.country` equals what?

In [62]:
italian_wines = reviews[reviews['country'] == 'Italy']

q8.check()

# Alternative using dot notation
# italian_wines = reviews[reviews.country == 'Italy']

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [64]:
q8.hint()
q8.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> For more information, see the section on **Conditional selection** in the [reference component](https://www.kaggle.com/residentmario/indexing-selecting-assigning-reference).

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
italian_wines = reviews[reviews.country == 'Italy']
```

## 9.

Create a DataFrame `top_oceania_wines` containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

In [71]:
top_oceania_wines = reviews[((reviews.country == 'Australia') | (reviews.country == 'New Zealand')) & 
                            (reviews.points >= 95)]

q9.check()
top_oceania_wines

# Remember the annoying there where parantheses are needed around each condition!!!!

# Alternative - easier when a condition involves subsetting one variable on multiple
#top_oceania_wines = reviews.loc[
    #(reviews.country.isin(['Australia', 'New Zealand']))
    #& (reviews.points >= 95)
]

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
345,Australia,This wine contains some material over 100 year...,Rare,100,350.0,Victoria,Rutherglen,,Joe Czerwinski,@JoeCz,Chambers Rosewood Vineyards NV Rare Muscat (Ru...,Muscat,Chambers Rosewood Vineyards
346,Australia,"This deep brown wine smells like a damp, mossy...",Rare,98,350.0,Victoria,Rutherglen,,Joe Czerwinski,@JoeCz,Chambers Rosewood Vineyards NV Rare Muscadelle...,Muscadelle,Chambers Rosewood Vineyards
...,...,...,...,...,...,...,...,...,...,...,...,...,...
122507,New Zealand,"This blend of Cabernet Sauvignon (62.5%), Merl...",SQM Gimblett Gravels Cabernets/Merlot,95,79.0,Hawke's Bay,,,Joe Czerwinski,@JoeCz,Squawking Magpie 2014 SQM Gimblett Gravels Cab...,Bordeaux-style Red Blend,Squawking Magpie
122939,Australia,Full-bodied and plush yet vibrant and imbued w...,The Factor,98,125.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Torbreck 2013 The Factor Shiraz (Barossa Valley),Shiraz,Torbreck


In [72]:
#q9.hint()
q9.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
top_oceania_wines = reviews.loc[
    (reviews.country.isin(['Australia', 'New Zealand']))
    & (reviews.points >= 95)
]
```

## Keep going

Great job. Next, learn about **[Summary functions and Map functions](https://www.kaggle.com/residentmario/summary-functions-and-maps-reference)** to get high-level insights and statistics about your data.

---
**[Pandas Micro-Course Home Page](https://www.kaggle.com/learn/pandas)**

