**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**

---


# Introduction

In these exercises we'll apply groupwise analysis to our dataset.

Run the code cell below to load the data before running the exercises.

In [1]:
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")

Setup complete.


# Exercises

## 1.
Who are the most common wine reviewers in the dataset? Create a `Series` whose index is the `taster_twitter_handle` category from the dataset, and whose values count how many reviews each person wrote.

In [17]:
# Your code here
ra = reviews.taster_twitter_handle.value_counts()
# rather unclear why this is not ok... 
print(ra.head())
print()
rb = reviews.groupby('taster_twitter_handle').size()
print(rb.head())
print()
# difference between size() and count(): size counts NaN values, count does not.
# https://riptutorial.com/pandas/example/6874/aggregating-by-size-versus-by-count
rc = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
print(rc.head())
print()
reviews_written = rc
q1.check()

@vossroger      25514
@wineschach     15134
@kerinokeefe    10776
@vboone          9537
@paulgwine       9532
Name: taster_twitter_handle, dtype: int64

taster_twitter_handle
@AnneInVino          3685
@JoeCz               5147
@bkfiona               27
@gordone_cellars     4177
@kerinokeefe        10776
dtype: int64

taster_twitter_handle
@AnneInVino          3685
@JoeCz               5147
@bkfiona               27
@gordone_cellars     4177
@kerinokeefe        10776
Name: taster_twitter_handle, dtype: int64



<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


```python
reviews_written = reviews.groupby('taster_twitter_handle').size()
```
or
```python
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
```


In [18]:
q1.hint()
q1.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use the `groupby` operation and `size()` (or `count()`).

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
reviews_written = reviews.groupby('taster_twitter_handle').size()
```
or
```python
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
```


## 2.
What is the best wine I can buy for a given amount of money? Create a `Series` whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that `4.0` dollars is at the top and `3300.0` dollars is at the bottom).

In [62]:
best_rating_per_price = reviews.groupby('price')['points'].describe()
print(best_rating_per_price)
print('\n')
best_rating_per_price = reviews.groupby('price')['points'].value_counts()
print(best_rating_per_price)
print('\n')
best_rating_per_price = reviews.groupby('price')['points'].max()
print(best_rating_per_price)
print('\n')
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index(ascending=True)
print(best_rating_per_price)
print('\n')
q2.check()

        count       mean       std   min    25%   50%    75%   max
price                                                             
4.0      11.0  84.272727  1.190874  82.0  84.00  84.0  85.00  86.0
5.0      46.0  83.586957  1.694122  80.0  83.00  83.5  85.00  87.0
6.0     120.0  84.341667  1.727172  80.0  83.00  84.0  85.00  88.0
7.0     433.0  84.450346  1.845404  80.0  83.00  84.0  86.00  91.0
8.0     892.0  84.628924  1.904621  80.0  83.00  85.0  86.00  91.0
...       ...        ...       ...   ...    ...   ...    ...   ...
1900.0    1.0  98.000000       NaN  98.0  98.00  98.0  98.00  98.0
2000.0    2.0  96.500000  0.707107  96.0  96.25  96.5  96.75  97.0
2013.0    1.0  91.000000       NaN  91.0  91.00  91.0  91.00  91.0
2500.0    2.0  96.000000  0.000000  96.0  96.00  96.0  96.00  96.0
3300.0    1.0  88.000000       NaN  88.0  88.00  88.0  88.00  88.0

[390 rows x 8 columns]


price   points
4.0     84        5
        85        2
        86        2
        82        1
        

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [43]:
q2.hint()
q2.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use `max()` and `sort_index()`.  The relevant columns in the DataFrame are `price` and `points`.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
```

## 3.
What are the minimum and maximum prices for each `variety` of wine? Create a `DataFrame` whose index is the `variety` category from the dataset and whose values are the `min` and `max` values thereof.

In [69]:
price_extremes = reviews.groupby('variety')['price'].describe()
print(price_extremes.head())
print('\n')
# .agg aggregates using existing functions (like min, max, sum, etc)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
price_extremes = reviews.groupby('variety')['price'].agg([min, max])
print(price_extremes.head())
print('\n')
q3.check()

             count       mean        std   min    25%   50%   75%    max
variety                                                                 
Abouriou       3.0  35.000000  34.641016  15.0  15.00  15.0  45.0   75.0
Agiorgitiko   63.0  23.571429  12.367640  10.0  15.00  20.0  27.0   66.0
Aglianico    294.0  38.887755  23.435723   6.0  22.25  33.5  49.0  180.0
Aidani         1.0  27.000000        NaN  27.0  27.00  27.0  27.0   27.0
Airen          3.0   9.000000   1.000000   8.0   8.50   9.0   9.5   10.0


              min    max
variety                 
Abouriou     15.0   75.0
Agiorgitiko  10.0   66.0
Aglianico     6.0  180.0
Aidani       27.0   27.0
Airen         8.0   10.0




<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [67]:
q3.hint()
#q3.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use `agg()`.

## 4.
What are the most expensive wine varieties? Create a variable `sorted_varieties` containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

In [87]:
sorted_varieties = reviews.groupby('variety')['price'].agg([min])
print(sorted_varieties.head(10))
print('\n')
sorted_varieties = reviews.groupby('variety')['price'].agg([min,max]).sort_values(by='max',ascending=True).sort_values(by='min',ascending=False)
print(sorted_varieties.head(100))
print('\n')
# below how to consistently order by MIN, and within the min group, by max. The sequantual calls to sort_values does not achieve the same
sorted_varieties = reviews.groupby('variety')['price'].agg([min,max]).sort_values(by=['min', 'max'], ascending=False)
print(sorted_varieties.head(100))
print('\n')
q4.check()

              min
variety          
Abouriou     15.0
Agiorgitiko  10.0
Aglianico     6.0
Aidani       27.0
Airen         8.0
Albana       12.0
Albanello    20.0
Albariño     10.0
Albarossa    40.0
Aleatico     25.0


                             min    max
variety                                
Ramisco                    495.0  495.0
Terrantez                  236.0  236.0
Francisa                   160.0  160.0
Rosenmuskateller           150.0  150.0
Tinta Negra Mole           112.0  112.0
...                          ...    ...
Vespolina                   25.0   25.0
Sämling                     25.0   46.0
Magliocco                   25.0   35.0
Poulsard                    25.0   28.0
Chardonnay Weissburgunder   25.0   58.0

[100 rows x 2 columns]


                     min    max
variety                        
Ramisco            495.0  495.0
Terrantez          236.0  236.0
Francisa           160.0  160.0
Rosenmuskateller   150.0  150.0
Tinta Negra Mole   112.0  112.0
...         

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [85]:
q4.hint()
q4.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use `sort_values()`, and provide a list of names to sort by.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)
```

## 5.
Create a `Series` whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the `taster_name` and `points` columns.

In [None]:
reviewer_mean_ratings = ____

q5.check()

In [None]:
#q5.hint()
#q5.solution()

Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the `describe()` method to see a summary of the range of values.

In [None]:
reviewer_mean_ratings.describe()

## 6.
What combination of countries and varieties are most common? Create a `Series` whose index is a `MultiIndex`of `{country, variety}` pairs. For example, a pinot noir produced in the US should map to `{"US", "Pinot Noir"}`. Sort the values in the `Series` in descending order based on wine count.

In [None]:
country_variety_counts = ____

q6.check()

In [None]:
#q6.hint()
#q6.solution()

# Keep going

Move on to the [**data types and missing data**](https://www.kaggle.com/residentmario/data-types-and-missing-values).

---
**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*