# --------------------------------------
# Data Frame
# --------------------------------------

# pd.DataFrame(data= , columns= , index=)

**1. Create a DataFrame fruits that looks like this:**     
![](https://storage.googleapis.com/kaggle-media/learn/images/Ax3pp2A.png)


In [77]:
import pandas as pd


# data frame creation with default row labels
fruits = pd.DataFrame({
        'Apples':[30],
        'Bananas':[21]
    })
print(fruits)

# using columns name argument
fruits2 = pd.DataFrame([[30, 21]], columns=['Oranges','Strawberries'])
print(fruits2)

   Apples  Bananas
0      30       21
   Oranges  Strawberries
0       30            21


**2. Create a dataframe fruit_sales that matches the diagram below:**     
![](https://storage.googleapis.com/kaggle-media/learn/images/CHPn7ZF.png)

In [78]:
# DataFrame() with custom row labels
fruit_sales = pd.DataFrame(data=[[35, 21],[41, 34]], 
                           columns=['Apples', 'Bananas'], 
                           index=['2017 Sales','2018 Sales'])
fruit_sales

Unnamed: 0,Apples,Bananas
2017 Sales,35,21
2018 Sales,41,34


# pd.Series(data= , index=, name= )

**3. Create a variable ingredients with a Series that looks like:**   

Flour     4 cups    
Milk       1 cup     
Eggs     2 large     
Spam       1 can     
Name: Dinner, dtype: object

In [79]:
# pd.Series(data=, index=, name=) - Create a one-dimensional labeled array
ingredients = pd.Series( data=['4 cups','1 cup','2 large','1 can'],
                         index=['Flour','Milk','Eggs','Spam'],
                         name='Dinner')

ingredients


Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

# pd.read_csv(datafilepath, index_col=)

**4. Read the following csv dataset of wine reviews into a DataFrame called `reviews`:**    
filename: winemag-data_first150k.csv
![](https://storage.googleapis.com/kaggle-media/learn/images/74RCZtU.png)


In [80]:
# data file path
reviews_filepath = "/kaggle/input/wine-reviews/winemag-data_first150k.csv"

# read and save the csv data
reviews = pd.read_csv(reviews_filepath, index_col=0) # use the first unamed column as index

reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
...,...,...,...,...,...,...,...,...,...,...
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


# DataFrame.to_csv()

**5. DataFrame with data in dictionary format, save the data to a csv file**

In [81]:
animals = pd.DataFrame({'Cows':[12, 20], 'Goats':[22, 19]},
                         index=['Year 1', 'Year 2'])
animals

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


In [82]:
# save this DataFrame to disk as a csv file with the name cows_and_goats.csv
animals.to_csv('cows_and_goats.csv')

# load the data from the saved csv file, use the first unnamed column as index_col
df = pd.read_csv("/kaggle/working/cows_and_goats.csv", index_col=0)
df

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


# --------------------------------------
# Indexing, Selecting & Assigning
# --------------------------------------
Pro data scientists do this dozens of times a day. You can, too!    

**How to go about selecting the data points relevant to you quickly and effectively.**

In [83]:
import pandas as pd
reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)

In [84]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Native accessors
Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with. In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.

In [85]:
# accessing column using dot notation, like accessing property of object
reviews.country  # returns pandas.core.series.Series

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

In [86]:
# accessing column using [] - indexing operator, like accessing key in dictionary
# prefer this method for its advanteges of being able to use column names with mulitple words or reserved words
reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

In [87]:
# drilling down to a specific value from the selected column using one more of []
reviews['country'][0]  # return the first value in 'country' column in the series

'Italy'

# Indexing in Pandas    
he indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, **loc** and **iloc**. For more advanced operations, **these are the ones you're supposed to be using**.   
* loc[] stands for "**location**" and is primarily **label-based indexing**. It allows you to access data by specifying the row and column labels explicitly.
* 
* iloc[] stands for "**integer location**" and is primarily **index-based indexing**. It allows you to access data by specifying the integer position of the rows and columns.

**index-based selections**

In [88]:
print(reviews.iloc[0]) # first row
print(reviews.iloc[0:1, 0:2])  # first and second row, first-third columns
print(reviews.iloc[:, 0]) # all rows, first column
print(reviews.iloc[:3, 0]) # to select the country column from just the first, second, and third row, first column
print(reviews.iloc[0:3, 0])  # first two rows and first column
print(reviews.iloc[[0, 1, 2], 0])  # use values in a list as indexing values
print(reviews.iloc[-5: ])  # using negative numbers in selection, counting forwards from the end of the values.

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object
  country                                        description
0   Italy  Aromas include tropical fruit, broom, brimston...
0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object
0       Italy
1    Portugal
2          US
Name: country, dtype: object
0       Italy
1    Portugal
2          US
Name: country, dtype: object
0       Italy
1    Portugal
2          US
Name: country, dtype: object
        country                                        description  \
129966  Germany  Notes of honeysuckle and cantaloupe sweeten th...   
129967       US  Citation

In [89]:
reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

**label-based selections**

In [90]:
reviews.loc[0, 'country']

'Italy'

In [91]:
# loc[] can use 'meaningful' indices names
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Unnamed: 0,taster_name,taster_twitter_handle,points
0,Kerin O’Keefe,@kerinokeefe,87
1,Roger Voss,@vossroger,87
...,...,...,...
129969,Roger Voss,@vossroger,90
129970,Roger Voss,@vossroger,90


**Choosing between** loc **and** iloc   
When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

**Manipulating the index**

**df.set_index()** - **index can be changed**. This is useful if you can come up with an index for the dataset which is better than the current one.

In [92]:
# index can be changed
reviews.set_index("title")

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,variety,winery
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Nicosia 2013 Vulkà Bianco (Etna),Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,White Blend,Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro),Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace),France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Pinot Gris,Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace),France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Gewürztraminer,Domaine Schoffit


**Conditional selection**

In [93]:
reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

In [94]:
# This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:
reviews.loc[reviews.country == 'Italy']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


In [95]:
# Wines from 'Italy' and its point is 90 or higher
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


In [96]:
# Buy any wine from Italy or the point is 90 of higher
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


**A few buit-in conditional selectors from Pandas**   
* **isin()**
* **isnull() / notnull()**

In [97]:
reviews.loc[reviews.country.isin(['Italy', 'France'])]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


In [98]:
# filter out price tag is lacking (NaN)
reviews.loc[reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Assigning Data

In [99]:
reviews['critic'] = 'everyone'
reviews['critic']

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

In [100]:
# assigning using an iterable of values
reviews['index_backward'] = range(len(reviews), 0, -1)
reviews # has a column 'index_backward' now as instructed above
reviews.set_index('index_backward') # reset the index to 'index_backward'

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,critic
index_backward,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
129971,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,everyone
129970,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,everyone
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss,everyone
1,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit,everyone


In [101]:
# import pd and set csv data file - winemag-data-130k-v2.csv
import pandas as pd
reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


**Exercises 1.   
Select the description column from reviews and assign the result to the variable desc.**

In [102]:
# using native accessor
desc = reviews['description']
print(desc)

# using label-based operator
desc = reviews.loc[:, 'description']
print(desc)

# using index-based operator
desc = reviews.iloc[:, 1]
print(desc)

0         Aromas include tropical fruit, broom, brimston...
1         This is ripe and fruity, a wine that is smooth...
                                ...                        
129969    A dry style of Pinot Gris, this is crisp with ...
129970    Big, rich and off-dry, this is powered by inte...
Name: description, Length: 129971, dtype: object
0         Aromas include tropical fruit, broom, brimston...
1         This is ripe and fruity, a wine that is smooth...
                                ...                        
129969    A dry style of Pinot Gris, this is crisp with ...
129970    Big, rich and off-dry, this is powered by inte...
Name: description, Length: 129971, dtype: object
0         Aromas include tropical fruit, broom, brimston...
1         This is ripe and fruity, a wine that is smooth...
                                ...                        
129969    A dry style of Pinot Gris, this is crisp with ...
129970    Big, rich and off-dry, this is powered by inte...
Na

**Exercise 2.    
Select the first value from the description column of reviews, assigning it to variable first_description.**

In [103]:
first_description = reviews.description.iloc[0]
print(first_description)
first_description = reviews.description.loc[0]
print(first_description)
first_description = reviews.description[0]
print(first_description)

Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.
Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.
Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.


**Exercise 3.    
Select the first row of data (the first record) from reviews, assigning it to the variable first_row.**

In [104]:
first_row = reviews.loc[0]
first_row

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

**Exercise 4.   
Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.**

In [105]:
first_descriptions = reviews.description.iloc[:10]
print(first_descriptions)
first_descriptions = reviews.loc[:9, 'description']
print(first_descriptions)
print(reviews.description.head(10))

0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object


**Exercise 5.    
Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.**

In [106]:
sample_reviews = reviews.loc[[1, 2, 3, 5, 8]]
sample_reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem
8,Germany,Savory dried thyme notes accent sunnier flavor...,Shine,87,12.0,Rheinhessen,,,Anna Lee C. Iijima,,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer,Heinz Eifel


**Exercise 6.   
Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100. In other words, generate the following DataFrame:**   
![](https://storage.googleapis.com/kaggle-media/learn/images/FUCGiKP.png)

In [107]:
indices = [0, 1, 10, 100]
cols = ['country','province','region_1','region_2']
df = reviews.loc[indices, cols]
df

Unnamed: 0,country,province,region_1,region_2
0,Italy,Sicily & Sardinia,Etna,
1,Portugal,Douro,,
10,US,California,Napa Valley,Napa
100,US,New York,Finger Lakes,Finger Lakes


**Exercise 7.   
Create a variable df containing the country and variety columns of the first 100 records.**   
Hint: you may use `loc` or `iloc`. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the tutorial:

> `iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. 
`loc`, meanwhile, indexes inclusively. 

> This is particularly confusing when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.iloc[0:999]`. 

In [108]:
cols = ['country', 'variety']
df = reviews.loc[0:99, ['country','variety']]
print(df)

cols_idx = [0, 11]
df = reviews.iloc[:100, cols_idx]
print(df)

     country                   variety
0      Italy               White Blend
1   Portugal            Portuguese Red
..       ...                       ...
98     Italy                Sangiovese
99        US  Bordeaux-style Red Blend

[100 rows x 2 columns]
     country                   variety
0      Italy               White Blend
1   Portugal            Portuguese Red
..       ...                       ...
98     Italy                Sangiovese
99        US  Bordeaux-style Red Blend

[100 rows x 2 columns]


**Exercise 8.   
Create a DataFrame italian_wines containing reviews of wines made in Italy. Hint: reviews.country equals what?**

In [109]:
italian_wines = reviews.loc[reviews['country'] == 'Italy']
print(italian_wines)

italian_wines = reviews.loc[reviews.country == 'Italy']
print(italian_wines)

       country                                        description  \
0        Italy  Aromas include tropical fruit, broom, brimston...   
6        Italy  Here's a bright, informal red that opens with ...   
...        ...                                                ...   
129961   Italy  Intense aromas of wild cherry, baking spice, t...   
129962   Italy  Blackberry, cassis, grilled herb and toasted a...   

                      designation  points  price           province  region_1  \
0                    Vulkà Bianco      87    NaN  Sicily & Sardinia      Etna   
6                         Belsito      87   16.0  Sicily & Sardinia  Vittoria   
...                           ...     ...    ...                ...       ...   
129961                        NaN      90   30.0  Sicily & Sardinia   Sicilia   
129962  Sàgana Tenuta San Giacomo      90   40.0  Sicily & Sardinia   Sicilia   

       region_2    taster_name taster_twitter_handle  \
0           NaN  Kerin O’Keefe          @k

**Excercise 9.   
Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.**

In [110]:
top_oceania_wines = reviews.loc[
    (reviews.country.isin(['Australia', 'New Zealand']))
    & (reviews.points >= 95)
]

top_oceania_wines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
345,Australia,This wine contains some material over 100 year...,Rare,100,350.0,Victoria,Rutherglen,,Joe Czerwinski,@JoeCz,Chambers Rosewood Vineyards NV Rare Muscat (Ru...,Muscat,Chambers Rosewood Vineyards
346,Australia,"This deep brown wine smells like a damp, mossy...",Rare,98,350.0,Victoria,Rutherglen,,Joe Czerwinski,@JoeCz,Chambers Rosewood Vineyards NV Rare Muscadelle...,Muscadelle,Chambers Rosewood Vineyards
...,...,...,...,...,...,...,...,...,...,...,...,...,...
122507,New Zealand,"This blend of Cabernet Sauvignon (62.5%), Merl...",SQM Gimblett Gravels Cabernets/Merlot,95,79.0,Hawke's Bay,,,Joe Czerwinski,@JoeCz,Squawking Magpie 2014 SQM Gimblett Gravels Cab...,Bordeaux-style Red Blend,Squawking Magpie
122939,Australia,Full-bodied and plush yet vibrant and imbued w...,The Factor,98,125.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Torbreck 2013 The Factor Shiraz (Barossa Valley),Shiraz,Torbreck


# Summary Functions and Maps
- formatting the data to your needs

In [111]:
import pandas as pd
pd.set_option('display.max_rows', 10)
import numpy as np

reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0) # 129971 rows × 13 columns

In [112]:
reviews.head(20)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15,Germany,Zesty orange peels and apple notes abound in t...,Devon,87,24.0,Mosel,,,Anna Lee C. Iijima,,Richard Böcking 2013 Devon Riesling (Mosel),Riesling,Richard Böcking
16,Argentina,"Baked plum, molasses, balsamic vinegar and che...",Felix,87,30.0,Other,Cafayate,,Michael Schachner,@wineschach,Felix Lavaque 2010 Felix Malbec (Cafayate),Malbec,Felix Lavaque
17,Argentina,Raw black-cherry aromas are direct and simple ...,Winemaker Selection,87,13.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Gaucho Andino 2011 Winemaker Selection Malbec ...,Malbec,Gaucho Andino
18,Spain,"Desiccated blackberry, leather, charred wood a...",Vendimia Seleccionada Finca Valdelayegua Singl...,87,28.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Pradorey 2010 Vendimia Seleccionada Finca Vald...,Tempranillo Blend,Pradorey


**summary functions** - panda provides many to restructure the data in useful way.

In [113]:
desc = reviews.points.describe()
print("Summary of reviews.points column ")
print(desc)


print("Series index: ")
print(desc.index)
print("Series values: ")
print(desc.values)


Summary of reviews.points column 
count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64
Series index: 
Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')
Series values: 
[1.29971000e+05 8.84471382e+01 3.03973020e+00 8.00000000e+01
 8.60000000e+01 8.80000000e+01 9.10000000e+01 1.00000000e+02]


In [114]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [115]:
reviews.points.mean()

88.44713820775404

In [116]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

In [117]:
reviews.taster_name.value_counts()

Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
                      ...  
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, Length: 19, dtype: int64

**Maps** - one set of values --> "maps" them to --> another set of values   
Two methods available for mapping.
* map()
* apply()

In [132]:
# map(): simple, transform a column in a data frame with a function 
reviews_points_mean = reviews.points.mean()
reviews_points_mean # 88.44713820775404

# Now remean the score the wine received to 0
# map each score value in 'points' to the set of values 'score_value - 88.44713820775404'
new_points = reviews.points.map(lambda p: p - reviews_points_mean) # return new Series after transformation

# original score_point vs new score_point
# for index in range(len(reviews.points)-1):
#    print("reviews.points[{}] = {} vs. new score = {}".format(index, reviews.points[index], new_points[index]))


In [119]:
#reviews.points

In [120]:
# apply(): tranform the entire data frame on each row level with a function
def remean_points(row):
    """Transfomrm the row value by subtracting mean wine point score 
    subtract mean wine points score."""
    row.points = row.points - reviews_points_mean
    return row

# returns a new df with the transformation
new_reviews = reviews.apply(remean_points, axis='columns')  # should apply to each row

# error
#new_reviews2 = reviews.apply(remean_points, axis='index') # should apply to each column

reviews.head(1) # the original df does not change
    

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia


In [121]:
# Pandas, smart and understand, very faster for basic python operators than map(), apply()
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean  # applies it to all the values in reviews.points
reviews.country + " - " + reviews.region_1

0                     Italy - Etna
1                              NaN
2           US - Willamette Valley
3         US - Lake Michigan Shore
4           US - Willamette Valley
                    ...           
129966                         NaN
129967                 US - Oregon
129968             France - Alsace
129969             France - Alsace
129970             France - Alsace
Length: 129971, dtype: object

**Exercise setup**

In [122]:
import pandas as pd
pd.set_option('display.max_rows', 7)
reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews.head()


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


**Excerse 1. What is the median of the points column in the reviews DataFrame?**

In [123]:
reviews.points.mean()

88.44713820775404

**Exercise 2. What countries are represented in the dataset? (Your answer should not include any duplicates.)**

In [124]:
countries = reviews.country.unique()

In [125]:
countries

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

**Excercise 3. How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.**

In [126]:
reviews_per_country = reviews.country.value_counts()
reviews_per_country

US          54504
France      22093
Italy       19540
            ...  
Slovakia        1
China           1
Egypt           1
Name: country, Length: 43, dtype: int64

**Excercise 4. Create variable centered_price containing a version of the price column with the mean price subtracted. (Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)**

In [127]:
centered_price = reviews.price - reviews.price.mean()
centered_price

0               NaN
1        -20.363389
2        -21.363389
            ...    
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

**Exercise 5. I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.**

In [128]:
#reviews.description.iloc[3]
#reviews['p2p']= reviews.points / reviews.price
#reviews
#reviews.drop(columns = ['p2p'], inplace=True)
bargain_score = reviews.points / reviews.price
type(bargain_score)
bargain_wine = reviews.loc[bargain_score.idxmax(), 'title']
bargain_wine

'Bandit NV Merlot (California)'

**Excercise 6. There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)**

In [129]:

word_check = reviews.description.map(lambda desc: "tropical" in desc)
print(word_check)
print(type(word_check))

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
print(n_trop)
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
print(n_fruity)

descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_counts



0          True
1         False
2         False
          ...  
129968    False
129969    False
129970    False
Name: description, Length: 129971, dtype: bool
<class 'pandas.core.series.Series'>
3607
9090


tropical    3607
fruity      9090
dtype: int64

**Excercise 7. We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star. Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points. Create a series star_ratings with the number of stars corresponding to each review in the dataset.**

In [130]:
def star_rating(row):
    star_rating = 0
    if row.country == "Canada" or row.points >= 95:
        star_rating = 3
    elif row.points >= 85:
        star_rating = 2
    else:
        star_rating = 1

    return star_rating
       
# ssend each row of reviews to the function star_rating on axis='columns' 
star_ratings = reviews.apply(star_rating, axis='columns')

print(type(star_ratings))
print(star_ratings)


<class 'pandas.core.series.Series'>
0         2
1         2
2         2
         ..
129968    2
129969    2
129970    2
Length: 129971, dtype: int64


# Grouping and Sorting

In [3]:
import pandas as pd
reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

**replicate value_counts() builtin function**

In [4]:
reviews
df_groupby = reviews.groupby('points')
print(df_groupby)
print(df_groupby.points)

print(df_groupby.points.count())
print(reviews.groupby('points').count())
print(reviews.groupby('points').points.count())

print(type(reviews.groupby('points').points.count()))


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd0ecf24610>
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fd0ecf246d0>
points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64
        country  description  designation  price  province  region_1  \
points                                                                 
80          397          397          250    395       397       328   
81          692          692          406    680       692       584   
...         ...          ...          ...    ...       ...       ...   
99           33           33           29     28        33        31   
100          19           19           13     19        19        17   

        region_2  taster_name  taster_twitter_handle  title  variety  winery  
points                                                                        
80           136          275                    271    397      397     397  
81           293

In [5]:
# apply some summary functions to the grouped Series / Df data.
reviews.groupby('points').price.min()

# the grouped df/seriese is accesible using apply() for further manipulation or selection
# Select the name of the first wine reviewed from each winery in the dataset:
wineries = reviews.groupby('winery')
wineries.apply(lambda df: df.title.iloc[0])


winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

In [6]:
# grouping with multiple columns and apply function
# Select Best wine by country and provice

reviews.groupby(['country','province']).apply(lambda df: df.loc[df.points.idxmax()])

Unnamed: 0_level_0,Unnamed: 1_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Argentina,Mendoza Province,Argentina,"If the color doesn't tell the full story, the ...",Nicasia Vineyard,97,120.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Bodega Catena Zapata 2006 Nicasia Vineyard Mal...,Malbec,Bodega Catena Zapata
Argentina,Other,Argentina,"Take note, this could be the best wine Colomé ...",Reserva,95,90.0,Other,Salta,,Michael Schachner,@wineschach,Colomé 2010 Reserva Malbec (Salta),Malbec,Colomé
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,San Jose,Uruguay,"Baked, sweet, heavy aromas turn earthy with ti...",El Preciado Gran Reserva,87,50.0,San Jose,,,Michael Schachner,@wineschach,Castillo Viejo 2005 El Preciado Gran Reserva R...,Red Blend,Castillo Viejo
Uruguay,Uruguay,Uruguay,"Cherry and berry aromas are ripe, healthy and ...",Blend 002 Limited Edition,91,22.0,Uruguay,,,Michael Schachner,@wineschach,Narbona NV Blend 002 Limited Edition Tannat-Ca...,Tannat-Cabernet Franc,Narbona


**agg(): very useful, apply a bunch of functions on the df at the same time.**

In [7]:
# group the data frame, and apply aggregation functions in bulk on df/column(s)
reviews.groupby(['country']).price.agg([len, min, max])  # apply aggregate functions

Unnamed: 0_level_0,len,min,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,3800,4.0,230.0
Armenia,2,14.0,15.0
...,...,...,...
Ukraine,14,6.0,13.0
Uruguay,109,10.0,130.0


**multi-index and convert to regular index**

In [10]:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed  # result with multi-index

Unnamed: 0_level_0,Unnamed: 1_level_0,len
country,province,Unnamed: 2_level_1
Argentina,Mendoza Province,3264
Argentina,Other,536
...,...,...
Uruguay,San Jose,3
Uruguay,Uruguay,24


In [16]:
mi = countries_reviewed.index
print(mi)
print(type(mi))

MultiIndex([('Argentina',  'Mendoza Province'),
            ('Argentina',             'Other'),
            (  'Armenia',           'Armenia'),
            ('Australia',   'Australia Other'),
            ('Australia',   'New South Wales'),
            ('Australia',   'South Australia'),
            ('Australia',          'Tasmania'),
            ('Australia',          'Victoria'),
            ('Australia', 'Western Australia'),
            (  'Austria',           'Austria'),
            ...
            (       'US',        'Washington'),
            (       'US', 'Washington-Oregon'),
            (  'Ukraine',           'Ukraine'),
            (  'Uruguay',         'Atlantida'),
            (  'Uruguay',         'Canelones'),
            (  'Uruguay',           'Juanico'),
            (  'Uruguay',        'Montevideo'),
            (  'Uruguay',          'Progreso'),
            (  'Uruguay',          'San Jose'),
            (  'Uruguay',           'Uruguay')],
           names=['coun

In [18]:
# reset the above multi-index to a regular index
countries_reviewed.reset_index()


Unnamed: 0,country,province,len
0,Argentina,Mendoza Province,3264
1,Argentina,Other,536
...,...,...,...
423,Uruguay,San Jose,3
424,Uruguay,Uruguay,24


**sort_values() / sort_index()**

In [24]:
countries_reviwed = countries_reviewed.reset_index() # reset the multi-index first
countries_reviwed.sort_values(by='len', ascending=False) # sort the df by the data value in 'len' col. default is row index

Unnamed: 0,country,province,len
1,Argentina,Other,536
0,Argentina,Mendoza Province,3264
...,...,...,...
424,Uruguay,Uruguay,24
419,Uruguay,Canelones,43


In [23]:
# sort them back by index, as default sorting 
countries_reviwed.sort_index()

Unnamed: 0,country,province,len
0,Argentina,Mendoza Province,3264
1,Argentina,Other,536
...,...,...,...
423,Uruguay,San Jose,3
424,Uruguay,Uruguay,24


In [25]:
# sorting with multiples columns 
countries_reviwed.sort_values(by=['country','len'])

Unnamed: 0,country,province,len
1,Argentina,Other,536
0,Argentina,Mendoza Province,3264
...,...,...,...
424,Uruguay,Uruguay,24
419,Uruguay,Canelones,43


**Exercise 1. Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.**

In [26]:
import pandas as pd

reviews = pd.read_csv("/kaggle/input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)

In [30]:
# using groupby.size() - getting the number of rows from series, number of elements 
reviews_written = reviews.groupby('taster_twitter_handle').size()
print(reviews_written)

# OR

# using dataframe.count() - getting total number of not empty rows
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
print(reviews_written)
reviews_written

taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Length: 15, dtype: int64
taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Name: taster_twitter_handle, Length: 15, dtype: int64


taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Name: taster_twitter_handle, Length: 15, dtype: int64

**Exercise 2. What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).**

In [39]:
# group the reviews by 'price' --> Get max points for each group
best_rating_per_price = reviews.groupby('price')['points'].max() # no need for sort_index(), done by index ascending - price
print(best_rating_per_price)

# sort the list highest price first.
best_rating_per_price = reviews.groupby('price').points.max().sort_index(ascending=False)
best_rating_per_price

price
4.0       86
5.0       87
          ..
2500.0    96
3300.0    88
Name: points, Length: 390, dtype: int64


price
3300.0    88
2500.0    96
          ..
5.0       87
4.0       86
Name: points, Length: 390, dtype: int64

**Exercise 3. What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.**

In [40]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


In [51]:
price_extremes = reviews.groupby('variety')['price'].agg([min, max])
# or
price_extremes = reviews.groupby('variety').price.agg([min,max])

price_extremes

Unnamed: 0_level_0,min,max
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Abouriou,15.0,75.0
Agiorgitiko,10.0,66.0
...,...,...
Çalkarası,19.0,19.0
Žilavka,15.0,15.0


**Exercise 4. What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).**

In [59]:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'],ascending=False)
sorted_varieties


Unnamed: 0_level_0,min,max
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Ramisco,495.0,495.0
Terrantez,236.0,236.0
...,...,...
Vital,,
Zelen,,


**Exercise 5. Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.**

In [62]:
reviwer_mean_ratings = reviews.groupby('taster_name').points.mean()
reviwer_mean_ratings

taster_name
Alexander Peartree    85.855422
Anna Lee C. Iijima    88.415629
                        ...    
Susan Kostrzewa       86.609217
Virginie Boone        89.213379
Name: points, Length: 19, dtype: float64

In [63]:
# Are there significant differences in the average scores assigned by the various reviewers?
reviwer_mean_ratings.describe()

count    19.000000
mean     88.233026
           ...    
75%      88.975256
max      90.562551
Name: points, Length: 8, dtype: float64

**Exercise 6. What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.**

In [76]:
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts

country  variety           
US       Pinot Noir            9885
         Cabernet Sauvignon    7315
                               ... 
Mexico   Rosado                   1
Uruguay  White Blend              1
Length: 1612, dtype: int64