<a href="https://colab.research.google.com/github/chrismarkella/Kaggle-access-from-Google-Colab/blob/master/kaggle_pandas_indexing_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import os

import numpy as np
import pandas as pd

from getpass import getpass 

In [2]:
def access_kaggle():
    """
    Access Kaggle from Google Colab.
    If the /root/.kaggle does not exist then prompt for
    the username and for the Kaggle API key.
    Creates the kaggle.json access file in the /root/.kaggle/ folder. 
    """
    KAGGLE_ROOT = os.path.join('/root', '.kaggle')
    KAGGLE_PATH = os.path.join(KAGGLE_ROOT, 'kaggle.json')

    if '.kaggle' not in os.listdir(path='/root'):
        user = getpass(prompt='Kaggle username: ')
        key  = getpass(prompt='Kaggle API key: ')
        
        !mkdir $KAGGLE_ROOT
        !touch $KAGGLE_PATH
        !chmod 666 $KAGGLE_PATH
        with open(KAGGLE_PATH, mode='w') as f:
            f.write('{"username":"%s", "key":"%s"}' %(user, key))
            f.close()
        !chmod 600 $KAGGLE_PATH
        del user
        del key
        success_msg = "Kaggle is successfully set up. Good to go."
        print(f'{success_msg}')

access_kaggle()


Kaggle username: ··········
Kaggle API key: ··········
Kaggle is successfully set up. Good to go.


In [3]:
!kaggle datasets download zynicide/wine-reviews -f winemag-data-130k-v2.csv

!unzip winemag-data-130k-v2.csv.zip
!rm winemag-data-130k-v2.csv.zip

df = pd.read_csv('winemag-data-130k-v2.csv', sep=',',
                 index_col=0)
df.head()

Downloading winemag-data-130k-v2.csv.zip to /content
 55% 9.00M/16.4M [00:00<00:00, 42.8MB/s]
100% 16.4M/16.4M [00:00<00:00, 54.6MB/s]
Archive:  winemag-data-130k-v2.csv.zip
  inflating: winemag-data-130k-v2.csv  


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

In [0]:
pd.set_option("display.max_rows", 20)


# Exercises

## 1.

What is the median of the `points` column in the `reviews` DataFrame?

In [5]:
df.points.describe()

count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

In [6]:
median_points = df.points.median()
median_points
# Check your answer
# q1.check()

88.0

In [0]:
#q1.hint()
#q1.solution()

## 2. 
What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [7]:
countries = df.country.unique()
countries
# Check your answer
# q2.check()

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

In [0]:
#q2.hint()
#q2.solution()

## 3.
How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [8]:

reviews_per_country = df.country.value_counts()
reviews_per_country

# Check your answer
# q3.check()

US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
                          ...  
Armenia                       2
Bosnia and Herzegovina        2
Egypt                         1
China                         1
Slovakia                      1
Name: country, Length: 43, dtype: int64

In [0]:
#q3.hint()
#q3.solution()

## 4.
Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.) 

In [9]:
mean_price = df.price.mean()
centered_price = df.price - mean_price 
centered_price
# Check your answer
# q4.check()

0               NaN
1        -20.363389
2        -21.363389
3        -22.363389
4         29.636611
            ...    
129966    -7.363389
129967    39.636611
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

In [0]:
#q4.hint()
#q4.solution()

## 5.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [10]:
df['point_price_rate'] = df.points / df.price
max_index = df.point_price_rate.idxmax()

bargain_wine = df.loc[max_index, 'title']
bargain_wine

# Check your answer
# q5.check()

'Bandit NV Merlot (California)'

In [0]:
#q5.hint()
#q5.solution()

## 6.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [13]:
# tropical_count = df.description.map(lambda d: 'tropical' in d).sum()
# fruity_count = df.description.map(lambda d: 'fruity' in d).sum()

word_count = lambda word: df.description.map(lambda d: word in d).sum()


pd.Series(
        data=[
            # tropical_count,
            # fruity_count,
            word_count('tropical'),
            word_count('fruity'),
            ],
        index=[
            'tropical',
            'fruity',   
        ]
)

# descriptor_counts = ____

# Check your answer
# q6.check()

tropical    3607
fruity      9090
dtype: int64

In [0]:
#q6.hint()
#q6.solution()

## 7.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [15]:
def get_stars(row)-> str:
    """
    Return the star ratings from the points and from the country.
    """
    if row.country == 'Canada' or row.points >= 95:
        return '***'
    elif row.points >= 85:
        return '**'
    else:
        return '*'

df['star_ratings'] = df.apply(func=get_stars, axis='columns')

df.sample(10).loc[:, ['country', 'points', 'star_ratings']]

Unnamed: 0,country,points,star_ratings
47746,Chile,87,**
103510,Chile,86,**
10210,US,86,**
21092,Austria,95,***
90218,France,84,*
126346,France,87,**
21468,US,92,**
41877,Italy,85,**
32894,France,91,**
112353,US,92,**


In [0]:
# star_ratings = ____

# Check your answer
# q7.check()

In [0]:
#q7.hint()
#q7.solution()

# Keep going
Continue to **[grouping and sorting](https://www.kaggle.com/residentmario/grouping-and-sorting)**.

---
**[Pandas Home Page](https://www.kaggle.com/learn/pandas)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*