## Data Set Cheat Sheet

This is the base for the Dataset Cheat Sheet for Python

First we import the SKLEARN library, "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities." https://scikit-learn.org/stable/about.html#citing-scikit-learn

In [1]:
import sklearn
sklearn.__version__

'0.23.1'

Then, we also import *pandas*, "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." https://pandas.pydata.org/. 

We also install *base64*, "The base64 encoding scheme is used to convert arbitrary binary data to plain text." https://www.oreilly.com/library/view/python-standard-library/0596000960/ch04s11.html

We import *enable_iterative_imputer* and *IterariveImputer*: "A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion." https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

In [2]:
import pandas as pd
import base64
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Then, we set the option of Pandas to *max_columns = None*, so it shows us all the cloumns of our dataframe.

Then we use pandas to read our dataset and finally, we print the initial values and Headers of our Dataset

In [3]:
pd.set_option('max_columns', None)
data = pd.read_csv('winemag-data-130k-v2.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [4]:
data.index

RangeIndex(start=0, stop=129971, step=1)

In [5]:
data.dtypes

Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

In [6]:
data.columns

Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
       'price', 'province', 'region_1', 'region_2', 'taster_name',
       'taster_twitter_handle', 'title', 'variety', 'winery'],
      dtype='object')

In [7]:
data.nunique(dropna=False)

Unnamed: 0               129971
country                      44
description              119955
designation               37980
points                       21
price                       391
province                    426
region_1                   1230
region_2                     18
taster_name                  20
taster_twitter_handle        16
title                    118840
variety                     708
winery                    16757
dtype: int64

In [8]:
data['country'].unique()

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

In [9]:
data['taster_name'].unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

In [10]:
data['variety'].unique()

array(['White Blend', 'Portuguese Red', 'Pinot Gris', 'Riesling',
       'Pinot Noir', 'Tempranillo-Merlot', 'Frappato', 'Gewürztraminer',
       'Cabernet Sauvignon', 'Nerello Mascalese', 'Chardonnay', 'Malbec',
       'Tempranillo Blend', 'Meritage', 'Red Blend', 'Merlot',
       "Nero d'Avola", 'Chenin Blanc', 'Gamay', 'Sauvignon Blanc',
       'Viognier-Chardonnay', 'Primitivo', 'Catarratto', 'Inzolia',
       'Petit Verdot', 'Monica', 'Bordeaux-style White Blend', 'Grillo',
       'Sangiovese', 'Cabernet Franc', 'Champagne Blend',
       'Bordeaux-style Red Blend', 'Aglianico', 'Petite Sirah',
       'Touriga Nacional', 'Carmenère', 'Albariño', 'Petit Manseng',
       'Rosé', 'Zinfandel', 'Vernaccia', 'Rosato', 'Grüner Veltliner',
       'Viognier', 'Vermentino', 'Grenache Blanc', 'Syrah', 'Nebbiolo',
       'Shiraz-Cabernet Sauvignon', 'Pinot Blanc', 'Alsace white blend',
       'Barbera', 'Rhône-style Red Blend', 'Portuguese White', 'Graciano',
       'Tannat-Cabernet', 'Sauvign

In [11]:
data.groupby('country').agg(Minimum=('price','min'), Median=('price','median'), Maximum=('price','max')).sort_values(by='Maximum',ascending=False)

Unnamed: 0_level_0,Minimum,Median,Maximum
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
France,5.0,25.0,3300.0
US,4.0,30.0,2013.0
Austria,7.0,25.0,1100.0
Portugal,5.0,16.0,1000.0
Italy,5.0,28.0,900.0
Australia,5.0,21.0,850.0
Germany,5.0,26.0,775.0
Spain,4.0,18.0,770.0
Hungary,10.0,25.0,764.0
Chile,5.0,15.0,400.0


In [18]:
data['Year'] = data['title'].str.extract('([2][0][0-9][0-9])', expand=False).str.strip()
data['Year'] = data['Year'].fillna(0)
data['Year'] = data['Year'].astype(int)

In [13]:
data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,Year
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012


In [19]:
data['Year'].unique()


array([2013, 2011, 2012, 2010, 2007, 2009, 2008, 2014, 2015,    0, 2016,
       2004, 2003, 2006, 2001, 2005, 2002, 2000, 2017])

Geographical Analysis

In [1]:
from geopy import Nominatim

In [2]:
geolocator = Nominatim(user_agent="geoapiExercises")

In [1]:
ladd1 = "Willamette Valley"
print("Location address:",ladd1)
location = geolocator.geocode(ladd1)
print("Latitude and Longitude of the said address:")
print((location.latitude, location.longitude))

Location address: Willamette Valley


NameError: name 'geolocator' is not defined

In [2]:
df['region'] = np.where(df['region_1']!=np.nan, df['region_1'], df['province'])

NameError: name 'np' is not defined