## Which country do we want to visualize?

Link to dataset: https://www.kaggle.com/zynicide/wine-reviews?fbclid=IwAR1gC2zwk_Km3CmF-C8OnCy7bd7NNfPlNGrBqnXD-lYbwFCHdoCSh0TX8ZI

In [1]:
# Import packages that are used
import pandas as pd

In [2]:
wines = pd.read_csv('Data/winemag-data-130k-v2.csv', sep = ',')
wines.head(3)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


In [3]:
# What countries are in the dataframe?
wines.country.value_counts()[:10]

US           54504
France       22093
Italy        19540
Spain         6645
Portugal      5691
Chile         4472
Argentina     3800
Austria       3345
Australia     2329
Germany       2165
Name: country, dtype: int64

Countries of interest are one of the top 3. Which one do we wanna work with? Lets look at how may regions that are included for the different countries first.

In [4]:
us = wines[wines.country.str.contains('US', na=False)]
france = wines[wines.country.str.contains('France', na=False)]
italy = wines[wines.country.str.contains('Italy', na=False)]

### US

In [5]:
print('US Provinces:\n\n', us.province.value_counts(), '\n')
print('US Region 1:\n\n', us.region_1.value_counts(), '\n')
print('US Region 2:\n\n', us.region_2.value_counts(), '\n')
print('Missing data for US:')
us.isnull().sum()

US Provinces:

 California           36247
Washington            8639
Oregon                5373
New York              2688
Virginia               777
Idaho                  192
Michigan               114
America                 95
Texas                   94
Colorado                68
New Mexico              45
Arizona                 41
Missouri                33
North Carolina          23
Pennsylvania            18
Ohio                    12
New Jersey               8
Massachusetts            7
Washington-Oregon        7
Illinois                 6
Iowa                     4
Nevada                   4
Connecticut              3
Vermont                  3
Kentucky                 1
Hawaii                   1
Rhode Island             1
Name: province, dtype: int64 

US Region 1:

 Napa Valley                               4480
Columbia Valley (WA)                      4124
Russian River Valley                      3091
California                                2629
Paso Robles          

Unnamed: 0                   0
country                      0
description                  0
designation              17596
points                       0
price                      239
province                     0
region_1                   278
region_2                  3993
taster_name              16774
taster_twitter_handle    19763
title                        0
variety                      0
winery                       0
dtype: int64

US has 50 states and the dataset contains:

In [6]:
us.province.unique().size

27

### France

In [7]:
print('France Provinces:\n\n', france.province.value_counts(), '\n')
print('France Region 1:\n\n', france.region_1.value_counts(), '\n')
print('France Region 2:\n\n', france.region_2.value_counts(), '\n')
print('Missing data for France:')
france.isnull().sum()

France Provinces:

 Bordeaux                5941
Burgundy                3980
Alsace                  2440
Loire Valley            1856
Champagne               1613
Southwest France        1503
Provence                1346
Rhône Valley            1081
Beaujolais              1044
France Other             676
Languedoc-Roussillon     613
Name: province, dtype: int64 

France Region 1:

 Alsace                2163
Champagne             1613
Côtes de Provence      859
Saint-Émilion          561
Chablis                559
                      ... 
Burgundy                 1
Corton Perrières         1
Côtes de Forez           1
Cabernet de Saumur       1
Coteaux de Verdon        1
Name: region_1, Length: 386, dtype: int64 

France Region 2:

 Series([], Name: region_2, dtype: int64) 

Missing data for France:


Unnamed: 0                   0
country                      0
description                  0
designation               7563
points                       0
price                     4317
province                     0
region_1                    76
region_2                 22093
taster_name                265
taster_twitter_handle      265
title                        0
variety                      0
winery                       0
dtype: int64

France has 18 regions and the dataset contains:

In [8]:
france.province.unique().size

11

### Italy

In [9]:
print('Italy Provinces:\n\n', italy.province.value_counts(), '\n')
print('Italy Region 1:\n\n', italy.region_1.value_counts(), '\n')
print('Italy Region 2:\n\n', italy.region_2.value_counts(), '\n')
print('Missing data for Italy:')
italy.isnull().sum()

Italy Provinces:

 Tuscany               5897
Piedmont              3729
Veneto                2716
Northeastern Italy    2138
Sicily & Sardinia     1797
Southern Italy        1349
Central Italy         1233
Lombardy               533
Italy Other            135
Northwestern Italy      13
Name: province, dtype: int64 

Italy Region 1:

 Barolo                               1599
Brunello di Montalcino               1470
Toscana                              1197
Chianti Classico                     1062
Sicilia                               925
                                     ... 
Coda di Volpe d'Irpinia                 1
Dolcetto d'Alba Superiore               1
Lamezia                                 1
Lambrusco Salamino di Santa Croce       1
Paestum                                 1
Name: region_1, Length: 381, dtype: int64 

Italy Region 2:

 Series([], Name: region_2, dtype: int64) 

Missing data for Italy:


Unnamed: 0                   0
country                      0
description                  0
designation               5651
points                       0
price                     2626
province                     0
region_1                    27
region_2                 19540
taster_name               8498
taster_twitter_handle     8498
title                        0
variety                      0
winery                       0
dtype: int64

Italy has 20 regions and the dataset contains:

In [10]:
italy.province.unique().size

10

### Summary

There is a bigger percentage of France's total regions being represented in the dataset compared to US and Italy. This would leed to a better visualization since we want to use a map to visualize the different regions and their wines. If we use a country with the majority of the countries regions being represented in the dataset it would lead to a more complete-looking visualization. 

Another pro of using France is that the regions are more professional compared to US. They do not have region 1 and 2, only 1, and the regions are usually well know as the wines inherit the regions in their names - for example champagne is only made in the region Champagne. US doesn't use as specific regions and there are so many of them that it will be hard to keep track of and summarize everything when visualizing. 

There is also very little data missing when looking at France and the categories that we are interested in. 

**Conclusion:** We will study France.