## Which country do we want to visualize?

In [43]:
# Import packages that are used
import pandas as pd

In [53]:
wines = pd.read_csv('Data/winemag-data_first150k.csv', sep = ',')
wines.head(3)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley


In [54]:
# What countries are in the dataframe?
wines.country.value_counts()[:10]

US             62397
Italy          23478
France         21098
Spain           8268
Chile           5816
Argentina       5631
Portugal        5322
Australia       4957
New Zealand     3320
Austria         3057
Name: country, dtype: int64

Countries of interest are one of the top 3. Which one do we wanna work with? Lets look at how may regions that are included for the different countries first.

In [55]:
us = wines[wines.country.str.contains('US', na=False)]
france = wines[wines.country.str.contains('France', na=False)]
italy = wines[wines.country.str.contains('Italy', na=False)]

### US

In [56]:
print('US Provinces:\n\n', us.province.value_counts(), '\n')
print('US Region 1:\n\n', us.region_1.value_counts(), '\n')
print('US Region 2:\n\n', us.region_2.value_counts(), '\n')
print('Missing data for US:')
us.isnull().sum()

US Provinces:

 California                       44508
Washington                        9750
Oregon                            4589
New York                          2428
Virginia                           515
Idaho                              136
New Mexico                          95
Missouri                            60
Pennsylvania                        50
Texas                               41
Arizona                             39
Ohio                                34
Colorado                            30
America                             27
Michigan                            25
New Jersey                          24
North Carolina                      20
Massachusetts                       10
Iowa                                 4
Kentucky                             4
Washington-Oregon                    3
Vermont                              2
Connecticut                          2
Santa Barbara County-Condrieu        1
Nevada                               1
Name: pro

Unnamed: 0         0
country            0
description        0
designation    22053
points             0
price            258
province           0
region_1         137
region_2        1445
variety            0
winery             0
dtype: int64

US has 50 states and the dataset contains:

In [57]:
us.province.unique().size

25

### France

In [58]:
print('France Provinces:\n\n', france.province.value_counts(), '\n')
print('France Region 1:\n\n', france.region_1.value_counts(), '\n')
print('France Region 2:\n\n', france.region_2.value_counts(), '\n')
print('Missing data for France:')
france.isnull().sum()

France Provinces:

 Bordeaux                         6111
Burgundy                         4308
Loire Valley                     1786
Alsace                           1680
Southwest France                 1601
Champagne                        1370
Rhône Valley                     1318
Languedoc-Roussillon             1082
Provence                         1021
Beaujolais                        532
France Other                      289
Santa Barbara County-Condrieu       1
Name: province, dtype: int64 

France Region 1:

 Alsace                                    1574
Champagne                                 1369
Saint-Émilion                              836
Côtes de Provence                          696
Chablis                                    694
                                          ... 
Corton Perrières                             1
Mâcon-Uchizy                                 1
Gard                                         1
Côtes du Roussillon Villages Tautavel        1
Vin 

Unnamed: 0         0
country            0
description        0
designation     6592
points             0
price           6313
province           0
region_1          16
region_2       21099
variety            0
winery             0
dtype: int64

France has 18 regions and the dataset contains:

In [59]:
france.province.unique().size

12

### Italy

In [60]:
print('Italy Provinces:\n\n', italy.province.value_counts(), '\n')
print('Italy Region 1:\n\n', italy.region_1.value_counts(), '\n')
print('Italy Region 2:\n\n', italy.region_2.value_counts(), '\n')
print('Missing data for Italy:')
italy.isnull().sum()

Italy Provinces:

 Tuscany               7281
Piedmont              4093
Veneto                3962
Sicily & Sardinia     2545
Northeastern Italy    1982
Central Italy         1530
Southern Italy        1439
Lombardy               580
Italy Other             56
Northwestern Italy      10
Name: province, dtype: int64 

Italy Region 1:

 Toscana                   1885
Brunello di Montalcino    1746
Sicilia                   1701
Barolo                    1398
Chianti Classico          1029
                          ... 
Ramandolo                    1
Galluccio                    1
Loazzolo                     1
Valle d'Aosta                1
Trentino Superiore           1
Name: region_1, Length: 371, dtype: int64 

Italy Region 2:

 Series([], Name: region_2, dtype: int64) 

Missing data for Italy:


Unnamed: 0         0
country            0
description        0
designation     6588
points             0
price           4694
province           0
region_1           0
region_2       23478
variety            0
winery             0
dtype: int64

Italy has 20 regions and the dataset contains:

In [61]:
italy.province.unique().size

10

### Summary

There is a bigger percentage of France's total regions being represented in the dataset compared to US and Italy. This would leed to a better visualization since we want to use a map to visualize the different regions and their wines. If we use a country with the majority of the countries regions being represented in the dataset it would lead to a more complete-looking visualization. 

Another pro of using France is that the regions are more professional compared to US. They do not have region 1 and 2, only 1, and the regions are usually well know as the wines inherit the regions in their names - for example champagne is only made in the region Champagne. US doesn't use as specific regions and there are so many of them that it will be hard to keep track of and summarize everything when visualizing. 

There is also very little data missing when looking at France and the categories that we are interested in. 

**Conclusion:** We will study France.