## Which country do we want to visualize?

Link to dataset: https://www.kaggle.com/zynicide/wine-reviews?fbclid=IwAR1gC2zwk_Km3CmF-C8OnCy7bd7NNfPlNGrBqnXD-lYbwFCHdoCSh0TX8ZI

In [108]:
# Import packages that are used
import pandas as pd

In [109]:
wines = pd.read_csv('Data/winemag-data_first150k.csv', sep = ',')
wines.head(3)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley


In [110]:
# What countries are in the dataframe?
wines.country.value_counts()[:10]

US             62397
Italy          23478
France         21098
Spain           8268
Chile           5816
Argentina       5631
Portugal        5322
Australia       4957
New Zealand     3320
Austria         3057
Name: country, dtype: int64

Countries of interest are one of the top 3. Which one do we wanna work with? Lets look at how may regions that are included for the different countries first.

In [111]:
us = wines[wines.country.str.contains('US', na=False)]
france = wines[wines.country.str.contains('France', na=False)]
italy = wines[wines.country.str.contains('Italy', na=False)]

### US

In [112]:
print('US Provinces:\n\n', us.province.value_counts(), '\n')
print('US Region 1:\n\n', us.region_1.value_counts(), '\n')
print('US Region 2:\n\n', us.region_2.value_counts(), '\n')
print('Missing data for US:')
us.isnull().sum()

US Provinces:

 California                       44508
Washington                        9750
Oregon                            4589
New York                          2428
Virginia                           515
Idaho                              136
New Mexico                          95
Missouri                            60
Pennsylvania                        50
Texas                               41
Arizona                             39
Ohio                                34
Colorado                            30
America                             27
Michigan                            25
New Jersey                          24
North Carolina                      20
Massachusetts                       10
Kentucky                             4
Iowa                                 4
Washington-Oregon                    3
Connecticut                          2
Vermont                              2
Nevada                               1
Santa Barbara County-Condrieu        1
Name: pro

Unnamed: 0         0
country            0
description        0
designation    22053
points             0
price            258
province           0
region_1         137
region_2        1445
variety            0
winery             0
dtype: int64

US has 50 states and the dataset contains:

In [113]:
us.province.unique().size

25

### France

In [114]:
print('France Provinces:\n\n', france.province.value_counts(), '\n')
print('France Region 1:\n\n', france.region_1.value_counts(), '\n')
print('France Region 2:\n\n', france.region_2.value_counts(), '\n')
print('Missing data for France:')
france.isnull().sum()

France Provinces:

 Bordeaux                         6111
Burgundy                         4308
Loire Valley                     1786
Alsace                           1680
Southwest France                 1601
Champagne                        1370
Rhône Valley                     1318
Languedoc-Roussillon             1082
Provence                         1021
Beaujolais                        532
France Other                      289
Santa Barbara County-Condrieu       1
Name: province, dtype: int64 

France Region 1:

 Alsace                                    1574
Champagne                                 1369
Saint-Émilion                              836
Côtes de Provence                          696
Chablis                                    694
                                          ... 
Vin de Pays de Sainte-Marie la Blanche       1
Mâcon-Uchizy                                 1
Clairette de Die                             1
Rasteau                                      1
Vin 

Unnamed: 0         0
country            0
description        0
designation     6592
points             0
price           6313
province           0
region_1          16
region_2       21099
variety            0
winery             0
dtype: int64

France has 18 regions and the dataset contains:

In [115]:
france.province.unique().size

12

### Italy

In [116]:
print('Italy Provinces:\n\n', italy.province.value_counts(), '\n')
print('Italy Region 1:\n\n', italy.region_1.value_counts(), '\n')
print('Italy Region 2:\n\n', italy.region_2.value_counts(), '\n')
print('Missing data for Italy:')
italy.isnull().sum()

Italy Provinces:

 Tuscany               7281
Piedmont              4093
Veneto                3962
Sicily & Sardinia     2545
Northeastern Italy    1982
Central Italy         1530
Southern Italy        1439
Lombardy               580
Italy Other             56
Northwestern Italy      10
Name: province, dtype: int64 

Italy Region 1:

 Toscana                                  1885
Brunello di Montalcino                   1746
Sicilia                                  1701
Barolo                                   1398
Chianti Classico                         1029
                                         ... 
Mitterberg                                  1
Recioto della Valpolicella Valpantena       1
Rosso del Salento                           1
Paestum                                     1
Alta Valle della Greve                      1
Name: region_1, Length: 371, dtype: int64 

Italy Region 2:

 Series([], Name: region_2, dtype: int64) 

Missing data for Italy:


Unnamed: 0         0
country            0
description        0
designation     6588
points             0
price           4694
province           0
region_1           0
region_2       23478
variety            0
winery             0
dtype: int64

Italy has 20 regions and the dataset contains:

In [117]:
italy.province.unique().size

10

### Summary

There is a bigger percentage of France's total regions being represented in the dataset compared to US and Italy. This would leed to a better visualization since we want to use a map to visualize the different regions and their wines. If we use a country with the majority of the countries regions being represented in the dataset it would lead to a more complete-looking visualization. 

Another pro of using France is that the regions are more professional compared to US. They do not have region 1 and 2, only 1, and the regions are usually well know as the wines inherit the regions in their names - for example champagne is only made in the region Champagne. US doesn't use as specific regions and there are so many of them that it will be hard to keep track of and summarize everything when visualizing. 

There is also very little data missing when looking at France and the categories that we are interested in. 

**Conclusion:** We will study France.

## What should we focus on in the data? 

In [196]:
# take a look at the other data in the dataset
france.head()
france.reset_index(drop = True, inplace=True)
france

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
0,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,Provence red blend,Domaine de la Bégude
1,France,This wine is in peak condition. The tannins an...,Château Montus Prestige,95,90.0,Southwest France,Madiran,Tannat,Vignobles Brumont
2,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,Malbec,Château Lagrézette
3,France,"Pale in color, this is nutty in character, wit...",Nonpareil Trésor Rosé Brut,90,22.0,France Other,Vin Mousseux,Sparkling Blend,Bouvet-Ladubay
4,France,Gingery spice notes accent fresh pear and melo...,,90,60.0,Rhône Valley,Châteauneuf-du-Pape,Rhône-style White Blend,Clos de L'Oratoire des Papes
...,...,...,...,...,...,...,...,...,...
21094,France,Shows some older notes: a bouquet of toasted w...,Blanc de Blancs Brut Mosaïque,91,38.0,Champagne,Champagne,Champagne Blend,Jacquart
21095,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,91,30.0,Champagne,Champagne,Champagne Blend,Jacquart
21096,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,91,70.0,Champagne,Champagne,Champagne Blend,Heidsieck & Co Monopole
21097,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,Champagne Blend,H.Germain


From previous we saw that the missing values are:

- country:            0
- description:        0
- designation:     6592
- points:             0
- price:           6313
- province:           0
- region_1:          16
- region_2:       21099
- variety:            0
- winery:             0


Since there are 21099 rows in the dataset, it can be smart to stay clear of visualizing data about designation and price as about 30% of the data is missing. 

Since there are no values given for region_2, we remove this column  (also remove Unnamed:0, do not know why this sometimes pops up..) 

In [120]:
france = france.drop(['region_2', 'Unnamed: 0'], axis = 1)
france.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
0,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,Provence red blend,Domaine de la Bégude
1,France,This wine is in peak condition. The tannins an...,Château Montus Prestige,95,90.0,Southwest France,Madiran,Tannat,Vignobles Brumont
2,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,Malbec,Château Lagrézette
3,France,"Pale in color, this is nutty in character, wit...",Nonpareil Trésor Rosé Brut,90,22.0,France Other,Vin Mousseux,Sparkling Blend,Bouvet-Ladubay
4,France,Gingery spice notes accent fresh pear and melo...,,90,60.0,Rhône Valley,Châteauneuf-du-Pape,Rhône-style White Blend,Clos de L'Oratoire des Papes


### Variety 
Exploration of the different varities

In [182]:
varieties = france.variety.value_counts()[1:34]
varieties

Chardonnay                    2892
Pinot Noir                    2073
Rosé                          1342
Bordeaux-style White Blend    1210
Champagne Blend               1104
Sauvignon Blanc                930
Rhône-style Red Blend          866
Gamay                          535
Malbec                         519
Riesling                       501
Syrah                          458
Chenin Blanc                   413
Gewürztraminer                 327
Red Blend                      305
Pinot Gris                     276
White Blend                    265
Sparkling Blend                223
Pinot Blanc                    214
Cabernet Franc                 197
Rhône-style White Blend        178
Merlot                         106
Melon                           97
Viognier                        83
Malbec-Merlot                   80
Marsanne                        74
Tannat                          58
Alsace white blend              52
Cabernet Sauvignon              50
Muscat              

There are 131 different varieties in total. Might be smart to focus on the ones that occur more in the dataset, and 131 is also a lot to visualize. Might be a good idea to have a treshhold of minimum 30 occurences. That will include 34 varieties.  

In [159]:
# create dataset with only these varietes. 
france_lim = france.loc[france.variety.isin(varieties.index)]

In [202]:
france_lim.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
0,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,Malbec,Château Lagrézette
1,France,"Delicious while also young and textured, this ...",Le Pavé,90,,Loire Valley,Sancerre,Sauvignon Blanc,Domaine Vacheron
2,France,The steely character of a young Chablis is ver...,Fourchaume Premier Cru,91,38.0,Burgundy,Chablis,Chardonnay,Louis Max
3,France,"Lightly structured, this is a balanced, ripe w...",,90,15.0,Bordeaux,Bordeaux Rosé,Rosé,Château Suau
4,France,"Beautifully balanced, this conveys both rich f...",Clos Häuserer Wintzenheim,93,65.0,Alsace,Alsace,Riesling,Domaine Zind-Humbrecht


### Winery
Exploration of the different wineries

In [178]:
wineries = france.winery.value_counts()[0:118]
wineries

Bouchard Père & Fils          203
Joseph Drouhin                189
Georges Duboeuf               188
Albert Bichot                 167
Louis Latour                  154
                             ... 
Morey-Blanc                    31
Domaine Jayer-Gilles           30
Domaine D'en Ségur             30
Laurent-Perrier                30
Château Smith Haut Lafitte     30
Name: winery, Length: 118, dtype: int64

In [190]:
# if we give the same treshhold as for varieties, at 30. 
france_lim = france_lim.loc[france_lim.winery.isin(wineries.index)]

# reset index
france_lim.reset_index(drop = True, inplace = True)
france_lim

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
0,France,Coming from a seven-acre vineyard named after ...,Le Pigeonnier,95,290.0,Southwest France,Cahors,Malbec,Château Lagrézette
1,France,"Delicious while also young and textured, this ...",Le Pavé,90,,Loire Valley,Sancerre,Sauvignon Blanc,Domaine Vacheron
2,France,The steely character of a young Chablis is ver...,Fourchaume Premier Cru,91,38.0,Burgundy,Chablis,Chardonnay,Louis Max
3,France,"Lightly structured, this is a balanced, ripe w...",,90,15.0,Bordeaux,Bordeaux Rosé,Rosé,Château Suau
4,France,"Beautifully balanced, this conveys both rich f...",Clos Häuserer Wintzenheim,93,65.0,Alsace,Alsace,Riesling,Domaine Zind-Humbrecht
...,...,...,...,...,...,...,...,...,...
6181,France,The muted nose offers faint hay and floral aro...,Brut,87,40.0,Champagne,Champagne,Champagne Blend,Nicolas Feuillatte
6182,France,A voluptuous blockbuster in the style we are b...,Le Pigeonnier,93,60.0,Southwest France,Cahors,Malbec,Château Lagrézette
6183,France,"Impressive dark purple in color, this powerful...",,87,20.0,Southwest France,Cahors,Red Blend,Château Lagrézette
6184,France,This one comes from the Mouton branch of the R...,,82,14.0,Bordeaux,Graves,White Blend,Baron Philippe de Rothschild


Have now reduced the dataset from 21099 to 6186 rows, which is also the amount after limiting the varieties. This means that no data is lost after removing the wineries with less than 30 occurences in the *france_lim* dataset. 

### Price & Designation

In [194]:
france_lim.price.value_counts(dropna = False)

NaN      1135
20.0      185
30.0      152
25.0      142
22.0      128
         ... 
255.0       1
167.0       1
520.0       1
236.0       1
121.0       1
Name: price, Length: 250, dtype: int64

Now, 18% of the prices are missing. It should not be a problem to include the prices at this rate, but it may depend on the choice of visualization. 

In [197]:
france_lim.designation.value_counts(dropna = False)

NaN                                  1394
Réserve                                61
Vieilles Vignes                        51
Brut                                   48
Premier Cru                            42
                                     ... 
Sancerre d'Antan Terroir de Silex       1
Clos Häuserer Wintzenheim               1
Les Abeilles                            1
Les Folatières                          1
Aile d'Argent Barrel Sample             1
Name: designation, Length: 1475, dtype: int64

About 22% of the designation is now missing. This is also something to consider when choosing a visualization. 

### Province

In [199]:
france_lim.province.value_counts()

Burgundy                2877
Alsace                   804
Loire Valley             535
Southwest France         443
Champagne                375
Rhône Valley             360
Languedoc-Roussillon     231
Beaujolais               225
Bordeaux                 158
Provence                 134
France Other              44
Name: province, dtype: int64

When limiting the dataset based on the varieties and wineries, we loose one province, which is *Santa Barbara County-Condrieu*. 

I am wondering if anything might be off about this province to start with, as it does not appear to be in France at all when searching on the internet. 

In [200]:
france.loc[france.province == 'Santa Barbara County-Condrieu']

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
20265,US-France,"Defies categorization, in more ways than one. ...",,88,50.0,Santa Barbara County-Condrieu,,Viognier,Deux C


Can see from the original data that it is located in a mix (it seems like) of US-France. As it only appears once, it can be counted as non-relevant for us. 

One positive thing is that since we still have all the other provinces from the original dataset, we have not lost any representation of areas. 

### Ideas for visualisation

Just wanted to write down some thoughts I had on how this could be presented: 

As many of the current visualizations that exist is basically google maps with locations on them, it could be cool to create a clean map (just the borders and names of the provinces). Then add different icons that signify the different varieties (a symbol of some kind, maybe their coat of arms). Then you can click on a province, get that province enlarged and see the different wineries, some information about them (maybe). 

Would also be cool to show some statistics of some kind for that province and their wines potentially. 

