# Intro
Prepare the *Wine Spectator* Top 100 review set for training a text classifier model.
Reference: [Dataset Splitting Best Practices in Python](https://www.kdnuggets.com/2020/05/dataset-splitting-best-practices-python.html)

# Load the *Wine Spectator* Top 100 dataset

## File setup

In [38]:
# import and initialize main python libraries
import numpy as np
import pandas as pd

# import libraries for file navigation
import os
import shutil
import glob
from pandas_ods_reader import read_ods

# import ML libraries
from sklearn.model_selection import train_test_split

## Load dataframe and reformat

In [39]:
# Note: save CSV files in UTF-8 format to preserve special characters.
df_Wine = pd.read_csv('./CSV_Wines.csv')

In [40]:
# CSV of wines is retaining a blank row at the end of the dataset. Remove the last row to prevent data type errors.

# number of rows to drop
n = 1

df_Wine.drop(df_Wine.tail(n).index, inplace = True)

In [41]:
df_Wine.shape

(3300, 18)

In [42]:
df_Wine.dtypes

Review_Year           float64
Rank                   object
Vintage                object
Score                 float64
Price                  object
Winemaker              object
Wine                   object
Wine_Style             object
Grape_Blend            object
Blend_List             object
Geography              object
Cases_Made            float64
Cases_Imported        float64
Reviewer               object
Drink_now             float64
Best_Drink_from       float64
Best_Drink_Through    float64
Review                 object
dtype: object

In [43]:
# limit dataset to text and classifier dimensions
df_Wine = df_Wine[['Wine_Style', 'Review']]
df_Wine.head()

Unnamed: 0,Wine_Style,Review
0,Red,"Maturing well, this round red is a lovely exam..."
1,Red,"Powerful and structured, with minerally richne..."
2,Red,"Effusive aromas of black currant, blueberry, v..."
3,Red,This distinctive red throws a lot of wild sage...
4,Red,"A lush, ripe style, with açaí berry, blueberry..."


In [44]:
# convert wine_style classifier from string to int for easier analysis.

# set up dictionary
style = {
    'Dessert & Fortified': 0,
    'Red': 1,
    'Rosé | Rosado': 2,
    'Sparkling': 3,
    'White': 4
}

# apply dictionary to Wine_Style column:
df_Wine.Wine_Style = [style[item] for item in df_Wine.Wine_Style]
df_Wine.head()

Unnamed: 0,Wine_Style,Review
0,1,"Maturing well, this round red is a lovely exam..."
1,1,"Powerful and structured, with minerally richne..."
2,1,"Effusive aromas of black currant, blueberry, v..."
3,1,This distinctive red throws a lot of wild sage...
4,1,"A lush, ripe style, with açaí berry, blueberry..."


# Explore dataset

## Randomly shuffle instances
Shuffle records in dataset to prevent biases. Avoids circumstance where one classifier might appear in the training dataset but not the testing dataset, or vice versa.

In [45]:
X, y = df_Wine.Review, df_Wine.Wine_Style
print(f'Dataset labels: {df_Wine.Wine_Style}')

Dataset labels: 0       1
1       1
2       1
3       1
4       1
5       4
6       1
7       1
8       1
9       3
10      1
11      1
12      1
13      1
14      1
15      1
16      4
17      4
18      1
19      1
20      4
21      1
22      4
23      1
24      1
25      4
26      1
27      1
28      1
29      1
       ..
3270    1
3271    4
3272    1
3273    1
3274    1
3275    1
3276    4
3277    1
3278    1
3279    4
3280    4
3281    1
3282    1
3283    4
3284    4
3285    1
3286    1
3287    1
3288    4
3289    4
3290    4
3291    1
3292    1
3293    4
3294    4
3295    4
3296    1
3297    1
3298    1
3299    3
Name: Wine_Style, Length: 3300, dtype: int64


In [46]:
# Split the dataset and shuffle the instances
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size = .75, 
                                                    random_state = 42)

print(f'Train labels:\n{y_train}')
print(f'Test labels:\n{y_test}')

Train labels:
650     1
227     1
365     1
3140    1
862     4
276     1
2196    1
1613    4
2216    1
1204    1
2415    4
407     1
2526    4
2322    1
2793    4
3115    1
847     4
2645    4
3225    1
844     1
7       1
1192    4
1477    1
1061    4
415     0
1159    1
990     1
1953    1
1018    1
199     1
       ..
747     1
2300    0
21      1
459     4
1184    1
2324    1
955     1
1215    1
2433    1
2853    4
1515    3
2391    1
769     1
1685    1
130     4
2919    1
3171    1
2135    3
1482    1
330     4
1238    1
466     1
2169    1
1638    1
3092    1
1095    1
1130    1
1294    1
860     1
3174    1
Name: Wine_Style, Length: 2475, dtype: int64
Test labels:
52      1
679     1
1253    4
2130    1
203     1
2451    1
2073    1
1488    1
1665    4
485     1
1511    4
511     1
1703    1
734     1
70      1
1812    1
2213    4
2780    1
2857    1
1233    1
2987    4
1411    1
2004    1
1178    4
1525    1
1590    1
3266    1
864     4
80      4
2899    1
       ..
1146    

## Stratify classes
Ensure even distribution of counts of the the different classifiers.

In [47]:
# Split the dataset and shuffle the instances
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size = .75, 
                                                    random_state = 42,
                                                    stratify = y)

print(f'Train labels:\n{y_train}')
print(f'Test labels:\n{y_test}')

Train labels:
2742    1
1810    1
2685    4
1550    4
1086    1
1711    4
2104    1
1612    3
1072    1
1875    1
971     4
96      1
2863    4
3258    3
2240    4
883     3
2267    4
100     1
2653    4
518     1
2398    4
1608    0
2257    1
415     0
1722    1
2232    1
2052    1
2567    1
3149    4
1534    1
       ..
2404    4
409     1
1172    4
2647    4
2505    1
1437    1
821     1
2842    4
2794    3
2829    4
559     3
2860    1
1473    1
2155    4
2273    1
134     1
2235    4
1264    1
486     1
1252    1
579     4
2393    1
2719    1
2386    1
2942    1
3073    1
2790    1
774     4
726     1
1216    3
Name: Wine_Style, Length: 2475, dtype: int64
Test labels:
1454    1
89      4
3059    0
2448    1
2776    4
3253    1
1223    1
689     4
249     1
977     4
3228    1
993     4
1791    1
414     1
371     1
258     4
1700    1
3026    1
1000    1
3009    1
152     0
2394    3
1056    4
1643    4
1       1
3271    4
1026    4
853     1
3011    4
1841    4
       ..
865     

In [48]:
print(f'Number of train instances by class: {np.bincount(y_train)}')
print(f'Number of test instances by class: {np.bincount(y_test)}')

Number of train instances by class: [  59 1718    7   67  624]
Number of test instances by class: [ 20 573   2  22 208]


## Stratify the training set into training and validation

In [49]:
# Split the dataset and shuffle the instances
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, 
                                                    train_size = .5, 
                                                    random_state = 42,
                                                    stratify = y_test)

print(f'Train labels:\n{y_test}')
print(f'Test labels:\n{y_val}')

Train labels:
1464    1
1185    1
2953    1
1215    1
3026    1
580     1
1233    1
952     1
3273    1
1223    1
926     1
568     1
654     1
1547    1
92      1
1935    1
69      4
1597    1
680     1
2552    1
2851    1
1716    4
1227    1
2602    1
2041    1
2158    4
557     1
2265    1
1695    4
2430    1
       ..
2676    1
2884    1
2628    0
221     1
1614    1
2483    1
1231    1
2633    1
315     1
2802    1
2454    4
79      4
3223    1
1158    1
2436    1
3011    4
2160    4
3091    4
2300    0
2641    4
2564    4
1000    1
970     4
2902    1
922     1
1964    1
2803    1
2141    4
2900    1
220     1
Name: Wine_Style, Length: 412, dtype: int64
Test labels:
2018    1
1066    1
2416    4
2147    4
170     4
3217    1
3163    1
1420    1
292     2
2472    4
2212    0
1291    4
2203    1
2143    1
494     1
943     1
1110    1
2665    4
836     4
1056    4
2937    1
2376    1
2772    1
2312    1
123     1
2245    1
1899    1
280     4
824     1
126     1
       ..
2901    1

In [51]:
print(f'Number of train instances by class: {np.bincount(y_train)}')
print(f'Number of test instances by class: {np.bincount(y_test)}')
print(f'Number of val instances by class: {np.bincount(y_val)}')

Number of train instances by class: [  59 1718    7   67  624]
Number of test instances by class: [ 10 286   1  11 104]
Number of val instances by class: [ 10 287   1  11 104]


# Set up folder structure for TensorFlow text_dataset_from_directory

In [213]:
# determine if folder structure exists

path_main = './review_data/'

path_train = 'train/'
path_test = 'test/'
path_val = 'val/'

path0 = '0/'
path1 = '1/'
path2 = '2/'
path3 = '3/'
path4 = '4/'

In [214]:
# Start with an empty set of folders, but consistent folder structure

path_mains = [path_main]
paths = [path_train, path_test, path_val]
subpaths = [path0, path1, path2, path3, path4]

for path_main in path_mains: 
    checkpath = path_main
    isDir = os.path.isdir(checkpath)
    
    # Create directories if do not already exist
    if isDir  == False:
        os.mkdir(checkpath)
        print(f'Directory {checkpath} created')
    else:
        shutil.rmtree(checkpath)
        os.mkdir(checkpath)
        print(f'Directory {checkpath} deleted and recreated')
        
    
    for path in paths:
        checkpath = path_main + path
        isDir = os.path.isdir(checkpath)
        
        # Create directories if do not already exist
        if isDir  == False:
            os.mkdir(checkpath)
            print(f'Directory {checkpath} created')
        else:
            print(f'Directory {checkpath} already exists')
            
        
                    
        for subpath in subpaths:
            checkpath = path_main + path + subpath
            isDir = os.path.isdir(checkpath)
            
            # Create directories if do not already exist
            if isDir  == False:
                os.mkdir(checkpath)
                print(f'Directory {checkpath} created')
            else:
                print(f'Directory {checkpath} already exists')
                

Directory ./review_data/ deleted and recreated
Directory ./review_data/train/ created
Directory ./review_data/train/0/ created
Directory ./review_data/train/1/ created
Directory ./review_data/train/2/ created
Directory ./review_data/train/3/ created
Directory ./review_data/train/4/ created
Directory ./review_data/test/ created
Directory ./review_data/test/0/ created
Directory ./review_data/test/1/ created
Directory ./review_data/test/2/ created
Directory ./review_data/test/3/ created
Directory ./review_data/test/4/ created
Directory ./review_data/val/ created
Directory ./review_data/val/0/ created
Directory ./review_data/val/1/ created
Directory ./review_data/val/2/ created
Directory ./review_data/val/3/ created
Directory ./review_data/val/4/ created


In [215]:
# iterate through the train, test, and validation datasets and create individual files. 
# Save in directory structure.

def populate_directory(X, y, train_test_val):
    
    y = y.astype('str')
    
    for i in range(len(df_Wine)):
        try:
            X_item = X[i]
            y_item = y[i]
            
            path = './review_data/' + train_test_val + '/' + y_item
            file = y_item + '_' + str(i) + '.txt'
            
            # Create a file at the designated path
            with open(os.path.join(path, file), 'w') as fp:
                fp.write(X_item)
        except:
            print(f'index {i} not in {train_test_val} tuple')

In [216]:
populate_directory(X_train, y_train, 'train')

index 1 not in train tuple
index 4 not in train tuple
index 13 not in train tuple
index 21 not in train tuple
index 26 not in train tuple
index 29 not in train tuple
index 34 not in train tuple
index 39 not in train tuple
index 41 not in train tuple
index 47 not in train tuple
index 48 not in train tuple
index 49 not in train tuple
index 50 not in train tuple
index 51 not in train tuple
index 53 not in train tuple
index 58 not in train tuple
index 63 not in train tuple
index 69 not in train tuple
index 70 not in train tuple
index 73 not in train tuple
index 79 not in train tuple
index 82 not in train tuple
index 87 not in train tuple
index 89 not in train tuple
index 92 not in train tuple
index 104 not in train tuple
index 107 not in train tuple
index 122 not in train tuple
index 123 not in train tuple
index 126 not in train tuple
index 129 not in train tuple
index 133 not in train tuple
index 137 not in train tuple
index 139 not in train tuple
index 145 not in train tuple
index 152 no

In [217]:
populate_directory(X_test, y_test, 'test')

index 0 not in test tuple
index 1 not in test tuple
index 2 not in test tuple
index 3 not in test tuple
index 5 not in test tuple
index 6 not in test tuple
index 7 not in test tuple
index 8 not in test tuple
index 9 not in test tuple
index 10 not in test tuple
index 11 not in test tuple
index 12 not in test tuple
index 14 not in test tuple
index 15 not in test tuple
index 16 not in test tuple
index 17 not in test tuple
index 18 not in test tuple
index 19 not in test tuple
index 20 not in test tuple
index 21 not in test tuple
index 22 not in test tuple
index 23 not in test tuple
index 24 not in test tuple
index 25 not in test tuple
index 26 not in test tuple
index 27 not in test tuple
index 28 not in test tuple
index 30 not in test tuple
index 31 not in test tuple
index 32 not in test tuple
index 33 not in test tuple
index 35 not in test tuple
index 36 not in test tuple
index 37 not in test tuple
index 38 not in test tuple
index 40 not in test tuple
index 41 not in test tuple
index 42 n

index 3067 not in test tuple
index 3068 not in test tuple
index 3069 not in test tuple
index 3070 not in test tuple
index 3071 not in test tuple
index 3072 not in test tuple
index 3073 not in test tuple
index 3074 not in test tuple
index 3075 not in test tuple
index 3076 not in test tuple
index 3077 not in test tuple
index 3078 not in test tuple
index 3079 not in test tuple
index 3080 not in test tuple
index 3081 not in test tuple
index 3082 not in test tuple
index 3083 not in test tuple
index 3084 not in test tuple
index 3085 not in test tuple
index 3086 not in test tuple
index 3087 not in test tuple
index 3088 not in test tuple
index 3089 not in test tuple
index 3090 not in test tuple
index 3092 not in test tuple
index 3094 not in test tuple
index 3096 not in test tuple
index 3098 not in test tuple
index 3099 not in test tuple
index 3100 not in test tuple
index 3101 not in test tuple
index 3102 not in test tuple
index 3103 not in test tuple
index 3104 not in test tuple
index 3106 not

In [218]:
populate_directory(X_val, y_val, 'val')

index 0 not in val tuple
index 2 not in val tuple
index 3 not in val tuple
index 4 not in val tuple
index 5 not in val tuple
index 6 not in val tuple
index 7 not in val tuple
index 8 not in val tuple
index 9 not in val tuple
index 10 not in val tuple
index 11 not in val tuple
index 12 not in val tuple
index 13 not in val tuple
index 14 not in val tuple
index 15 not in val tuple
index 16 not in val tuple
index 17 not in val tuple
index 18 not in val tuple
index 19 not in val tuple
index 20 not in val tuple
index 22 not in val tuple
index 23 not in val tuple
index 24 not in val tuple
index 25 not in val tuple
index 27 not in val tuple
index 28 not in val tuple
index 29 not in val tuple
index 30 not in val tuple
index 31 not in val tuple
index 32 not in val tuple
index 33 not in val tuple
index 34 not in val tuple
index 35 not in val tuple
index 36 not in val tuple
index 37 not in val tuple
index 38 not in val tuple
index 39 not in val tuple
index 40 not in val tuple
index 42 not in val t

index 3116 not in val tuple
index 3117 not in val tuple
index 3119 not in val tuple
index 3120 not in val tuple
index 3122 not in val tuple
index 3123 not in val tuple
index 3124 not in val tuple
index 3125 not in val tuple
index 3126 not in val tuple
index 3127 not in val tuple
index 3128 not in val tuple
index 3129 not in val tuple
index 3130 not in val tuple
index 3131 not in val tuple
index 3132 not in val tuple
index 3133 not in val tuple
index 3135 not in val tuple
index 3136 not in val tuple
index 3137 not in val tuple
index 3138 not in val tuple
index 3139 not in val tuple
index 3140 not in val tuple
index 3141 not in val tuple
index 3142 not in val tuple
index 3143 not in val tuple
index 3144 not in val tuple
index 3145 not in val tuple
index 3146 not in val tuple
index 3148 not in val tuple
index 3149 not in val tuple
index 3150 not in val tuple
index 3151 not in val tuple
index 3152 not in val tuple
index 3153 not in val tuple
index 3154 not in val tuple
index 3156 not in va