# Beer ABV (Alcohol by Volume) Prediction

We'll predict whether a beer is highly alcoholic (ABV greater than 7 percent).

dataset:
- Beer Reviews: https://cseweb.ucsd.edu/classes/fa23/cse258-a/data/beer_50000.json

## 1. Prepare the dataset

In [2]:
# Download the data
!wget 'https://cseweb.ucsd.edu/classes/fa23/cse258-a/data/beer_50000.json'

--2023-11-28 17:34:49--  https://cseweb.ucsd.edu/classes/fa23/cse258-a/data/beer_50000.json
Resolving cseweb.ucsd.edu (cseweb.ucsd.edu)... 132.239.8.30
Connecting to cseweb.ucsd.edu (cseweb.ucsd.edu)|132.239.8.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61156124 (58M) [application/json]
Saving to: ‘beer_50000.json’


2023-11-28 17:34:50 (36.7 MB/s) - ‘beer_50000.json’ saved [61156124/61156124]



In [4]:
# parse the dataset
def parseData(fname):
  for l in open(fname):
    yield eval(l)

data = list(parseData('beer_50000.json'))

In [5]:
data[0]

{'review/appearance': 2.5,
 'beer/style': 'Hefeweizen',
 'review/palate': 1.5,
 'review/taste': 1.5,
 'beer/name': 'Sausa Weizen',
 'review/timeUnix': 1234817823,
 'beer/ABV': 5.0,
 'beer/beerId': '47986',
 'beer/brewerId': '10325',
 'review/timeStruct': {'isdst': 0,
  'mday': 16,
  'hour': 20,
  'min': 57,
  'sec': 3,
  'mon': 2,
  'year': 2009,
  'yday': 47,
  'wday': 0},
 'review/overall': 1.5,
 'review/text': 'A lot of foam. But a lot.\tIn the smell some banana, and then lactic and tart. Not a good start.\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\tAgain tending to lactic sourness.\tSame for the taste. With some yeast and banana.',
 'user/profileName': 'stcules',
 'review/aroma': 2.0}

In [7]:
# Shuffle the data and split it into 50%/25%/25% train/validation/test fractions, and create the lables.
import random
random.seed(0)
random.shuffle(data)

In [8]:
# Split the dataset
dataTrain = data[:25000]
dataValid = data[25000:37500]
dataTest = data[37500:]

In [9]:
# Create the labels
yTrain = [d['beer/ABV'] > 7 for d in dataTrain]
yValid = [d['beer/ABV'] > 7 for d in dataValid]
yTest = [d['beer/ABV'] > 7 for d in dataTest]

## 2. Beer ABV prediction

We'll first use the style of the beer to predict its ABV, and then extend the model to include two additional features:

1. A vector of five ratings (review/aroma,
review/overall, etc.)
2. The review length (in characters). Scale the ‘length’ feature to be between
0 and 1 by dividing by the maximum length seen during training.

We'll only look at the beer which appear in more than 1,000 reviews

### 2.1 Get the beer with more than 1,000 reviews

In [11]:
# Count the total number of every style of beer by putting them into a dict{style: numbers}
from collections import defaultdict

categoryCounts = defaultdict(int)
for d in data:
    categoryCounts[d['beer/style']] += 1

# Get categories that appear in more than 1,000 reviews. These beer are what we want to use
categories = [c for c in categoryCounts if categoryCounts[c] > 1000]

# Give each kind of beer an ID (start with 0)
catID = dict(zip(list(categories),range(len(categories))))

In [13]:
catID

{'American Porter': 0,
 'Fruit / Vegetable Beer': 1,
 'English Pale Ale': 2,
 'Rauchbier': 3,
 'American Pale Ale (APA)': 4,
 'Scotch Ale / Wee Heavy': 5,
 'American IPA': 6,
 'Old Ale': 7,
 'American Double / Imperial IPA': 8,
 'American Double / Imperial Stout': 9,
 'Czech Pilsener': 10,
 'Rye Beer': 11,
 'Russian Imperial Stout': 12}

### 2.2 Create feature function and pipeline function

It's convinent to create pipeline function so that we can decided what feature to include.

For our model, using a regularization constant of C = 10

In [14]:
# For scaling the review length between 0 and 1
maxLength = max([len(d['review/text']) for d in dataTrain])

In [15]:
# We can decide which features to include in our feature function
def feat(d, includeCat = True, includeReview = True, includeLength = True):
    feat = []
    if includeCat:
        feat = [0] * len(catID)
        if d['beer/style'] in catID:
            feat[catID[d['beer/style']]] = 1
    if includeReview:
        feat += [d['review/appearance'],
                 d['review/aroma'],
                 d['review/overall'],
                 d['review/palate'],
                 d['review/taste']]
    if includeLength:
        feat += [len(d['review/text']) / maxLength]
    return feat + [1]

In [16]:
from sklearn import linear_model

# Define pipeline function
def pipeline(reg, includeCat = True, includeReview = True, includeLength = True):
    mod = linear_model.LogisticRegression(C=reg, class_weight='balanced')

    Xtrain = [feat(d, includeCat, includeReview, includeLength) for d in dataTrain]
    Xvalid = [feat(d, includeCat, includeReview, includeLength) for d in dataValid]
    Xtest = [feat(d, includeCat, includeReview, includeLength) for d in dataTest]

    mod.fit(Xtrain,yTrain)
    ypredValid = mod.predict(Xvalid)
    ypredTest = mod.predict(Xtest)

    # validation BER

    TP = sum([(a and b) for (a,b) in zip(yValid, ypredValid)])
    TN = sum([(not a and not b) for (a,b) in zip(yValid, ypredValid)])
    FP = sum([(not a and b) for (a,b) in zip(yValid, ypredValid)])
    FN = sum([(a and not b) for (a,b) in zip(yValid, ypredValid)])

    TPR = TP / (TP + FN)
    TNR = TN / (TN + FP)

    vBER = 1 - 0.5*(TPR + TNR)

    print("C = " + str(reg) + "; validation BER = " + str(vBER))

    # test BER

    TP = sum([(a and b) for (a,b) in zip(yTest, ypredTest)])
    TN = sum([(not a and not b) for (a,b) in zip(yTest, ypredTest)])
    FP = sum([(not a and b) for (a,b) in zip(yTest, ypredTest)])
    FN = sum([(a and not b) for (a,b) in zip(yTest, ypredTest)])

    TPR = TP / (TP + FN)
    TNR = TN / (TN + FP)

    tBER = 1 - 0.5*(TPR + TNR)

    print("C = " + str(reg) + "; test BER = " + str(tBER))

    return mod, vBER, tBER

## 3. Train the model

Train a logistic regressor using this one-hot encoding to predict whether beers have an ABV greater than 7 percent (i.e., d[’beer/ABV’] > 7).

### 3.1 Only using the style feature to predict beer ABV

And get the BER on validation data and test data


In [17]:
mod, validBER, testBER = pipeline(10, True, False, False)

C = 10; validation BER = 0.16130237168160533
C = 10; test BER = 0.1607838024608832


### 3.2 Using all the three features to predict beer ABV

In [18]:
mod, validBER, testBER = pipeline(10, True, True, True)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


C = 10; validation BER = 0.14173342357610152
C = 10; test BER = 0.14297185466520057


### 3.3 Try different regularization constant C

Consider values of C in the range {0.001, 0.01, 0.1, 1, 10}. Report the validation BER for each value of C.

And decide wich value of C you would select for the model, and get the performance on the validation and test sets

In [19]:
for c in [0.001, 0.01, 0.1, 1, 10]:
    pipeline(c, True, True, True)

C = 0.001; validation BER = 0.18963590685390597
C = 0.001; test BER = 0.1948467442774623
C = 0.01; validation BER = 0.14215569058816835
C = 0.01; test BER = 0.14364649970318144


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


C = 0.1; validation BER = 0.14163189531729137
C = 0.1; test BER = 0.14212756957605366


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


C = 1; validation BER = 0.1421549703471634
C = 1; test BER = 0.1427898122932576
C = 10; validation BER = 0.14173342357610152
C = 10; test BER = 0.14297185466520057


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
bestC = 1
mod, validBER, testBER = pipeline(bestC, True, True, True)

C = 1; validation BER = 0.1421549703471634
C = 1; test BER = 0.1427898122932576


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 4. Ablation

An ablation study measures the marginal benefit of various features by re-training the model with one feature ‘ablated’ (i.e., deleted) at a time.

Considering each of the three features in your classifier above
(i.e., beer style, ratings, and length), and setting C = 1

In [22]:
mod, validBER, testBER_noCat = pipeline(bestC, False, True, True)

C = 1; validation BER = 0.300682433496804
C = 1; test BER = 0.3138624152215086


In [23]:
mod, validBER, testBER_noReview = pipeline(bestC, True, False, True)

C = 1; validation BER = 0.1605845486285633
C = 1; test BER = 0.16109632033831978


In [21]:
mod, validBER, testBER_noLength = pipeline(bestC, True, True, False)

C = 1; validation BER = 0.14384635345580388
C = 1; test BER = 0.14747098648986734


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


We can see that the model performs worst without style feature included.