In [12]:
!pip install gplearn

import pandas as pd
from gplearn.genetic import SymbolicTransformer
from sklearn.linear_model import Ridge
import numpy as np



# Wine Quality

The Wine Quality datasets are related to red and white variants of the Portuguese "Vinho Verde" wine (red used for this problem).

The dataset's attributes are the following:

1.   fixed acidity
2.   volatile acidity
3.   citric acid
4.   residual sugar
5.   chlorides
6.   free sulfur dioxide
7.   total sulfur dioxide
8.   density
9.   pH
10.   sulphates
11.   alcohol
12.   quality (score between 0 and 10)

Attribute 12 (`quality`) will be used as the target for this example.

First, we read the file with the dataset.

In [13]:
data = pd.read_csv('/content/winequality-red.csv', sep=";")

print(data)

target = np.array(data.iloc[:, -1])
data = np.array(data.iloc[:, :-1])

      fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
0               7.4             0.700         0.00  ...       0.56      9.4        5
1               7.8             0.880         0.00  ...       0.68      9.8        5
2               7.8             0.760         0.04  ...       0.65      9.8        5
3              11.2             0.280         0.56  ...       0.58      9.8        6
4               7.4             0.700         0.00  ...       0.56      9.4        5
...             ...               ...          ...  ...        ...      ...      ...
1594            6.2             0.600         0.08  ...       0.58     10.5        5
1595            5.9             0.550         0.10  ...       0.76     11.2        6
1596            6.3             0.510         0.13  ...       0.75     11.0        6
1597            5.9             0.645         0.12  ...       0.71     10.2        5
1598            6.0             0.310         0.47  ...       0.6

## Using a simple Ridge regressor

This dataset does not behave well when used for normal regression without any feature engineering done on the attributes. Let's see how it behaves using sklearn's Ridge estimator.

In [20]:
est = Ridge()
est.fit(data[:1200, :], target[:1200])
print(est.score(data[1200:, :], target[1200:]))


0.27865563012532557


$0.2786$ is the benchmark (awfully low). 

## Using the Symbolic Transformer

Now we'll train the symbolic transformer on the same 1200 samples to generate new features.

2000 individuals over 20 generations will be used. Out of them, the top 100 will be selected (`hall_of_fame`), and then the least correlated 10 will be used as the new features.

In [16]:
function_set = ["add", "sub", "mul", "div", "sqrt", "log", "abs", "neg", "inv", "max", "min"]
gp = SymbolicTransformer(generations = 20, population_size = 2000,
                         hall_of_fame = 100, n_components = 10,
                         function_set = function_set,
                         parsimony_coefficient = 0.0005,
                         max_samples = 0.9, verbose = 1,
                         random_state = 0, n_jobs=3)
gp.fit(data[:1200, :], target[:1200])

    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    11.21          0.17389        7          0.54489         0.469925      1.21m
   1     7.93         0.386825        5         0.566093         0.539384      1.25m
   2     7.67         0.436177       15         0.579594         0.536366      1.35m
   3     9.78         0.448376       17         0.615545         0.368574      1.31m
   4    10.23         0.461024       10         0.609621         0.498668      1.26m
   5    13.04         0.474219       15         0.611811         0.508593      1.23m
   6    14.55         0.488978       22         0.616924         0.388375      1.20m
   7    15.13          0.48888       29         0.623092         0.358094      1.11m
   8    14.84         0.495252       14         0.623398         0.409266  

SymbolicTransformer(const_range=(-1.0, 1.0), feature_names=None,
                    function_set=['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                                  'abs', 'neg', 'inv', 'max', 'min'],
                    generations=20, hall_of_fame=100, init_depth=(2, 6),
                    init_method='half and half', low_memory=False,
                    max_samples=0.9, metric='pearson', n_components=10,
                    n_jobs=3, p_crossover=0.9, p_hoist_mutation=0.01,
                    p_point_mutation=0.01, p_point_replace=0.05,
                    p_subtree_mutation=0.01, parsimony_coefficient=0.0005,
                    population_size=2000, random_state=0, stopping_criteria=1.0,
                    tournament_size=20, verbose=1, warm_start=False)

We now use the trained transformer on the dataset.

In [17]:
gp_features = gp.transform(data)
new_data = np.hstack((data, gp_features))

Ridge regressor is run on the first 1200 samples of the transformed dataset.

In [21]:
est = Ridge()
est.fit(new_data[:1200, :], target[:1200])
print(est.score(new_data[1200:, :], target[1200:]))

0.3101965363632304


The score is not much better than without genetic programming, but it still went up. 

# References

*   [1] Dataset taken from: https://archive.ics.uci.edu/ml/datasets/Wine+Quality. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.