# Language Identification

- https://arxiv.org/pdf/1701.03682.pdf
- https://cs229.stanford.edu/proj2015/324_report.pdf
- https://cs229.stanford.edu/proj2015/324_poster.pdf
- https://sites.google.com/view/vardial2021/home
- http://ttg.uni-saarland.de/resources/DSLCC/
- https://mzampieri.com/publications.html
- https://mzampieri.com/papers/dsl2016.pdf

## Imports

In [7]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

## Make workspace

In [8]:
# Make directories if they don't exist
os.makedirs(os.path.join('datasets/DSLCC-v2.0'), exist_ok=True)
if not os.path.exists('models'):
    os.mkdir('models')

## Download dataset

We use the [DSLCC v2.0](https://github.com/alvations/bayesmax/tree/master/bayesmax/data/DSLCC-v2.0) dataset from the [DSL Shared Task 2015](http://ttg.uni-saarland.de/lt4vardial2015/dsl.html)

In [9]:
# DSLCC v2.0
if not os.path.exists('datasets/DSLCC-v2.0/train.txt'):
    !cd datasets/DSLCC-v2.0 && curl -o train.txt https://raw.githubusercontent.com/alvations/bayesmax/master/bayesmax/data/DSLCC-v2.0/train-dev/train.txt
if not os.path.exists('datasets/DSLCC-v2.0/devel.txt'):
    !cd datasets/DSLCC-v2.0 && curl -o devel.txt https://raw.githubusercontent.com/alvations/bayesmax/master/bayesmax/data/DSLCC-v2.0/train-dev/devel.txt
if not os.path.exists('datasets/DSLCC-v2.0/test.txt'):
    !cd datasets/DSLCC-v2.0 && curl -o test.txt https://raw.githubusercontent.com/alvations/bayesmax/master/bayesmax/data/DSLCC-v2.0/test/test.txt

## Data

The corpus contains 20,000 instances per language (18,000 training + 2,000 development). Each instance is an excerpt extracted from journalistic texts containing 20 to 100 tokens and tagged with the country of origin of the text. A list of languages and the corresponing codes is shown in the following table:

<table>
    <tr>
        <th>Group Name</th>
        <th>Language Name</th>
        <th>Language Code</th>
    </tr>
    <tr>
        <td rowspan=2>South Eastern Slavic</td>
        <td>Bulgarian</td>
        <td>bg</td>
    </tr>
    <tr>
        <td>Macedonian</td>
        <td>mk</td>
    </tr>
    <tr>
        <td rowspan=3>South Western Slavic</td>
        <td>Bosnian</td>
        <td>bs</td>
    </tr>
    <tr>
        <td>Croatian</td>
        <td>hr</td>
    </tr>
    <tr>
        <td>Serbian</td>
        <td>sr</td>
    </tr>
    <tr>
        <td rowspan=2>West-Slavic</td>
        <td>Czech</td>
        <td>cz</td>
    </tr>
    <tr>
        <td>Slovak</td>
        <td>sk</td>
    </tr>
    <tr>
        <td rowspan=2>Ibero-Romance (Spanish)</td>
        <td>Peninsular Spanish</td>
        <td>es-ES</td>
    </tr>
    <tr>
        <td>Argentinian Spanish</td>
        <td>es-AR</td>
    </tr>
    <tr>
        <td rowspan=2>Ibero-Romance (Portugese)</td>
        <td>Brazilian Portugese</td>
        <td>pt-BR</td>
    </tr>
    <tr>
        <td>European Portugese</td>
        <td>pt-PT</td>
    </tr>
    <tr>
        <td rowspan=2>Astronesian</td>
        <td>Indonesian</td>
        <td>id</td>
    </tr>
    <tr>
        <td>Malay</td>
        <td>my</td>
    </tr>
    <tr>
        <td>Other</td>
        <td>Various Languages</td>
        <td>xx</td>
    </tr>
</table>

In [10]:
train = pd.read_csv('datasets/DSLCC-v2.0/train.txt', sep='\t', names=['sentence', 'language'])
validation = pd.read_csv('datasets/DSLCC-v2.0/devel.txt', sep='\t', names=['sentence', 'language'])
test = pd.read_csv('datasets/DSLCC-v2.0/test.txt', sep='\t', names=['sentence', 'language'])

In [11]:
print(f'Training set size:   {len(train)}')
print(f'Validation set size: {len(validation)}')
print(f'Test set size:       {len(test)}')

Training set size:   236135
Validation set size: 26335
Test set size:       13229


In [12]:
# Print number of instances per label
print(train['language'].value_counts())

es-ES    17990
es-AR    17808
sr       17442
pt-PT    17284
id       17195
mk       17154
bg       17151
xx       16920
bs       16858
hr       16792
pt-BR    16483
cz       16454
sk       16224
my       14380
Name: language, dtype: int64


In [13]:
train[train['language'] == 'xx'].head()

Unnamed: 0,sentence,language
219215,"Любые действия, направленные на инаугурацию Ви...",xx
219216,"Если мы посмотрим на то, как государство начин...",xx
219217,Ko sem letos po dolgih letih spet stala na vrh...,xx
219218,"Данный центр, расположенный в Эдинбурге (Шотла...",xx
219219,Ito ang paniniwala ni Iloilo Rep. Niel Tupas J...,xx


In [14]:
print(train.head())

                                            sentence language
0  Зад думите “просто искам да се махна от Българ...       bg
1  Сега нещата там леко потръгнаха с усилията на ...       bg
2  Хърватският филм "Пътят на дините" на режисьор...       bg
3  Министърът на правосъдието на РС Джерард Селма...       bg
4  В крайна сметка войната между Севера и Юга дов...       bg


In [15]:
CLASS_UNKNOWN = 'xx'
CLASSES = ['bg', 'mk', 'bs', 'hr', 'sr', 'cz', 'sk', 'es-ES', 'es-AR', 'pt-BR', 'pt-PT', 'id', 'my', CLASS_UNKNOWN]
CLASS_NAMES = [
    'Bulgarian', 'Macedonian', 'Bosnian', 'Croatian', 'Serbian', 'Czech', 'Slovak',
    'Peninsular Spanish', 'Argentinian Spanish', 'Brazilian Portuguese', 'European Portuguese',
    'Indonesian', 'Malay', 'Other'
]
NUM_CLASSES = len(CLASSES)

In [31]:
NUM_CLASSES

14

In [16]:
# Change all other language codes to xx
def mark_unknown_languages(data):
    data['language'].where([x in CLASSES for x in data['language']], CLASS_UNKNOWN, inplace=True)
mark_unknown_languages(train)
mark_unknown_languages(validation)
mark_unknown_languages(test)

## Common preprocessing

In [17]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [18]:
X_train = train['sentence']
y_train = train['language']
X_validation = validation['sentence']
y_validation = validation['language']

In [19]:
print(X_train.head())
print(y_train.head())

0    Зад думите “просто искам да се махна от Българ...
1    Сега нещата там леко потръгнаха с усилията на ...
2    Хърватският филм "Пътят на дините" на режисьор...
3    Министърът на правосъдието на РС Джерард Селма...
4    В крайна сметка войната между Севера и Юга дов...
Name: sentence, dtype: object
0    bg
1    bg
2    bg
3    bg
4    bg
Name: language, dtype: object


In [20]:
from sklearn.preprocessing import OneHotEncoder

In [37]:
# Use OneHotEncoder for target variable
# this is better than get_dummies because here we specify all classes
# so all possible classes will have a column and the order will be specified
# if the language code is unkown, an error is thrown
target_encoder = OneHotEncoder(sparse=False, dtype=np.int32)
target_encoder.fit(np.array(CLASSES).reshape(-1, 1))

In [38]:
# target_encoder.transform(np.asarray(y_train[30000:30010]).reshape(-1, 1))

In [21]:
# Create, configure and train a tokenizer 
def get_tokenizer(data, num_words=None):
    tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“„”–', num_words=num_words, oov_token='<OOV>')
    tokenizer.fit_on_texts(data)
    return tokenizer

In [22]:
# trim and pad data
def preprocess_data(X, y, max_length=None):
    if max_length is not None:
        y = y[[len(x)<=max_length for x in X]]
        X = [x for x in X if len(x)<=max_length]
    X = pad_sequences(X, padding='post')
    return X, y

## Model

In [23]:
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, LSTM, GRU, Dropout, Dense
from tensorflow.keras.utils import Sequence

In [24]:
EPOCHS = 20
BATCH_SIZE = 32

In [25]:
def get_model(input_shape, hidden_layer_size, recurrent_dropout_rate=0.0, dropout_rate=0.0, use_lstm=False):
    model = Sequential()
    model.add(InputLayer(input_shape=input_shape))
    if use_lstm:
        model.add(LSTM(hidden_layer_size, recurrent_dropout=recurrent_dropout_rate, name='lstm'))
    else:
        model.add(GRU(hidden_layer_size, recurrent_dropout=recurrent_dropout_rate, name='gru'))
    model.add(Dropout(rate=dropout_rate, name='dropout'))
    model.add(Dense(NUM_CLASSES, activation='softmax', name='softmax'))
    
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [26]:
# Plot model training history
def plot_history(history):
    plt.plot(history.epoch, history.history['accuracy'])
    plt.plot(history.epoch, history.history['val_accuracy'])
    plt.legend(['Training accuracy', 'Validation accuracy'])
    plt.xlabel('epochs')
    plt.ylabel('accuracy')
    plt.xticks(history.epoch, np.arange(1, len(history.epoch)+1))

In [27]:
# Save trained model, compilation and training data and history plot
def save_model(model, batch_size, history, model_name=None):
    recurrent_layer = model.get_layer(index=0)
    recurrent_type = recurrent_layer.name
    recurrent_units = recurrent_layer.units
    recurrent_dropout_rate = recurrent_layer.recurrent_dropout
    
    dropout_layer = model.get_layer(index=1)
    dropout_rate = dropout_layer.rate

    epochs = len(history.epoch)
    
    if not model_name:
        model_name = f'model_{epochs}_{batch_size}_{recurrent_type}_{recurrent_units}_{int(100*recurrent_dropout_rate)}_{int(100*dropout_rate)}_{time.strftime("%Y%m%d_%H%M%S")}'
    model_path = f'models/{model_name}'
    
    model.save(model_path)
    with open(f'{model_path}/training.txt', 'w') as f:
        f.write(f'EPOCHS:            \t {epochs}\n')
        f.write(f'BATCH SIZE:        \t {batch_size}\n')
        f.write(f'RECURRENT LAYER:   \t {recurrent_type}\n')
        f.write(f'RECURRENT UNITS:   \t {recurrent_units}\n')
        f.write(f'RECURRENT DROPOUT: \t {recurrent_dropout_rate}\n')
        f.write(f'OUTPUT DROPOUT:    \t {dropout_rate}\n')
        model.summary(print_fn = lambda x: f.write(x + '\n'))
        f.write(f'ACCURACY:     \t {history.history["accuracy"]}\n')
        f.write(f'VAL ACCURACY: \t {history.history["val_accuracy"]}\n')
    plot_history(history)
    plt.title(model_name)
    plt.savefig(f'{model_path}/history.png')
    return model_name

In [28]:
class DataGenerator(Sequence):
    def __init__(self, input_sequences, vocabulary_size, labels, batch_size=32, shuffle=True):
        self.input_sequences = input_sequences
        self.vocabulary_size = vocabulary_size
        # TODO move to __getitem__
        self.labels = target_encoder.transform(np.asarray(labels).reshape(-1, 1))
        self.batch_size = batch_size
        self.shuffle = shuffle
        # TODO check does this get called automatically anyway
        self.on_epoch_end()

    # Number of batches per epoch
    def __len__(self):
        return int(np.floor(len(self.input_sequences) / self.batch_size))

    # Generate one batch
    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X = to_categorical([self.input_sequences[index] for index in indexes], num_classes=self.vocabulary_size)
        # y = target_encoder.transform(np.asarray([labels[index] for index in indexes]).reshape(-1, 1))
        y = np.asarray([self.labels[index] for index in indexes])
        return X, y

    # Update indexes for next epoch
    def on_epoch_end(self):
        # TODO move to __init__, there is no need to re-arange indexes each epoch
        # they will either always or never be shuffled
        self.indexes = np.arange(len(self.input_sequences))
        if self.shuffle:
            np.random.shuffle(self.indexes)

In [29]:
class PredictGenerator(Sequence):
    def __init__(self, input_sequences, vocabulary_size, batch_size=32):
        self.input_sequences = input_sequences
        self.vocabulary_size = vocabulary_size
        self.batch_size = batch_size

    # Number of batches per epoch
    def __len__(self):
        return int(np.ceil(len(self.input_sequences) / self.batch_size))

    # Generate one batch
    def __getitem__(self, index):
        indexes = np.arange(index*self.batch_size, min((index+1)*self.batch_size, len(self.input_sequences)))
        X = to_categorical([self.input_sequences[index] for index in indexes], num_classes=self.vocabulary_size)
        return X

## Character n-grams

In [30]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from nltk.util import ngrams
from itertools import chain
from tqdm import tqdm
tqdm.pandas()

In [31]:
NGRAMS_MAX_WORDS = {
    2: None,
    3: 20000,
    4: None,
    5: None
}

def sentence_to_char_ngram(sentence, n):
    s = ''.join([c if c not in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“„”–' else ' ' for c in sentence])
    tokens = text_to_word_sequence(s)
    ngrams_ = [[''.join(ng) for ng in list(ngrams(token, n))] for token in tokens if len(token) >= n]
    return ' '.join(chain.from_iterable(ngrams_))

def transform_to_char_ngrams(X, n):
    X_ngram_train = X.copy()
    print(f'{n} - gramming')
    return X_ngram_train.progress_apply(lambda sentence: sentence_to_char_ngram(sentence, n))

def get_char_ngram_tokenizer(X, n):
  tokenizer = get_tokenizer(X, num_words=NGRAMS_MAX_WORDS[n])
  return tokenizer

In [32]:
max_lengths = {2: 150, 3: 150}
def prepare_n_gram_model(n):
  X_ngram_train_initial = transform_to_char_ngrams(X_train, n)
  n_tokenizer = get_char_ngram_tokenizer(X_ngram_train_initial, n)
  X_ngram_train_tokenized = n_tokenizer.texts_to_sequences(X_ngram_train_initial)
  X_ngram_train, y_ngram_train = preprocess_data(X_ngram_train_tokenized, y_train, max_length=max_lengths[n])

  print(f'')

  n_model = get_model((X_ngram_train.shape[1], len(n_tokenizer.word_counts.keys())), hidden_layer_size=1024, dropout_rate=0.45)
  def train_model():
    history = n_model.fit(
      DataGenerator(X_ngram_train, len(n_tokenizer.word_counts.keys()), y_ngram_train, BATCH_SIZE),
#     validation_data=validation_generator,
      epochs=EPOCHS,    # callbacks=[checkpoint_callback]
    )
    return history

  return n_model, train_model



In [33]:
_, train_2 = prepare_n_gram_model(2)

history_2 = train_2()

2 - gramming


100%|██████████| 236135/236135 [00:19<00:00, 12030.28it/s]


## Word unigrams

### Preprocessing

In [48]:
NUM_UNIQUE_WORDS = 10_000

In [49]:
tokenizer_w1 = get_tokenizer(X_train, NUM_UNIQUE_WORDS)

In [None]:
X_train = tokenizer_w1.texts_to_sequences(X_train)

In [None]:
# Find max word dict index
max([max(x) for x in X_train])

In [None]:
# Find max length of train sequences
max([len(x) for x in X_train])

In [None]:
X_train, y_train = preprocess_data(X_train, y_train, 50)

In [None]:
X_train.shape

In [None]:
tokenizer_w1.num_words

In [None]:
print(len(X_train))
print(len(y_train))

In [None]:
len(tokenizer_w1.word_index)

In [None]:
tokenizer_w1.word_index

In [None]:
len([x for x in tokenizer_w1.word_counts.items() if x[1] > 10])

In [None]:
sorted(tokenizer_w1.word_counts.items(), key=lambda w: w[1], reverse=False)[:150]

In [None]:
X_train.shape

In [None]:
print(X_train[0])

### Model

In [None]:
# train_generator = DataGenerator(X_train, tokenizer_w1.num_words, y_train, batch_size=BATCH_SIZE)
# validation_generator = DataGenerator()

In [None]:
# model = get_model((X_train.shape[1], tokenizer_w1.num_words), HIDDEN_LAYER_SIZE, DROPOUT_RATE)

In [None]:
# checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
#     filepath='models/checkpoint',
#     save_weights_only=True,
#     monitor='val_accuracy',
#     mode='max',
#     verbose=1
# )

# history = model.fit(
#     train_generator,
# #     validation_data=validation_generator,
#     epochs=EPOCHS,
#     batch_size=BATCH_SIZE,
#     callbacks=[checkpoint_callback]
# )

In [None]:
# save_model(model, BATCH_SIZE, history)

## Hyperparameter search

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def grid_search_model(devel_X_processor=lambda x: x):
    X = devel_X_processor(X_validation.copy())

    devel_train_X, devel_test_X, devel_train_Y, devel_test_Y = train_test_split(
        X, y_validation, train_size=0.75, stratify=y_validation
    )

    tokenizer = get_tokenizer(devel_train_X, 10_000)
    
    devel_train_X = tokenizer.texts_to_sequences(devel_train_X)
    devel_test_X = tokenizer.texts_to_sequences(devel_test_X)

    devel_train_X, devel_train_Y = preprocess_data(devel_train_X, devel_train_Y, 50)
    devel_test_X, devel_test_Y = preprocess_data(devel_test_X, devel_test_Y, 50)

    recurrent_layer_sizes = [768, 1024, 1280]
    dropout_rates = [0.2, 0.25, 0.35, 0.4, 0.45]
    
    results_acc = [[0 for _ in range(len(dropout_rates))] for _ in range(len(recurrent_layer_sizes))]
    results_val_acc = [[0 for _ in range(len(dropout_rates))] for _ in range(len(recurrent_layer_sizes))]

    for i, recurrent_layer_size in enumerate(recurrent_layer_sizes):
        for j, dropout_rate in enumerate(dropout_rates):
#             if not (i == 0 and j == 0):
#                 continue
            print('Training network with params:')
            print(f' - recurrent_layer_size = {recurrent_layer_size}')
            print(f' - dropout_rate      = {dropout_rate}')
            
            devel_train_generator = DataGenerator(devel_train_X, tokenizer.num_words, devel_train_Y, batch_size=BATCH_SIZE)
            devel_test_generator = DataGenerator(devel_test_X, tokenizer.num_words, devel_test_Y, batch_size=BATCH_SIZE)

            model = get_model((devel_train_X.shape[1], tokenizer.num_words), recurrent_layer_size, dropout_rate)
            history = model.fit(
                devel_train_generator,
                validation_data=devel_test_generator,
                epochs=EPOCHS
            )
            model_name = save_model(model, BATCH_SIZE, history)
      
            results_acc[i][j] = history.history["accuracy"]
            results_val_acc[i][j] = history.history["val_accuracy"]
            print(f'Results for {hidden_layer_size}, {dropout_rate} ({i}, {j}):')
            print(f'accuracy:     {history.history["accuracy"]}')
            print(f'val_accuracy: {history.history["val_accuracy"]}')

    grid_search_results_acc_df = pd.DataFrame(results_acc, index=recurrent_layer_sizes, columns=dropout_rates)
    grid_search_results_val_acc_df = pd.DataFrame(results_val_acc, index=recurrent_layer_sizes, columns=dropout_rates)
    grid_search_results_acc_df.to_csv('grid_search_acc.csv')
    grid_search_results_val_acc_df.to_csv('grid_search_acc.csv')

In [None]:
grid_search_model()

In [None]:
recurrent_layer_sizes = [768, 1024, 1280]
dropout_rates = [0.2, 0.25, 0.35, 0.4, 0.45]

In [None]:
grid_search_val_acc = np.asarray([
    [0.8253777623176575, 0.8014002442359924, 0.838083803653717,  0.8221153616905212, 0.8358948230743408],
    [0.8248626589775085, 0.8288934230804443, 0.8365384340286255, 0.8264079689979553, 0.8384562730789185],
    [0.8257211446762085, 0.8265796899795532, 0.8123282790184021, 0.8094093203544617, 0.8167691230773926]
])

In [None]:
plt.imshow(grid_search_val_acc)
plt.colorbar()
plt.yticks(np.arange(len(recurrent_layer_sizes)), recurrent_layer_sizes)
plt.xticks(np.arange(len(dropout_rates)), dropout_rates)
plt.show()

In [None]:
best_score = np.argmax(grid_search_val_acc)
best_recurrent_layer_size = recurrent_layer_sizes[best_score // len(dropout_rates)]
best_dropout = dropout_rates[best_score % len(dropout_rates)]
print(best_score)
print(best_recurrent_layer_size)
print(best_dropout)

## Ensemble

In [None]:
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import load_model

In [None]:
X_train_model, X_val_model, y_train_model, y_val_model = train_test_split(
    X_train, y_train, train_size=0.9, stratify=y_train
)

### Word unigram

In [None]:
tokenizer_w1 = get_tokenizer(X_train_model, 10_000)

X_train_model = tokenizer_w1.texts_to_sequences(X_train_model)
X_val_model = tokenizer_w1.texts_to_sequences(X_val_model)

X_train_model, y_train_model = preprocess_data(X_train_model, y_train_model, 50)
X_val_model, y_val_model = preprocess_data(X_val_model, y_val_model, 50)

In [None]:
# train the model
model_w1 = get_model((X_train_model.shape[1], tokenizer_w1.num_words), best_recurrent_layer_size, best_dropout)
history_w1 = model_w1.fit(
    DataGenerator(X_train_model, tokenizer_w1.num_words, y_train_model, batch_size=BATCH_SIZE),
    validation_data=DataGenerator(X_val_model, tokenizer_w1.num_words, y_val_model, batch_size=BATCH_SIZE),
    epochs=EPOCHS
)
save_model(model_w1, BATCH_SIZE, history_w1, model_name='model_w1_final')

In [None]:
# load model_w1
# model_w1 = load_model('models/model_w1_final')
# model_w1.summary()

In [None]:
output_w1 = model_w1.predict(PredictGenerator(X_val_model, tokenizer_w1.num_words, batch_size=BATCH_SIZE))

### Logistic regression

In [None]:
# combine outputs
# TODO add other models
# X_ensemble = pd.DataFrame(np.hstack((output_w1, output_w1)))
X_ensemble = pd.DataFrame(output_w1)
y_ensemble = pd.Series(y_val_model)

In [None]:
print(len(X_ensemble))
print(len(y_ensemble))

In [None]:
example_index = 3
print(X_ensemble.iloc[example_index])
print(y_ensemble.iloc[example_index])

In [None]:
print(y_ensemble.value_counts())

In [None]:
scores = []
N_SPLITS = 5
kf = KFold(n_splits=N_SPLITS)

for i, (train_indexes, test_indexes) in enumerate(kf.split(X_ensemble, y_ensemble)):
    print(f'evaluating split {i+1} of {N_SPLITS}')
    
    X_ensemble_train = X_ensemble.iloc[train_indexes]
    y_ensemble_train = y_ensemble.iloc[train_indexes]
    
    X_ensemble_val = X_ensemble.iloc[test_indexes]
    y_ensemble_val = y_ensemble.iloc[test_indexes]
    
    ensemble = LogisticRegression()
    ensemble.fit(X_ensemble_train, y_ensemble_train)
    
    score = ensemble.score(X_ensemble_val, y_ensemble_val)
    
    scores.append(score)
    print(f'score: {score}')
    
print('all scores:')
print(scores)
print(f'average score: {np.mean(scores)}')

In [None]:
ensemble = LogisticRegression()
ensemble.fit(X_ensemble, y_ensemble)

In [None]:
y_pred = ensemble.predict(X_ensemble)

In [None]:
print(y_pred)

In [None]:
print(ensemble.coef_)
print(ensemble.intercept_)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y_ensemble, y_pred)

## Results