# 1. **Define dependencies and constrains**

In order to download tweet from Twitter, first one must create an account and apply for **developer priviledges**. The application will grant the developer basic access the the [Twitter API](https://developer.twitter.com/en/docs/twitter-api) which are not enough because it only allows the download of tweet of the last 7 days. Therefore, I've applied to the [Premium plan](https://developer.twitter.com/en/support/twitter-api/premium) which allows the download of 25k of tweets per month along with the use _full archive_ and the _30 days_ search API but with limited amout of request per month.

In [100]:
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
COLLAB_DIR = "/content/"

RANDOM_SEED = 42

# File with Twitter project credentials
CREDENTIALS = '/content/credentials.yaml'
CREDENTIALS_KEY = 'search_tweets_30_day_dev'

# csv file where tweet downloaded will be saved
DATASET = '/content/dataset.csv'
DATASET_ANNOTATED = '/content/dataset_annotated.csv'
SENTIPOLIC = '/content/sentipolic.csv'

In [101]:
!python --version

Python 3.7.13


### install libraries

In [102]:
!apt-get install libenchant1c2a
!pip install pyenchant
!apt-get install hunspell-it

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libenchant1c2a is already the newest version (1.6.0-11.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
hunspell-it is already the newest version (1:6.0.3-3).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


In [103]:
!apt install openjdk-8-jdk-headless -qq

import os
os.environ["JAVA_HOME"] = JAVA_HOME

openjdk-8-jdk-headless is already the newest version (8u312-b07-0ubuntu1~18.04).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


In [104]:
!java -version

openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1)
OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1, mixed mode, sharing)


In [105]:
!pip install pyspark==3.2.0
!pip install spark-nlp==3.4.4



In [106]:
!pip install keras-tqdm



### import libraries

In [107]:
# pyspark packages
from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.sql import DataFrame

In [108]:
# data processing useful packages
from pyspark.sql.functions import udf, col, lower, trim, regexp_replace, transform
import enchant
from enchant.checker import SpellChecker

In [109]:
# libraries for feature engineering
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.annotator import Tokenizer
from sparknlp.base import LightPipeline

In [110]:
# useful imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import json
import yaml
import csv
import pdb
import pandas as pd

In [111]:
# python widgets
from ipywidgets import Button
import asyncio
from IPython.display import display, clear_output
import ipywidgets as widgets
from ipywidgets import HBox, Layout
import time as t
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [112]:
# keras 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.callbacks import ModelCheckpoint, EarlyStopping

In [113]:
# sklearn 
from sklearn.model_selection import train_test_split

In [114]:
from keras_tqdm import TQDMNotebookCallback

### PySpark configurations

In [115]:
spark = sparknlp.start(spark32=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

AttributeError: ignored

In [None]:
! cd ~/.ivy2/cache/com.johnsnowlabs.nlp/spark-nlp_2.12/jars && ls -lt

In [None]:
sc = spark.sparkContext

## Download Files from GitHub

In [None]:
!wget https://github.com/deborahdore/italian-sarcastic-tweet-classification/raw/main/dataset/dataset.csv
!wget https://github.com/deborahdore/italian-sarcastic-tweet-classification/raw/main/dataset/other/sentipolic.csv
!wget https://raw.githubusercontent.com/deborahdore/italian-sarcastic-tweet-classification/main/credentials/credentials.yaml
!wget https://raw.githubusercontent.com/deborahdore/italian-sarcastic-tweet-classification/main/dataset/dataset_annotated.csv

In [None]:
# italian dictionary for lemmatization
!wget https://raw.githubusercontent.com/michmech/lemmatization-lists/master/lemmatization-it.txt

# 2. **Retrieve Tweet**


> Following, some code cell will be annotated with *%% script false* in order to avoid their execution. Those cell concern the download of the tweets from Twitter. Even if this may not sound dangerous, I've finished the request at my disposal. Therefore, calling the Twitter API will produce an error. Also, please don't run them otherwise the output of the cell will be lost.



- First we must retrieve and validate the credentials that we will need to access the Twitter API. I've store the bearer token in a yaml file: *credentials.yaml*





In [None]:
def handle_credentials(credentials, key):
  with open(credentials, "r") as stream:
    try:
        credentials = yaml.safe_load(stream)
        return credentials[key]
    except yaml.YAMLError as exc:
        print(exc)

In [None]:
credentials = handle_credentials(CREDENTIALS, CREDENTIALS_KEY)
endpoint = credentials['endpoint'] # we will use this endpoint to search for the tweet
print(endpoint)

- Second we must create the header for the request

In [None]:
def handle_headers(credentials:dict):
  headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {credentials["bearer_token"]}'
  }
  return headers

In [None]:
headers = handle_headers(credentials)
headers

- Another parameter of the request is the query. The query determines which tweet will be returned in the response. In our case, we have 2 types of queries: the one that searches for sarcastic tweets and the one that returns non-sarcastic tweets

For the query about sarcastic tweet I've chosen some keyword that, in my opion, are used to express sarcasm and/or irony (sarcasm is a sub-type of irony):


1. sarcasmo (with or without #)
2. ironia (with or without #)
3. "*ridiamo per non piangere*"
4. #coincidenze (.. io non credo) is mostly used to express sarcasm
5. "*qualquadra non cosa*"

Many studies also suggest that sarcasm can be found in tweet related to politics. Therefore, these seems very good starting point:
1. monti, draghi, berlusconi (known italian prime minister)
2. governo
3. premier


For non-sarcastic tweet, I've excluded all the possibile word that may refer to sarcasm.

The list of operator used can be found in the [Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api/enterprise/rules-and-filtering/operators-by-product).

In [None]:
sarcasm_query = '(#sarcasmo OR sarcasmo OR #ironia OR ironia OR "ridiamo per non piangere" \
                  OR #coincidenze OR "qualquadra non cosa" OR draghi OR monti OR berlusconi \
                  OR governo OR premier) lang:it -has:media'

non_sarcasm_query = '-"ridiamo per non piangere" -sarcasmo -ironia -"qualquadra non cosa" lang:it -has:media'

- Now we can define the function that will handle the request and the dataframe where tweet will be stored.


> Other parameters that we need in order to process the request are:
- *max_result_per_page* : the maximum number of tweets per call 
- *next_token* : a token that if passed to the request will return the next page of results
- I've defined a parameter *max_num_of_request* that will stop the call once that we've reached the desidered amount of calls. This must be done because the request at our disposal are not illimited. So we must be careful to the number of the request that we do




In [None]:
def handle_request(endpoint, headers, query, max_result_per_page, next_token = None):
  
  if next_token is not None:
    payload = json.dumps({
      "maxResults": max_result_per_page,
      "query": query,
      "next": next_token
    })
  else:
    payload = json.dumps({
      "maxResults": max_result_per_page,
      "query": query,
    })
  
  response = requests.post(endpoint, headers=headers, data=payload)

  return response.text

In [None]:
def extract_tweet(response, label):
  tweets = []
  json_response = json.loads(response)
  
  if 'results' in response:
    results = json_response["results"]

    for tweet in results:
      # is tweet a retweet?
      if 'retweeted_status' in tweet:
        if tweet['retweeted_status']['truncated']:
          text = tweet['retweeted_status']['extended_tweet']['full_text']
        else:
          text = tweet['retweeted_status']['text']
      else:
        if tweet['truncated']:
          text = tweet['extended_tweet']['full_text']
        else:
          text = tweet['text']
        
      text = text.replace('"', "'")
      data = Tweet(tweet["id"], f"{text}", label)
      
      tweets.append(data)

  else:
    print("Request went wrong")
    print(response)

  return tweets

In [None]:
def download_tweet(endpoint, 
                   headers, 
                   query, 
                   label,
                   max_result_per_page,
                   tweet_list,
                   next_token = None, 
                   max_num_of_request = 20):

  if max_num_of_request <= 0:
    return tweet_list

  response = handle_request(endpoint, headers, query, max_result_per_page, next_token)

  tweet_list.extend(extract_tweet(response, label))

  try:
      next_token = json.loads(response)['next']
  except:
      next_token = None

  if next_token is not None:
      return download_tweet(endpoint, headers, query, label, max_result_per_page,
                   tweet_list, next_token, max_num_of_request - 1)
  else:
      return tweet_list

In [None]:
# define tweet
Tweet = Row("id", "text", "sarcastic")

In [None]:
tweets = []

In [None]:
%%script false

# download sarcastic tweet
tweets = download_tweet(endpoint, 
                   headers, 
                   sarcasm_query, 
                   "Yes",
                   100,
                   [],
                   next_token = None, 
                   max_num_of_request = 40)

In [None]:
%%script false

# download non-sarcastic tweet
tweets.extend(
    download_tweet(endpoint, 
                   headers, 
                   non_sarcasm_query, 
                   "No",
                   100,
                   [],
                   next_token = None, 
                   max_num_of_request = 40))

In [None]:
%%script false
# create DataFrame
df = spark.createDataFrame(tweets)

In [None]:
%%script false
df.show(10, truncate=False)

In [None]:
%%script false

# create file
if not os.path.exists(DATASET):
  os.mknod(DATASET)

# save tweets
df.toPandas().to_csv(DATASET, header=True, index=False) 

# 3. **Annotate Tweet**

When we download tweet using an hashtag, we are not 100% sure of what we downloaded is correct. We must analyze - at least - the majority of the tweet to understand if what we have labelled is correct. There here's a little tool to help us with that.

In [None]:
Tweet = Row("id", "text", "sarcastic")

schema = StructType([StructField("id", StringType(), True)\
                   ,StructField("text", StringType(), True)\
                   ,StructField("sarcastic", StringType(), True)])

df = spark.createDataFrame(pd.read_csv(DATASET), schema=schema)
df.show(10)

In [None]:
def count_label(df, numeric=False):
  label_yes = 1 if numeric else "Yes"
  label_no = 0 if numeric else "No"
  return df.groupBy("sarcastic").agg(
      count(when(col("sarcastic") == label_yes, 1)),
      count(when(col("sarcastic") == label_no, 1)))

In [None]:
# count tweet
print(f'Total number of tweet retrieved {df.count()}')

In [None]:
# we want first to drop duplicates

print("Count before drop:")
count_label(df).show()

count_before_drop = df.count()
df = df.dropDuplicates(["text"])
print(f"Distinct count: {str(df.count())} \n")

print("Count after drop:")
count_label(df).show()

In [None]:
print(f'dropped {count_before_drop-df.count()} columns')
print(f'total count: {df.count()}')

In [None]:
# visually 
data = count_label(df).collect()

labels = ['sarcastic', 'non sarcastic']
colors = sns.color_palette('pastel')[0:5]

plt.pie([int(data[1][1]), int(data[0][2])], labels = labels, colors = colors, autopct='%.0f%%')
plt.show()

In [None]:
tweets_annotated = []

In [None]:
def wait_for_change(widget1, widget2): 
    future = asyncio.Future()
    def getvalue(change):
        future.set_result(change.description)
        widget1.on_click(getvalue, remove=True)
        widget2.on_click(getvalue, remove=True) 
    widget1.on_click(getvalue)
    widget2.on_click(getvalue)
    return future

async def f(df):
  df_pandas = df.toPandas()
  for index, row in df_pandas.iterrows():
    print(f'Is this tweet sarcastic? \n {row.text} \n', flush=True)

    x = await wait_for_change(sarcastic,non_sarcastic)
    
    if x == "Yes":
      print("Tagged ", row.id, "with sarcastic \n")
      data = Tweet(row.id, row.text, "Yes")
      tweets_annotated.append(data)
    else:
      print("Tagged ", row.id, "with non-sarcastic \n")
      data = Tweet(row.id, row.text, "No")      
      tweets_annotated.append(data)

    clear_output()
    display(HBox([sarcastic,non_sarcastic]))

Before going forward, we want to ask ourselves *How can know if a tweet is sarcastic or not?*

*In Harry Potter and the Half Blood Prince, there is a scene where Harry is leaving the Weasley house and Mrs. Weasley says: “Promise me you will look after yourself…stay out of trouble….” Harry responds: “I always do Mrs. Weasley. I like a quiet life, you know me.” Anyone familiar with Harry Potter knows that his life is far from quiet, and so he must not really mean what he is saying. In fact, Harry is being sarcastic.*

[source](https://kids.frontiersin.org/articles/10.3389/frym.2018.00056)

Sarcasm is the use of words that say the opposite of what you really mean, often as a joke and with a tone of voice that shows this. It is often used to mock or critize someone, express disapproval or as a defence mechanism.

For example:
> *Noi invece ce la caviamo con un grado in meno ai termosifoni d'inverno e spegnendo i condizionatori d'estate. Non è fantastico? (#Draghi è un cialtrone sesquipedale, nel caso aveste ancora qualche dubbio)*

Here we can imagine the sarcastic tone of the writer. He's obviously criticising the Italian prime minister, Mario Draghi, when, during an interview, he said that we must make sacrifices like lowering the grade of the radiator in order to cope with the possibility of not having the gas from Russia anymore. Obviously, this won't be enough. *Isn't this great?*

Sometimes it's difficult also for a human person to understand sarcasm therefore I don't expect the following dataset to be 100% free from bias.

In [None]:
# tool used for annotation: it displays each tweet and the user has to click "Yes" 
# if the tweet was sarcastic, "No" otherwise

sarcastic=Button(description="Yes", button_style='info', layout=Layout(width='150px', height='50px'))
non_sarcastic=Button(description="No", button_style='info', layout=Layout(width='150px', height='50px'))

asyncio.create_task(f(df))
t.sleep(2)
display(HBox([sarcastic,non_sarcastic]))

In [None]:
%%script false
print(tweets_annotated)

In [None]:
%%script false
df_annotated = spark.createDataFrame(tweets_annotated)
df_annotated.tail(5)

In [None]:
%%script false
if not os.path.exists(DATASET_ANNOTATED):
  os.mknod(DATASET_ANNOTATED)

# save tweets
df_annotated.toPandas().to_csv(DATASET_ANNOTATED, header=True, index=False) 

# 4. **Extend Dataset**

In [None]:
schema = StructType([StructField("id", StringType(), True)\
                   ,StructField("text", StringType(), True)\
                   ,StructField("sarcastic", StringType(), True)])

df_annotated = spark.createDataFrame(pd.read_csv(DATASET_ANNOTATED), schema=schema)

In [None]:
print(f"Annotated tweets: {df_annotated.count()}")

As we can see from the code below, we lost multiple *tweet*.
First of all, multiple tweets classified as sarcastic were not sarcastic. Also, I've dropped every tweet that contained only one word, that wasn't actually in italian or 
that had no sense.

In [None]:
count_label(df_annotated).show()

However, we can integrate we some external Dataset such as: [SENTIPOLIC](http://www.di.unito.it/~tutreeb/sentipolc-evalita16/index.html) from the challenge EVALITA2016 which contains several italian tweet already classified.

In [None]:
df_sentipolic = spark.createDataFrame(pd.read_csv(SENTIPOLIC))

In [None]:
df_sentipolic.show(10)

In [None]:
# we will extract only the tweets which are ironic since we have plenty non-ironic
df_sentipolic = df_sentipolic.filter(col("iro")==1)

In [None]:
print(f"Ironic tweet retrieved: {df_sentipolic.count()}")

In [None]:
# drop columns that we don't need
df_sentipolic = df_sentipolic.drop(*('subj', 'opos', 'oneg', 'lpos', 'lneg', 'top'))

# rename columns
df_sentipolic = df_sentipolic.withColumnRenamed("idTwitter", "id")\
                              .withColumnRenamed("iro", "sarcastic")

# change order
df_sentipolic = df_sentipolic.select("id", "text", "sarcastic")

In [None]:
df_sentipolic.show(10)

In [None]:
# now we want to join the two dataset. However we must use the same label for both.
# Therefore if the tweet is sarcastic, the label will be 1, 0 otherwise.


df_annotated = df_annotated.withColumn("sarcastic", 
                                         when(df_annotated.sarcastic == "Yes", 1)
                                         .when(df_annotated.sarcastic == "No", 0)                                    
                                         .otherwise(df_annotated.sarcastic))

In [None]:
df_annotated.show()

In [None]:
# concatenate DataFrames

df_complete = df_annotated.union(df_sentipolic)
df_complete.show(5)

In [None]:
print(f'Now we have a total of {df_complete.count()} tweets')

In [None]:
count_label(df_complete, numeric=True).show()

The dataset is still unbalanced, but better than before.

# 5. **Data Processing**

First we want to clean tweet: remove hashtag, links, emoji, whitespaces, mentions.

### Convert to lowercase

In [None]:
df_lowercase = df_complete.withColumn('text', lower(col('text')))
df_lowercase.show(5)

### Remove Links

In [None]:
df_links = df_lowercase.withColumn('text', regexp_replace('text', r'http\S+', ''))
df_links.show(5)

### Remove mentions

In [None]:
df_mentions = df_links.withColumn('text', regexp_replace('text', '@\w+', ''))
df_mentions.show(5)

### Remove hashtag, keeping the word

In [None]:
df_hashtag = df_mentions.withColumn('text', regexp_replace('text', '#', ''))
df_hashtag.show(5)

### Remove RT symbol

In [None]:
df_RT = df_hashtag.withColumn('text', regexp_replace('text', 'RT', ''))
df_RT.show(5)

### Remove punctuation

In [None]:
df_punctuation = df_RT.withColumn('text', regexp_replace('text', '[^a-zA-Z\\s]', ''))
df_punctuation.show(5)

### Remove new line symbol

In [None]:
df_new_line = df_punctuation.withColumn('text', regexp_replace('text', '\n', ''))
df_new_line.show(5)

### Remove emoij

In [None]:
df_emoij = df_new_line.withColumn('text', regexp_replace('text', "[^\x00-\x7F]+" , ''))
df_emoij.show(5)

### Remove Digits

In [None]:
df_digit = df_emoij.withColumn('text', regexp_replace('text', r'[0-9]{5,}', ''))
df_digit.show(5)

### Spell Checker

When annotating the tweets, I've noticed that many of them contained spelling errors. It is recommended to adjust those tweets before the model training.

In [None]:
broker = enchant.Broker()
broker.describe()
broker.list_languages()

In [None]:
def spell_checker(text):
  checker = SpellChecker("it_IT", text)
  for err in checker:
    if len(err.suggest())>0:
      sug = err.suggest()[0]
      err.replace(sug)
  return checker.get_text()

In [None]:
udf_spell_checker = udf(lambda x: spell_checker(x), StringType())
df_spell = df_digit.withColumn('text', udf_spell_checker(col('text')))

df_spell.cache()

df_spell.show(5)

### Removing exceeding whitespace

In [None]:
print("a. Trimming")
df_trimming = df_spell.withColumn('text', trim(col('text')))
df_trimming.show(5, truncate=False)

print("b. Filter out extra whitespaces")
df_cleaned = df_trimming.withColumn('text', regexp_replace(col("text"), " +", " "))

df_cleaned.show(5, truncate=False)

## Result

In [None]:
df = df_cleaned.select([col('text'), col('sarcastic')])

df.cache()
df.show(5, truncate=False)

df_spell.unpersist()

# 6. **Feature Engineering**

In [None]:
print("Starting feature engineering, constructing pipeline..")

## Document assembler
Each annotator in Spark NLP takes specific sorts of columns and produces new columns of a different type. We have the following types in Spark NLP: document, token, chunk, pos, word embeddings, date, entity, sentiment, named entity, dependency, labeled dependency.

To implement the solution in Spark NLP, we must first transform raw data into Document type. DocumentAssembler() is a special transformer that builds the initial annotation of type Document that annotators can utilize later on.

In [None]:
document_assembler = DocumentAssembler()\
                        .setInputCol('text')\
                        .setOutputCol('document')\
                        .setCleanupMode("shrink")

## Tokenizer
Tokenization is the process of breaking raw text into smaller pieces. Tokenization divides the raw text into words known as tokens. These tokens help to better understand the context or constructing the NLP model. Tokenization aids in determining the meaning of the text by evaluating the word sequence.

In [None]:
tokenizer = sparknlp.annotator.Tokenizer().setInputCols(["document"]).setOutputCol("token")

## Lemmatizer
Lemmatization is a technique for reducing words to their normalized form. The transformation of lemmatization employs a dictionary to map distinct versions of a word back to its base format. So, using this method, we may reduce non-trivial inflections like "is," "was," and "were" down to the root "be."

In [None]:
lemma = Lemmatizer()\
     .setInputCols(['token'])\
     .setOutputCol('lemma')\
     .setDictionary("lemmatization-it.txt", "->", "\t")

## Stopwords cleaner
Removes stopwords, that are not useful to our goal, from the text.

In [None]:
stopwords_cleaner = StopWordsCleaner.pretrained("stopwords_it", "it")\
     .setInputCols(['lemma'])\
     .setOutputCol('clean_lemma')

## Word Embedding using BERT pretrained for italian language
Word Embedding is a method that involves representing a word with a vector. The BERT model was used to construct these embeddings in the code below since it provides embeddings that allow us to have numerous vector representations for the same word dependent on the context in which the word is used. BERT embeddings are thus context-dependent.

In [None]:
embeddings = BertEmbeddings.pretrained("bert_base_italian_cased", "it") \
      .setInputCols(["document", "clean_lemma"]) \
      .setOutputCol("embeddings")

In [None]:
sentence_embeddings = SentenceEmbeddings()\
                        .setInputCols(["document", "embeddings"])\
                        .setOutputCol("sentence_embeddings")

In [None]:
embeddingsFinisher = EmbeddingsFinisher() \
                      .setInputCols("sentence_embeddings") \
                      .setOutputCols("finished_sentence_embeddings") \
                      .setOutputAsVector(True) \
                      .setCleanAnnotations(False)

## Fitting pipeline

In [None]:
pipeline = Pipeline(stages=[document_assembler,
                            tokenizer,
                            lemma,
                            stopwords_cleaner,
                            embeddings,
                            sentence_embeddings,
                            embeddingsFinisher
                            ])

In [None]:
%%time
features = pipeline.fit(df)

In [None]:
embeddings = features.transform(df)

In [None]:
embeddings.cache()
print("word embeddings")
embeddings.select('embeddings').show(5, truncate=False)
print("sentence embeddings")
embeddings.select('sentence_embeddings').show(5, truncate=False)
print("finisher")
embeddings.select('finished_sentence_embeddings').show(5, truncate=False)

In [None]:
df.unpersist()

# Training the model

In [None]:
df_pandas = embeddings.to_pandas_on_spark()
df_pandas.head(5)

In [None]:
embeddings.unpersist()

In [92]:
features = df_pandas['finished_sentence_embeddings']
target = df_pandas['sarcastic']

In [93]:
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.33,
                                                    random_state=42,
                                                    shuffle=True)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-93-af41aa40cbd9>", line 4, in <module>
    random_state=42)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_split.py", line 2417, in train_test_split
    arrays = indexable(*arrays)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 378, in indexable
    check_consistent_length(*result)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 329, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 329, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 267, in _num_

KeyboardInterrupt: ignored

In [None]:
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(50, 1)))
model.add(Dropout(0.5))
model.add(LSTM(units=50))
model.add(Dropout(0.5))
model.add(Dense(12))
model.compile(optimizer='adam', loss="mse", metrics=['mse', 'mae', 'mape'])

# callback stops the traning when the val_loss is increasing
callback = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# fit the model with a validation dataset
base_history = model.fit(x_train[:, 1:].astype('float64'), y_train, epochs=30, batch_size=64, verbose=2,
                          validation_split=0.2, callbacks=[callback])