# ICE 5: Text Classification (Sequential Data)

With new tools for generating text, I've been thinking about the issues of inaccurate news and detecting this automatically. Given the scale of news these days, it would not be feasible to manually vet all news stories so the ability to automatically identify them would be critical. 

To this end, I found a dataset of real vs fake news but to make it a bit more difficult of a task, I've downsampled the dataset dramatically leaving you with 100 training observations. I've listed the tasks I'd like you to complete below with this dataset. 

For consistency, I also created a validation and test split, all of which can be found on Canvas under the datasets with the following names:

* `FakeNews_Train.csv`
* `FakeNews_Test.csv`
* `FakeNews_Val.csv`

## Tasks
1. Train an LSTM (including the embedding layer) to predict real vs fake news.
2. Train an LSTM using pre-trained GLoVE embeddings to predict real vs fake news.
3. (Bonus) How little training data do you need to achieve an "acceptable" accuracy? This may require grabbing some more test data

The fake news stories I created were from ChatGPT. In short, I attempted to give it specific scenerios, and assign it names to speak of.

In [2]:
import pandas as pd
train = pd.read_csv('FakeNews_Train.csv')
val = pd.read_csv('FakeNews_Val.csv')
test = pd.read_csv('FakeNews_Test.csv')
created_test = pd.read_csv('news_stories_created_test_set.csv')
train.head()

Unnamed: 0,title,text,date,Label,BinLabel
0,MARIA BARTIROMO Gets Into Heated Interview Wit...,The DNC Chair Tom Perez took his delusional an...,8-Nov-17,False,0
1,WATCH: Watergate Reporter Carl Bernstein HAMM...,Legendary investigative reporter Carl Bernstei...,10-Jun-17,False,0
2,Biggest Leak EVER Exposes How The 1% Dodges M...,A secret cache of documents revealing the tax ...,4-Apr-16,False,0
3,[VIDEO] EMBOLDENED BY OBAMA‚ÄôS LAWLESS AMERIC...,Obama has encouraged this type of behavior wit...,3-Sep-15,False,0
4,ELLEN Just Proved She‚Äôs A HUGE Hypocrite And...,Someone needs to educate Ellen DeGeneres on Pr...,5-May-17,False,0


In [3]:
train['text'][3]

'Obama has encouraged this type of behavior with his divisive rhetoric, as he proves himself to be a true disciple of the Reverend Wright s church of Black Liberation Theology (hate against whites) After a month that saw law enforcement officers from Texas to New Orleans being targeted and gunned down because they were cops, a new viral video features a self-described black supremacist calling for more deadly assaults on police.https://youtu.be/brX0XcmtVUY It s open season on killing white people and crackers,  King Noble says in a chilling Youtube rant.  The Black Lives Matter movement wasn t enough. Noble calls for more execution-style killings of police officers similar to the slaying of Texas sheriff s Deputy Darren Goforth. It s not safe no more to be white in America. Lurking behind any corner could be an angry black man ready to take yo ass out. It s a reality,  he said.  It s open season on killing whites and police. Via: Breitbart News'

In [4]:
print("Train shape:", train.shape, "Val shape: ", val.shape, "test shape:", test.shape)
train = pd.concat([train[['text','BinLabel']], test[['text','BinLabel']]], axis=0)
val = val[['text','BinLabel']]
test = created_test[['text','BinLabel']]

train.shape

Train shape: (100, 5) Val shape:  (1000, 5) test shape: (1000, 5)


(1100, 2)

In [5]:
train.head()

Unnamed: 0,text,BinLabel
0,The DNC Chair Tom Perez took his delusional an...,0
1,Legendary investigative reporter Carl Bernstei...,0
2,A secret cache of documents revealing the tax ...,0
3,Obama has encouraged this type of behavior wit...,0
4,Someone needs to educate Ellen DeGeneres on Pr...,0


In [6]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [7]:
max_words = 10000
max_length = 650

tokenizer = Tokenizer(num_words = max_words, oov_token = '<OOV>')
tokenizer.fit_on_texts(train['text'])

In [8]:
X_train_seq = tokenizer.texts_to_sequences(pd.concat((train['text'], test['text']), axis=0))
X_val_seq = tokenizer.texts_to_sequences(val['text'])
X_test_seq = tokenizer.texts_to_sequences(test['text'])

In [9]:
X_train_padded = pad_sequences(X_train_seq, maxlen = max_length, padding = 'post', truncating = 'post')
X_val_padded = pad_sequences(X_val_seq, maxlen = max_length, padding = 'post', truncating = 'post')
X_test_padded = pad_sequences(X_test_seq, maxlen = max_length, padding = 'post', truncating = 'post')

In [10]:
len(X_train_seq[0])

300

In [11]:
X_train_padded[0]

array([   2, 1458, 1839, 1783, 3577,  278,   20, 6643,    6,  167, 6644,
       4628,   10,  271,  280,   27,  815, 2916, 5409,  468,   77,   24,
         15,   18,    5, 8797,    4, 2569,  732,   39,    2, 2374,    2,
       7501,   28,  151,    7,  233,   99,   22,   46, 7502,   21,    2,
       2467,  237,  392,   70, 1840, 1701,  361,  388,  219,    3,    5,
       3356, 2468,  710,   18, 5409, 1252,    2, 1458,  530,   39,   20,
         99,    9, 4629,    7,  202,    9,  496,  145, 5409,  802,    5,
       7503,   60,   54, 1784,    2, 1176,   39,   74, 8798,    2,  964,
       3577,  777,  174, 2916,   37,  951,    8,   60,    2,  244,  440,
         19,   32,  125,  354,    2,  756,  105,  475,    8,   48,   38,
         91,    3, 1096,   10,   83,  361,  145,   53, 3577, 5410,  825,
          6, 5409,   14,    2,  334,  469,  436,   17,    1,    2,   87,
        449,  174,    6, 4036,   18, 3577, 2667,    3, 1276,    2, 1785,
        469,  284,    6, 5409,    1,   77,   10,  2

In [12]:
batch_size = 32
buffer_size = 10000

train_dataset = tf.data.Dataset.from_tensor_slices((X_train_padded, pd.concat((train['BinLabel'], test['BinLabel']), axis=0)))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val_padded, val['BinLabel']))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test_padded, test['BinLabel']))

train_dataset = train_dataset.shuffle(buffer_size).batch(batch_size)
val_dataset = val_dataset.batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

In [13]:
for t,f in val_dataset:
  print(t.shape)
  print(t)
  break

(32, 650)
tf.Tensor(
[[ 320 5609 3726 ...    0    0    0]
 [  56   48  154 ...    0    0    0]
 [  13  224  231 ...    0    0    0]
 ...
 [  68    5  641 ...    0    0    0]
 [   2   34   30 ...    0    0    0]
 [1048  762 1010 ...    0    0    0]], shape=(32, 650), dtype=int32)


In [14]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-04-20 01:20:55--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-04-20 01:20:55--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-04-20 01:20:56--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [15]:
import os

path_to_glove_file = "/content/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [16]:
num_tokens = len(tokenizer.word_index) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 21438 words (3279 misses)


In [17]:
from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

In [26]:
inputs = keras.Input(shape=(None,), dtype = "int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss = "binary_crossentropy",
              metrics = ['accuracy'])
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2471900   
                                                                 
 bidirectional_4 (Bidirectio  (None, 64)               34048     
 nal)                                                            
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,506,013
Trainable params: 34,113
Non-trainable params: 2,471,900
____________________________________________

In [29]:
model.fit(train_dataset, validation_data = val_dataset, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f3af8c6da30>

In [30]:
model.evaluate(X_test_padded, test['BinLabel'])



[0.6152816414833069, 0.75]

In [36]:
y_pred = model.predict(X_test_padded)
y_pred = np.round(y_pred)

y_test_labels = test['BinLabel']
for i in range(len(y_test_labels)):
    print(f"Actual: {y_test_labels[i]}, Predicted: {y_pred[i]}")

Actual: 1, Predicted: [1.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]


These are the incorrectly labeled stories

In [46]:
created_test['text'][9]

'War has broken out in Israel. The sound of bombs and gunfire echoes through the streets. People run for cover as buildings crumble and smoke fills the air. The conflict seems to have no end in sight. Families are torn apart, lives are lost, and the future is uncertain. The fighting is intense and both sides seem determined to win. The conflict has sparked protests around the world, with people taking to the streets to demand an end to the violence. But as the days go by, it becomes clear that the situation is only getting worse. The international community has tried to intervene, but so far their efforts have been in vain. The United Nations has called for a ceasefire, but it has been ignored. The conflict has even spilled over into neighboring countries, further destabilizing an already volatile region. The impact of the war is felt everywhere. Schools are closed, businesses are shut down, and hospitals are overwhelmed with casualties. Families huddle in bomb shelters, praying for sa

In [47]:
created_test['text'][11]

"President Trump has just tweeted his intention to tackle the issue of federal student debt relief. In a series of early morning tweets, he stated that he wants to work with Congress to find a solution that will provide relief to millions of Americans burdened with student loan debt. The issue of student debt has become a major political issue in recent years, with many young people struggling to pay off their loans and build a stable financial future. Trump's tweet suggests that he recognizes the seriousness of the issue and wants to take action to address it. The details of his proposed solution are not yet clear, but his tweets indicate that he is open to a range of options, including loan forgiveness, lower interest rates, and expanded repayment plans. He also emphasized the need for accountability, saying that any solution should be designed to prevent future generations from being burdened with the same level of debt. The response to Trump's tweet has been mixed. Supporters have 

In [49]:
created_test['text'][15]

'A new gun law has been passed, which is set to have a significant impact on gun owners across the country. The law, which has been hotly debated in Congress for months, will impose tighter restrictions on the sale and ownership of firearms. Under the new law, background checks will be mandatory for all gun sales, including private sales and transfers. The law also bans the sale of high-capacity magazines and assault weapons, which have been used in many recent mass shootings. Supporters of the law argue that it is necessary to prevent gun violence and protect public safety. They point to the fact that the US has one of the highest rates of gun violence in the world, with mass shootings and other gun-related crimes becoming increasingly common. Opponents of the law, however, argue that it infringes on the Second Amendment rights of law-abiding gun owners. They argue that the vast majority of gun owners are responsible and pose no threat to public safety, and that the law unfairly penal

In [50]:
created_test['text'][16]

"The COVID-19 pandemic is once again wreaking havoc across Europe, with more than 10 million people succumbing to the virus over a six-month period. The fatality rate stands at a concerning 2.3%, causing widespread concern and alarm. The resurgence of the virus has been attributed to a number of factors, including the emergence of new, more contagious variants of the virus, the relaxation of lockdown measures, and a general complacency towards the threat posed by the virus. Governments across the region have been scrambling to contain the outbreak, with many reintroducing lockdown measures and imposing new restrictions on travel and gatherings. However, these measures have been met with resistance from some members of the public, who are frustrated with the ongoing disruption to their lives. The situation has been made even more challenging by the slow rollout of vaccines across the region. While some countries have made significant progress in vaccinating their populations, others are

In [51]:
created_test['text'][18]

"Kamala Harris is a lawyer and politician who served as the Attorney General of California from 2011 to 2017, and later as a United States Senator from California from 2017 to 2021. In 2020, she was chosen by Joe Biden as his running mate in the United States Presidential Election. She made history as the first woman of color to be chosen as a vice-presidential nominee by a major political party. Throughout her career, Kamala Harris has been a strong advocate for criminal justice reform, environmental protection, and healthcare access. She has also been a vocal critic of Donald Trump and his policies, particularly on issues related to immigration and civil rights. In recent years, Kamala Harris has also faced criticism from some conservatives for her handling of criminal cases as Attorney General of California and for her association with Hunter Biden. However, she has defended her record and emphasized her commitment to upholding the rule of law and ensuring justice for all. As Vice P

In [48]:
created_test['text'][20]

'There are several potential candidates who are considering running for President in 2024. One of them is Senator Cory Booker from New Jersey. He has a strong track record of advocating for civil rights, criminal justice reform, and economic equality. Senator Booker has been a leading voice in Congress on issues related to social justice and equity. He has championed legislation to address police brutality and systemic racism, and has been a vocal supporter of policies to promote economic opportunity and job growth. In addition to his policy positions, Senator Booker is known for his ability to connect with voters and inspire change. He has a powerful personal story, having grown up in poverty and risen to become a Rhodes Scholar and successful politician. If he were to run for President in 2024, Senator Booker would bring a unique perspective and a proven record of leadership to the race. He has demonstrated a strong commitment to public service and a willingness to fight for what is 

I am unable to find the reason the model miscategorized these articles over the others.

This model performed moderately well on my test set, but it struggled to do as well on the test set as it had on the train and validation sets. I think this might have something to do with my test set being created differently. Capturing more of the patterns in the data may help it bridge that gap.

I feel that the model will perform better with more complexity. I will attempt another model with another layer.

In [39]:
inputs = keras.Input(shape=(None,), dtype = "int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(64))(embedded)
x = layers.Dropout(0.3)(x)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
large_model = keras.Model(inputs, outputs)

from keras.callbacks import EarlyStopping, ModelCheckpoint

large_model.compile(optimizer="rmsprop",
              loss = "binary_crossentropy",
              metrics = ['accuracy'])

callback_list = [
    EarlyStopping(monitor='val_loss', patience=3),
    ModelCheckpoint(filepath='weights.{epoch:02d}-{val_loss:.2f}.h5',
    monitor='val_loss', verbose=1, save_best_only=True)
]

large_model.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2471900   
                                                                 
 bidirectional_10 (Bidirecti  (None, 64)               34048     
 onal)                                                           
                                                                 
 dropout_10 (Dropout)        (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,506,013
Trainable params: 34,113
Non-trainable params: 2,471,900
____________________________________________

In [41]:
large_model.fit(train_dataset, validation_data = val_dataset, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f3a482a2df0>

In [44]:
large_model.evaluate(X_test_padded, test['BinLabel'])



[0.045223940163850784, 0.9583333134651184]

The larger model performed wonderfully on the created test set.

In [42]:
y_pred = large_model.predict(X_test_padded)
y_pred = np.round(y_pred)

y_test_labels = test['BinLabel']
for i in range(len(y_test_labels)):
    print(f"Actual: {y_test_labels[i]}, Predicted: {y_pred[i]}")

Actual: 1, Predicted: [1.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 1, Predicted: [1.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]
Actual: 0, Predicted: [0.]


This is the incorrectly labeled article.

In [45]:
created_test['text'][5]

"During a recent debate, conservative commentator Ben Shapiro surprised many by expressing his support for a woman's right to choose. Shapiro, known for his conservative views on issues such as gun rights and taxation, made a passionate argument in favor of pro-choice policies, citing individual freedom and limited government as his main reasons. Shapiro's comments came during a heated discussion on abortion and reproductive rights, where he argued that the government should not be in the business of dictating what a woman can or cannot do with her body. He acknowledged that the issue of abortion is complex and emotionally charged, but maintained that individual freedom and personal choice should be the guiding principles in this debate. Shapiro's comments were met with surprise and some confusion by his fellow debaters, who questioned how his views on abortion fit into his broader conservative philosophy. Shapiro responded by reiterating his belief in limited government and individual

I cannot tell what articles it did better or worse on, because this story was predicted correctly by the smaller model, so it was more chance than a lack of ability that it was incorrectly labeled in the large model.