# D22124454 - Deep Learning Assignment - Part 3
# Writing news articles:

#### System specifics:
OS: Windows 11

RAM: 32 GB

GPU: RTX 3070

IDE: Models initially trained and evaluated on local - Jupyter IDE

#### Task overview:
This task is about trying to generate news articles from the given data using the text portion. We will be using the articles from the top 2 categories of articles.

#### Note:
#### In this notebook, we have the training and validation comparisons, along with a graph for evaluation. The best performing models will be evaluated again in the demo notebooks. The training was done in local with 10 epochs, so it might not give similar results with less epochs in collab.

#### FAQ:
- If some of the plotly graphs are not rendering, run the code above the imports. For local running, it is "iframe". For colab running, it is "colab".
- Please enable GPU before running

### Imports:

In [None]:
#IF PLOTLY GRAPHS DO NOT RENDER. SET TO COLAB WHEN RUN IN COLAB AND IFRAME FOR LOCAL (iframe)
#import plotly.io as pio
#pio.renderers.default = 'colab'

In [2]:
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

from sklearn.model_selection import train_test_split

import tensorflow as tf
import numpy as np

In [3]:
import tensorflow_hub as hub

import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Embedding, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, LSTM, SimpleRNN, MaxPooling1D, Conv1D, TimeDistributed, AveragePooling1D, Input
from tensorflow.keras import layers
from tensorflow.keras.datasets import imdb
import tensorflow as tf

import plotly.express as px
import matplotlib.pyplot as plt

In [4]:
from sklearn.preprocessing import OneHotEncoder
import re

from nltk import word_tokenize
from nltk.util import ngrams

In [1]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [5]:
# credentials to get the data file
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Import the dataset

In [6]:
# https://drive.google.com/file/d/1sD2qf_JAOXKPQ6VuxpprYGfAZr1dAWmY/view?usp=sharing
bbcCsv = drive.CreateFile({'id':'1sD2qf_JAOXKPQ6VuxpprYGfAZr1dAWmY'})
bbcCsv.GetContentFile('bbc-text.csv')

In [7]:
raw_data = pd.read_csv("bbc-text.csv", delimiter =",", index_col=False)
display(raw_data)

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


### Identify the top 2 common categories and filter the dataset

In [None]:
topic_list_all = np.array(raw_data["category"])
unique_top, top_counts = np.unique(topic_list_all, return_counts=True)

In [None]:
fig = px.bar(
    raw_data, x=unique_top, y=top_counts,
    title = "Fig: Topics in raw data",
    labels={'x' : 'Topics', 'y': 'No. of occurance'}
)

fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})

fig.show()

In [None]:
top_2_sections = ["sport", "business"]
filtered_df = raw_data.loc[raw_data['category'].isin(top_2_sections)]
display(filtered_df)

Unnamed: 0,category,text
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
7,sport,henman hopes ended in dubai third seed tim hen...
8,sport,wilkinson fit to face edinburgh england captai...
...,...,...
2214,business,bush budget seeks deep cutbacks president bush...
2218,sport,davies favours gloucester future wales hooker ...
2219,business,beijingers fume over parking fees choking traf...
2220,business,cars pull down us retail figures us retail sal...


### Sample and clean the data

In [None]:
# Sample the data from the dataset
sample_portion = filtered_df.sample(frac = 0.40, random_state= 10)

#Clear the non-alphanumeric letters (like a-hat)
sample_portion['text'] = sample_portion['text'].apply(lambda a: str(a).encode('ascii','ignore'))
sample_portion['text'] = sample_portion['text'].apply(lambda a: a.decode('ascii','ignore'))

In [None]:
#Clear the punctuations
sample_portion["NoPunct"] = sample_portion['text'].apply(lambda a: re.sub(r'[^\w\s]','',a))

display(sample_portion)

Unnamed: 0,category,text,NoPunct
2187,sport,jones files conte lawsuit marion jones has fil...,jones files conte lawsuit marion jones has fil...
1202,sport,what now for british tennis tim henman s deci...,what now for british tennis tim henman s deci...
868,sport,moody joins up with england lewis moody has fl...,moody joins up with england lewis moody has fl...
1339,business,latin america sees strong growth latin america...,latin america sees strong growth latin america...
2077,sport,holmes starts 2005 with gb events kelly holmes...,holmes starts 2005 with gb events kelly holmes...
...,...,...,...
145,business,industrial output falls in japan japanese indu...,industrial output falls in japan japanese indu...
1210,business,us adds more jobs than expected the us economy...,us adds more jobs than expected the us economy...
577,sport,hearts of oak 3-2 cotonsport hearts of oak set...,hearts of oak 32 cotonsport hearts of oak set ...
810,sport,ronaldo considering new contract manchester un...,ronaldo considering new contract manchester un...


### Generate the unigrams

Here, we first take the bigrams, and then split into X and Y. The idea is to create a set that says "Y occurs after X".

In [None]:
bigram_list = []
for line in sample_portion["NoPunct"]:
    token = word_tokenize(line)
    bigram = list(ngrams(token, 2))
    bigram_list.extend(bigram)
#print(bigram_list)

word_X = []
word_Y = []
for wordset in bigram_list:
    word_X.append(wordset[0])
    word_Y.append(wordset[1])

print(word_X[0])
print(word_Y[0])

jones
files


### Constructing the model

For this, the idea is to try to predict the next word given X. For this, we have the set of word(X) and next word(Y). We will treat it as a classification task.

In [None]:
X_trainwv, X_testwv, y_trainwv, y_testwv = train_test_split(word_X, word_Y, random_state=10, test_size=0.20)

In [None]:
oh = OneHotEncoder()
word_Y_le = pd.get_dummies(y_trainwv)
display(word_Y_le)
print(type(word_Y_le))

Unnamed: 0,0,00,000,000m,000seat,000strong,01,02,0227,027,...,zephyr,zero,zeros,zheng,zib,zimbabwe,zinc,zogbia,zone,zurich
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109213,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
109214,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
109215,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
109216,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


<class 'pandas.core.frame.DataFrame'>


In [None]:
maxlen = 400
maxlen = 400
embedding_dims = 16
epochs = 5

max_features = 70000

category_no = 20

vectorizer = tf.keras.layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    output_mode="int",
)
vectorizer.adapt(X_trainwv)

In [None]:
gen_model = Sequential()

gen_model.add(vectorizer)
gen_model.add(Embedding(max_features, embedding_dims))
gen_model.add(LSTM(embedding_dims, return_sequences=True, dropout=0.0, recurrent_dropout=0.1))
gen_model.add(LSTM(embedding_dims, return_sequences=False, dropout=0.0, recurrent_dropout=0.1))
gen_model.add(Dense(400,activation="relu"))
gen_model.add(Dense(400,activation="relu"))
gen_model.add(Dense(400,activation="relu"))
gen_model.add(Dense(11257,activation=tf.keras.activations.softmax))

In [None]:
gen_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
             metrics=['accuracy'])

gen_model.summary()

gen_hist = gen_model.fit(np.array(X_trainwv),word_Y_le,
                   batch_size=32,
                    epochs=5)

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_6 (Embedding)     (None, None, 16)          1120000   
                                                                 
 lstm_12 (LSTM)              (None, None, 16)          2112      
                                                                 
 lstm_13 (LSTM)              (None, 16)                2112      
                                                                 
 dense_24 (Dense)            (None, 400)               6800      
                                                                 
 dense_25 (Dense)            (None, 400)               160400    
                                                      

### Generating a sentence from the trained model

Just taking a random word from the excel to pass as a seed word. When we receive a prediction for the next word, the seed word is replaced by the prediction and so on until the word count is reached. We will string together the predictions for the final sentence.

In [None]:
seed_word = ["Face"]
final_word_list = ["Face"]
column_headers = list(word_Y_le.columns.values)

for i in range(0,20):
    pred = gen_model.predict(seed_word)
    word_index = np.argmax(pred, axis=1)
    #word = str(word_Y_le.columns[word_index].value)
    word = column_headers[int(word_index)]
    print(word)
    final_word_list.append(word)
    temp_list = []
    temp_list.append(word)
    seed_word = temp_list

#print(final_word_list)

print("The generated sentence is:")
print(" ".join(final_word_list))

the
year
in
the
year
in
the
year
in
the
year
in
the
year
in
the
year
in
the
year
The generated sentence is:
Face the year in the year in the year in the year in the year in the year in the year


### Using pre-built MLE model

In this section, we will also run the MLE model for character estimation just to compare the sentences produced.

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline
train, vocab = padded_everygram_pipeline(3, sample_portion["NoPunct"])

In [None]:
from nltk.lm import MLE
model2 = MLE(3) 
model2.fit(train, vocab)

In [None]:
word_list = model2.generate(40, random_seed=2)
print("".join(word_list))

ut asking of ists ge  inding  ecoxx togt


### Saving the custom model

In [None]:
gen_model.save('saved_model/gen_model')



INFO:tensorflow:Assets written to: saved_model/gen_model\assets


INFO:tensorflow:Assets written to: saved_model/gen_model\assets


### Saving the encoding set for generating

In [None]:
print(type(word_Y_le.columns.values))

<class 'numpy.ndarray'>


In [None]:
encodingcol = pd.DataFrame({
    'columns': word_Y_le.columns.values
})
display(encodingcol)

Unnamed: 0,columns
0,0
1,00
2,000
3,000m
4,000seat
...,...
11252,zimbabwe
11253,zinc
11254,zogbia
11255,zone


In [None]:
encodingcol.to_csv("encodings.csv")