# Suppose we are NLP Engineer working to build an automatic text summarizer for a News Channel. The news channel uses flashcards of broad news articles to design the front page of their blog, which is read by more than a million readers across the globe. To develop the content for these flashcards, the news channel editors manually summarize the prospective news articles. This process is very time-consuming therefore, it becomes important to build a text summarizer that can automatically generate summaries.

# The text summarizer developed by us is going to play a crucial role in reducing the turnaround time for the news editors in developing the content for the flashcards. 



# The dataset which we will be using to train our text summarizer, is DeepMind Q and A dataset. This dataset is discussed in the following research publication: 

https://arxiv.org/abs/1506.03340

# This dataset primarily contains the documents and accompanying questions from the news articles of the CNN news channel. The url of the dataset is here: 

https://cs.nyu.edu/~kcho/DMQA/

# Within the CNN column, we will be using the section named "Stories" to fetch our dataset which will be having long paragraphs as well as their summaries. 

# We have already downloaded the dataset from the above url, now let's unzip it: 

In [None]:
cd /content/drive/MyDrive

/content/drive/MyDrive


In [None]:
! tar -xvf /content/drive/MyDrive/cnn_stories.tgz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
./cnn/stories/776c7e45847c8c099657298bc6badfa229ad7d24.story
./cnn/stories/28ca66b78f32b0395bcf89658121708c12d2adaa.story
./cnn/stories/4ccc93faef1eeac4c661a4369edbf324bd81202b.story
./cnn/stories/2bba1b8102456ebcdee37b375cca7097956eff1d.story
./cnn/stories/6beafefada5649c04532798b97f57ce64691ef9b.story
./cnn/stories/18565de60a3da497f26c7998339cfa29817ae305.story
./cnn/stories/59bcc9136dec19cbac2d6bd4d6d13e3caaeb40cb.story
./cnn/stories/45409369da2dc0b467417b4623c26d46c0aa0944.story
./cnn/stories/22edd27f817234833c5cff3cc55bd278002dffba.story
./cnn/stories/8e20fca3d8c15ee2ea9c706d3130213198cf06af.story
./cnn/stories/1122704b29b7cc3034fdf40aae78070b5ec7fd79.story
./cnn/stories/0587309e69611037f4293a898fbb57f2761b6347.story
./cnn/stories/e0127880a6d047cf7adf5b8ceca22873c7b4c86b.story
./cnn/stories/a9a0d4af0c7ab1750911e3244afd881be730ce77.story
./cnn/stories/2718cfa01c16a5360fb4898aa27b010bafad35b7.story
./cnn/stories/dfeb9d

In [None]:
import pandas as pd
import numpy as np 
import os

# Let's create functions to load the dataset and split stories into news paragraphs as well as summaries or highlights. 

In [None]:
def load_story(single_story_path):

  file_handle = open(single_story_path,encoding="utf-8")
  single_complete_story = file_handle.read()
  file_handle.close()
  return single_complete_story

In [None]:
def split_story_into_para_highlights(single_complete_story):

  highlight_loc = single_complete_story.find("@highlight")
  para, highlights = single_complete_story[:highlight_loc], single_complete_story[highlight_loc:].split("@highlight")
  highlights = [summary.strip() for summary in highlights if len(highlights) > 0]

  return para,highlights

In [None]:
paragraphs = list()
summaries = list()

for story_filename in os.listdir("./cnn/stories"):

  single_story_path = os.path.join("./cnn/stories",story_filename)
  single_complete_story = load_story(single_story_path)

  para, highlights = split_story_into_para_highlights(single_complete_story)

  paragraphs.append(para)
  summaries.append(highlights)

stories = dict(zip(["Story_paragraphs","Abstractive_summaries"],[paragraphs,summaries]))

# We have now converted our data into abstractive text summarization dataset. 

In [None]:
import pickle

In [None]:
pkl_file_handle = open("cnn_news_stories.pkl","wb")
pickle.dump(stories,pkl_file_handle)

In [None]:
stories = pickle.load(open("./cnn_news_stories.pkl","rb"))

In [None]:
stories["Story_paragraphs"][0]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\n\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix"\n\nTo the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant.\n\n"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs."\n\nAt 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office c

# Lets create a small function to preprocess each line of paragraphs as well as abstractive summaries. 

In [None]:
import string

In [None]:
def preprocess_single_sent_per_story(sents_per_story):

  cleaned_sents = list()
  waste_tokens_ascii_values_mapping = dict(zip(list(range(33,48)) + list(range(58,65)) +\
                                                 list(range(91,97)) + list(range(123,127)),[None]*32))
  for sent in sents_per_story:
            
    loc = sent.find('(CNN) -- ')
    if loc > -1:
      sent = sent[loc+len('(CNN)'):]
        
    sent = sent.split()
    sent = [token.lower() for token in sent]
    sent = [token.translate(waste_tokens_ascii_values_mapping) for token in sent]
    sent = [token for token in sent if token.isalpha()]
    cleaned_sents.append(' '.join(sent))
    
  cleaned_sents = [sent for sent in cleaned_sents if len(sent) > 0]
  return cleaned_sents

In [None]:
from tqdm.notebook import tqdm

In [None]:
for i in tqdm(range(len(stories["Story_paragraphs"]))):

  stories["Story_paragraphs"][i] = preprocess_single_sent_per_story(stories["Story_paragraphs"][i].split("\n"))
  stories["Abstractive_summaries"][i] = preprocess_single_sent_per_story(stories["Abstractive_summaries"][i])

  0%|          | 0/92579 [00:00<?, ?it/s]

In [None]:
stories["Story_paragraphs"][0]

['london england reuters harry potter star daniel radcliffe gains access to a reported million million fortune as he turns on monday but he insists the money wont cast a spell on him',
 'daniel radcliffe as harry potter in harry potter and the order of the phoenix',
 'to the disappointment of gossip columnists around the world the young actor says he has no plans to fritter his cash away on fast cars drink and celebrity parties',
 'i dont plan to be one of those people who as soon as they turn suddenly buy themselves a massive sports car collection or something similar he told an australian interviewer earlier this month i dont think ill be particularly extravagant',
 'the things i like buying are things that cost about pounds books and cds and dvds',
 'at radcliffe will be able to gamble in a casino buy a drink in a pub or see the horror film hostel part ii currently six places below his number one movie on the uk box office chart',
 'details of how hell mark his landmark birthday are

In [None]:
stories["Abstractive_summaries"][0]

['harry potter star daniel radcliffe gets fortune as he turns monday',
 'young actor says he has no plans to fritter his cash away',
 'radcliffes earnings from first five potter films have been held in trust fund']

# Now, we will be making each story paragraphs as short and as concise as possible. The way to do this is by using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score. We will be here using ROUGE score to extract the most relevant sentences from each story paragraph based on the given summary corresponding to that story paragraph. 

# So, basically what we will be doing is that for each story, we will be calculating ROUGE score between each sentence of the story paragraph and each of the summaries corresponsing to that specific paragraph. Furthermore, we will be selecting top 5 sentences from the story paragraphs with respect to their ROUGE score. 

# In this manner, we will be making the story paragraphs concise. 

# To know more about ROUGE score and what extractive summarization is, please navigate to this url: 

https://arxiv.org/abs/1807.02305

In [None]:
! pip install Rouge

Collecting Rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: Rouge
Successfully installed Rouge-1.0.1


In [None]:
from rouge import Rouge

In [None]:
rouge_obj = Rouge()

In [None]:
def compute_rouge_score(story_para_sent, abstractive_summaries):

  score_per_story_para_sent = list()

  for summary in abstractive_summaries:

    summary_scores = rouge_obj.get_scores(summary, story_para_sent)
    score_per_story_para_sent.append(summary_scores[0]['rouge-1']['f'])
    
  return max(score_per_story_para_sent)

In [None]:
def fetch_each_story_top5_para_sents(story_para, abstractive_summaries):

  story_para_sents = list()
  max_scores = list()

  for i in range(0, len(story_para)):

    story_para_sent = story_para[i]
    max_score = compute_rouge_score(story_para_sent, abstractive_summaries)

    story_para_sents.append(story_para_sent)
    max_scores.append(max_score)
        
  story_para_sents = np.array(story_para_sents)
    
  max_scores1 = np.array(max_scores)
  max_scores2 = np.sort(max_scores)[::-1]
  idx = np.argsort(max_scores)[::-1]
     
  idx = idx[0:5]
    
  return list(story_para_sents[idx]), max_scores2[0:5]

In [None]:
fetch_each_story_top5_para_sents(stories["Story_paragraphs"][0],stories["Abstractive_summaries"][0])

(['radcliffes earnings from the first five potter films have been held in a trust fund which he has not been able to touch',
  'to the disappointment of gossip columnists around the world the young actor says he has no plans to fritter his cash away on fast cars drink and celebrity parties',
  'london england reuters harry potter star daniel radcliffe gains access to a reported million million fortune as he turns on monday but he insists the money wont cast a spell on him',
  'daniel radcliffe as harry potter in harry potter and the order of the phoenix',
  'despite his growing fame and riches the actor says he is keeping his feet firmly on the ground'],
 array([0.74285714, 0.63157894, 0.51282051, 0.45454545, 0.28571428]))

In [None]:
len(stories["Story_paragraphs"])

92579

In [None]:
all_stories_top5_sents_dict = dict()
all_stories_top5_sents_scores = dict()

for story_idx in tqdm(range(0, len(stories["Story_paragraphs"]))):
    
  story_para_sents = stories["Story_paragraphs"][story_idx]
  abstractive_summaries = stories["Abstractive_summaries"][story_idx]
  top5_para_sents, top5_sents_scores = fetch_each_story_top5_para_sents(story_para_sents,abstractive_summaries)
  all_stories_top5_sents_dict[story_idx] = top5_para_sents
  all_stories_top5_sents_scores[story_idx] = top5_sents_scores

  0%|          | 0/92579 [00:00<?, ?it/s]

In [None]:
pkl_file_handle = open("./all_stories_top5_sents_dict.pkl","wb")
pickle.dump(all_stories_top5_sents_dict,pkl_file_handle)

pkl_file_handle = open("./all_stories_top5_sents_scores.pkl","wb")
pickle.dump(all_stories_top5_sents_scores,pkl_file_handle)

In [None]:
all_stories_top5_sents_dict = pickle.load(open("./all_stories_top5_sents_dict.pkl","rb"))
all_stories_top5_sents_scores = pickle.load(open("./all_stories_top5_sents_scores.pkl","rb"))

# Let's now create a Pandas DataFrame where each row will consist of story index (story_idx), sentence index of top 5 sentences selected from the story paragraph (sent_idx), Each sentence out of top 5 sentences in a story paragraph, label representing whether each sentence out of top 5 sentences is in the extractive summary or not (extractive_label)

In [None]:
len(stories["Story_paragraphs"])

92579

In [None]:
story_idx = list()
sent_idx = list()
sents_list = list()
extractive_label = list()

for i in tqdm(range(0, len(stories["Story_paragraphs"]))):
    
  top5_para_sents = all_stories_top5_sents_dict[i]
    
  for j, para_sent in enumerate(stories["Story_paragraphs"][i]):
        
    ohe_label =  int(para_sent in top5_para_sents)
    extractive_label.append(ohe_label)
    sents_list.append(para_sent)
    sent_idx.append(j)
    story_idx.append(i)

  0%|          | 0/92579 [00:00<?, ?it/s]

In [None]:
extractive_summaries_df = pd.DataFrame()
extractive_summaries_df["Story_idx"] = story_idx
extractive_summaries_df["Sent_idx"] = sent_idx
extractive_summaries_df["Para_sents"] = sents_list
extractive_summaries_df["Extractive_label"] = extractive_label

In [None]:
extractive_summaries_df.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label
0,0,0,london england reuters harry potter star danie...,1
1,0,1,daniel radcliffe as harry potter in harry pott...,1
2,0,2,to the disappointment of gossip columnists aro...,1
3,0,3,i dont plan to be one of those people who as s...,0
4,0,4,the things i like buying are things that cost ...,0


In [None]:
len(extractive_summaries_df["Story_idx"].unique())

92465

In [None]:
extractive_summaries_df.to_pickle("extractive_summaries_df.pkl")

In [None]:
data = pd.read_pickle("extractive_summaries_df.pkl")

In [None]:
data.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label
0,0,0,london england reuters harry potter star danie...,1
1,0,1,daniel radcliffe as harry potter in harry pott...,1
2,0,2,to the disappointment of gossip columnists aro...,1
3,0,3,i dont plan to be one of those people who as s...,0
4,0,4,the things i like buying are things that cost ...,0


In [None]:
len(data)

1972394

# Let's divide our data into Training, Cross Validation and Testing Data. 

In [None]:
data_story_sents_count = data.groupby("Story_idx").size().reset_index(name="Sentences_count")

In [None]:
data_story_sents_count.head()

Unnamed: 0,Story_idx,Sentences_count
0,0,17
1,1,20
2,2,23
3,3,17
4,4,34


In [None]:
selected_stories_idx = list(data_story_sents_count[data_story_sents_count["Sentences_count"] <= 20]["Story_idx"])

In [None]:
len(selected_stories_idx)

52030

In [None]:
train_story_ids = selected_stories_idx[:30000]
cv_story_ids = selected_stories_idx[30000:40000]
test_story_ids = selected_stories_idx[40000:]

training_data = data[data["Story_idx"].isin(train_story_ids)]
cv_data = data[data["Story_idx"].isin(cv_story_ids)]
testing_data = data[data["Story_idx"].isin(test_story_ids)]

In [None]:
selected_stories_idx

[0,
 1,
 3,
 7,
 9,
 10,
 12,
 13,
 14,
 16,
 17,
 19,
 21,
 23,
 25,
 26,
 27,
 29,
 31,
 35,
 38,
 39,
 42,
 44,
 46,
 48,
 51,
 52,
 54,
 59,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 73,
 74,
 76,
 77,
 78,
 79,
 81,
 82,
 84,
 85,
 87,
 89,
 90,
 91,
 92,
 93,
 94,
 99,
 102,
 103,
 104,
 105,
 107,
 109,
 113,
 115,
 116,
 118,
 119,
 121,
 122,
 123,
 127,
 129,
 138,
 140,
 142,
 143,
 146,
 147,
 150,
 151,
 152,
 153,
 155,
 156,
 157,
 160,
 162,
 163,
 164,
 167,
 168,
 170,
 174,
 175,
 176,
 178,
 182,
 183,
 185,
 186,
 188,
 189,
 192,
 193,
 195,
 196,
 197,
 200,
 201,
 202,
 204,
 205,
 206,
 207,
 208,
 209,
 212,
 216,
 217,
 218,
 219,
 220,
 221,
 222,
 225,
 228,
 229,
 231,
 232,
 233,
 234,
 237,
 238,
 240,
 241,
 242,
 243,
 244,
 246,
 247,
 248,
 249,
 250,
 251,
 252,
 253,
 254,
 255,
 256,
 258,
 259,
 262,
 264,
 265,
 266,
 268,
 269,
 270,
 275,
 276,
 277,
 278,
 279,
 284,
 285,
 286,
 287,
 288,
 290,
 292,
 293,
 295,
 299,
 300,
 301,
 302,
 303,

In [None]:
len(training_data["Story_idx"].unique())

30000

In [None]:
training_data.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label
0,0,0,london england reuters harry potter star danie...,1
1,0,1,daniel radcliffe as harry potter in harry pott...,1
2,0,2,to the disappointment of gossip columnists aro...,1
3,0,3,i dont plan to be one of those people who as s...,0
4,0,4,the things i like buying are things that cost ...,0


In [None]:
len(cv_data["Story_idx"].unique())

10000

In [None]:
cv_data.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label
993417,49793,0,ewcom bilbo baggins went after treasure this w...,1
993418,49793,1,smaug notched the fourthbest december opening ...,1
993419,49793,2,in order to differentiate it from the first ho...,0
993420,49793,3,in second place disneys animated musical froze...,0
993421,49793,4,tyler perrys a madea christmas unwrapped only ...,0


# Now, lets compute maximum number of sentences which a paragraph can have inside a story in a training data. 

In [None]:
training_data = training_data.sort_values(["Story_idx","Sent_idx"])
sents_count = training_data.groupby("Story_idx").size().reset_index(name="Sentences_count")

In [None]:
sents_count["Sentences_count"].describe()

count    30000.000000
mean        13.424133
std          4.087346
min          1.000000
25%         10.000000
50%         14.000000
75%         17.000000
max         20.000000
Name: Sentences_count, dtype: float64

In [None]:
story_max_length = sents_count["Sentences_count"].max()

In [None]:
story_max_length

20

In [None]:
unique_sents = set(training_data["Para_sents"].tolist())

In [None]:
len(unique_sents)

372544

In [None]:
num_labels = len(training_data["Extractive_label"].unique())

In [None]:
num_labels

2

In [None]:
np.sort(training_data["Extractive_label"].unique())

array([0, 1])

In [None]:
labels2idx = {l: i+1 for i,l in enumerate(np.sort(training_data["Extractive_label"].unique()))}
labels2idx["PAD"] = 0
idx2labels = {i: l for l,i in labels2idx.items()}
print(labels2idx)

{0: 1, 1: 2, 'PAD': 0}


# Let's now add two more columns into the Training, Cross Validation as well as Testing Data. 

In [None]:
training_data.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label
0,0,0,london england reuters harry potter star danie...,1
1,0,1,daniel radcliffe as harry potter in harry pott...,1
2,0,2,to the disappointment of gossip columnists aro...,1
3,0,3,i dont plan to be one of those people who as s...,0
4,0,4,the things i like buying are things that cost ...,0


In [None]:
def create_token_count_list(df):
  
  df['Number_tokens'] = df["Para_sents"].apply(lambda x: len(x.split()))
  df['Tokens_list'] = df["Para_sents"].apply(lambda x: x.split())
  return df

In [None]:
training_data = create_token_count_list(training_data)
cv_data = create_token_count_list(cv_data)
testing_data = create_token_count_list(testing_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [None]:
training_data.head()

Unnamed: 0,Story_idx,Sent_idx,Para_sents,Extractive_label,Number_tokens,Tokens_list
0,0,0,london england reuters harry potter star danie...,1,32,"[london, england, reuters, harry, potter, star..."
1,0,1,daniel radcliffe as harry potter in harry pott...,1,14,"[daniel, radcliffe, as, harry, potter, in, har..."
2,0,2,to the disappointment of gossip columnists aro...,1,29,"[to, the, disappointment, of, gossip, columnis..."
3,0,3,i dont plan to be one of those people who as s...,0,41,"[i, dont, plan, to, be, one, of, those, people..."
4,0,4,the things i like buying are things that cost ...,0,16,"[the, things, i, like, buying, are, things, th..."


# Now, let's compute the total number of unique tokens inside the training data paragraphs. 

In [None]:
from itertools import chain

In [None]:
total_unique_tokens = set(list(chain(*training_data['Tokens_list'].tolist())))
num_unique_tokens = len(total_unique_tokens)

token2idx = {token: i+2 for i,token in enumerate(total_unique_tokens)}
token2idx["UNK"] = 1
token2idx["PAD"] = 0

idx2token = {i: token for token, i in token2idx.items()}

In [None]:
len(idx2token)

145176

In [None]:
def create_sent_label_example(df):

  df["Sent_example"] = df[["Para_sents","Extractive_label"]].apply(tuple,axis=1)
  return df

In [None]:
training_data = create_sent_label_example(training_data)
cv_data = create_sent_label_example(cv_data)
testing_data = create_sent_label_example(testing_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
training_data.iloc[0]["Sent_example"]

('london england reuters harry potter star daniel radcliffe gains access to a reported million million fortune as he turns on monday but he insists the money wont cast a spell on him',
 1)

In [None]:
max_sent_length = 40

def stories_representation(df):
   
  story_ids = df['Story_idx'].unique()
  stories_examples = list()

  for story_idx in tqdm(story_ids):

    temp_story = list(df[df['Story_idx'] == story_idx]["Sent_example"])
    stories_examples.append(temp_story)
    X_token = np.zeros((len(stories_examples), story_max_length, max_sent_length))
    
    for idx, story_example in enumerate(stories_examples):

      story_seq = list()
        
      # to give an upper bound on the maximum length of the token sequence for sentence
      for i in range(story_max_length):

          sent_seq = list()
            
          # to give an upper bound on the maximum length of tokens to consider
          for j in range(max_sent_length):

            try:
                split_sent = story_example[i][0].split()
                sent_seq.append(token2idx.get(split_sent[j]))
            except:  
                # exception will be there when there will not be any sentence for the length 
                # and will be padded 0
                sent_seq.append(token2idx.get("PAD"))

          story_seq.append(sent_seq)
        
      X_token[idx] = np.array(story_seq)

  return (X_token, stories_examples)

In [None]:
X_train,Y_train = stories_representation(training_data)

  0%|          | 0/30000 [00:00<?, ?it/s]

In [None]:
X_train.shape

In [None]:
X_cv,Y_cv = stories_representation(cv_data)

In [None]:
X_cv.shape

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def prepare_labels(story_examples):

    Y = [[labels2idx[ex_content[1]] for ex_content in sent_example] for sent_example in story_examples]
    Y = pad_sequences(maxlen=story_max_length, sequences=Y, value=labels2idx["PAD"], padding='post', truncating='post')
    Y = Y.reshape(-1, story_max_length, 1)
    
    return Y

In [None]:
train_labels = prepare_labels(Y_train)
cv_labels = prepare_labels(Y_cv)

In [None]:
import tensorflow as tf

In [None]:
training_data_batch_gen = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
training_data_batch_gen = (training_data_batch_gen.batch(64).cache().prefetch(tf.data.experimental.AUTOTUNE))

cv_data_batch_gen = tf.data.Dataset.from_tensor_slices((X_cv, Y_cv))
cv_data_batch_gen = (cv_data_batch_gen.batch(64).cache().prefetch(tf.data.experimental.AUTOTUNE))

In [None]:
! wget https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
! unzip /content/drive/MyDrive/glove.6B.zip

In [None]:
def create_embedding_matrix(token_idxes, embedding_path, topic_vector_dim):

  embedding_matrix_dict = dict()

  with open(embedding_path) as file_handle:

    for line in file_handle:

      values = line.split()
      token = values[0]
      topic_vector = np.asarray(values[1:], dtype='float32')
      embedding_matrix_dict[token] = topic_vector

  num_words = len(token_idxes) 
  embedding_matrix = np.zeros((num_words, topic_vector_dim))

  for token, idx in token_idxes.items():

    topic_vector = embedding_matrix_dict.get(token)

    if topic_vector is not None:
      embedding_matrix[idx] = topic_vector
  
  return embedding_matrix

In [None]:
from tensorflow.keras.layers import Input, TimeDistributed, Embedding, Convolution1D, Dense, Flatten, Activation, RepeatVector, Permute, multiply
from tensorflow.keras.layers import Lambda, Bidirectional, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K

In [None]:
embedding_matrix_txt_path = "/content/drive/MyDrive/glove.6B.100d.txt"
topic_vectors_dim = 100

def text_summarization_model():

  token_input = Input(shape=(story_max_length, max_sent_length,))
  embedding_layer_out = TimeDistributed(Embedding(input_dim=(num_unique_tokens + 2), output_dim=topic_vectors_dim, input_length=max_sent_length,
                                      weights=[create_embedding_matrix(token2idx, embedding_matrix_txt_path, topic_vectors_dim)], trainable=True))(token_input)

  embedding_layer2_out = TimeDistributed(Convolution1D(32, 2, activation='relu',padding= 'same'))(embedding_layer_out)
    
  hidden_layer_out = TimeDistributed(Dense(1, activation='tanh'))(embedding_layer2_out)
  hidden_layer_out = TimeDistributed(Flatten())(hidden_layer_out)
  hidden_layer_out = TimeDistributed(Activation('softmax'))(hidden_layer_out)
  hidden_layer_out = TimeDistributed(RepeatVector(32))(hidden_layer_out)
  hidden_layer_out = TimeDistributed(Permute([2, 1]))(hidden_layer_out)
  hidden_layer_out = multiply([embedding_layer2_out,hidden_layer_out])
    
  sent_embedding = TimeDistributed(Lambda(lambda x: K.sum(x, axis=-2)))(hidden_layer_out)
    
  lstm_nw = Bidirectional(LSTM(units=16, return_sequences=True))(sent_embedding)
  nw_final_output = TimeDistributed(Dense(num_labels + 1, activation='softmax'))(lstm_nw)

  model = Model([token_input], nw_final_output)

  return model

In [None]:
model = text_summarization_model()

In [None]:
lr_start = 1e-5
lr_max = 1e-3
lr_rampup_epochs = 5
lr_to_sustain_epochs = 0
lr_step_decay = 0.75

In [None]:
def lr_scheduler(epoch):

  if epoch < lr_rampup_epochs:
    lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start

  elif epoch < lr_rampup_epochs + lr_to_sustain_epochs:
    lr = lr_max

  else:
    lr = lr_max * lr_step_decay**((epoch - lr_rampup_epochs - lr_to_sustain_epochs)//10)

  return lr

In [None]:
lr_scheduler_cb = tf.keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=True)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)

In [None]:
optimizer = tf.keras.optimizer.Adam(lr=1e-5)

In [None]:
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(training_data_batch_gen, validation_data=cv_data_batch_gen,epochs=50,callbacks=[lr_scheduler_cb, early_stopping_cb], verbose=1)

# Write the code to perform the inference on this network and provide the output as extractive summary to the input paragraph. 