## CSCE 5290 - Fall 2021
### Final Project - CNN dataset preprocessing
After running the same expensive preprocessing over and over again, I decided to dedicate a notebook to just outputting a CSV that is ready for training, and should have done this much earlier.

### Preprocessing

In [1]:
# Install the TF dependencies

%%capture
!pip install tensorflow-datasets
!pip install tensorflow-text
!python -m spacy download en_core_web_sm

In [2]:
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_datasets as tfds

In [18]:
# Load and split the dataset
# I wish I knew about this much easier TFDS approach when I started the project
raw_train_ds, raw_val_ds = tfds.load('cnn_dailymail', split=['train', 'validation'])
print(len(raw_train_ds))

287113


In [14]:
import re

# Slightly reworked to work with TF tensors
# Remove non-alphabetic characters (Data Cleaning)
# Use TF string processing instead of python built-ins
def clean_text(row):
    
  row = tf.strings.lower(row)

  # carriages
  row = tf.strings.regex_replace(row, "(\\t)", ' ')
  row = tf.strings.regex_replace(row, "(\\r)", ' ')
  row = tf.strings.regex_replace(row, "(\\n)", ' ')

  # CNN marker
  row = tf.strings.regex_replace(row, "\(cnn\) --", ' ')
  
  # Remove _ if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(__+)", ' ')   
  
  # Remove - if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(--+)", ' ')
  
  # Remove ~ if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(~~+)", ' ')   
  
  # Remove + if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(\+\++)", ' ')

  # Remove . if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(\.\.+)", ' ')   
  
  # Remove the characters - <>()|&©ø"',;?~*!
  row = tf.strings.regex_replace(row, r"[<>()|&©ø\[\]\'\",;?~*!]", ' ') 
  
  # Remove mailto:
  row = tf.strings.regex_replace(row, "(mailto:)", ' ') 
  
  # Remove \x9* in text
  row = tf.strings.regex_replace(row, r"(\\x9\d)", ' ') 
  
  # Remove punctuations at the end of a word 
  row = tf.strings.regex_replace(row, "(\.\s+)", ' ') 
  row = tf.strings.regex_replace(row, "(\-\s+)", ' ') 
  row = tf.strings.regex_replace(row, "(\:\s+)", ' ') 
  
  # Remove multiple spaces
  row = tf.strings.regex_replace(row, "(\s+)", ' ') 
  
  # Remove the single character hanging between any two spaces
  row = tf.strings.regex_replace(row, "(\s+.\s+)", ' ') 
  
  return row

def clean(row):
  row['article'] = clean_text(row['article'])
  row['highlights'] = clean_text(row['highlights'])
  return row

In [20]:
# Clean all datasets
train_ds = raw_train_ds.map(clean)
val_ds = raw_val_ds.map(clean)

In [33]:
from tqdm import tqdm
# Prepare as dataframe
train_df = tfds.as_dataframe(train_ds)

In [42]:
# Set a maximum amount of training examples. 
max_train = 50000

# Set length parameters (from EDA workbook)
min_text_len = 10
max_text_len = 500
min_summary_len = 10
max_summary_len = 50

import pandas as pd
clean_train_df = pd.DataFrame()
print(f'filtering {len(train_df)} records')
for i, r in tqdm(train_df.iterrows()):

  article = train_df.iloc[i]['article'].decode('utf-8')
  summary = train_df.iloc[i]['highlights'].decode('utf-8')

  a_len = len(article.split())
  s_len = len(summary.split())

  if a_len > min_text_len and a_len < max_text_len \
    and s_len > min_summary_len and s_len < max_summary_len:
    clean_train_df = clean_train_df.append({'text': article, 'summary': summary}, ignore_index=True)

  if len(clean_train_df) >= max_train:
    break

clean_train_df.head(5)

filtering 287113 records


183769it [03:25, 892.27it/s]


Unnamed: 0,summary,text
0,bishop john folda of north dakota is taking ti...,by associated press published 14:11 est 25 oct...
1,criminal complaint cop used his role to help c...,ralph mata was an internal affairs lieutenant...
2,prime minister and his family are enjoying an ...,he been accused of making many fashion faux pa...
3,airstrike kills nine syrians in refugee camp s...,beirut syria carried out an airstrike on refug...
4,china top security official visited afghanista...,kabul afghanistan china top security official ...


In [43]:
len(clean_train_df)

50000

In [44]:
clean_train_df.to_csv('cnn_cleaned_50k.csv')

In [46]:
train_X = clean_train_df['text']
train_y = clean_train_df['summary']

In [47]:
import spacy
from time import time

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) 

# Process text as batches and yield Doc objects in order
text = [str(doc) for doc in nlp.pipe(train_X, batch_size=5000)]

summary = ['_START_ '+ str(doc) + ' _END_' for doc in nlp.pipe(train_y, batch_size=5000)]

In [48]:
text[0]

'by associated press published 14:11 est 25 october 2013 updated 15:36 est 25 october 2013 the bishop of the fargo catholic diocese in north dakota has exposed potentially hundreds of church members in fargo grand forks and jamestown to the hepatitis virus in late september and early october the state health department has issued an advisory of exposure for anyone who attended five churches and took communion bishop john folda pictured of the fargo catholic diocese in north dakota has exposed potentially hundreds of church members in fargo grand forks and jamestown to the hepatitis state immunization program manager molly howell says the risk is low but officials feel it important to alert people to the possible exposure the diocese announced on monday that bishop john folda is taking time off after being diagnosed with hepatitis the diocese says he contracted the infection through contaminated food while attending conference for newly ordained bishops in italy last month symptoms of h

In [49]:
summary[0]

'_START_ bishop john folda of north dakota is taking time off after being diagnosed he contracted the infection through contaminated food in italy church members in fargo grand forks and jamestown could have been exposed . _END_'

In [54]:
import pandas as pd
pre = pd.DataFrame()
pre['text'] = pd.Series(text)
pre['summary'] = pd.Series(summary)

In [55]:
post_pre = pre
# Add sostok and eostok at the start and end of the summary
post_pre['summary'] = post_pre['summary'].apply(lambda x: 'sostok ' + x + ' eostok')

In [56]:
post_pre.head(2)

Unnamed: 0,text,summary
0,by associated press published 14:11 est 25 oct...,sostok _START_ bishop john folda of north dako...
1,ralph mata was an internal affairs lieutenant...,sostok _START_ criminal complaint cop used his...


In [57]:
post_pre.to_csv('cnn_post_pre_50k.csv')