## CSCE 5290 - Fall 2021
### Final Project - CNN dataset preprocessing
After running the same expensive preprocessing over and over again, I decided to dedicate a notebook to just outputting a CSV that is ready for training, and should have done this much earlier.

This notebook preprocesses the test set.

### Preprocessing

In [1]:
# Install the TF dependencies

%%capture
!pip install tensorflow-datasets
!pip install tensorflow-text
!python -m spacy download en_core_web_sm

In [2]:
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow_datasets as tfds

In [4]:
# Load and split the dataset
# I wish I knew about this much easier TFDS approach when I started the project
raw_train_ds, raw_test_ds = tfds.load('cnn_dailymail', split=['train', 'test'])
print(len(raw_test_ds))

11490


In [5]:
import re

# Slightly reworked to work with TF tensors
# Remove non-alphabetic characters (Data Cleaning)
# Use TF string processing instead of python built-ins
def clean_text(row):
    
  row = tf.strings.lower(row)

  # carriages
  row = tf.strings.regex_replace(row, "(\\t)", ' ')
  row = tf.strings.regex_replace(row, "(\\r)", ' ')
  row = tf.strings.regex_replace(row, "(\\n)", ' ')

  # CNN marker
  row = tf.strings.regex_replace(row, "\(cnn\) --", ' ')
  
  # Remove _ if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(__+)", ' ')   
  
  # Remove - if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(--+)", ' ')
  
  # Remove ~ if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(~~+)", ' ')   
  
  # Remove + if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(\+\++)", ' ')

  # Remove . if it occurs more than one time consecutively
  row = tf.strings.regex_replace(row, "(\.\.+)", ' ')   
  
  # Remove the characters - <>()|&©ø"',;?~*!
  row = tf.strings.regex_replace(row, r"[<>()|&©ø\[\]\'\",;?~*!]", ' ') 
  
  # Remove mailto:
  row = tf.strings.regex_replace(row, "(mailto:)", ' ') 
  
  # Remove \x9* in text
  row = tf.strings.regex_replace(row, r"(\\x9\d)", ' ') 
  
  # Remove punctuations at the end of a word 
  row = tf.strings.regex_replace(row, "(\.\s+)", ' ') 
  row = tf.strings.regex_replace(row, "(\-\s+)", ' ') 
  row = tf.strings.regex_replace(row, "(\:\s+)", ' ') 
  
  # Remove multiple spaces
  row = tf.strings.regex_replace(row, "(\s+)", ' ') 
  
  # Remove the single character hanging between any two spaces
  row = tf.strings.regex_replace(row, "(\s+.\s+)", ' ') 
  
  return row

def clean(row):
  row['article'] = clean_text(row['article'])
  row['highlights'] = clean_text(row['highlights'])
  return row

In [6]:
# Clean all datasets
train_ds = raw_train_ds.map(clean)
test_ds = raw_test_ds.map(clean)

In [7]:
from tqdm import tqdm
# Prepare as dataframe
test_df = tfds.as_dataframe(test_ds)

In [9]:
# Set a maximum amount of training examples. 
max_train = 10000

# Set length parameters (from EDA workbook)
min_text_len = 10
max_text_len = 500
min_summary_len = 10
max_summary_len = 50

import pandas as pd
clean_test_df = pd.DataFrame()
print(f'filtering {len(test_df)} records')
for i, r in tqdm(test_df.iterrows()):

  article = test_df.iloc[i]['article'].decode('utf-8')
  summary = test_df.iloc[i]['highlights'].decode('utf-8')

  a_len = len(article.split())
  s_len = len(summary.split())

  if a_len > min_text_len and a_len < max_text_len \
    and s_len > min_summary_len and s_len < max_summary_len:
    clean_test_df = clean_test_df.append({'text': article, 'summary': summary}, ignore_index=True)

  if len(clean_test_df) >= max_train:
    break

clean_test_df.head(5)

filtering 11490 records


11490it [00:10, 1114.90it/s]


Unnamed: 0,summary,text
0,experts question if packed out planes are putt...,ever noticed how plane seats appear to be gett...
1,drunk teenage boy climbed into lion enclosure ...,a drunk teenage boy had to be rescued by secur...
2,nottingham forest are close to extending dougi...,dougie freedman is on the verge of agreeing ne...
3,fiorentina goalkeeper neto has been linked wit...,liverpool target neto is also wanted by psg an...
4,giant pig fell into the swimming pool at his h...,this is the moment that crew of firefighters s...


In [None]:
len(clean_train_df)

50000

In [10]:
clean_test_df.to_csv('cnn_cleaned_test_10k.csv')

In [11]:
train_X = clean_test_df['text']
train_y = clean_test_df['summary']

In [12]:
import spacy
from time import time

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) 

# Process text as batches and yield Doc objects in order
text = [str(doc) for doc in nlp.pipe(train_X, batch_size=5000)]

summary = ['_START_ '+ str(doc) + ' _END_' for doc in nlp.pipe(train_y, batch_size=5000)]

In [13]:
text[0]

'ever noticed how plane seats appear to be getting smaller and smaller with increasing numbers of people taking to the skies some experts are questioning if having such packed out planes is putting passengers at risk they say that the shrinking space on aeroplanes is not only uncomfortable it putting our health and safety in danger more than squabbling over the arm rest shrinking space on planes putting our health and safety in danger this week u.s consumer advisory group set up by the department of transportation said at public hearing that while the government is happy to set standards for animals flying on planes it doesn stipulate minimum amount of space for humans in world where animals have more rights to space and food than humans said charlie leocha consumer representative on the committee.\xa0 it is time that the dot and faa take stand for humane treatment of passengers but could crowding on planes lead to more serious issues than fighting for space in the overhead lockers cra

In [14]:
summary[0]

'_START_ experts question if packed out planes are putting passengers at risk u.s consumer advisory group says minimum space must be stipulated safety tests conducted on planes with more leg room than airlines offer . _END_'

In [15]:
import pandas as pd
pre = pd.DataFrame()
pre['text'] = pd.Series(text)
pre['summary'] = pd.Series(summary)

In [16]:
post_pre = pre
# Add sostok and eostok at the start and end of the summary
post_pre['summary'] = post_pre['summary'].apply(lambda x: 'sostok ' + x + ' eostok')

In [17]:
post_pre.head(2)

Unnamed: 0,text,summary
0,ever noticed how plane seats appear to be gett...,sostok _START_ experts question if packed out ...
1,a drunk teenage boy had to be rescued by secur...,sostok _START_ drunk teenage boy climbed into ...


In [18]:
post_pre.to_csv('cnn_post_pre_test_10k.csv')