### Feature Engineering--True and False News
The final step of feature engineering is to tokenize the text of the stories.  The raw data sequence of characters cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

I considered doing this using discrete Scikit-Learn modules, but the recently Tensorflow 2.1 adds support for a TextVectorization layer, and 2.3 adds experiment support for the new Keras Preprocessing Layers API. These layers allow you to package preprocessing logic inside the model for easier deployment — allowing the model to take raw strings, images, or rows from a table as input.  This module also includes a 

The processing of each sample contains the following steps:

1. Standardize each sample.  Lowercase all words and strip punctuation. 

2. Split each sample into substrings (usually words).

3. Recombine substrings into tokens (usually ngrams). Options here include determining how many words to include in each token.  Text classification tasks typically  consider tokens of 1 or 2 works, but we may experiment with more than that.

4. Index tokens (associate a unique int value with each token).

5. Transform each sample using this index, either into a vector of ints or a dense float vector.  This layer includes the ability to set the length of the resulting vector, either truncating or padding the vector with zeroes so it will fit the size of our input layer.  It also has several output modes, including tf-idf which is weighting algorithm based on the frequency of words found on the dataset.

From the Scikit-learn documentation: 
> The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.  In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

We will consider varying output modes as we go forward. 

Since the TextVectorization layer will allow us to convert the text of our stories to integrer tensors, so there is not much for us to do with feature engineering.  

## Vocabulary Size
The default TextVectorization settings will retain all words found as part of our vocabulary.  In experimenting, we found some value in altering the vocabulary size.  

Below is the code we used to count words and find words that only appeared once, with the hypothesis that any word that appeared in only one article could not inform decisions on any other articles. The code and some further discussion is found below. 

In [1]:
!pip install --upgrade numpy
!pip install --upgrade pandas

# we want tensorflow 2.3
!pip install --upgrade tensorflow  

Collecting numpy
[?25l  Downloading https://files.pythonhosted.org/packages/b1/9a/7d474ba0860a41f771c9523d8c4ea56b084840b5ca4092d96bdee8a3b684/numpy-1.19.1-cp36-cp36m-manylinux2010_x86_64.whl (14.5MB)
[K     |████████████████████████████████| 14.5MB 9.3MB/s eta 0:00:01
[31mERROR: tensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.[0m
[31mERROR: autoai-libs 1.10.5 has requirement pandas>=0.24.2, but you'll have pandas 0.24.1 which is incompatible.[0m
[?25hInstalling collected packages: numpy
  Found existing installation: numpy 1.15.4
    Uninstalling numpy-1.15.4:
      Successfully uninstalled numpy-1.15.4
Successfully installed numpy-1.19.1
Collecting pandas
[?25l  Downloading https://files.pythonhosted.org/packages/a1/c6/9ac4ae44c24c787a1738e5fb34dd987ada6533de5905a041aa6d5bea4553/pandas-1.1.1-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
[K     |████████████████████████████████| 10.5MB 8.5MB/s eta 0:00:01
Installing collected packages: pandas
  Foun

Collecting pyasn1-modules>=0.2.1 (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/95/de/214830a981892a3e286c3794f41ae67a4495df1108c3da8a9f62159b9a9d/pyasn1_modules-0.2.8-py2.py3-none-any.whl (155kB)
[K     |████████████████████████████████| 163kB 36.7MB/s eta 0:00:01
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow)
  Downloading https://files.pythonhosted.org/packages/a3/12/b92740d845ab62ea4edf04d2f4164d82532b5a0b03836d4d4e71c6f3d379/requests_oauthlib-1.3.0-py2.py3-none-any.whl
Collecting pyasn1>=0.1.3 (from rsa<5,>=3.1.4; python_version >= "3.5"->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
[K     |████████████████████████████████| 81kB 26.1MB/s eta 0:00:01
[?25hCollect

In [2]:
import tensorflow as tf
print("Tensorflow version: ", tf.__version__)
if not tf.__version__ == '2.3.0':
    raise ValueError('please upgrade to TensorFlow 2.3, or restart your Kernel (Kernel->Restart & Clear Output)')

Tensorflow version:  2.3.0


In [3]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from pprint import pprint
from time import time
import logging
import numpy as np
import pandas as pd
import string
import re

from keras.utils import to_categorical
from keras import models
from keras import layers

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

from sklearn.model_selection import train_test_split

from ibm_botocore.client import Config
import ibm_boto3

Using TensorFlow backend.


In [4]:
#Get our data
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_news = {
    'IAM_SERVICE_ID': 'iam-ServiceId-32e8ee67-397c-4ff1-b69b-543172331f43',
    'IBM_API_KEY_ID': 'Rx4FR4JSAueCnnIsoevsgYgOsuh8LCXtbkFpFpC0EmVU',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.cloud.ibm.com/oidc/token',
    'BUCKET': 'advanceddatasciencecapstone-donotdelete-pr-tqabpnbxebk8rm',
    'FILE': 'dfTrueFalseNews.pkl'
}

def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception as e:
        print(Exception, e)
    else:
        print('File Downloaded')

def upload_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.upload_file(Filename=local_file_name, Bucket=credentials['BUCKET'],Key=key)
    except Exception as e:
        print(Exception, e)
    else:
        print(' File Uploaded')
        
dfNews = download_file_cos(credentials_news, "dfTrueFalseNews.pkl", "dfTrueFalseNews.pkl")

File Downloaded


In [5]:
dfNews = pd.read_pickle('dfTrueFalseNews.pkl')
#dfNews['truthvalue'] = pd.Categorical(dfNews['truthvalue'])

print (dfNews.shape, dfNews.columns, '\n',  dfNews.dtypes)

(1129, 3) Index(['text', 'source', 'truthvalue'], dtype='object') 
 text          object
source        object
truthvalue    object
dtype: object


In [28]:
dfNews.head()

Unnamed: 0,text,source,truthvalue
tech010legit,"AT&T pulls ads from YouTube, other Google site...",MihalceaNewsLegit,1
tech008legit,Are Autonomous Cars Ready to Go It Alone? Tra...,MihalceaNewsLegit,1
polit24legit,Back Channel to Trump: Loyal Aide in Trump Tow...,MihalceaNewsLegit,1
edu32legit,Students Experiment With Drones for 4-H Nation...,MihalceaNewsLegit,1
sports01legit,Basketball 'bible' auction sets sports memora...,MihalceaNewsLegit,1


In [29]:
x = dfNews['text'].values
y = dfNews['truthvalue'].values
print(type(x), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [30]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

# Once we have our handles, we format the datasets in a Keras-fit compatible
# format: a tuple of the form (text_data, label).
def format_dataset(x, y):
  return (x, y)

train_dataset = list(map(format_dataset, X_train, y_train))
test_dataset = list(map(format_dataset, X_test, y_test))

# We also create a dataset with only the textual data in it. This will be used
# to build our vocabulary later on.
textL_dataset = list(map(lambda a:a, x))


In [31]:
print (len(X_train), len(X_test), len(y_train), len(y_test), len(text_dataset), '\n',
type(X_train), type(X_test), type(y_train), type(y_test), type(text_dataset))


903 226 903 226 1129 
 <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>


In [32]:
# move our numpy structures into Tensorflow datasets
Dataset = tf.data.Dataset
text_dataset = tf.data.Dataset.from_tensor_slices(textL_dataset)

features_dataset = Dataset.from_tensor_slices(X_train)
labels_dataset = Dataset.from_tensor_slices(list(y_train))
tfds_train = Dataset.zip((features_dataset, labels_dataset))

features_test_dataset = Dataset.from_tensor_slices(X_test)
labels_test_dataset = Dataset.from_tensor_slices(list(y_test))
tfds_test = Dataset.zip((features_test_dataset, labels_test_dataset))

# Try to determine the optimum vocabulary size.  
The results for model 2 below showed that varying the vocabulary size was productive, with the optimum size seeming to be between 18000 and 25000.
We know there are a lot of junk words in our data, where spaces are missing and words appear only once in a story (ex: we find "dyma davi d la pajti" in our text, a French tranliteration of "Dumas Davy de la Pailleterie"). 

Let's count our words and see what the vocabulary size would be if we removed these.

In [11]:
import re
import string
from collections import Counter
def CleanUpPunctuation(pattern, rep, input_data):
  lowercase = input_data.lower()
  s = pattern.sub(lambda m: rep[re.escape(m.group(0))], lowercase)
  return s


#rep = {"condition1": "", "condition2": "text"} # define desired replacements here
rep =  {re.escape(s):"" for i,s in enumerate(string.punctuation)}
# use these three lines to do the replacement
pattern = re.compile("|".join(rep.keys()))

v = list(x)
cnt = Counter()
for a in v:
    a = CleanUpPunctuation(pattern, rep, a)
    # split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
    alist = a.split()
    for word in alist:
        cnt[word] += 1


In [12]:
# we have 26884 unique words in our vocabulary
len(cnt)

26884

In [13]:
cnt.most_common()[:200:-1]

[('planetcom', 1),
 ('indianas', 1),
 ('growsandstates', 1),
 ('biology', 1),
 ('crosses', 1),
 ('sameness', 1),
 ('socio', 1),
 ('indoctrinated', 1),
 ('indoctrination', 1),
 ('psychologists', 1),
 ('molestation', 1),
 ('desensitize', 1),
 ('exam', 1),
 ('correctness', 1),
 ('inclusiveness', 1),
 ('eviscerate', 1),
 ('federally', 1),
 ('passages', 1),
 ('coreapproved', 1),
 ('excelling', 1),
 ('denominator', 1),
 ('assailed', 1),
 ('buttocks', 1),
 ('insert', 1),
 ('pornographic', 1),
 ('dildos', 1),
 ('brook', 1),
 ('stony', 1),
 ('choking', 1),
 ('16yearold', 1),
 ('instinctively', 1),
 ('16yearolds', 1),
 ('patchogue', 1),
 ('42pound', 1),
 ('pix11', 1),
 ('newsday', 1),
 ('euthanized', 1),
 ('sociable', 1),
 ('reunions', 1),
 ('apollo', 1),
 ('armstrongs', 1),
 ('ops', 1),
 ('antarcticas', 1),
 ('spacewalked', 1),
 ('coordinate', 1),
 ('onsite', 1),
 ('deserts', 1),
 ('admunsenscott', 1),
 ('evacuating', 1),
 ('nsf', 1),
 ('precautionary', 1),
 ('amundsenscott', 1),
 ('christchurc

In [14]:
# we have 12,139 words that only appear once in any article.
# 26,884 - 12,139 = 
cntd = dict(cnt)
sort_orders = sorted(cntd.items(), key=lambda x: x[1], reverse=False)
singles = []
for i in sort_orders:
    if i[1] ==1:
        singles.append(i[0])
len(singles)

12139

### Vocabulary Size
Of the 26,884 words in our articles, we have 12,139 words used once, leaving 14,745 words found in more than one article.  Any analysis that depends on finding the same word multiple articles will not find any of these 12,139 words, so from that standpoint they are just noise.

However, if we can find a way to analyze on a scope broader than word or ngram repetition, such as sentence structure, or by inferring parts of speech, even words that appear once words might be useful.



#### (Below is just kept as a note to myself on how to find a record based on its key value.)

In [15]:
i = dfNews.index.get_loc('biz01legit')
print (i)
#dfNews.iloc[i:i+2]
dfNews['text'][159]



159


'Alex Jones Apologizes for Promoting \'Pizzagate\' Hoax  Alex Jones  a prominent conspiracy theorist and the host of a popular right-wing radio show  has apologized for helping to spread and promote the hoax known as Pizzagate. The admission on Friday by Mr. Jones  the host of "The Alex Jones Show" and the operator of the website Infowars  was striking. In addition to promoting the Pizzagate conspiracy theory  he has contended that the Sept. 11 attacks were inside jobs carried out by the United States government and that the 2012 shooting at Sandy Hook Elementary School in Newtown  Conn.  was a hoax concocted by those hostile to the Second Amendment. The Pizzagate theory  which posited with no evidence that top Democratic officials were involved with a satanic child pornography ring centered around Comet Ping Pong  a pizza restaurant in Washington  D.C.  grew in online forums before making its way to more visible venues  including Mr. Jones\'s show. And its prominence after the electio