<a href="https://colab.research.google.com/github/da-on-git/temp/blob/main/GSGCenv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting Started with Google Colab (GSGC) using Tensor Flow
Manning Publications Live Project <br>
Student: David Aviel, September 2023

###Mount Google Drive to this Colab Notebook.
/content/drive/MyDrive/ is the MyDrive directory of Google Drive in Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Pythagorean Theorem
In a right angled triangle where c is the Hypotenuse: $a^2 + b^2 = c^2$ inline <br>or
\begin{equation}
a^2 + b^2 = c^2
\end{equation}
<br> as a paragraph

In [None]:
def hyp(a,b):
  return (a**2 + b**2)**0.5

hyp(3,4)

5.0

###To install a new package PackageName write
!pip install PackageName

In [2]:
import sklearn as sk
import tensorflow as tf
import torch as pt
import numpy as np

###Na√Øve Bayes
I used Windows charachter map application, charmap. Searched for "Latin small i with Diaeresis", and copied here.

Read SMSSpamCollection.txt, the SMS Spam Collection dataset.<br>
Make two lists:<br>
a) A list that contains the category of the SMS (ham or spam),<br>
b) A list that contains the text of the SMS.<br>
Put it into all lowercase and filter out punctuation marks.

In [3]:
file = open('/content/drive/MyDrive/Colab Notebooks/RecSpeech/SMSSpamCollection.txt','r')
lines = file.readlines()
file.close()
import random
random.Random(1337).shuffle(lines)


In [None]:
lists = [s.split("\t") for s in lines]
cat_t, txt_t = zip(*lists)
cat = list(cat_t) #List of SMS Category, "ham" or "spam".
txt = list(txt_t) #List of SMS Text string.
txt = [s.lower() for s in txt]
print(('r' + repr(txt[0]))[1:]) #Print raw string
import re
txt = [re.sub('[^a-z0-9 ]','',s) for s in txt] #list of SMS text.
print(('r' + repr(txt[0]))[1:]) #Print raw string
from sklearn.model_selection import train_test_split
T_train, T_test, C_train, C_test = train_test_split(
    txt, cat, test_size=0.25, random_state=0)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(T_train)
X_test = vectorizer.transform(T_test)
import pickle
variables = {
    'X_train': X_train,
    'X_test': X_test,
    'C_train' : C_train,
    'C_test' : C_test
}
with open('/content/drive/MyDrive/Colab Notebooks/RecSpeech/SMSSpamCollection.pickle','wb') as f:
  pickle.dump(variables,f)
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB(force_alpha=True)
mnb.fit(X_train,C_train)
from joblib import dump, load
with open('/content/drive/MyDrive/Colab Notebooks/RecSpeech/SMSSpamCollection.joblib', 'wb') as f:
   dump(mnb, f)

###Evaluate the saved model

In [None]:
import pickle
with open('/content/drive/MyDrive/Colab Notebooks/RecSpeech/SMSSpamCollection.pickle','rb') as f:
  loaded_variables = pickle.load(f)
from joblib import dump, load
with open('/content/drive/MyDrive/Colab Notebooks/RecSpeech/SMSSpamCollection.joblib','rb') as f:
  mnb = load(f)
y_pred = mnb.predict(loaded_variables['X_test'])
y_true = loaded_variables['C_test']
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1198
        spam       0.98      0.89      0.93       196

    accuracy                           0.98      1394
   macro avg       0.98      0.94      0.96      1394
weighted avg       0.98      0.98      0.98      1394



###Making a repository in GitHub available in Colab and running notebooks from there


*   In Drive, create a new folder in the root folder. For this exercise, call the new folder dl_june_17.
*   Access the new folder, right-click the background, and choose More -> "Google Colaboratory" from the contextual menu. Colab opens in a new tab.
*   Choose Connect -> Connect to hosted runtime.
*   Make Drive accessible in Colab by running: drive.mount('/content/drive')
*   Set the new folder the working directory by running: %cd /content/drive/My Drive/dl_june_17
*   Clone the new folder to a GitHub directory by running: ! git clone https://github.com/ryanmark1867/deep_learning_for_structured_data.git , for example








###Preprocess the data by using keras.preprocessing
To create a Keras dataset object from a list of text strings use the tf.keras.preprocessing.text_dataset_from_directory function. This function expects a directory structure where each subdirectory represents a class of data. For example: <br>
<br>
SMSSpamCollection/<br>
...train/ #70%<br>
......ham/<br>
......spam/<br>
...test/ #20%<br>
......ham/<br>
......spam/<br>
...val/ #10%<br>
......ham/<br>
......spam/<br>
<br>



In [4]:
import os, pathlib, shutil
base_dir = pathlib.Path("SMSSpamCollection")
shutil.rmtree (base_dir, ignore_errors=True)
train_dir = base_dir / "train"
test_dir = base_dir / "test"
val_dir = base_dir / "val"

num_train_samples = int(0.7 * len(lines))
print(num_train_samples, "train samples")
train_lines = lines[:num_train_samples]
for category in ("ham", "spam"):
  os.makedirs(train_dir / category)
for ln, line in enumerate(train_lines):
  l = line.split("\t")
  lc = l[0]
  lt = l[1]
  fpath = train_dir / lc / "train" f'{ln:04}' ".txt"
  f = open (fpath, "w")
  f.write (lt)
  f.close ()

num_test_samples = int(0.2 * len(lines))
print(num_test_samples, "test samples")
test_lines = lines[num_train_samples :
                   num_train_samples + num_test_samples]
for category in ("ham", "spam"):
  os.makedirs(test_dir / category)
for ln, line in enumerate(test_lines):
  l = line.split("\t")
  lc = l[0]
  lt = l[1]
  fpath = test_dir / lc / "test" f'{ln:04}' ".txt"
  f = open (fpath, "w")
  f.write (lt)
  f.close ()

val_lines = lines[num_train_samples + num_test_samples :]
num_val_samples = len(val_lines)
print(num_val_samples, "val samples")
for category in ("ham", "spam"):
  os.makedirs(val_dir / category)
for ln, line in enumerate(val_lines):
  l = line.split("\t")
  lc = l[0]
  lt = l[1]
  fpath = val_dir / lc / "val" f'{ln:04}' ".txt"
  f = open (fpath, "w")
  f.write (lt)
  f.close ()

print(len(lines), "Total samples")

3901 train samples
1114 test samples
559 val samples
5574 Total samples


###Load the data into Keras Dataset objects

In [5]:
from tensorflow import keras
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
           train_dir, batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(
         val_dir, batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(
          test_dir, batch_size=batch_size)

for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

Found 3901 files belonging to 2 classes.
Found 559 files belonging to 2 classes.
Found 1114 files belonging to 2 classes.
inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+ \n', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


###Preprocessing the datasets with a TextVectorization layer

In [6]:
from tensorflow.keras.layers import TextVectorization
# Define the layer
text_vectorization = TextVectorization(
    max_tokens=5000,
    output_mode="multi_hot",
)
# Create a text only dataset for building the vocabulary
text_only_train_ds = train_ds.map(lambda x, y: x)
# Build the vocabulary
text_vectorization.adapt(text_only_train_ds)
# Build the vectorized datasets
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

###Inspecting the binary unigram vectorized training dataset

In [7]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 5000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([0. 0. 1. ... 0. 0. 0.], shape=(5000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


###A model building utility

In [8]:
import tensorflow as tf
from tensorflow import keras
from keras import layers

def get_model(max_tokens=5000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

###Train and test the model
Note: there is a bug in keras.callbacks.ModelCheckpoint. A workaround is to suffix the filename with ".tf". See this link <br>
[stackoverflow](https://stackoverflow.com/questions/76701617/the-following-arguments-are-not-supported-with-the-native-keras-format-opti)
Also, you can see there an example how to write your own ModelCheckpoint function.

In [26]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras.tf",
                                    save_best_only=True
                                    )
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras.tf")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")
!cp -r 'binary_1gram.keras.tf' '/content/drive/MyDrive/Colab Notebooks/RecSpeech/binary_1gram.keras.tf'