#**CSE164 Final Project: Semi-supervised Learning**
**This project is made to be run in Google Colab**

>Model File: https://drive.google.com/file/d/1GQHO3t8iA9F-1Wu1JxTjwY--9_bfdRbc/view?usp=sharing

> credit to this tutorial for using tf hub: https://www.tensorflow.org/hub/tutorials/tf2_image_retraining

How to Run
==================

To run this notebook, import your kaggle.json api file into the content folder of the colab runtime then hit run all. This should take aproximately 10-12 minutes to complete.

Please note that running the following notebook automatically creates a kaggle submission. If you also want the full prediction.csv file you can download that directly from the runtime files.

On a secondary Note, it might tell you this notebook requires high ram, that is only the case if you run the predictions multiple times because tf is a memory hog. Running the whole thing one time through only requires about 9GB's of RAM

Network Backbone
==================



Transfer Learning
> Supervised Learning
>> Semi-supervised Learning would lead to marginal returns given the size of the pretrained model training time is (45 min/epoch) no noticable prediction score (~75%)
>
> Data Augmentation Layer + Preprocessing at the top
>
> Swin-t Transformer: https://tfhub.dev/sayakpaul/swin_base_patch4_window7_224/1
>
> Dense(200) + Dropout(.2) + Softmax(10)
>



Training Pipeline
==================

Google Colab
> Easier to work with + same compute power

Import data directly from Kaggle using CLI

Preprocess images(Normalize + Augment)
> The Swin-t does its own preprocessing it just needs normalized images

Batch and shuffle the sets
> Tuned to work with the small dataset

Use a 20% subset of the given training data as a validation set

Early Stopping + Model Checkpoint
> Don't waste any training time and reduce overfitting
>
> Increase the luck for training scores by running multiple times and saving the best score

Fix Seed for Reproducible Results
==================

In [None]:
import numpy as np
import os
import glob
from PIL import Image
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from functools import partial
import copy
import tensorflow_hub as hub

In [None]:
# Some stuff we'll need...
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Flatten, Dropout
from tensorflow.keras.layers.experimental import preprocessing

In [None]:
seed = 123
np.random.seed(seed)
tf.random.set_seed(seed)

In [None]:
! pip install -q kaggle

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

Load Data From Kaggle
==================

In [None]:
! kaggle competitions download -c ucsc-cse-164-spring-2023-final-project

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 403, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
! unzip -q ucsc-cse-164-spring-2023-final-project.zip -d data

unzip:  cannot find or open ucsc-cse-164-spring-2023-final-project.zip, ucsc-cse-164-spring-2023-final-project.zip.zip or ucsc-cse-164-spring-2023-final-project.zip.ZIP.


Load datasets
==================

In [None]:
num_classes = 10
input_shape = (224, 224, 3)
BATCH_SIZE = 10

In [None]:
def build_dataset(subset):
  return tf.keras.preprocessing.image_dataset_from_directory(
      "data/CSE164_2023/Train set",
      label_mode="categorical",
      color_mode="rgb",
      class_names= ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"],
      batch_size=1,
      image_size=(224,224),
      seed=seed,
      validation_split=.2,
      subset=subset,)

train_ds = build_dataset("training")
class_names = tuple(train_ds.class_names)
train_size = train_ds.cardinality().numpy()
train_ds = train_ds.unbatch().batch(BATCH_SIZE)
train_ds = train_ds.repeat()

normalization_layer = tf.keras.layers.Rescaling(1. / 255)
preprocessing_model = tf.keras.Sequential([normalization_layer])
do_data_augmentation = False
if do_data_augmentation:
  preprocessing_model.add(
      tf.keras.layers.RandomRotation(40))
  preprocessing_model.add(
      tf.keras.layers.RandomTranslation(0, 0.2))
  preprocessing_model.add(
      tf.keras.layers.RandomTranslation(0.2, 0))
  # Like the old tf.keras.preprocessing.image.ImageDataGenerator(),
  # image sizes are fixed when reading, and then a random zoom is applied.
  # If all training inputs are larger than image_size, one could also use
  # RandomCrop with a batch size of 1 and rebatch later.
  preprocessing_model.add(
      tf.keras.layers.RandomZoom(0.2, 0.2))
  preprocessing_model.add(
      tf.keras.layers.RandomFlip(mode="horizontal"))
train_ds = train_ds.map(lambda images, labels:
                        (preprocessing_model(images), labels))

val_ds = build_dataset("validation")
valid_size = val_ds.cardinality().numpy()
val_ds = val_ds.unbatch().batch(BATCH_SIZE)
val_ds = val_ds.map(lambda images, labels:
                    (normalization_layer(images), labels))

Found 100 files belonging to 10 classes.
Using 80 files for training.
Found 100 files belonging to 10 classes.
Using 20 files for validation.


In [None]:
data_augmentation = Sequential([
    preprocessing.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(40),
    preprocessing.RandomContrast(0.2),
    tf.keras.layers.RandomTranslation(0, 0.2),
    tf.keras.layers.RandomZoom(0.2, 0.2),
], name="data_augmentation")



Training SWIN-T
==================

In [None]:
do_fine_tuning = False
model = tf.keras.Sequential([
    # Explicitly define the input shape so the model can be properly
    # loaded by the TFLiteConverter
    tf.keras.layers.InputLayer(input_shape=input_shape),
    data_augmentation,
    hub.KerasLayer("https://tfhub.dev/sayakpaul/swin_base_patch4_window7_224/1", trainable=do_fine_tuning),
    tf.keras.layers.Dense(units=1024, activation='relu'),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(len(class_names),
                          kernel_regularizer=tf.keras.regularizers.l2())
])
model.build((None,)+input_shape)
model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 data_augmentation (Sequenti  (None, 224, 224, 3)      0         
 al)                                                             
                                                                 
 keras_layer (KerasLayer)    (None, 1000)              87768224  
                                                                 
 dense (Dense)               (None, 1024)              1025024   
                                                                 
 dropout (Dropout)           (None, 1024)              0         
                                                                 
 dense_1 (Dense)             (None, 10)                10250     
                                                                 
Total params: 88,803,498
Trainable params: 1,035,274
Non-trainable params: 87,768,224
__________________________________

In [None]:
model.compile(
  optimizer=tf.keras.optimizers.Adam(),
  loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=0.1),
  metrics=['accuracy'])

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="model",
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

Run the Model 3 times in order to take advantage of the model checkpoint, and so that it can continue to train without overfitting. It also takes all of the chance out of the model accuarcy

In [None]:
steps_per_epoch = train_size // BATCH_SIZE
validation_steps = valid_size // BATCH_SIZE
hist = model.fit(
    train_ds,
    epochs=30, steps_per_epoch=steps_per_epoch,
    validation_data=val_ds,
    validation_steps=validation_steps,
    callbacks=[callback, model_checkpoint_callback]).history

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30


In [None]:
steps_per_epoch = train_size // BATCH_SIZE
validation_steps = valid_size // BATCH_SIZE
hist = model.fit(
    train_ds,
    epochs=30, steps_per_epoch=steps_per_epoch,
    validation_data=val_ds,
    validation_steps=validation_steps,
    callbacks=[callback, model_checkpoint_callback]).history

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30


In [None]:
steps_per_epoch = train_size // BATCH_SIZE
validation_steps = valid_size // BATCH_SIZE
hist = model.fit(
    train_ds,
    epochs=30, steps_per_epoch=steps_per_epoch,
    validation_data=val_ds,
    validation_steps=validation_steps,
    callbacks=[callback, model_checkpoint_callback]).history

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30


In [None]:
model.save("model", save_format='h5')

Submit Supervised Model
==================

In [None]:
cd "data/CSE164_2023/Test set"

/content/data/CSE164_2023/Test set


In [None]:


# Specify the directory
directory = ''

# Specify the pattern
pattern   = "**/*.jpeg"  # This will match any jpg file

# Use glob to get all file paths
image_paths = glob.glob(os.path.join(directory, pattern), recursive=True)
# len(image_paths)
# print(image_paths[:5])

This next Cell takes like 6 minutes to run. That mainly has to do with memory constraints of having to do small batches of images

In [None]:
images_array = []

predictions = []
# Define batch size
batch_size = 10

# Create list of path batches
path_batches = [image_paths[i:i + batch_size] for i in range(0, len(image_paths), batch_size)]

for path_batch in path_batches:
    images_batch = []
    for path in path_batch:
        image_string = tf.io.read_file(path)
        # Decode the image file
        image_decoded = tf.image.decode_jpeg(image_string, channels=3)

        image_resized = tf.image.resize(image_decoded, [224, 224])
        image_float = tf.image.convert_image_dtype(image_resized, tf.float32)
        image_float = normalization_layer(image_float)
        images_batch.append(image_float)

    images_batch = np.array(images_batch)
    predictions.extend(tf.argmax(model.predict(images_batch, verbose=0), axis=-1).numpy())

# print(images_batch[0])
# print(predictions[:10])

In [None]:
cd ..

/content/data/CSE164_2023


In [None]:
cd ..

/content/data


In [None]:
cd ..

/content


In [None]:
import pandas as pd

# Suppose `predicted_labels` is a numpy array of your predicted labels
# and `image_names` is a list or numpy array of your unlabeled image names


# TODO
# Check how they match up, because the images might be getting the wrong labels
# Since the two arrays are not linked in any way


# Create a DataFrame
df = pd.DataFrame({
    'Image_id': image_paths,
    'label': predictions
})

# Export DataFrame to .csv file
df.to_csv('predictions.csv', index=False)

In [None]:
! kaggle competitions submit -c ucsc-cse-164-spring-2023-final-project -f predictions.csv -m "Message"

  0% 0.00/116k [00:00<?, ?B/s]100% 116k/116k [00:00<00:00, 584kB/s]
Successfully submitted to UCSC CSE 164 Spring 2023 Final Project