## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2023.

# ECBM E4040 - Assignment 2- Task 5: Kaggle Open-ended Competition

Kaggle is a platform for predictive modelling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

If you don't have a Kaggle account, feel free to join at [www.kaggle.com](https://www.kaggle.com). To let the CAs do the grading more conveniently, please __use Lionmail to join Kaggle__ and __use UNI as your username__.

The competition is located here: https://www.kaggle.com/t/b70c5e10a0be487f937f2539dc5c3d04

You can find detailed description about this in-class competition on the website above. Please read carefully and follow the instructions.

<span style="color:red">__TODO__:</span>

Train a custom model for the bottle dataset classification problem.
- You are free to use any methods taught in the class or found by yourself on the Internet (ALWAYS provide reference to the source).
- General training methods include dropout, batch normalization, early stopping, L1/L2 regularization.

You are given the test set to generate your predictions
- The splitting is 70% public + 30% private, but you don't know which ones are public/private.
- Students should achieve an accuracy on the public test set of at least 70%. Two points will be deducted for each 1% below 70% accuracy threshold (i.e. 65% accuracy will have 10 points deducted).

The accuracy will be shown on the public leaderboard once you submit your prediction .csv file. The private leaderboard will be released after the competition. The final ranking is based on the private leaderboard results, not the public leaderboard.

<span style="color:red">**Note**:</span>
* Report your results on the Kaggle, for comparison with other students' optimal results (you can do this several times). 
* Save your best model.

<span style="color:red">**Hint**:</span> You can start from what you have implemented in task 4. Students are allowed to use pretrained networks, and utilize transfer learning. 

### HW Submission Details:

There are two components to reporting the results of this task: 

**(A) Submission (up to 20 submissions each day) of the .csv prediction file through the Kaggle platform**. You should start doing this __VERY early__, so that students can compare their work as they are making progress with model optimization.

**(B) Submitting your best CNN model through Github Classroom repo.**

**Note** that assignments are submitted through github classroom only. All code for training your kaggle model should be done in this task 5 jupyter notebook, or in a user defined module (.py file) that is imported for use in the jupyter notebook.

### Useful Information: 

1. Unzip zip files in GCP or acquire administrator permission for other application installation. When you upload your dataset to your vm instances, you may want to unzip your files. However, unzip command is not built in. To use `sudo apt install unzip` or for future applications installation, you need to: 
  - Login as `ecbm4040` as is exampled [in the tutorial](https://ecbme4040.github.io/2023_fall/EnvSetup/gcp.html#2_Connect_to_your_instance_22)
  - Run `sudo apt install unzip`
  - You might experience different errors on this step, e.g.
    - Permission Denied: Command `apt(-get)` will only run with `sudo` or under `root`. Don't forget the `sudo` prefix or switch to root by `sudo su`.
    - Require password: Change Linux username **before** SSH login as is exampled in the tutorial. If you do `su ecbm4040` after SSH login with other users, you will need password to use `sudo`.
    - Resource Temporarily Unavailable: Restart the VM or kill the processes occupying the resource. Google for solutions or refer to [this link](https://itsfoss.com/could-not-get-lock-error/) for details.
    - Others: Google for solutions or contact TAs.

2. If you meet kernel crash (or the running never ends), you might consider using a larger memory CPU.Especially if you include large network structure like VGG, 15GB memory or more CPU is recommended

3. Some python libraries that you might need to install first include pandas, scikit-learn. there are **2 OPTIONS** that you can use to install them:
  - In the envTF24 environment in linux interface, type: `pip install [PACKAGE]` 
  - In the jupyter notebook (i.e. this file), type `!pip install [PACKAGE]`. Sometimes you need to restart the virtual environment or even the instance to get these packages functional.

4. You might need extra pip libraries to handle dataset, include network, etc. You can follow step 3 to install them.

### <span style="color:red">__Submission:__</span>

(i) In your Assignment 2 submission folder, create a subfolder called __KaggleModel__. Save your best model using `model.save()`. This will generate a `saved_model.pb` file, a folder called `variables`, and a folder called `checkpoints` all inside the __KaggleModel__ folder. Only upload your best model. 

(ii) <span style="color:red">If your saved model exceeds 100 MB, ".gitignore" it or you will get an error when pushing.</span> **Upload the model to Google Drive and explicitly provide the link under the 'Save your best model' cell.**

(iii) Remember to delete any intermediate results, we only want your best model. Do not upload any data files. The instructors will rerun the uploaded best model and verify against the score which you reported on the Kaggle.

**The top 10 final submissions of the Kaggle competition will receive up to 10 bonus points proportional to the private test accuracy.**

## Load Data

There are two options to load Kaggle data.

**Option 1**:
1. Manually download the data from kaggle.
2. Upload the data to GCP via Jupyter or SSH.
3. Unzip the files using command `unzip` or a Python script (you may use the [`zipfile`](https://docs.python.org/3/library/zipfile.html) package).
4. Move the dataset to your favorite location for the following tasks. Be careful not to upload them to Github.

**Option 2**:
1. Login as `ecbm4040`.
2. Upload your API key to "**~/.kaggle/kaggle.json**" folder following the instructions in https://github.com/Kaggle/kaggle-api#api-credentials.
3. Run in console:
    ```
    chmod 600 ~/.kaggle/kaggle.json
    pip install kaggle
    kaggle competitions download -c ecbm4040-fall2023
    sudo apt install unzip
    unzip ecbm4040-fall2023.zip
    ```
4. Now the data should be under your current directory ("**/home/ecbm4040/**" by default).
5. Move the dataset to your favorite location for the following tasks. Be careful not to upload them to Github.

In [None]:
!unzip ecbm4040-fall2023.zip 

In [1]:
#Generate dataset
import os
import pandas as pd
import numpy as np
from PIL import Image

root = "./assignment2-task5/" #TODO: Enter your path

#Load Training images and labels
train_directory = root + "kaggle_train_128/train_128"
image_list=[]
label_list=[]
for sub_dir in os.listdir(train_directory):
    print("Reading folder {}".format(sub_dir))
    sub_dir_name=os.path.join(train_directory,sub_dir)
    for file in os.listdir(sub_dir_name):
        filename = os.fsdecode(file)
        if filename.endswith(".jpg") or filename.endswith(".png"):
            image_list.append(np.array(Image.open(os.path.join(sub_dir_name,file))))
            label_list.append(int(sub_dir))
X_train = np.array(image_list)
y_train = np.array(label_list)

#Load Test images
test_directory = root + "kaggle_test_128/test_128"
test_image_list=[]
test_df = pd.DataFrame([], columns=['Id', 'X'])
print("Reading Test Images")
for file in os.listdir(test_directory):
    filename = os.fsdecode(file)
    if filename.endswith(".jpg") or filename.endswith(".png"):
        test_df = test_df.append({
            'Id': filename,
            'X': np.array(Image.open(os.path.join(test_directory,file)))
        }, ignore_index=True)

test_df['s'] = [int(x.split('.')[0]) for x in test_df['Id']]
test_df = test_df.sort_values(by=['s'])
test_df = test_df.drop(columns=['s'])
X_test = np.stack(test_df['X'])

print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)

Reading folder 1
Reading folder 3
Reading folder 0
Reading folder 2
Reading folder 4
Reading Test Images
Training data shape:  (15000, 128, 128, 3)
Training labels shape:  (15000,)
Test data shape:  (3500, 128, 128, 3)


## Build and Train Your Model Here

In [2]:
# YOUR CODE HERE
!pip install split-folders
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import pathlib, splitfolders, os, math
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import *
import matplotlib.pyplot as plt
from PIL import Image



In [3]:
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)

Training data shape:  (15000, 128, 128, 3)
Training labels shape:  (15000,)
Test data shape:  (3500, 128, 128, 3)


In [4]:
class CNNBlock(tf.keras.layers.Layer):
    def __init__(self, filters, kernel_size, strides, padding, pool_size, dropout_rate):
        super(CNNBlock, self).__init__()
        self.C1 = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding)
        self.B1 = BatchNormalization()
        self.A1 = Activation('relu')
        self.P1 = MaxPooling2D(pool_size=pool_size, strides=2, padding=padding)
        self.Dr1 = Dropout(dropout_rate)
        
    def call(self, x):
        x = self.C1(x)
        x = self.B1(x)
        x = self.A1(x)
        x = self.P1(x)
        y = self.Dr1(x)
        return y

In [5]:
class DenseBlock(tf.keras.layers.Layer):
    def __init__(self, units, dropout_rate):
        super(DenseBlock, self).__init__()
        self.D1 = Dense(units, activation='relu')
        self.B1 = BatchNormalization()
        self.D2 = Dense(units * 2, activation='relu')
        self.D3 = Dense(units * 2, activation='relu')
        self.D4 = Dense(units, activation='relu')
        self.Dr1 = Dropout(dropout_rate)
        self.D5 = Dense(5, activation='softmax')
        
    def call(self, x):
        x = self.D1(x)
        x = self.B1(x)
        x = self.D2(x)
        x = self.D3(x)
        x = self.D4(x)
        x = self.Dr1(x)
        y = self.D5(x)
        return y

In [6]:
class NeuralNetwork(tf.keras.Model):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.C1 = Conv2D(filters=32, kernel_size=(3 * 3), strides=1, padding='same', input_shape=input_shape)
        self.B1 = BatchNormalization()
        self.A1 = Activation('relu')
        
        self.layer1 = CNNBlock(filters=32, kernel_size=(3 * 3), strides=1, padding='same', pool_size=(2 * 2), dropout_rate=0.3)  #filters=output size, kernel size=size of filter, stride=no. of steps, padding=not used, pool=combining outputs
        self.layer2 = CNNBlock(filters=64, kernel_size=(3 * 3), strides=1, padding='same', pool_size=(2 * 2), dropout_rate=0.4)
        self.layer3 = CNNBlock(filters=32, kernel_size=(3 * 3), strides=1, padding='same', pool_size=(2 * 2), dropout_rate=0.3)
        
        self.F1 = Flatten()
        self.layer4 = DenseBlock(units=64, dropout_rate=0.3)
        
    def call(self, x):
        x = self.C1(x)
        x = self.B1(x)
        x = self.A1(x)
        
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        
        x = self.F1(x)
        
        y = self.layer4(x)
        return y
    
    def __repr__(self):
        name = 'Bangle_Net'
        return name

    def build_graph(self):
        x = tf.keras.Input(shape=(300,300,3))
        return tf.keras.Model(inputs=[x], outputs=self.call(x))

In [None]:
X_train = tf.convert_to_tensor(X_train, dtype=tf.float32)

X_test = tf.convert_to_tensor(X_test, dtype=tf.float32)

In [None]:
y_train = tf.convert_to_tensor(y_train, dtype=tf.int32)

In [None]:
depth = 5
y_train_ohe = tf.one_hot(y_train,depth)

In [None]:
img_height, img_width = 128, 128
input_shape = (img_height, img_width, 3)

In [None]:
net = NeuralNetwork()

net.compile(optimizer='adam',
            loss='categorical_crossentropy',
            metrics=['accuracy'])

checkpoint_save_path = './checkpoint/Baseline.ckpt'
if os.path.exists(checkpoint_save_path + '.index'):
    net.load_weights(checkpoint_save_path)

cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_save_path, save_weights_only=True,
                                                 save_best_only=True)

history = net.fit(X_train, y_train_ohe, epochs=20, batch_size=32, callbacks=[cp_callback])

net.summary()

file = open('./weights.txt', 'w')
for v in net.trainable_variables:
    file.write(str(v.name) + '\n')
    file.write(str(v.shape) + '\n')
    file.write(str(v.numpy()) + '\n')

file.close()  

## Save your best model

**Link to large model on Google Drive: [insert link here]**

In [None]:
# YOUR CODE HERE
net.save('final', save_format='tf')


## Generate .csv file for Kaggle

The following code snippet can be used to generate your prediction .csv file.

NOTE: If your Kaggle results are indicating random performance, then it's likely that the indices of your csv predictions are misaligned.

In [None]:
import pandas as pd
import numpy as np
import os
from PIL import Image
from keras.models import load_model

# Assuming 'root' is defined and points to the root directory
root = "./assignment2-task5/"  # replace with your root directory path
test_directory = root + 'kaggle_test_128/test_128'

In [None]:
# Load trained model
model = tf.keras.models.load_model('/content/final')  # replace with your model file

In [None]:
test_image_list = []
test_df = pd.DataFrame([], columns=['Id', 'X'])
print("Reading Test Images")

In [None]:
for file in os.listdir(test_directory):
    filename = os.fsdecode(file)
    if filename.endswith(".jpg") or filename.endswith(".png"):
        image_path = os.path.join(test_directory, file)
        image = np.array(Image.open(image_path).convert('RGB'))  # Ensure images are in RGB
        test_df = test_df.append({'Id': filename, 'X': image}, ignore_index=True)


In [None]:
# Sort the dataframe by image ID
test_df['s'] = [int(x.split('.')[0]) for x in test_df['Id']]
test_df = test_df.sort_values(by=['s'])
test_df = test_df.drop(columns=['s'])

# Stack the images into a numpy array
X_test = np.stack(test_df['X'])

# Normalize the images if your model expects values between 0 and 1
X_test = X_test.astype('float32') 

print('Test data shape: ', X_test.shape)


In [None]:
# Make predictions
print("Making predictions")
predictions = model.predict(X_test)

# Convert predictions to labels
predicted_labels = np.argmax(predictions, axis=1)

# Create a DataFrame for the predictions
predictions_df = pd.DataFrame({
    'Id': test_df['Id'],
    'label': predicted_labels
})

In [None]:
# Save the predictions to a CSV file
predictions_df.to_csv('predict_labels.csv', index=False, header=True)
print("Predictions saved to predict_labels.csv")

In [None]:
#Google Drive Link
