# 9. Keras and deep learning
In this lab, we will learn how to use Keras to build deep learning models. We will use Keras to build a LSTM model for sentiment classification and a CNN model for digit recognition.
You need to put in the code to complete the models in the blocks marked with `## YOUR CODE HERE` and `## END OF YOUR CODE`.

## Installation
Before you can start using Keras, you'll need to install TensorFolw, which includes Keras as part of its core library.
```bash
source activate {your_env}
pip install tensorflow
pip install keras
```

## Basics of Keras
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

### 1. Initialize a model
Start by creating a Sequential model and adding layers to it.
```python
from keras.models import Sequential
from keras.layers import Dense

# Initialize a model
model = Sequential()

# Add layers to the model
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))

# this is equivalent to the above
#model = Sequential([
#    Dense(64, activation='relu', input_dim=100),
#    Dense(10, activation='softmax')
#])
```


In [35]:
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

# Initialize a model
model = Sequential()

# Add layers to the model
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))


### 2. Compile the model
Compile the model with the appropriate loss function and optimizer.
```python
model.compile(loss='categorical_crossentropy', # loss function, binary_crossentropy for binary classification
              optimizer='sgd', # stochastic gradient descent
              metrics=['accuracy'])
```


In [36]:
model.compile(loss='categorical_crossentropy', # loss function, binary_crossentropy for binary classification
              optimizer='sgd', # stochastic gradient descent
              metrics=['accuracy'])


### 3. Train the model
Train the model with the training data.
```python
x_train = np.random.random((1000, 100))
y_train = np.random.randint(2, size=(1000, 10))
model.fit(x_train, y_train, epochs=5, batch_size=32)
```


In [37]:
x_train = np.random.random((1000, 100))
y_train = np.random.randint(2, size=(1000, 10))
model.fit(x_train, y_train, epochs=5, batch_size=32)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1d1c020d250>

In [38]:
x_test = np.random.random((100, 100))
y_test = np.random.randint(2, size=(100, 10))
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128)




## Keras LSTM for IMDB sentiment classification
The IMDB dataset is in `datasets/` of this repository. Use the following code the load the dataset and write a LSTM model to classify the sentiment of the reviews.
```python
import pandas as pd    # to load dataset
import nltk
from nltk.corpus import stopwords   # to get a collection of stopwords
import nltk
nltk.download('stopwords')

data = pd.read_csv('../datasets/IMDB.csv')

custom_path = '../datasets/IMDB.csv'

# Append your custom path to the NLTK data path
nltk.data.path.append(custom_path)

nltk.download('stopwords', download_dir=custom_path)
english_stops = set(stopwords.words('english'))

x_data = data['review']       # Reviews/Input
y_data = data['sentiment']    # Sentiment/Output
# PRE-PROCESS REVIEW
x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
```


In [39]:
import pandas as pd    # to load dataset
import nltk
from nltk.corpus import stopwords   # to get a collection of stopwords

data = pd.read_csv('../datasets/IMDB.csv')
custom_path = '../datasets'
nltk.download('stopwords', download_dir=custom_path)
english_stops = set(stopwords.words('english'))

x_data = data['review']       # Reviews/Input
y_data = data['sentiment']    # Sentiment/Output
# PRE-PROCESS REVIEW
x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case


[nltk_data] Downloading package stopwords to ../datasets...
[nltk_data]   Package stopwords is already up-to-date!


The tokenization of the reviews is done by the following code:
```python
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000)    # num_words is the number of words to keep based on word frequency
tokenizer.fit_on_texts(x_data)            # fit tokenizer to our training text data

# retrieve the word index
word_index = tokenizer.word_index

x_data = tokenizer.texts_to_sequences(x_data)  # convert our text data to sequence of numbers
```


In [19]:
#The tokenization of the reviews is done by the following code:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000)    # num_words is the number of words to keep based on word frequency
tokenizer.fit_on_texts(x_data)            # fit tokenizer to our training text data

# retrieve the word index
word_index = tokenizer.word_index

x_data = tokenizer.texts_to_sequences(x_data)  # convert our text data to sequence of numbers


Now, complete the following code to create a LSTM model for the IMDB sentiment classification.

In [20]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, LSTM, Dense, GRU
from sklearn.model_selection import train_test_split
# Pad sequences to ensure uniform input size
max_length = 100  # Define sequence length
x_data = pad_sequences(x_data, maxlen=max_length)

# Convert sentiments to binary labels
y_data = np.where(y_data == 'positive', 1, 0)

# Split data into training and testing sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

# Build the RNN model
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

model = Sequential([
    Embedding(input_dim=10000, output_dim=32, input_length=max_length),
    SimpleRNN(units=32),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=1)

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.847100019454956


## Keras CNN for Digit Recognition
In lab 5, we use the digit dataset. Now, we will use the same dataset to train a CNN model to recognize the digits.
```python
import pandas as pd

X_train = pd.read_csv('../datasets/digits/Digits_X_train.csv').values
y_train = pd.read_csv('../datasets/digits/Digits_y_train.csv').values
X_test  = pd.read_csv('../datasets/digits/Digits_X_test.csv').values
y_test  = pd.read_csv('../datasets/digits/Digits_y_test.csv').values
```

In [33]:
## Keras CNN for Digit Recognition
#In lab 5, we use the digit dataset. Now, we will use the same dataset to train a CNN model to recognize the digits.
import pandas as pd

X_train = pd.read_csv('../datasets/digits/Digits_X_train.csv').values
y_train = pd.read_csv('../datasets/digits/Digits_y_train.csv').values.ravel()
X_test  = pd.read_csv('../datasets/digits/Digits_X_test.csv').values
y_test  = pd.read_csv('../datasets/digits/Digits_y_test.csv').values.ravel()


Complete the following code to create a CNN model for the digit recognition.

In [34]:
from keras.models import Sequential
from keras.layers import Dense, Convolution2D, Flatten, MaxPooling2D
from keras.utils import to_categorical

# One-hot encode the labels
y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)

# Reshape the data to 8 * 8 * 1
X_train = X_train.reshape(X_train.shape[0], 8, 8, 1)
X_test = X_test.reshape(X_test.shape[0], 8, 8, 1)
# Normalize the image data
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255

## YOUR CODE HERE
# Create the model
model = Sequential([
    Conv2D(16, (3, 3), activation='relu', input_shape=(8, 8, 1), padding='same'),
    MaxPooling2D(2, 2),
    Conv2D(32, (3, 3), activation='relu', padding='same'),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(10, activation='softmax')
])


# Print the model summary
print(model.summary())

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2, verbose=1)

## END OF YOUR CODE

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy: ", accuracy)

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_16 (Conv2D)          (None, 8, 8, 16)          160       
                                                                 
 max_pooling2d_14 (MaxPooli  (None, 4, 4, 16)          0         
 ng2D)                                                           
                                                                 
 conv2d_17 (Conv2D)          (None, 4, 4, 32)          4640      
                                                                 
 flatten_8 (Flatten)         (None, 512)               0         
                                                                 
 dense_19 (Dense)            (None, 64)                32832     
                                                                 
 dense_20 (Dense)            (None, 10)                650       
                                                     