<a href="https://colab.research.google.com/github/arav-dhoot/DL-Workshop/blob/main/Celebrities_Classification_Algorithm_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Using Machine Learning to Detect Celebrities**

For an average film-goer, differentiating between two celebrities is child's play. However, most computers need help understanding the difference. By building basic machine learning models, known as Convolutional Neural Networks, using basic Python libraries such as numpy, pandas, tensorflow, keras, and scikit-learn, we will train a computer to learn the difference between different celebrities.

This notebook is common for all students. So, before you start, please ensure that you make a copy of this notebook on your own Google Drive. To make a copy of this notebook use the following tools on the top left-hand corner: **File > Save a copy in Drive**

In [None]:
#These packages help with file management
import os
import glob as gb
from google.colab import files

##Numpy
* NumPy is a popular Python library for numerical computing and array manipulation.
* It provides powerful tools for efficient numerical operations on multi-dimensional arrays and matrices.
* NumPy's core feature is its ndarray (N-dimensional array) object, which allows for fast and memory-efficient array operations.
* It offers a wide range of mathematical functions for array manipulation, such as linear algebra, Fourier transforms, and random number generation.
* NumPy is widely used in scientific computing, data analysis, machine learning, and other domains that require high-performance numerical operations.

##Pandas
* Pandas is a powerful Python library for data manipulation, analysis, and exploration.
* It provides easy-to-use data structures, such as DataFrame and Series, for efficient handling of structured data.
* Pandas enables loading data from various sources like CSV, Excel, SQL databases, and more, allowing for seamless data ingestion and integration.
* It offers extensive functionality for data cleaning, transformation, and preprocessing, including missing data handling, data alignment, and data type conversion.
* Pandas supports flexible data indexing, slicing, and filtering operations, making it convenient for extracting and manipulating subsets of data.

##Matplotlib
* Matplotlib is a popular data visualization library in Python.
* It provides a comprehensive set of functions for creating various types of plots, charts, and graphs.
* Matplotlib allows for the creation of static, animated, and interactive visualizations with high customization options.
* It supports a wide range of plot types, including line plots, scatter plots, bar plots, histograms, pie charts, and more.
* Matplotlib provides fine-grained control over plot elements such as axes, labels, titles, colors, and styles.

##Seaborn
* Seaborn is a powerful data visualization library built on top of Matplotlib in Python.
* It provides a higher-level interface and simplifies the process of creating aesthetically pleasing statistical graphics.
* Seaborn offers a wide range of statistical visualization functions, including scatter plots, line plots, bar plots, histograms, box plots, heatmaps, and more.
* It specializes in creating informative and visually appealing visualizations with minimal code.

In [None]:
#These packages help with data visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Pillow
* Pillow is a popular Python library for image processing and manipulation.
* It provides a wide range of functions and methods for opening, editing, and saving various image file formats.
* Pillow allows for basic image operations such as resizing, cropping, rotating, and flipping images.
* It supports advanced image processing tasks, including filtering, blending, enhancing, and applying effects to images.

## OpenCV
* OpenCV (Open Source Computer Vision Library) is a popular open-source computer vision and image processing library.
* It provides a wide range of functions and algorithms for tasks related to image and video analysis, object detection, and machine learning.
* OpenCV supports real-time computer vision applications and can process images and videos in real-time from various sources.
* It offers functionalities such as image and video capture, image filtering, feature detection, object recognition, and camera calibration.
* OpenCV provides efficient implementations of common computer vision algorithms, including edge detection, image segmentation, and optical flow.

In [None]:
#These packages help with managing and viewing images
from PIL import Image
import cv2

## Tensorflow
* TensorFlow is a popular open-source machine learning framework developed by Google.
* It provides a flexible and scalable ecosystem for building and deploying machine learning models.
* TensorFlow offers a computational graph abstraction, where models are represented as a series of connected nodes that perform mathematical operations.
* It supports both deep learning and traditional machine learning algorithms, allowing for a wide range of applications.
* TensorFlow provides extensive tools and libraries for tasks such as neural networks, natural language processing, computer vision, and reinforcement learning.

## Scikit-Learn
* Scikit-learn is a popular open-source machine learning library for Python.
* It provides a comprehensive set of tools and algorithms for machine learning tasks, including classification, regression, clustering, dimensionality reduction, and model evaluation.
* Scikit-learn offers a consistent and user-friendly API, making it easy to develop and experiment with machine learning models.
* It supports various supervised and unsupervised learning techniques, such as decision trees, support vector machines, random forests, k-means clustering, and principal component analysis.

In [None]:
#These packages help with building, training, and analyzing the CNN
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten, GlobalAveragePooling2D
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

When you run the next code block, you will be prompted to upload files. Please upload the ```M17.h5``` file to the notebook. That file is essential in running further code blocks.

In [None]:
files.upload()

In the next code block, you will be prompted to give Google Colab access to your Google Drive. Please ensure that you give Colab access to the Google account that you have stored the ```Celebrity_Faces_Dataset``` in.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

The network we train can only accept an image of the same size. All images need to be resized so that they are the same size. Having a smaller image results in a CNN that uses fewer computational resources during training. With a larger image, on the other hand, you retain a greater amount of information about the image. In the next block, you can enter a numeric value that you want the images to be resized to.

In [None]:
ImageSize = 150 #@param {type:"slider", min:100, max:224, step:2}

In [None]:
#Declaring constants.
PATH='/content/drive/MyDrive/Celebrity_Faces_Dataset' #Change the path so that it points to the celebrity dataset on your Drive.

In [None]:
#Preliminary data visualization. It gives us a general idea of the distribution of the files between the different classes.
i = 0
for folder in os.listdir(PATH):
    files = gb.glob(pathname= str(f'{PATH}/{folder}/*.jpg'))
    print(f'There are {len(files)} images in {folder} folder')
    i+=1
print(f'There are a total of {i} folders!')

In [None]:
# Define a data dictionnary mapping each celebrity name
# to "its class", a numerical value representing the celebrity
# The dictionnnary allows to easily go from the numeric value to the corresponding name
# and vice-verca
# Numeric values are easier to handle by the DL algorithms we shall build
# So we have 17 celebrities and thus 17 classes: from 0 to 16
CelebCodes = {'Sandra_Bullock':0,
        'Angelina_Jolie':1,
        'Natalie_Portman':2,
        'Megan_Fox':3,
        'Tom_Cruise':4,
        'Kate_Winslet':5,
        'Leonardo_DiCaprio':6,
        'Jennifer_Lawrence':7,
        'Brad_Pitt':8,
        'Hugh_Jackman':9,
        'Will_Smith':10,
        'Nicole_Kidman':11,
        'Johnny_Depp':12,
        'Robert_Downey_Jr':13,
        'Tom_Hanks':14,
        'Scarlett_Johansson':15,
        'Denzel_Washington':16}
CelebKeys = list(CelebCodes.keys())

In [None]:
#The X is a list of pictures in the form numpy arrays (celebrity images).
X = []
#The y is a list of celebrity classes of the corresponding images.
y = []
for folder in os.listdir(PATH):
    files = gb.glob(pathname= str(f'{PATH}/{folder}/*.jpg'))
    for file in files:
        image = cv2.imread(file)
        image_array = cv2.resize(image, (ImageSize,ImageSize)) #Each of the images is resized to a uniform 224 x 224 size.
        X.append(list(image_array))
        y.append(CelebCodes[folder])

## Data Visualization

Now let's view a few images to see what our data looks like. In the next code block, you can input any number between 0 and 1699. The number you choose is an index and corresponds to a specific celebrity. Enter a few number and run the cell to view different celebrites. What do you notice when you enter consecutive numbers? What do you notice when you enter numbers with a difference of 100 or more?

In [None]:
index = 1601 #@param {type:"integer"}
if index < len(X):
  plt.imshow(X[index])
  title = str(f'Class: {y[index]:02} -- Celebrity: {CelebKeys[y[index]]}')
  plt.title(title)
  plt.show()
else:
  print('Index out of range!')

## One-Hot Encoding
One-hot encoding is a way to represent categorical data in a format that a computer can understand. Imagine you have a list of categories, like colors: red, blue, and green. Instead of using words to represent these categories, we convert them into numbers.

In one-hot encoding, we create a new column for each category. If an item belongs to a specific category, we put a "1" in that column; otherwise, we put a "0". So, for our example of colors, the column for red would have a "1" if an item is red and "0" otherwise. The column for blue would have a "1" if an item is blue and "0" otherwise, and so on.

In [None]:
#Each value in the y array is converted to a one-hot encoded array.
encoder = OneHotEncoder()
Y = encoder.fit_transform(np.array(y).reshape(-1,1)).toarray()

In [None]:
#The X and Y lists are converted to numpy arrays.
X = np.asarray(X)
Y = np.asarray(Y)

In [None]:
#The X and y arrays are split into train and test sub-datasets. Their ratio is 7:3 respectively.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1 , shuffle=True)

#Visualizing the train and test datasets.
print(X_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_train.shape)

###Batch Size
In machine learning, batch size refers to the number of data samples that are processed together in one iteration during the training of a model. It affects how the model learns from the data and how computational resources are utilized.

Imagine you have a large set of math problems to solve. Instead of solving them one by one, you decide to solve a group of problems together. The batch size is like the number of problems you solve in each round.

Once you have the trained the model, you can come back and play with the value of the batch size and change it until you get a better test accuracy score.

In [None]:
BatchSize = 20 #@param {type:"slider", min:10, max:30, step:1}

## Data Augmentation
Data augmentation is a technique used to increase the size and diversity of a dataset by applying various transformations to existing data samples. It helps in training machine learning models to perform better and generalize well on unseen data.

Imagine you have a small set of images to train a model, but you want to make it more robust and capable of recognizing different variations of those images. Instead of collecting more images, you can use data augmentation.

Data augmentation involves applying operations like flipping, rotating, scaling, cropping, or changing the colors of the existing images. By doing this, you create new versions of the same image with slight modifications.

In [None]:
#Creating a list of transformations that we will apply to each image. This process is known as image augmentation. It helps to make deep learning algorithms robust.
train_datagen = ImageDataGenerator(rescale = 1./255, #Converts each pixel value in the image array to a decimal value.
                                   shear_range = 20.0, #Slants the image by the specified value.
                                   rotation_range = 20, #Rotates the image by the specied value.
                                  )
test_datagen = ImageDataGenerator(rescale = 1./255) #Converts each pixel value in the image array to a decimal value.

#Applying the transformations to all the images.
train_iterator = train_datagen.flow(X_train, y_train, BatchSize)
test_iterator = test_datagen.flow(X_test, y_test, BatchSize)
print('Batches train=%d, test=%d, batch-size=%d' % (len(train_iterator), len(test_iterator), BatchSize)) #The ratio of the batch size of the train dataset to the test dataset is about 7:3.
batchX, batchy = train_iterator.next()


##**Building the Model**##

###Question Time!!

1. What should the input shape of the first convolution layer be? *Hint: the input shape is the same as the size of the image.*

In [None]:
x_axis_size = 224 #@param {type:"integer"}
y_axis_size = 224 #@param {type:"integer"}

 Run this code block to find out if your answer correct

In [None]:
#@title
if x_axis_size != ImageSize or y_axis_size != ImageSize:
  print(f'Oh no! The input size needs to be the same as the image size: {ImageSize}')
else:
  print('Yay! You are correct!')

2. How many neurons (outputs) should the Dense, or final, layer of the model have? *Hint: the model should have as many output neuros as the number of classes of the data*

In [None]:
 output_neuron = 17 #@param {type:"integer"}

Run this code block to find out if your answer is correct

In [None]:
#@title
if output_neuron != len(CelebKeys):
  print('Oh no! The number of output neurons need to be the same as the number of classes we are detecting: 17')
else:
  print('Yay! Your answer is correct!')

In [None]:
#Creating our CNN model!!!
model = Sequential()
model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=(ImageSize,ImageSize,3)))
model.add(MaxPooling2D(2,2))
model.add(Conv2D(32,kernel_size=(4,4),activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(Dropout(rate=0.5))
model.add(Flatten())
model.add(Dense(len(CelebKeys),activation='softmax'))

### Learning Rate
The learning rate is a parameter used in training machine learning models, specifically in optimization algorithms like gradient descent. It determines the step size taken in adjusting the model's parameters during the learning process.

Imagine you are climbing a mountain, and you want to reach the top by taking small steps. The learning rate is like the size of the steps you take. If the learning rate is too large, you might overshoot the optimal point and miss the peak. On the other hand, if the learning rate is too small, you might take forever to reach the top or get stuck in a suboptimal position.

In the next block, you can choose the learning rate. Don't forget to come back and play with the learning rate.

In [None]:
LearningRate = 0.0717236 #@param {type:"slider", min:0.0000001, max:0.1, step:1e-8}

In [None]:
#Creating a Adam optimizer, with a learning rate you have just chosen. The learning rate is a hyperparameter that can be adjusted to improve the performance of the model.
opt = tf.keras.optimizers.Adam(LearningRate)
model.compile(optimizer=opt, loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
#Visualizing the model.
print('Model Details are : ')
print(model.summary())

## Epochs
In machine learning, an epoch refers to a complete pass through the entire training dataset during the model training process. It helps in determining how many times the model will iteratively learn from the training data.

Imagine you have a book that you want to read and understand thoroughly. Instead of reading it all at once, you decide to read it chapter by chapter. Each time you read through all the chapters of the book, you complete one epoch.

In the next block, you can choose the number of epochs you want to train your model for. Don't forget to come back and play with the epoch.

In [None]:
NumEpochs = 10 #@param {type:"slider", min:2, max:20, step:1}

In [None]:
#Train the model for 80 epochs. The number of epochs can be increased, which usually improves the performance of the model.
History = model.fit(X_train, y_train,  steps_per_epoch=len(train_iterator), epochs=NumEpochs)

In [None]:
#Testing the model on the testing dataset.
test_eval = model.evaluate(X_test, y_test, batch_size=BatchSize)
print('Test loss:', test_eval[0])
print(f'Test accuracy: {test_eval[1] * 100}%')

Our contructed model does not receive a very high accuracy score because it has only been trained for a few epochs, and because it has a small brain. To improve the accuracy of a model, you may choose to increase it size (number of layers and structure), and change its hyperparameters to improve performance. For example, you may choose to modify the learning rate, or increase the number of epochs that you train the CNN for.

In the next section, you will go against a bigger model that has been trained for 500 epochs and achieves an accuracy score that is very close to 100%.


In [None]:
#Let us load a pretrained model that has good accuracy
#Load the file where the model is stored
PATHModel= '/content/M17.h5'
#
#load the model so we can work with it
Model500 = keras.models.load_model(PATHModel)

In [None]:
#Testing the loaded model on the testing dataset.
test_eval = Model500.evaluate(X_test, y_test, batch_size=BatchSize)
#Predicting values on the test dataset.
pred_test = Model500.predict(X_test)
y_pred = encoder.inverse_transform(pred_test)
y_t = encoder.inverse_transform(y_test)
print('Test loss:', test_eval[0])
print(f'Test accuracy: {test_eval[1] * 100}%')

Your model does not receive a very high accuracy score because it has only been trained for a few epochs. To improve the accuracy of a model, you may choose to change its hyperparameters to improve performance. For example, you may choose to modify the learning rate, or increase the number of epochs that you train the CNN for.

In the next section, you will go against a model that has been trained for 500 epochs and achieves an accuracy score that is very close to 100%.

**Now let's test you against a computer. You will be shown an image, choose the name of the celebrity, and then let's see what the model we just trained thinks. Run the next code block as many times as you wish.**

In [None]:
#@title
import random
random_integer=random.randint(0, len(os.listdir(PATH)))
random_celebrity=f'{PATH}/{os.listdir(PATH)[random_integer]}'
file = f'{random_celebrity}/{os.listdir(random_celebrity)[random.randint(0, len(os.listdir(random_celebrity)))]}'
img = Image.open(file)
img

In [None]:
#@title
YourChoice = 'Scarlett_Johansson' #@param ['Will_Smith','Scarlett_Johansson','Nicole_Kidman','Denzel_Washington','Johnny_Depp','Robert_Downey_Jr','Hugh_Jackman','Jennifer_Lawrence','Brad_Pitt','Leonardo_DiCaprio','Megan_Fox','Kate_Winslet','Angelina_Jolie','Tom_Cruise','Tom_Hanks','Sandra_Bullock','Natalie_Portman', 'Scarlett Johanson']

In [None]:
#@title
def get_celebrity(n):
  for k, v in CelebCodes.items():
    if v == n: return k

def predict(img, model):
  img = img.resize((ImageSize, ImageSize), Image.LANCZOS)
  np_img = np.asarray(img)
  return get_celebrity(np.argmax(model.predict(np.expand_dims(np_img, axis=0))))

In [None]:
#predict using both model
GoodModelValue = predict(img, Model500)
BadModelValue = predict(img, model)

In [None]:
#@title
random_celebrity = random_celebrity.split('/')[-1]
print(f'The good pre-trained model predicted celebrity: {GoodModelValue}')
if GoodModelValue == random_celebrity:
    if YourChoice != random_celebrity:
      print('Oops! The model predicted correctly, but unfortunately, you did not')
    elif YourChoice== random_celebrity:
      print('Yay! Both you and the model predicted the celebrity correctly')
else:
    if YourChoice != random_celebrity:
      print('Oops! Both you and the model predicted incorrectly')
    elif YourChoice == random_celebrity:
      print('Yay! You predicted correctly, but unfortunately, the model did not')

print(f'The model that we just trained predicted celebrity is: {BadModelValue}')

In [None]:
#@title
#Placing all the values from the model prediction onto a dataframe.
df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])
df['Predicted Labels'] = y_pred.flatten()
df['Actual Labels'] = y_t.flatten()

In [None]:
#@title
#Creating a confusion matrix. This helps visualize what the model predicted versus what the actual value of the images were.
cm  = confusion_matrix(y_t, y_pred ,normalize='true')
plt.figure(figsize = (12, 10))

CNN=[cm[0,0],cm[1,1],cm[2,2],cm[3,3],cm[4,4]]
CNN= pd.DataFrame(CNN)

x_label=np.array(CelebKeys)
y_label=np.array(CelebKeys)

cmYP = pd.DataFrame(cm, index = x_label, columns = y_label,)

sns.heatmap(cmYP, linecolor='white', cmap='Blues', linewidth=2, annot=True, )
plt.title('Confusion Matrix', size=20)
plt.xlabel('Predicted Labels', size=14)
plt.ylabel('Actual Labels', size=14)
plt.show()