## Meme Image Classifier Trainer with Model Upload to Cloud Storage
#### Overview
This is an end to end model trainer for creating an image classifier for memes. Once a model is trained it is uploaded to Azure cloud storage. The following is the general flow of this trainer:

1. Meme titles are pulled using the imgflip api to get a list of meme titles.
2. This list is then used to create a Bing image search for each meme that returns a list of image URLs to be downloaded for training. 
3. Finally the training is run and the model is saved to a cloud storage to be pulled in by another service.

This notebook is run on a cadence so that a consumer of the model can pull in the latest model at any given point from cloud storage.

The model is pretty good at guessing any meme seen in the titles list, but being reliant on the imgflip api caps what is seen. The api is supposed to return the top 100 or so memes that are popular on the site, but this appears to be somewhat hit or miss in testing.

Training was based around the FastAi library and using resnet18.

---
#### Notebook Setup

The following os vars must be set to use Azure services:
```
AZURE_SEARCH_KEY
AZURE_STORAGE_CONNECTION_STRING
STORAGE_SHARE_NAME
```
The image search is can return a flexable number of images by setting `TRAINING_DATA_SIZE` below (with a max of 150 imgs per search).

This notebook is more or less agnostic to what service it is run on, however anything that is specific to Kaggle should be tagged with the comment `only for Kaggle`.

In [1]:
!pip install --user torch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 torchtext==0.10.0 # only for Kaggle

import pandas as pd
import sys
import time
import requests
import urllib.request
from PIL import Image
from fastai.vision.all import *
from pandas.api.types import CategoricalDtype
from pathlib import Path



from kaggle_secrets import UserSecretsClient # only for Kaggle
user_secrets = UserSecretsClient()

secret_key = user_secrets.get_secret("AZURE_SEARCH_KEY")


api_url = "https://api.imgflip.com/get_memes"
OUTPUT_DIR = "./"
TRAINING_DATA_DIR = "meme_training_data"
TRAINING_DATA_PATH = OUTPUT_DIR+TRAINING_DATA_DIR
TRAINING_DATA_SIZE = 40

# create a directory if it doesn't exist
if not os.path.exists(TRAINING_DATA_PATH):
    os.makedirs(TRAINING_DATA_PATH)

#### Helper Functions

In [2]:
!pip install azure-cognitiveservices-search-imagesearch

from azure.cognitiveservices.search.imagesearch import ImageSearchClient as api
from msrest.authentication import CognitiveServicesCredentials as auth

def search_images_bing(key, term, min_sz=128, max_images=150):    
     params = {'q':term, 'count':max_images, 'min_height':min_sz, 'min_width':min_sz}
     headers = {"Ocp-Apim-Subscription-Key":key}
     search_url = "https://api.bing.microsoft.com/v7.0/images/search"
     response = requests.get(search_url, headers=headers, params=params)
     response.raise_for_status()
     search_results = response.json()
     return L(search_results['value'])
    
def get_image(row) -> str:
    url = row['url']
    name = row['name']
    filename = name.lower().replace(' ', '_') + '.jpg'
    r = requests.get(url)
    with open(OUTPUT_DIR+filename, 'wb') as outfile:
        outfile.write(r.content)
    return filename

def get_training_image_from_bing(row):
    search_term = row['name'] + 'meme'
    return search_images_bing(secret_key, search_term, max_images=TRAINING_DATA_SIZE)

def get_training_data(row):
    path = Path(TRAINING_DATA_PATH + '/' + row['name'].lower().replace(' ', '_'))
    if not path.exists():
        path.mkdir()
    results = get_training_image_from_bing(row)
    results_content_urls = results.attrgot('contentUrl')
    download_images(path, urls=results_content_urls)
    return results_content_urls

---
## Get Meme Types From API

In [3]:
response = requests.get(api_url)
response.json()['data']['memes'][0]

# Load response into dataframe
meme_df = pd.DataFrame(response.json()['data']['memes'])

pd.set_option('display.max_columns', None)
meme_df

---
## Collect Training Data
For each meme type run a bing image search and download the images for training data.

In [4]:
print('Downloading training data. This might take a few minutes...')
meme_df['example_urls'] = meme_df.apply(get_training_data, axis=1)
meme_csv_path = OUTPUT_DIR+'memes.csv'
meme_df.to_csv(meme_csv_path)
print('Training data downloads completed.')
print('Saving df to ' + meme_csv_path)

In [5]:
# clean any failed state files
fns = get_image_files(TRAINING_DATA_PATH)
failed = verify_images(fns)
failed.map(Path.unlink);

---
## Build model
Now that we have training data downloaded we can start to build out our actual ml model.

In [6]:
memes_data_block = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms()
    )
dls = memes_data_block.dataloaders(TRAINING_DATA_PATH)

dls.valid.show_batch(max_n=8, nrows=1)

In [7]:
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

learn.export()

### Export Model and Save Offline
Uses Azure [Share Service](https://docs.microsoft.com/en-us/azure/storage/files/storage-python-how-to-use-file-storage?tabs=python) for persistant storage.

In [19]:
!pip install azure-storage-file-share

from azure.core.exceptions import (
    ResourceExistsError,
    ResourceNotFoundError
)

from azure.storage.fileshare import (
    ShareServiceClient,
    ShareClient,
    ShareDirectoryClient,
    ShareFileClient
)

STORAGE_CONNECTION_STRING = user_secrets.get_secret("AZURE_STORAGE_CONNECTION_STRING")
STORAGE_SHARE_NAME = user_secrets.get_secret("STORAGE_SHARE_NAME")

try:
    # Create a ShareClient from a connection string
    share_client = ShareClient.from_connection_string(STORAGE_CONNECTION_STRING, STORAGE_SHARE_NAME)
    print("Creating share:", STORAGE_SHARE_NAME)
    share_client.create_share()
except ResourceExistsError as ex:
    print("Share already exists moving to store ->")

def upload_to_cloud_storage(filename):
    local_file_path = OUTPUT_DIR + filename
    try:
        source_file = open(local_file_path, "rb")
        data = source_file.read()

        # Create a ShareFileClient from a connection string
        file_client = ShareFileClient.from_connection_string(STORAGE_CONNECTION_STRING, STORAGE_SHARE_NAME, filename)

        print("Uploading to:", STORAGE_SHARE_NAME + "/" + filename)
        file_client.upload_file(data)

    except ResourceExistsError as ex:
        print("ResourceExistsError:", ex.message)

    except ResourceNotFoundError as ex:
        print("ResourceNotFoundError:", ex.message)

filename = 'export.pkl'
upload_to_cloud_storage(filename)
upload_to_cloud_storage('memes.csv')
print('Model uploaded')

---
## Manual Testing Model

In [16]:
# from fastai.vision.widgets import *
# btn_upload = widgets.FileUpload()
# btn_upload

In [17]:
# img = PILImage.create(btn_upload.data[-1])
# out_pl = widgets.Output()
# out_pl.clear_output()
# with out_pl: display(img.to_thumb(128,128))
    
# pred,pred_idx,probs = learn.predict(img)
# print(f'Prediction: {pred}; Probability: {probs[pred_idx]:.04f}')
# out_pl
