# Deep Learning Project

You have developed a model in your NLP session to predict the topic of tweets by examining the text content of the postings. In this project, you will create a model to classify the topic of pictures used in the tweets to help predict the topic of the tweets later. Your model from the NLP session and this one later can be used to predict a tweet's topic by examining both textual and the visual content of the postings.

You will first execute Step 1 to pull the images from the corresponding URL address of each image contained in the tweets. These images are already labeled manually by human editors in terms of whether the images belong to `Nature` topic or not. The label `1` means the image belongs to the nature topic and `2` means otherwise.

In Step 2, you will create a classification algorithm. Please divide the dataset into train and test datasets. You may use your train dataset to validate the accuracy of your model when tuning up the hyperparametrs of your model. After finalizing training your model, test it on your test dataset and report its accuracy. You may use different accuracy metrics. 

## Step 1: Download the images


In [4]:
import pandas as pd
import logging, os
from _logging import set_logging
from _metrics import display_metrics
from _pckle import save_pickle_object, load_pickle_object
from _utility import gl, get_perc

set_logging(logging)


In [5]:
def get_photo_urls():
    df_photo_urls = load_pickle_object(gl.pkl_df_photo_urls)
    if df_photo_urls.empty == False:
        logging.info("Topic assigned Photo Urls retrieved from storage")
        return df_photo_urls

    logging.info("Topic assigned Photo Urls not currently stored")
    filepath = os.path.join("Files", "tweet_data.csv")
    logging.info(f"Read data from {filepath}")
    df_tweets_all = pd.read_csv(filepath)
    # remove tweets with no photo urls
    df_tweets = df_tweets_all[~df_tweets_all[gl.photoUrl].isnull()]
    # select only relevant columns
    df_photo_urls = df_tweets[[gl.photoUrl, gl.topic]].copy()
    # we are interested if the topic is or is not Nature, so add a flag column
    df_topic = pd.get_dummies(df_photo_urls[gl.topic], columns=[gl.topic])
    df_photo_urls[gl.is_nature] = df_topic[gl.nature]
    save_pickle_object(df_photo_urls, gl.pkl_df_photo_urls)
    return df_photo_urls


In [6]:
df_photo_urls = get_photo_urls()
df_photo_urls.head(15)

2023-01-30 21:10:38,907 | INFO : Loading pickle file from: pickle\pkl_df_photo_urls.pkl
2023-01-30 21:10:38,962 | INFO : Topic assigned Photo Urls retrieved from storage


Unnamed: 0,photoUrl,topicName,IsNature
0,https://pbs.twimg.com/media/Dtx8SiIWkAImVsb.jpg,Business,0
1,https://pbs.twimg.com/media/Dtx8yTyW4AEciqP.jpg,Business,0
6,https://pbs.twimg.com/media/Dtx83JvX4AE48aw.jpg,Business,0
15,https://pbs.twimg.com/media/DtyCu-ZXgAUimt7.jpg,Business,0
21,https://pbs.twimg.com/media/DtyA7KSW4AA5deF.jpg,Business,0
28,https://pbs.twimg.com/media/DtyAKObXQAAPUSN.jpg,Business,0
40,https://pbs.twimg.com/media/Dts0d_vWkAARiHg.jpg,Animal,0
42,https://pbs.twimg.com/media/DtyDOo0U8AAEl_o.jpg,Animal,0
44,https://pbs.twimg.com/media/DtuIrMXWoAA5Wve.jpg,Nature,1
47,https://pbs.twimg.com/media/Dcnhi_FXcAAQPXk.jpg,Nature,1


In [10]:
total_urls = len(df_photo_urls)
total_is_nature = sum(df_photo_urls[gl.is_nature])
perc_nature = get_perc(total_is_nature, total_urls)
logging.info(f"There are {total_urls} photos, of which {total_is_nature} ({perc_nature}%) are of Nature")

2023-01-30 21:12:34,957 | INFO : There are 277896 photos, of which 10615 (3.82%) are of Nature


## Step 2: Classifier

Develop a classifier for two categories. Create the necessary folders for the test and train datasets. Either create your own model or tranfer a model and revise it. Make sure you incorporate regularization, callbacks, etc., and use data augmentation. Since images may not be so distinct with respect to their categories, you may not get the same kind of performance you had in your assignments.