# Deep Learning Project

You have developed a model in your NLP session to predict the topic of tweets by examining the text content of the postings. In this project, you will create a model to classify the topic of pictures used in the tweets to help predict the topic of the tweets later. Your model from the NLP session and this one later can be used to predict a tweet's topic by examining both textual and the visual content of the postings.

You will first execute Step 1 to pull the images from the corresponding URL address of each image contained in the tweets. These images are already labeled manually by human editors in terms of whether the images belong to `Business` topic or not. The label `1` means the image belongs to the business topic and `2` means otherwise.

In Step 2, you will create a classification algorithm. Please divide the dataset into train and test datasets. You may use your train dataset to validate the accuracy of your model when tuning up the hyperparametrs of your model. After finalizing training your model, test it on your test dataset and report its accuracy. You may use different accuracy metrics. 

## Step 1: Download the images


In [4]:
import pandas as pd
import logging, os, requests, shutil
from _logging import set_logging
from _pckle import save_pickle_object, load_pickle_object
from _utility import gl, get_perc

set_logging(logging)


In [5]:
def get_photo_urls():
    df_photo_urls = load_pickle_object(gl.pkl_df_photo_urls)
    if df_photo_urls is not None:
        logging.info("Topic assigned Photo Urls retrieved from storage")
        return df_photo_urls

    logging.info("Topic assigned Photo Urls not currently stored")
    filepath = os.path.join("Files", "tweet_data.csv")
    logging.info(f"Read data from {filepath}")
    df_tweets_all = pd.read_csv(filepath)
    # remove tweets with no photo urls
    df_tweets = df_tweets_all[~df_tweets_all[gl.photoUrl].isnull()]
    num_of_rows = len(df_tweets_all)
    logging.info(f"There are {num_of_rows} entries")
    df_tweets_all.drop_duplicates(subset=gl.photoUrl, inplace=True)
    num_of_rows_after = len(df_tweets_all)
    logging.info(f"Number of duplicate photo rows deleted = {num_of_rows - num_of_rows_after}")
    # select only relevant columns
    df_photo_urls = df_tweets[[gl.photoUrl, gl.topic]].copy()
    # we are interested if the topic is or is not Business, so add a flag column
    df_topic = pd.get_dummies(df_photo_urls[gl.topic], columns=[gl.topic])
    df_photo_urls[gl.is_business] = df_topic[gl.business]
    save_pickle_object(df_photo_urls, gl.pkl_df_photo_urls)
    return df_photo_urls


In [6]:
df_photo_urls = get_photo_urls()
total_urls = len(df_photo_urls)
total_is_business = sum(df_photo_urls[gl.is_business])
perc_business = get_perc(total_is_business, total_urls)
logging.info(f"There are {total_urls} photos, of which {total_is_business} ({perc_business}%) are of Business")

#df_photo_urls.head(15)

2023-02-01 05:52:05,355 | INFO : Pickle file in: pickle\pkl_df_photo_urls.pkl
2023-02-01 05:52:05,356 | INFO : Topic assigned Photo Urls not currently stored
2023-02-01 05:52:05,357 | INFO : Read data from Files\tweet_data.csv
2023-02-01 05:52:09,599 | INFO : There are 785916 entries
2023-02-01 05:52:10,090 | INFO : Number of duplicate photo rows deleted = 530830
2023-02-01 05:52:10,355 | INFO : Saving pickle file from: pickle\pkl_df_photo_urls.pkl
2023-02-01 05:52:10,567 | INFO : There are 277896 photos, of which 20134 (7.25%) are of Business


In [7]:
def create_folder(folder):
    if os.path.exists(folder) == False:
        os.makedirs(folder)

Some image urls may no longer exist. So the best option is to download them to a local folder.

In [8]:
def get_url_file_path(row, i, business_folder, other_folder):
    url = row[0]
    url_parts = url.split(".")
    index = len(url_parts) - 1
    ext = url_parts[index]
    file_name = f"{i}.{ext}"
    folder = business_folder if row[1] == "Business" else other_folder
    file_path = os.path.join(folder, file_name)
    return url, file_path

In [9]:
def download_images(df_photo_urls):
    business_folder = os.path.join("Images", "Business")
    if os.path.exists(business_folder):
        logging.info("Images already copied across")
        return
        
    other_folder = os.path.join("Images", "Other")
    create_folder(business_folder)
    create_folder(other_folder)
    np_photo_urls = df_photo_urls.to_numpy()
    invalid_url_cnt = 0
    exception_cnt = 0
    successful_cnt = 0
    logging.info("------- Start copying the images")
    for i, row in enumerate(np_photo_urls, start=1):
        if i % 1000 == 0:
            logging.info(f"----Copying the {i} image")
        url, file_path = get_url_file_path(row, i, business_folder, other_folder)
        try:
            res = requests.get(url, stream = True)
            if res.status_code == 200:
                with open(file_path,'wb') as f:
                    shutil.copyfileobj(res.raw, f)
                successful_cnt += 1
            else:
                invalid_url_cnt += 1
                #logging.info(f"!! {i} Image from url {url} cannot be retreived")
        except:
            exception_cnt += 1
            logging.info(f"!! {i} EXCEPTION: Image from url {url} cannot be retreived")
            
    logging.info("** All Images still available downloaded")
    logging.info(f"{successful_cnt} ({get_perc(successful_cnt, i)}%) images were successfully downloaded")
    logging.info(f"{invalid_url_cnt} ({get_perc(invalid_url_cnt, i)}%) image urls are invalid")
    logging.info(f"{exception_cnt} ({get_perc(exception_cnt, i)}%) attempted image downloads caused exceptions")



In [10]:
download_images(df_photo_urls)

2023-02-01 05:53:04,241 | INFO : ------- Start copying the images
2023-02-01 05:54:47,495 | INFO : ----Copying the 1000 image
2023-02-01 05:56:18,630 | INFO : ----Copying the 2000 image
2023-02-01 05:59:03,021 | INFO : ----Copying the 3000 image
2023-02-01 06:00:30,646 | INFO : ----Copying the 4000 image
2023-02-01 06:01:56,006 | INFO : ----Copying the 5000 image
2023-02-01 06:03:35,439 | INFO : ----Copying the 6000 image
2023-02-01 06:05:09,023 | INFO : ----Copying the 7000 image
2023-02-01 06:06:38,211 | INFO : ----Copying the 8000 image
2023-02-01 06:08:05,667 | INFO : ----Copying the 9000 image
2023-02-01 06:13:36,347 | INFO : ----Copying the 10000 image
2023-02-01 06:17:14,128 | INFO : ----Copying the 11000 image
2023-02-01 06:18:39,789 | INFO : ----Copying the 12000 image
2023-02-01 06:22:22,278 | INFO : ----Copying the 13000 image
2023-02-01 06:27:28,717 | INFO : ----Copying the 14000 image
2023-02-01 06:31:26,233 | INFO : ----Copying the 15000 image
2023-02-01 06:33:39,419 | IN

: 

Had to interupt the download process as the program appeared to stop for no apparent reason. We have enough images to train the data and create a classifier model. This will be done in the code ipynb file; P2_TwitterProject.ipynb