# URL Flood Presence Dataset Creation Notebook
#### Constructs Dataset, Train/Dev/Test Data Splits and corresponding folders containing images.


**IMPORTANT NOTE:**

Some of the tweet data we use is produced from rehydrating tweets from tweet IDs. Because of this, some tweets may have been deleted since the work conducted in the [original work](https://github.com/dyllew/towards-automated-crowdsourced-crisis-reporting) which created the first version of the URL Flood Presence dataset and Train/Dev/Test splits. If you would like to access to the original tweets used to create the original URL Flood Presence dataset as was used in the [original work](https://github.com/dyllew/towards-automated-crowdsourced-crisis-reporting), please email [url_googleai@mit.edu](mailto:url_googleai@mit.edu) with your request for the URL Flood Presence Dataset and your plans for using it. We can only permit use for non-commercial research in accordance with [Twitter's Content Redistribution Policies](https://developer.twitter.com/en/developer-terms/agreement-and-policy) and you must comply with the Twitter Terms of Service, Privacy Policy, Development Agreement, and Developer Policy prior to receiving the dataset.

#### Install required packages

In [None]:
!pipenv install

In [None]:
import os
from os.path import join, splitext, exists
import numpy as np
import pandas as pd
import json
import shutil
from copy import copy
import requests
import wget
from zipfile import ZipFile
import tarfile
from pathlib import Path
from git.repo.base import Repo

from sklearn.model_selection import train_test_split

EVENT_NAME_COL_NAME, CLASS_LABEL_COL_NAME = 'event_name', 'class_label'
FLOOD_LABEL_NAME, NOT_FLOOD_LABEL_NAME = 'flood', 'not_flood'
LABELS_LIST = [FLOOD_LABEL_NAME, NOT_FLOOD_LABEL_NAME]
IMAGE_PATH_COL_NAME = 'image_path'

def make_dir(dir_path):
    try:
        os.mkdir(dir_path)
    except FileExistsError:
        print('{} directory already exists.'.format(dir_path))

def print_label_totals(df, df_name):
    # Allows us to see number of data points for each class (Flood/Not Flood) from each event source
    event_names = df[EVENT_NAME_COL_NAME].unique()
    labels = df[CLASS_LABEL_COL_NAME].unique()
    total_label_counts = {label:0 for label in labels}
    for event_name in event_names:
        event_label_counts = {label:0 for label in labels}
        for label in labels:
            mask = (df[EVENT_NAME_COL_NAME] == event_name) & (df[CLASS_LABEL_COL_NAME] == label) & (df[IMAGE_PATH_COL_NAME].apply(exists))
            num_label = len(df[mask])
            event_label_counts[label] = num_label
            total_label_counts[label] += num_label
        print(f'{event_name}: ')
        total_event_imgs = len(df[df[EVENT_NAME_COL_NAME] == event_name])
        for label in labels:
            print(f'    {label}: {event_label_counts[label]} images')
        print(f'    Total: {total_event_imgs} images')
    df_length = len(df)
    print(f'{df_name}:' )
    for label in labels:
        print(f'    {label}: {total_label_counts[label]} images')
    print(f'    Total: {df_length} images')
    

def add_imgs_to_df(df, img_name_set, src_dir, data_dir_path, event_name, label):
    src_dir_img_names = os.listdir(src_dir)
    cant_find_count = 0
    for img_name in img_name_set:
        i = len(df)
        try:
            original_filename = list(filter(lambda x: splitext(x)[0] == img_name, src_dir_img_names))[0]
            full_img_path = join(data_dir_path, original_filename)
            df.loc[i] = [event_name, original_filename, full_img_path, label]
        except IndexError:
            # If file does not exist, we don't add it to the dataframe
            cant_find_count += 1
    print("Could not find {} {} images from the {} dataset".format(cant_find_count, label, event_name))

In [None]:
# Sets path to directory for URL Flood presence dataset and Train/Dev/Test splits
flood_dataset_path = './Flood Dataset'
make_dir(flood_dataset_path)
final_splits_dir_path = join(flood_dataset_path, 'url_flood_presence')
make_dir(final_splits_dir_path)

In [None]:
splits = [
    'train',
    'dev',
    'test'
]

### We are going to aggregate data from three sources to make this dataset
---

### 1. Consolidated Crisis Dataset [1, 2, 3] - Disaster Types

#### We construct the labeled directories Flood vs. Not Flood derived from the Consolidated Crisis Image data set
The data for the original task was disaster type classification. We use the **"flood"** class from the original dataset for the positive **"flood"** class in our case and consolidate all other classes from the original dataset into the negative **"not flood"** class.
Link to this data is [here](https://crisisnlp.qcri.org/crisis-image-datasets-asonam20#).

We start with the Train/Dev/Test splits provided by the authors, and we consolidate the labels into **"flood"** and **"not_flood"** categories to be used to construct the labeled image folders for our case. We begin by constructing a folder of all the images from the original dataframe.

#### (a) Download and Extract tarfile containing images and Train/Dev/Test splits for Disaster Types

In [None]:
disaster_types_tarfile_link = "https://crisisnlp.qcri.org/data/crisis_image_datasets_benchmarks/data_disaster_types.tar.gz"
disaster_types_data_tar_filename = disaster_types_tarfile_link.split('/')[-1]
p = Path(disaster_types_data_tar_filename)
extensions = "".join(p.suffixes)
disaster_types_data_name = str(p).replace(extensions, "")
full_disaster_types_data_dir_path = join(flood_dataset_path, disaster_types_data_name)

In [None]:
print(f'Downloading & Extracting {disaster_types_data_tar_filename} as directory {full_disaster_types_data_dir_path} ...')
response = requests.get(disaster_types_tarfile_link, stream = True)
file = tarfile.open(fileobj = response.raw, mode = "r|gz")
file.extractall(path = flood_dataset_path)
file.close()
print(f'Extracted {disaster_types_data_tar_filename} as directory {full_disaster_types_data_dir_path}')

#### (b) Using Train/Dev/Test splits relabel using labeling scheme described above

In [None]:
disaster_types_splits_dict = {
   split: pd.read_csv(join(full_disaster_types_data_dir_path, f'consolidated_disaster_types_{split}_final.tsv'), sep='\t') for split in splits
}

In [None]:
data_columns = disaster_types_splits_dict[splits[0]].columns
for split in splits:
    df = disaster_types_splits_dict[split]
    df[CLASS_LABEL_COL_NAME] = df[CLASS_LABEL_COL_NAME].apply(lambda x: FLOOD_LABEL_NAME if x == FLOOD_LABEL_NAME else NOT_FLOOD_LABEL_NAME)

In [None]:
def remove_bad_string_from_path(df, path_col_name, bad_string, path_idx):
    new_df = df.copy()
    def path_func(path, bad_string, path_idx):
        split_path = path.split('/')
        if split_path[path_idx] == bad_string:
            mod_path = "/".join(split_path[:path_idx] + split_path[path_idx+1:])
            return mod_path
        return path
    new_df[path_col_name] = new_df[path_col_name].apply(lambda path: path_func(path, bad_string, path_idx))
    return new_df

#### (c) Create labeled flood/not_flood CSVs for disaster types CSVs

In [None]:
BAD_STRING = 'aidr_disaster_types'
for split in splits:
    split_df = disaster_types_splits_dict[split]
    split_df = remove_bad_string_from_path(split_df, IMAGE_PATH_COL_NAME, BAD_STRING, 1)
    print(f'disaster types {split} split has {len(split_df)} total images')
    abs_disaster_types_data_dir_path = "/".join(full_disaster_types_data_dir_path.split("/")[1:])
    split_df[IMAGE_PATH_COL_NAME] = split_df[IMAGE_PATH_COL_NAME].apply(lambda path: join(abs_disaster_types_data_dir_path, path))
    # Subset to files which actually exist
    split_df = split_df[split_df[IMAGE_PATH_COL_NAME].apply(exists)]
    # Save splits
    split_csv_path = join(flood_dataset_path, f'consolidated_disaster_types_{split}.csv')
    split_df.to_csv(split_csv_path, index = False)
    print_label_totals(split_df, f'Full {split}')

### 2. Central European Floods 2013 [4]

#### We construct the dataframe with labels "flood" and "not_flood" from the EU flood data set which consists of flood-relevant and flood-irrelevant images from central European floods of May & June 2013
Link to this data repo is [here](https://github.com/cvjena/eu-flood-dataset)

#### Clone the repo

In [None]:
eu_flood_repo_url = "https://github.com/cvjena/eu-flood-dataset"
eu_flood_data_path = join(flood_dataset_path, 'eu_flood_data')
make_dir(eu_flood_data_path)

In [None]:
Repo.clone_from(eu_flood_repo_url, eu_flood_data_path)

#### Separate photos for labeling

In [None]:
relevance_path = join(eu_flood_data_path, 'relevance')
relevance_dict = {}
relevance_files_list = os.listdir(relevance_path)
for relevance_filename in relevance_files_list:
    label_name = os.path.splitext(relevance_filename)[0]
    path_data_filename = join(relevance_path, relevance_filename)
    with open(path_data_filename, 'r') as f:
        relevance_dict[label_name] = f.read().splitlines()

In [None]:
flood_relevant_data_filenames = set(relevance_dict['flooding'])
depth_relevant_data_filenames = set(relevance_dict['depth'])
pollution_relevant_data_filenames = set(relevance_dict['pollution'])
irrelevant_data_filenames = set(relevance_dict['irrelevant'])

In [None]:
not_flood_data_filenames = set()
not_flood_data_filenames.update(irrelevant_data_filenames)
not_flood_data_filenames.update(depth_relevant_data_filenames - flood_relevant_data_filenames)
not_flood_data_filenames.update(pollution_relevant_data_filenames - flood_relevant_data_filenames)

#### Download imgs_small

In [None]:
eu_img_small_url = "https://archive.org/download/european-flood-2013/european-flood-2013_imgs_small.zip"
eu_img_dir = join(eu_flood_data_path, 'imgs_small')
filename = eu_img_small_url.split('/')[-1]

In [None]:
import sys
def bar_progress(current, total, width=80):
  progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
  # Don't use print() as it will print in new line every time.
  sys.stdout.write("\r" + progress_message)
  sys.stdout.flush()

In [None]:
print(f'Downloading {filename}')
wget.download(eu_img_small_url, bar=bar_progress)

#### Extract imgs_small

In [None]:
zfile = ZipFile(filename)
zfile.extractall(eu_flood_data_path)
zfile.close()
print(f'Extracted {filename} as directory {eu_img_dir}')

#### Label images using "flood" & "not_flood" labels

In [None]:
eu_data_df = pd.DataFrame(columns=data_columns)
eu_data_abs_dir_path = '/'.join(eu_img_dir.split('/')[1:])
eu_event_name = 'central_eu13'

In [None]:
add_imgs_to_df(eu_data_df, flood_relevant_data_filenames, eu_img_dir, eu_data_abs_dir_path, eu_event_name, FLOOD_LABEL_NAME)
add_imgs_to_df(eu_data_df, not_flood_data_filenames, eu_img_dir, eu_data_abs_dir_path, eu_event_name, NOT_FLOOD_LABEL_NAME)

num_flood_imgs = len(eu_data_df[eu_data_df[CLASS_LABEL_COL_NAME] == FLOOD_LABEL_NAME])
num_not_flood_imgs = len(eu_data_df[eu_data_df[CLASS_LABEL_COL_NAME] == NOT_FLOOD_LABEL_NAME])

eu_csv_path = join(flood_dataset_path, 'eu_labeled_flood_data.csv')
eu_data_df.to_csv(eu_csv_path, index = False)

In [None]:
print_label_totals(eu_data_df, eu_event_name)

### 3. Harz 2017 & Rhine 2018 [5]

#### We construct the labeled image folders flood vs. not_flood for the Twitter scrape of flood-related keywords of the Harz17 and Rhine18 flood events in Germany. [5] 
Link to the data repo is [here](https://github.com/cvjena/twitter-flood-dataset)

**NOTE:**

Since this data source is made from the rehydration of tweets using tweet IDs, some of the original 704 tweets for Harz 2017 and 1848 tweets for Rhine 2018 may have been deleted and thus can no longer be used for adding to the URL Flood Presence Dataset. Please see the note at the top of the notebook for directions for attaining the original URL Flood Presence Dataset used in the [original work](https://github.com/dyllew/towards-automated-crowdsourced-crisis-reporting).

#### Clone repo

In [None]:
twitter_flood_repo_url = "https://github.com/cvjena/twitter-flood-dataset"
path_twitter_flood_data = join(flood_dataset_path, 'twitter_flood_data')
path_to_download_file = join(path_twitter_flood_data, 'download_images.py')
harz17_dir = join(path_twitter_flood_data, 'harz17')
rhine18_dir = join(path_twitter_flood_data, 'rhine18')
make_dir(path_twitter_flood_data)

In [None]:
Repo.clone_from(twitter_flood_repo_url, path_twitter_flood_data)

#### Good to run the cell below a few times to ensure you get maximum number of currently existant tweets from the original dataset Harz17 & Rhine18 datasets.

In [None]:
%run '{path_to_download_file}' '{harz17_dir}' '{rhine18_dir}'

In [None]:
for event_name in ['harz17', 'rhine18']:
    data_dir_files = os.listdir(path_twitter_flood_data)
    data_img_dir = join(path_twitter_flood_data, event_name)
    data_filenames = os.listdir(data_img_dir)
    data_json_path = join(path_twitter_flood_data, '{}.json'.format(event_name))
    img_dict_list = []
    event_data_df = pd.DataFrame(columns=data_columns)
    with open(data_json_path) as data_json_file:
        json_data = json.load(data_json_file)
        for img_name in json_data.keys():
            img_dict = copy(json_data[img_name])
            img_dict['image_filename'] = img_name
            img_dict_list.append(img_dict)
    flood_filenames = set(map(lambda img_dict: img_dict['image_filename'], filter(lambda img_dict: img_dict['RelFlooding'], img_dict_list)))
    not_flood_filenames = set(map(lambda img_dict: img_dict['image_filename'], filter(lambda img_dict: not img_dict['RelFlooding'], img_dict_list)))
    event_data_dir_path = '/'.join(data_img_dir.split('/')[1:])
    add_imgs_to_df(event_data_df, flood_filenames, data_img_dir, event_data_dir_path, event_name, FLOOD_LABEL_NAME)
    add_imgs_to_df(event_data_df, not_flood_filenames, data_img_dir, event_data_dir_path, event_name, NOT_FLOOD_LABEL_NAME)
    event_csv_filename = '{}_labeled_flood_data.csv'.format(event_name)
    event_csv_path = join(flood_dataset_path, event_csv_filename)
    event_data_df.to_csv(event_csv_path, index = False)
    print_label_totals(event_data_df, event_name)

## Construct Final Train/Dev/Test Splits

We now have all the data prepared into separate folders, we need to construct our own Train/Dev/Test splits based on the consolidated train / dev / test splits, we use the sci-kit learn train/dev/test split function on the EU and twitter data and append these entries to the consolidated train/dev/test splits.

In [None]:
central_eu13_df = pd.read_csv(join(flood_dataset_path, 'eu_labeled_flood_data.csv'))
harz17_df = pd.read_csv(join(flood_dataset_path, 'harz17_labeled_flood_data.csv'))
rhine18_df = pd.read_csv(join(flood_dataset_path, 'rhine18_labeled_flood_data.csv'))

#### Set seed for reproducible randomized splitting

In [None]:
seed = 1
train_ratio, dev_ratio, test_ratio = 0.7, 0.1, 0.2

In [None]:
other_flood_data = pd.concat([central_eu13_df, harz17_df, rhine18_df], ignore_index=True)
X = other_flood_data.drop(columns=[CLASS_LABEL_COL_NAME])
y = other_flood_data[CLASS_LABEL_COL_NAME]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_ratio, random_state = seed)
X_train, X_dev, y_train, y_dev = train_test_split(
    X_train, y_train, test_size = dev_ratio/(train_ratio + dev_ratio), random_state = seed
)
created_splits_data = {
    'train': (X_train, y_train),
    'dev': (X_dev, y_dev),
    'test': (X_test, y_test)
}

In [None]:
other_flood_data_dfs_dict = {}
for split in splits:
    (X_split, y_split) = created_splits_data[split]
    split_df = X_split
    split_df[CLASS_LABEL_COL_NAME] = y_split
    other_flood_data_dfs_dict[split] = split_df


In [None]:
consolidated_data_dfs_dict = {
    split: pd.read_csv(join(flood_dataset_path, f'consolidated_disaster_types_{split}.csv')) for split in splits
}

In [None]:
final_split_dfs_dict = {
    split:pd.concat([consolidated_data_dfs_dict[split], other_flood_data_dfs_dict[split]], ignore_index = True) for split in splits
}

#### Create Train/Dev/Test splits (as .csv) using full data sources and create folders containing images in their corresponding Train/Dev/Test Folders

In [None]:
for split, split_df in final_split_dfs_dict.items():
    filenames = {} # To help with duplicate image filenames
    split_name = 'url_flood_presence_{}'.format(split)
    split_path = join(final_splits_dir_path, split)
    make_dir(split_path)
    labels = split_df[CLASS_LABEL_COL_NAME].unique()
    for label in labels:
        label_path = join(split_path, label)
        make_dir(label_path)
        label_df = split_df[split_df[CLASS_LABEL_COL_NAME] == label]
        for index, row in label_df.iterrows():
            src_event = row[EVENT_NAME_COL_NAME]
            abs_img_path = row[IMAGE_PATH_COL_NAME]
            image_name = abs_img_path.split('/')[-1]
            if image_name in filenames: # Found duplicate image filename
                filenames[image_name] += 1
                filename, ext = splitext(image_name)
                image_name = filename + '_' + str(filenames[image_name]) + ext
                filenames[image_name] = 1
            else:
                filenames[image_name] = 1
            final_img_path = join(label_path, image_name)
            abs_final_img_path = "/".join(final_img_path.split('/')[1:])
            split_df.loc[index] = [row[EVENT_NAME_COL_NAME], image_name, abs_final_img_path, row[CLASS_LABEL_COL_NAME]]
            shutil.copy(abs_img_path, final_img_path)
    print_label_totals(split_df, split_name)
    split_filename = split_name + '.csv'
    split_df.to_csv(join(final_splits_dir_path, split_filename), index = False)
print()
print(f'Final URL Flood Presence Train/Dev/Test splits and corresponding image folders can be found in {final_splits_dir_path}')

## References
---
[1] ***Firoj Alam, Ferda Ofli, Muhammad Imran, Tanvirul Alam, Umair Qazi, [Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response](https://arxiv.org/pdf/2011.08916.pdf), In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2020.***
```
@inproceedings{crisisdataset2020-images,
Author = {Firoj Alam and Ferda Ofli and Muhammad Imran and Tanvirul Alam and Umair Qazi},
Keywords = {Social Media, Crisis Computing, Tweet Text Classification, Disaster Response},
Title = {Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response},
Publisher = {IEEE},
Booktitle = {2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)},
Year = {2020}
}
```

[2] ***Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA.***
```
@InProceedings{crisismmd,
  author = {Alam, Firoj and Ofli, Ferda and Imran, Muhammad},
  title = { CrisisMMD: Multimodal Twitter Datasets from Natural Disasters},
  booktitle = {Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)},
  year = {2018},
  month = {June},
  date = {23-28},
  location = {USA}}
```
[3] ***Hussein Mozannar, Yara Rizk, and Mariette Awad, Damage Identification in Social Media Posts using Multimodal Deep Learning, In Proc. of ISCRAM, May 2018, pp. 529–543***
```
 @inproceedings{multimodal-deep-learning, title={Damage Identification in Social Media Posts using Multimodal Deep Learning}, booktitle={ISCRAM 2018 Conference Proceedings – 15th International Conference on Information Systems for Crisis Response and Management}, author={Mouzannar, Hussein and Yara Rizk and Awad, Mariette}, year={2018}, pages={529--543}} 
```

[4] ***Björn Barz, Kai Schröter, Moritz Münch, Bin Yang, Andrea Unger, Doris Dransch, and Joachim Denzler.
"Enhancing Flood Impact Analysis using Interactive Image Retrieval of Social Media Images."
Archives of Data Science, Series A, 5.1, 2019.***
```
@article{flood-impact-in-european-context,
  author    = {Bj{\"{o}}rn Barz and
               Kai Schr{\"{o}}ter and
               Moritz M{\"{u}}nch and
               Bin Yang and
               Andrea Unger and
               Doris Dransch and
               Joachim Denzler},
  title     = {Enhancing Flood Impact Analysis using Interactive Retrieval of Social
               Media Images},
  journal   = {CoRR},
  volume    = {abs/1908.03361},
  year      = {2019},
  url       = {http://arxiv.org/abs/1908.03361},
  eprinttype = {arXiv},
  eprint    = {1908.03361},
  timestamp = {Mon, 19 Aug 2019 13:21:03 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-03361.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

[5] ***Björn Barz, Kai Schröter, Ann-Christin Kra, and Joachim Denzler.
"Finding Relevant Flood Images on Twitter using Content-based Filters."
ICPR 2020 Workshop on Machine Learning Advances Environmental Science.***
```
@article{flood-impact-euro-context-twitter,
  author    = {Bj{\"{o}}rn Barz and
               Kai Schr{\"{o}}ter and
               Ann{-}Christin Kra and
               Joachim Denzler},
  title     = {Finding Relevant Flood Images on Twitter using Content-based Filters},
  journal   = {CoRR},
  volume    = {abs/2011.05756},
  year      = {2020},
  url       = {https://arxiv.org/abs/2011.05756},
  eprinttype = {arXiv},
  eprint    = {2011.05756},
  timestamp = {Thu, 12 Nov 2020 15:14:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2011-05756.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
