# Generating Training Data for Predicting Semantic Changes in Tweets in Response to Natural Disasters

In this part of the project, we take the necessary steps to create our training data. This includes loading the tweets and disaster data, rasterizing the tweets into a set of pixels, and calculating the change in semantic attributes in response to natural disaster events.

Our data is structured around two main components: Tweets and Disasters. Tweets are geolocated social media posts that contain semantic attributes like 'aggressiveness', 'sentiment', and 'stance'. Disasters are natural disaster events, each associated with a specific geolocation, date and disaster type.

To facilitate the analysis, we use a custom module, geoprocessing, which we developed for this project. This module provides two classes: the Disaster class and the TweetRaster class. The Disaster class represents a disaster instance and the TweetRaster class enables us to rasterize the Tweet data into a set of pixels, allowing us to handle the data in a spatially aggregated manner.

### Rasterizing Tweets and Calculating Semantic Changes
The rasterization process groups tweets based on their geographic location. Each group of tweets is represented by a pixel. This allows us to aggregate the semantic attributes of tweets by pixel and monitor their change in response to natural disasters.

Following rasterization, we calculate semantic changes in response to natural disasters. This is done through the create_training_data method, which iterates over all disaster events and calls the compute_change method on each one of them. For each pixel with a minimum of 30 tweets both before and after the event, it calculates whether the semantic attributes of the tweets change significantly using a t-test for the continuous sentiment values and a Chi-Square test for homogeneity for the discrete values for aggressiveness and stance. If there is a significant change, it further identifies whether it is positive or negative.

The time frame surrounding the disaster event in which we look for changes can be specified. This flexibility allows us to investigate the impact of disasters over different time periods.

The training set generated through this process contains, for each pixel, the change response (no change, positive change, negative change) to each disaster event in the disaster file. This forms the data we use to train our machine learning model.

### Geocoding the Disaster Dataset
Before we start rasterizing the tweets and calculating semantic changes, it's important to make sure that all disaster events have geographic coordinates. For this, we provide the geocode_disaster_dataset method. This method takes a raw disaster dataset and geocodes the disaster events that have missing coordinates based on their 'Country' column. It then creates a new file with these geocoded events, and from then on, we work with the geocoded version of the file.

### Running the Data Generation Process
We run the data generation process for each semantic attribute ('aggressiveness', 'sentiment', 'stance') and for different numbers of days before and after the disaster event. This results in 16 different training sets for each attribute, giving us a rich dataset to train and test our models on.

In [1]:
from geoprocessing import TweetRaster

In [2]:
# Set raster resolution in Equator degress
resolution = 2

In [3]:
# Create a TweetRaster object with the specified resolution
raster = TweetRaster(resolution=resolution)

In [4]:
# Load the tweets dataset into the raster object
raster.load_tweets('data/tweets.csv',
                   longitude_column='lng',
                   latitude_column='lat',
                   crs=4326,
                   filter_before='2015-01-01')

In [5]:
# Get the date of the most recent tweet
latest_tweet = raster.tweets['created_at'].max()

In [6]:
# raster.geocode_disaster_dataset('data/disasters.csv', longitude_column='Longitude', latitude_column='Latitude')

In [7]:
# Load the geocoded disaster dataset into the raster object
raster.load_disasters('data/disasters_geocoded.csv',
                      longitude_column='Longitude',
                      latitude_column='Latitude',
                      crs=4326,
                      filter_before='2015-01-01',
                      filter_after=latest_tweet)

In [8]:
# Define the semantic attributes and days before and after disaster to consider
attributes = ['aggressiveness', 'sentiment', 'stance']
days_before_after = [7, 14, 21, 28]

In [9]:
import os.path

# Loop over each attribute and each combination of days before and after
for attribute in attributes:
    for days_before in days_before_after:
        for days_after in days_before_after:
            
            # Construct the path of the training data file
            PATH = f'training_data_{resolution}deg/training_data_{resolution}deg_{attribute}_{days_before}db_{days_after}da.csv'
            
            # Check if the training data file already exists
            if not os.path.exists(PATH):
                # Create the training data
                raster.create_training_data(attribute, days_before, days_after)
                # Write the training data to a CSV file
                raster.write_training_data(attribute, days_before, days_after)
                
            print(f'File ready: {PATH}')

File ready: training_data_2deg/training_data_2deg_aggressiveness_7db_7da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_7db_14da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_7db_21da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_7db_28da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_14db_7da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_14db_14da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_14db_21da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_14db_28da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_21db_7da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_21db_14da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_21db_21da.csv
File ready: training_data_2deg/training_data_2deg_aggressiveness_21db_28da.csv
File ready: training_data_2deg/training_data_2deg_aggressiv