# Semeval Dataset Processing

This notebook takes care of the preprocessing of the tweets of the Semeval task for the division in Train/Test/Validation.

## Corpus Download

In [0]:
%%bash

rm -rf data/

curl -LO https://cs.famaf.unc.edu.ar/~ccardellino/resources/semeval/semeval-2016-task-6.tar.gz
tar xvf semeval-2016-task-6.tar.gz

mv StanceDataset/ data/
rm -f semeval-2016-task-6.tar.gz

StanceDataset/
StanceDataset/train.csv
StanceDataset/test.csv


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0  6  254k    6 16384    0     0   5481      0  0:00:47  0:00:02  0:00:45  5479100  254k  100  254k    0     0  67466      0  0:00:03  0:00:03 --:--:-- 67466


## Dataset Loading

After downloading the corpus we need to process the data and set our train/test/validation datasets to work with.

In [0]:
import pandas as pd

from sklearn.model_selection import train_test_split

### Exploratory Data Analysis

In this section of the notebook we do some basic EDA of the Corpus. For now limited to the view of the dataset and check the stances (and their count).

In [0]:
train_dataset = pd.read_csv("./data/train.csv")
train_dataset.head()

Unnamed: 0,Tweet,Target,Stance,Opinion Towards,Sentiment
0,"@tedcruz And, #HandOverTheServer she wiped cle...",Hillary Clinton,AGAINST,1. The tweet explicitly expresses opinion abo...,neg
1,Hillary is our best choice if we truly want to...,Hillary Clinton,FAVOR,1. The tweet explicitly expresses opinion abo...,pos
2,@TheView I think our country is ready for a fe...,Hillary Clinton,AGAINST,1. The tweet explicitly expresses opinion abo...,neg
3,I just gave an unhealthy amount of my hard-ear...,Hillary Clinton,AGAINST,1. The tweet explicitly expresses opinion abo...,neg
4,@PortiaABoulger Thank you for adding me to you...,Hillary Clinton,NONE,3. The tweet is not explicitly expressing opi...,pos


Here we check the possible stances and the amount of instances for each class.

In [0]:
stances = train_dataset.groupby(["Target", "Stance"]).size().reset_index()
stances

Unnamed: 0,Target,Stance,0
0,Atheism,AGAINST,304
1,Atheism,FAVOR,92
2,Atheism,NONE,117
3,Climate Change is a Real Concern,AGAINST,15
4,Climate Change is a Real Concern,FAVOR,212
5,Climate Change is a Real Concern,NONE,168
6,Feminist Movement,AGAINST,328
7,Feminist Movement,FAVOR,210
8,Feminist Movement,NONE,126
9,Hillary Clinton,AGAINST,393


(this needs further analysis, I'll leave it for later, for now let's work on extracting the graphs)



## Train/Validation/Test

We will be using a validation subset of the train dataset in order to do our hyperparameter optimization. For that we need to define it clearly.

Before anything, we need to set our target, in order to only use that for our base experiments.

In [0]:
SELECTED_TARGET = "Legalization of Abortion"

train_abortion_dataset = train_dataset[train_dataset["Target"] == SELECTED_TARGET].reset_index(drop=True)

Next select a random portion for validation.

In [0]:
train_abortion_indices, validation_abortion_indices = train_test_split(
    train_abortion_dataset.index, 
    test_size=0.2, 
    random_state=42 # Seeding with the answer to the Ultimate Question of Life, the Universe, and Everything :p
)

train_abortion_dataset.loc[train_abortion_indices, "Split"] = "Train"
train_abortion_dataset.loc[validation_abortion_indices, "Split"] = "Validation"
train_abortion_dataset.head()

Unnamed: 0,Tweet,Target,Stance,Opinion Towards,Sentiment,Split
0,Just laid down the law on abortion in my bioet...,Legalization of Abortion,AGAINST,1. The tweet explicitly expresses opinion abo...,neg,Train
1,@tooprettyclub Are you OK with #GOP males tell...,Legalization of Abortion,FAVOR,1. The tweet explicitly expresses opinion abo...,neg,Train
2,"If you don't want your kid, put it up for adop...",Legalization of Abortion,AGAINST,1. The tweet explicitly expresses opinion abo...,neg,Validation
3,"@RedAlert -there should be a ""stigma"" to butch...",Legalization of Abortion,AGAINST,1. The tweet explicitly expresses opinion abo...,neg,Train
4,But isn't that the problem then. Not enough fa...,Legalization of Abortion,NONE,2. The tweet does NOT expresses opinion about ...,neg,Train


With the Validation data set. We need the test dataset for the whole graph construction.

We collect everything into one dataset.

In [0]:
test_dataset = pd.read_csv("./data/test.csv")
test_abortion_dataset = test_dataset[test_dataset["Target"] == SELECTED_TARGET]

test_abortion_dataset.loc[:, "Split"] = "Test"

dataset = pd.concat([train_abortion_dataset.sort_values("Split"), test_abortion_dataset], ignore_index=True)
dataset = dataset[["Tweet", "Stance", "Split"]]  # We only need these columns
dataset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Tweet,Stance,Split
0,Just laid down the law on abortion in my bioet...,AGAINST,Train
1,Bad 2 days for #Kansas Conservatives #ksleg @g...,NONE,Train
2,"Now that there's marriage equality, can we sta...",AGAINST,Train
3,I'll always put all my focus and energy toward...,AGAINST,Train
4,"@BarackObama celebrates ""equality"" while 3000 ...",AGAINST,Train


Just to check our division, let's see the distribution of the classes.

In [0]:
dataset.groupby(["Split", "Stance"]).size().reset_index()

Unnamed: 0,Split,Stance,0
0,Test,AGAINST,189
1,Test,FAVOR,46
2,Test,NONE,45
3,Train,AGAINST,278
4,Train,FAVOR,99
5,Train,NONE,145
6,Validation,AGAINST,77
7,Validation,FAVOR,22
8,Validation,NONE,32


We can save this data and use that for the baselines.

In [0]:
for split in ["Train", "Validation", "Test"]:
    dataset.loc[dataset["Split"] == split, ["Tweet", "Stance"]].to_csv("./data/semeval.abortion.{}.csv".format(split.lower()), index=False)

In [0]:
%%bash

cd data/
tar zcvf semeval.abortion.tgz semeval.*

semeval.abortion.test.csv
semeval.abortion.train.csv
semeval.abortion.validation.csv


### Resource

The dataset is now available at: https://cs.famaf.unc.edu.ar/~ccardellino/resources/semeval/semeval.abortion.tgz