# Set up

In [1]:
import numpy as np
import pandas as pd

# Load data

In [2]:
# load cleaned tweets dataset
filepath_in = '../data/derived/tweets_clean.csv'
tweet_df = pd.read_csv(filepath_or_buffer=filepath_in)
# preview dataframe
tweet_df.head()

Unnamed: 0,tweet_id,text,label
0,597576902212063232,Cisco had to deal with a fat cash payout to th...,0.0
1,565586175864610817,"@MadamPlumpette I'm decent at editing, no worr...",0.0
2,563881580209246209,@girlziplocked will read. gotta go afk for a b...,0.0
3,595380689534656512,guys. show me the data. show me your github. t...,0.0
4,563757610327748608,@tpw_rules nothings broken. I was just driving...,0.0


## Filter records

For the unsupervised portion of the project, I am going to use clusters to (hopefully!) propagate more specific sexism labels to unlabelled sexist tweets.

| Old label | Old meaning | New label | New meaning | Include in unsupervised? |
| --------- | ----------- | --------- | ----------- | ------------------------ |
| 0 | Not racist or sexist according to expert opinion (Waseem 2016) | N/A | Not sexist | No |
| 1 | Racist and not sexist according to expert opinion (Waseem 2016) | N/A | Not sexist | No |
| 2 | Sexist and not racist according to expert opinion (Waseem 2016) | 0 | Unknown category of sexism | Yes |
| 3 | Racist and sexist according to expert opinion (Waseem 2016) | 0 | Unkown category of sexism | Yes |
| 4 | Hostile sexist (Jha Mamidi 2017) | 1 | Hostile sexism | Yes |
| 5 | Benevolent sexist (Jha Mamidi 2017) | 2 | Benevolent sexism | Yes |

Given that, I now need to remove tweets that are not sexist from the dataset.

In [3]:
# remove tweets that are not sexist
unsupervised_tweet_df = tweet_df[tweet_df['label'] > 1]

## Map labels for unsupervised project

Now, I will adjust my labels in the remaining records.

In [4]:
# create dictionary to map old to new labels
old_to_new_labels = {2:0, 3:0, 4:1, 5:2}

# map old to new labels for training, development, and test datasets
unsupervised_tweet_df['label'] = unsupervised_tweet_df['label'].map(old_to_new_labels)

## Write unsupervised data to CSV

In [5]:
# write data to csv
filepath_out = '../data/derived/tweets_unsupervised.csv'
unsupervised_tweet_df.to_csv(filepath_out, index=False)