## Downloading and preprocessing MEN
This notebook downloads and preprocessed the [MEN](https://staff.fnwi.uva.nl/e.bruni/resources/MEN) dataset.

In [1]:
%env URL=https://staff.fnwi.uva.nl/e.bruni/resources/MEN.zip
!wget $URL
!unzip MEN.zip

env: URL=https://staff.fnwi.uva.nl/e.bruni/resources/MEN.zip
--2017-08-11 11:40:43--  https://staff.fnwi.uva.nl/e.bruni/resources/MEN.zip
Resolving staff.fnwi.uva.nl... 146.50.61.62
Connecting to staff.fnwi.uva.nl|146.50.61.62|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96854 (95K) [application/zip]
Saving to: ‘MEN.zip’


2017-08-11 11:40:45 (125 KB/s) - ‘MEN.zip’ saved [96854/96854]

Archive:  MEN.zip
   creating: MEN/
  inflating: MEN/licence.txt         
  inflating: MEN/.DS_Store           
   creating: MEN/agreement/
  inflating: MEN/agreement/elias-men-ratings.txt  
  inflating: MEN/agreement/agreement-score.txt  
  inflating: MEN/agreement/marcos-men-ratings.txt  
  inflating: MEN/instructions.txt    
  inflating: MEN/MEN_dataset_lemma_form.test  
  inflating: MEN/MEN_dataset_lemma_form.dev  
  inflating: MEN/MEN_dataset_lemma_form_full  
  inflating: MEN/MEN_dataset_natural_form_full  


In [2]:
!ls

[34mMEN[m[m                     MEN.zip                 preprocessing_men.ipynb


In [1]:
!ls MEN

MEN_dataset_lemma_form.dev    [34magreement[m[m
MEN_dataset_lemma_form.test   instructions.txt
MEN_dataset_lemma_form_full   licence.txt
MEN_dataset_natural_form_full


In [2]:
import pandas as pd

In [18]:
raw_data = 'MEN/MEN_dataset_natural_form_full'
df = pd.read_csv(raw_data, header=None, sep=' ')

In [19]:
df.head()

Unnamed: 0,0,1,2
0,sun,sunlight,50.0
1,automobile,car,50.0
2,river,water,49.0
3,stairs,staircase,49.0
4,morning,sunrise,49.0


In [21]:
df.columns = ['word1', 'word2', 'similarity_out_of_50']

In [24]:
df['similarity'] = df['similarity_out_of_50'] / 5

In [25]:
outfile = '../men.csv'
df.to_csv(outfile, index=False)

Now get the agreement data, that is, the ratings from two of the authors on a 1-7 scale. These are stored in the two texts files labelled *-men-ratings.txt.

In [3]:
!ls MEN/agreement

agreement-score.txt    elias-men-ratings.txt  marcos-men-ratings.txt


In [8]:
!head MEN/agreement/elias-men-ratings.txt

hamster	party	1	
bed	sleep	6
raspberry	strawberry	6
cooking	fruit	5
downtown	shopping	4
drug	wolf	2
colorful	outfit	6
burger	mac	5
frost	weather	5	
arch	concrete	2


In [21]:
elias = pd.read_csv('MEN/agreement/elias-men-ratings.txt', sep='\s+', header=None) # weird format with spaces and tabs separating columns
marcos = pd.read_csv('MEN/agreement/marcos-men-ratings.txt', sep='\s+', header=None)
elias.head()

Unnamed: 0,0,1,2
0,hamster,party,1
1,bed,sleep,6
2,raspberry,strawberry,6
3,cooking,fruit,5
4,downtown,shopping,4


Just renaming the columns

In [22]:
cols = ['word1', 'word2', 'similarity_out_of_7']
elias.columns = cols
marcos.columns = cols

The 'similarity_out_of_7' column was read in as a string. Fix that:

In [23]:
elias['similarity_out_of_7'] = pd.to_numeric(elias['similarity_out_of_7'].str.strip())
marcos['similarity_out_of_7'] = pd.to_numeric(marcos['similarity_out_of_7'].str.strip())

In [25]:
elias['similarity'] = 10 * (elias['similarity_out_of_7'] - 1)/6
marcos['similarity'] = 10 * (marcos['similarity_out_of_7'] - 1)/6
marcos.head()

Unnamed: 0,word1,word2,similarity_out_of_7,similarity
0,burger,sandwich,6,8.333333
1,blue,violet,6,8.333333
2,splash,wash,4,5.0
3,rust,rusty,7,10.0
4,snake,strawberry,1,0.0


Write to file

In [28]:
outfile = '../elias-men.csv'
elias.to_csv(outfile, index=False)
outfile = '../marcos-men.csv'
marcos.to_csv(outfile, index=False)

In [29]:
!find . -not -name 'preprocessing_men.ipynb' -print0 | xargs -0 rm --
!rm -rf MEN

rm: "." and ".." may not be removed
rm: ./.ipynb_checkpoints: is a directory
