## Downloading and preprocessing the SimLex dataset
This notebook downloads and preprocesses the SimLex dataset.

In [6]:
%env URL=https://www.cl.cam.ac.uk/~fh295/SimLex-999.zip
! wget $URL
!unzip SimLex-999.zip

env: URL=https://www.cl.cam.ac.uk/~fh295/SimLex-999.zip
--2017-08-10 20:25:04--  https://www.cl.cam.ac.uk/~fh295/SimLex-999.zip
Resolving www.cl.cam.ac.uk... 128.232.0.20, 2001:630:212:200::80:14
Connecting to www.cl.cam.ac.uk|128.232.0.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16805 (16K) [application/zip]
Saving to: ‘SimLex-999.zip’


2017-08-10 20:25:05 (84.6 KB/s) - ‘SimLex-999.zip’ saved [16805/16805]

Archive:  SimLex-999.zip
  inflating: SimLex-999/README.txt   
  inflating: SimLex-999/SimLex-999.txt  


In [7]:
!ls

[34mSimLex-999[m[m                 preprocessing_simlex.ipynb
SimLex-999.zip


In [8]:
!ls SimLex-999

README.txt     SimLex-999.txt


In [9]:
import pandas as pd

In [11]:
df = pd.read_csv('SimLex-999/SimLex-999.txt', sep='\t')
df.head()

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


Important notes from the README:
- POS is for both words, meaning that SimLex only has word pairs with same POS.
- Similarity ratings were collected on a 0-6 scale, but have been linearly mapped to 0-10.
- Concreteness ratings are from the Nelson norms, and are on a 1-7 scale.
- Concreteness quartile also from Nelson norms (unsure how it is calculated from two numbers)
- Assoc(USF) = Nelson norms
- Binary variable indicating whether the word pair is in the top 333 (third) of association ratings, as per the Nelson norm column

In [12]:
df.columns = ['word1', 'word2', 'POS', 'similarity', 'word1_concreteness', 'word2_concreteness',
              'concreteness_quartile', 'nelson_norms', 'top_333_in_nelson', 'similarity_sd']

In [13]:
df.head()

Unnamed: 0,word1,word2,POS,similarity,word1_concreteness,word2_concreteness,concreteness_quartile,nelson_norms,top_333_in_nelson,similarity_sd
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


In [14]:
outfile = '../simlex.csv'
df.to_csv(outfile, index=False)

Remove everything except this file to save space.

In [15]:
!find . -not -name 'preprocessing_simlex.ipynb' -print0 | xargs -0 rm --
! rm -rf SimLex-999

rm: "." and ".." may not be removed
rm: ./.ipynb_checkpoints: is a directory
rm: ./SimLex-999: is a directory
