## Preprocessing ws-353
This notebook downloads and preprocesses the WordSimilarity-353 dataset.

In [32]:
%env URL=http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip
!wget $URL
!unzip wordsim353.zip

env: URL=http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip
--2017-08-10 12:32:55--  http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip
Resolving www.cs.technion.ac.il... 132.68.32.15
Connecting to www.cs.technion.ac.il|132.68.32.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23257 (23K) [application/zip]
Saving to: ‘wordsim353.zip’


2017-08-10 12:32:56 (292 MB/s) - ‘wordsim353.zip’ saved [23257/23257]

Archive:  wordsim353.zip
  inflating: combined.csv            
  inflating: set1.csv                
  inflating: set2.csv                
  inflating: combined.tab            
  inflating: set1.tab                
  inflating: set2.tab                
  inflating: instructions.txt        


In [33]:
!ls

combined.csv               set1.tab
combined.tab               set2.csv
instructions.txt           set2.tab
preprocessing_ws-353.ipynb wordsim353.zip
set1.csv


The ws-353 data comes in three sets: set1, set2, and combined. Set1 and set2 are a partition of the words, combined contains all of them. However, only set1 and set2 have similarity judgements on an individual level. Combined only has the mean rating. I manually combine them to preserve individual level judgements.

In [34]:
import pandas as pd
import numpy as np

In [35]:
set1_df = pd.read_csv('set1.csv')
set2_df = pd.read_csv('set2.csv')

In [36]:
set1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 16 columns):
Word 1          153 non-null object
Word 2          153 non-null object
Human (mean)    153 non-null float64
1               153 non-null float64
2               153 non-null float64
3               153 non-null float64
4               153 non-null int64
5               153 non-null int64
6               153 non-null int64
7               153 non-null float64
8               153 non-null int64
9               153 non-null float64
10              153 non-null int64
11              153 non-null float64
12              153 non-null int64
13              153 non-null int64
dtypes: float64(7), int64(7), object(2)
memory usage: 19.2+ KB


Some columns are all of type `int`, which makes me think these subjects didn't know real-valued responses were possible. Convert to `float` anyway.

In [37]:
set1_df = set1_df.apply(pd.to_numeric, downcast='float', errors='ignore')
set2_df = set2_df.apply(pd.to_numeric, downcast='float', errors='ignore')

Although the columns across set1 and set2 are not the same person, I combine the two sets anyway. However, I include a column preserving which set each word pair came from.

In [38]:
set1_df['which_set?'] = 'set1' 
set1_df.head()

Unnamed: 0,Word 1,Word 2,Human (mean),1,2,3,4,5,6,7,8,9,10,11,12,13,which_set?
0,love,sex,6.77,9.0,6.0,8.0,8.0,7.0,8.0,8.0,4.0,7.0,2.0,6.0,7.0,8.0,set1
1,tiger,cat,7.35,9.0,7.0,8.0,7.0,8.0,9.0,8.5,5.0,6.0,9.0,7.0,5.0,7.0,set1
2,tiger,tiger,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,set1
3,book,paper,7.46,8.0,8.0,7.0,7.0,8.0,9.0,7.0,6.0,7.0,8.0,9.0,4.0,9.0,set1
4,computer,keyboard,7.62,8.0,7.0,9.0,9.0,8.0,8.0,7.0,7.0,6.0,8.0,10.0,3.0,9.0,set1


In [41]:
combined_df = pd.concat([set1_df, set2_df])
combined_df['which_set?'] = combined_df['which_set?'].fillna('set2')
combined_df.head()

Unnamed: 0,1,10,11,12,13,14,15,16,2,3,4,5,6,7,8,9,Human (mean),Word 1,Word 2,which_set?
0,9.0,2.0,6.0,7.0,8.0,,,,6.0,8.0,8.0,7.0,8.0,8.0,4.0,7.0,6.77,love,sex,set1
1,9.0,9.0,7.0,5.0,7.0,,,,7.0,8.0,7.0,8.0,9.0,8.5,5.0,6.0,7.35,tiger,cat,set1
2,10.0,10.0,10.0,10.0,10.0,,,,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,tiger,tiger,set1
3,8.0,8.0,9.0,4.0,9.0,,,,8.0,7.0,7.0,8.0,9.0,7.0,6.0,7.0,7.46,book,paper,set1
4,8.0,8.0,10.0,3.0,9.0,,,,7.0,9.0,9.0,8.0,8.0,7.0,7.0,6.0,7.62,computer,keyboard,set1


Reorder and rename columns

In [42]:
combined_df.columns = list(combined_df.columns[:-4]) + ['similarity', 'word1', 'word2', 'which_set?']
new_columns = ['word1', 'word2', 'similarity', 'which_set?'] + list(map(str, range(1, 17)))
combined_df = combined_df[new_columns]
combined_df.head()

Unnamed: 0,word1,word2,similarity,which_set?,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,love,sex,6.77,set1,9.0,6.0,8.0,8.0,7.0,8.0,8.0,4.0,7.0,2.0,6.0,7.0,8.0,,,
1,tiger,cat,7.35,set1,9.0,7.0,8.0,7.0,8.0,9.0,8.5,5.0,6.0,9.0,7.0,5.0,7.0,,,
2,tiger,tiger,10.0,set1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,,,
3,book,paper,7.46,set1,8.0,8.0,7.0,7.0,8.0,9.0,7.0,6.0,7.0,8.0,9.0,4.0,9.0,,,
4,computer,keyboard,7.62,set1,8.0,7.0,9.0,9.0,8.0,8.0,7.0,7.0,6.0,8.0,10.0,3.0,9.0,,,


In [43]:
outfile = '../ws-353.csv'
combined_df.to_csv(outfile, index=False)

Remove everything except this file to save space.

In [44]:
!find . -not -name 'preprocessing_ws-353.ipynb' -print0 | xargs -0 rm --

rm: "." and ".." may not be removed
rm: ./.ipynb_checkpoints: is a directory
