### Importing the necessary modules

In [9]:
import pandas as pd
import random

### Filtering the Dataset

Filtering the shifts dataset to only include "Semantic Evolution" and "Microevolution"

In [10]:
data = pd.read_csv('Shifts.csv')

SC_data = data[(data['Type'] == ' Semantic evolution') | (data['Type'] == ' Microevolution')]

### Merging the Shifts Data

Merging the shifts that are cognates into the same row. The shifts that share a root word and followed the same meaning shift path are placed in the same row. The script takes the csv that includes the shifts, and groups the shifts that share a shift ID, source language, source lexeme and meaning together. Then, the second languages of the grouped shifts and the second lexemes are written into the same row. For instance, if ShiftXYZ includes shifts from a word from an ancestor language into several daughter languages, and the shifted meaning is the same across these languages, these shifts are presented as a single row.

In [11]:
#Keeping only the necessary columns in the dataframe
SC_data = SC_data[['ID', 'Type', 'Language_1', 'Lexeme_1', 'Meaning_1', 'Direction', 'Language_2', 'Lexeme_2', 'Meaning_2']]

#Grouping the rows that have the same value in ID, Language 1, Lexeme 1 and Meaning 1 columns, and merging the Language 2 and 
#Lexeme 2 columns into the same row, every value separated by commas.
merged = SC_data.groupby(['ID', 'Language_1', 'Lexeme_1', 'Meaning_1']).agg({
    'Language_2': lambda x: ','.join(x.unique()),
    'Lexeme_2':   lambda x: ','.join(x.unique()),
    'Meaning_2':  'first',   
    'Type':       'first',   
    'Direction':  'first'
}).reset_index()

#Reordering the columns
merged_reordered = merged.iloc[:, [0, 1, 2, 3, 8, 4, 5, 6]]

merged_reordered.to_csv('MergedShifts.csv')

### Sampling the Data

Randomly sampling 200 rows from a dataframe for annotation. 100 of these rows are saved with all the information and the other 100 rows are saved without any language information.

In [12]:
#To ensure that the results are consistent across trial
random.seed(3)

#Importing the dataframe
data = pd.read_csv('MergedShifts.csv')

#Since this column is the row number, it is unique to each row. So it is used for sampling.
#Putting all row numbers into a list
id_nums = data[['Unnamed: 0']].values.tolist()

#Since after 'tolist' items are put as list into list, this loop puts them as single items into a list
ids = []
for i in id_nums:
    ids.append(i[0])

#Sampling 200 row numbers
sample_id = random.sample(ids, 200)

#Sampling 100 row numbers from the sampled row numbers for the rows with all information
id_lang = random.sample(sample_id, 100)

#Removing the 100 sampled row numbers from all of the row numbers
id_nolang = list(set(sample_id) - set(id_lang))

#Saving the first sample of rows
lang = data[data['Unnamed: 0'].isin(id_lang)]

#Saving the second sample of rows
nolang = data[data['Unnamed: 0'].isin(id_nolang)]

#Removing the language information from the second sample and the row number column
nolang = nolang[['ID', 'Meaning_1', 'Direction', 'Meaning_2']]

#Removing the row number column from the first sample
lang = lang[['ID', 'Language_1', 'Lexeme_1', 'Meaning_1', 'Direction',
       'Language_2', 'Lexeme_2', 'Meaning_2']]

#Saving the sample with language information
lang.to_csv('RandomShiftsLang.csv')

#Saving the sample without language information
nolang.to_csv('RandomShifts.csv')
