SemEval-2020 Task 1: Automatic Identification of Novel Word Senses and its Application to Social Media

About the Project

This project applies techniques based on word sense induction for determining instances of novel word senses. In the get_data file there are scripts for scraping social media data to provide the corpora for the project.

Requirements

The project is in Python, and requires a version of at least 3.6. To install the required libraries, run

pip3 install -r requirements.txt

or

conda install --file requirements.txt

to use Anaconda

Running the Project

There are two ways of running the project -- one for using SemEval 2020 Task 1 for evaluation, and one for the final project without a provided set of lemmas. Both projects require two textfiles, containing the two corpora which are to be compared, as well as the location for the final output, though the form of this output may differ based on which way the project is being run. To retrieve the social media data, there are two scripts in the get_data.py file. Calling the scrape script will produce a csv file with many millions of tweets in it. The partition script will take 500,000 of these tweets and produce two textfiles, one for the reference and one for the focus corpus. Running the project as part of SemEval requires an additional parameter, the location of a textfile containing the target lemmas. Run the project with

python main.py corpus1 corpus2 targets output --semeval_mode=True

when using for evaluation on SemEval, and with

python main.py corpus1 corpus2 output

otherwise. Running the project in SemEval mode produces a file with the the binary classification of word sense change and a file with the Jensen-Shannon distance for all words in the file containing target words. Running the project for the final project produces a file with the k words identified as having changed the most, where k is a model hyperparameter with a default of 25.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
output		output
README.md		README.md
analogy.py		analogy.py
corpus.py		corpus.py
focus.txt		focus.txt
get_data.py		get_data.py
hdp.py		hdp.py
main.py		main.py
reference.txt		reference.txt
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output

output

README.md

README.md

analogy.py

analogy.py

corpus.py

corpus.py

focus.txt

focus.txt

get_data.py

get_data.py

hdp.py

hdp.py

main.py

main.py

reference.txt

reference.txt

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

SemEval-2020 Task 1: Automatic Identification of Novel Word Senses and its Application to Social Media

About the Project

Requirements

Running the Project

About

Releases

Packages

Languages

elerisarsfield/semeval

Folders and files

Latest commit

History

Repository files navigation

SemEval-2020 Task 1: Automatic Identification of Novel Word Senses and its Application to Social Media

About the Project

Requirements

Running the Project

About

Resources

Stars

Watchers

Forks

Languages