Skip to content

edielam/Project-Symmetry-Semantic-comparison-Alpha

 
 

Repository files navigation

project-symmetry

About

Software designed for comparing Wikipedia articles in different languages in order to determine what information is missing from one article, but present in another. The goal is for everyone to have the same access to information no matter what language they speak. This is just one small step in eliminating the digital divide.

The intended use of this software is for comparing Wikipedia articles, however in it's current state, users must copy/paste text into text-boxes. Because of this, it can be used to compare any 2 sets of text to view similarities.

For more information, visit: https://www.grey-box.ca

Requirements

Python version >=3.7 < 3.11
https://www.python.org/downloads/
Git:
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

Installation

# Go to your operating system's command line interface
# Go to the folder where you want to clone the project to 
# Clone repo 
  git clone https://github.com/grey-box/Project-Symmetry-Semantic-comparison-Alpha.git
# change directory to Project-Symmetry-Semantic-comparison-Alpha
  cd Project-Symmetry-Semantic-comparison-Alpha
# Install required libraries
  python -m pip install -r requirements.txt
# Install Punkt
  1- python
  2- import nltk
  3- nltk.download('punkt')
# Create a deepl config file
  1- Inside *dev* folder, create a file called *deeplconfig.py*
  2- In the file, add the following line of code: deepl_api_key = "your_api_key"
  3- Replace "your_api_key" by your personal deepl api key
  4- Save the file
# Run program
  python main.py

Getting Started

Text Entry

Screenshot 2022-11-03 at 10 01 40 PM

Inside these text-boxes is where you will paste the articles you wish to compare. Please note, at this time text must be in the same language, so use a translation tool such as DeepL to make text the same language.

Language Selection

We currently offer support for 45 different languages. Select one from the drop down then hit "Select" to change the on screen display.

Screenshot 2022-11-04 at 2 59 33 PM

Comparison

First select a comparison tool (currently, only 2 are supported):

Screenshot 2022-11-04 at 2 59 15 PM

Then select a similarity percentage:

Screenshot 2022-11-04 at 2 59 24 PM

The program will search for sentences that have a similarity score >= to this number. (Note: The program is unlikely to return results if you select a high percentage due to the nature of comparison tools. A percentage of ~10% for BLEU Score and ~30% for Sentence Bert has returned best results, though feel free to test different values. Click "Select" to change the Comparison Tool and Similarity Percentage.

Finally, click "Compare", and the program should highlight sentences in both articles than are similar to each other. Matching colors denote maching sentences.

Ex. English v. French Article on Barack Obama:

Screenshot 2022-11-04 at 3 28 43 PM

Note: Some sections highlighted may not be very similar at all, please reference the disclaimer down below. We will try to get better results so less human review is needed.

Testing

Testing for comparison speed will be done as follows:

  • Clean up formatting of articles, if this step is not done the comparison will take much longer than needed (i.e. 1.5 minutes vs 10 minutes)
    • Paste without formatting into word document
    • Remove infobox
      • It's likely these contain roughly the same information, and since text is unformatted, it will be hard to tell where this information is coming from
      • Mostly done to cut down on comparison time
    • Preferably remove image captions
    • Remove references
  • Minimal apps open so more RAM is being allocated to comparison (i.e. Tests were run with only VSCode, the comparison app, and Excel open).
  • Number iterations based off # sentences in each article (O(m * n))
    • NLTK splits articles into sentences using sent_tokenizer, simply got length by using len()
  • Estimates made based off previous results
    • Comparison speed calculated using time library
    • Time per comparison = (total time) / (# iterations)
    • Initial estimate of .0005 seconds per iteration
  • Articles of varying length used
    • Barack Obama (random selection)
    • Elvis Presley (One of the longest articles according to this site
    • Boris Johnson (One of largest articles according to Wikipedia)

If you wish to not wait roughly 1-2 minute to compare entire articles, then it is recommended you only compare single sections at a time. Doing so will provide much faster results, though more testing will be needed if doing this section by section will save much time.

Disclaimer

This project utilizes several NLP libraries to compare text. It is important to note that the results may not always be accurate. Most of these libraries do not take into consideration sentence structure and grammar, so it is advised that the user double checks to make sure highlighted sections are close enough to each other. The best translations and comparisons will always be made by a real person, however having someone manually do this would be extremely time consuming, which is one of the problems this project aims to solve.

About

Semantic comparison between Wikipedia articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.7%
  • C 1.2%
  • Other 1.1%