project-symmetry

About

Software designed for comparing Wikipedia articles in different languages in order to determine what information is missing from one article, but present in another. The goal is for everyone to have the same access to information no matter what language they speak. This is just one small step in eliminating the digital divide.

The intended use of this software is for comparing Wikipedia articles, however in it's current state, users must copy/paste text into text-boxes. Because of this, it can be used to compare any 2 sets of text to view similarities.

For more information, visit: https://www.grey-box.ca

Requirements

Python version >=3.7 < 3.11
https://www.python.org/downloads/
Git:
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

Installation

# Go to your operating system's command line interface
# Go to the folder where you want to clone the project to 
# Clone repo 
  git clone https://github.com/grey-box/Project-Symmetry-Semantic-comparison-Alpha.git
# change directory to Project-Symmetry-Semantic-comparison-Alpha
  cd Project-Symmetry-Semantic-comparison-Alpha
# Install required libraries
  python -m pip install -r requirements.txt
# Install Punkt
  1- python
  2- import nltk
  3- nltk.download('punkt')
# Create a deepl config file
  1- Inside *dev* folder, create a file called *deeplconfig.py*
  2- In the file, add the following line of code: deepl_api_key = "your_api_key"
  3- Replace "your_api_key" by your personal deepl api key
  4- Save the file
# Run program
  python main.py

Getting Started

Text Entry

Inside these text-boxes is where you will paste the articles you wish to compare. Please note, at this time text must be in the same language, so use a translation tool such as DeepL to make text the same language.

Language Selection

We currently offer support for 45 different languages. Select one from the drop down then hit "Select" to change the on screen display.

Comparison

First select a comparison tool (currently, only 2 are supported):

Then select a similarity percentage:

The program will search for sentences that have a similarity score >= to this number. (Note: The program is unlikely to return results if you select a high percentage due to the nature of comparison tools. A percentage of ~10% for BLEU Score and ~30% for Sentence Bert has returned best results, though feel free to test different values. Click "Select" to change the Comparison Tool and Similarity Percentage.

Finally, click "Compare", and the program should highlight sentences in both articles than are similar to each other. Matching colors denote maching sentences.

Ex. English v. French Article on Barack Obama:

Note: Some sections highlighted may not be very similar at all, please reference the disclaimer down below. We will try to get better results so less human review is needed.

Testing

Testing for comparison speed will be done as follows:

Clean up formatting of articles, if this step is not done the comparison will take much longer than needed (i.e. 1.5 minutes vs 10 minutes)
- Paste without formatting into word document
- Remove infobox
  - It's likely these contain roughly the same information, and since text is unformatted, it will be hard to tell where this information is coming from
  - Mostly done to cut down on comparison time
- Preferably remove image captions
- Remove references
Minimal apps open so more RAM is being allocated to comparison (i.e. Tests were run with only VSCode, the comparison app, and Excel open).
Number iterations based off # sentences in each article (O(m * n))
- NLTK splits articles into sentences using sent_tokenizer, simply got length by using len()
Estimates made based off previous results
- Comparison speed calculated using time library
- Time per comparison = (total time) / (# iterations)
- Initial estimate of .0005 seconds per iteration
Articles of varying length used
- Barack Obama (random selection)
- Elvis Presley (One of the longest articles according to this site
- Boris Johnson (One of largest articles according to Wikipedia)

If you wish to not wait roughly 1-2 minute to compare entire articles, then it is recommended you only compare single sections at a time. Doing so will provide much faster results, though more testing will be needed if doing this section by section will save much time.

Disclaimer

This project utilizes several NLP libraries to compare text. It is important to note that the results may not always be accurate. Most of these libraries do not take into consideration sentence structure and grammar, so it is advised that the user double checks to make sure highlighted sections are close enough to each other. The best translations and comparisons will always be made by a real person, however having someone manually do this would be extremely time consuming, which is one of the problems this project aims to solve.

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
.eggs		.eggs
dev		dev
myenv		myenv
supplements		supplements
test docs		test docs
.gitignore		.gitignore
Compare_Tests.xlsx		Compare_Tests.xlsx
GoogleCloud_project-symmetry.json		GoogleCloud_project-symmetry.json
Project-Symmetry-Alpha.spec		Project-Symmetry-Alpha.spec
README.md		README.md
display.csv		display.csv
main.py		main.py
myfile.txt		myfile.txt
requirements.txt		requirements.txt
setup.py		setup.py
text1_download.txt		text1_download.txt
text2_download.txt		text2_download.txt
userguide.txt		userguide.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

project-symmetry

About

Requirements

Installation