# Question 1 
Fetch and return the content of an article from the specified URL. Additionally, save the content to a file.

In [1]:
#!pip install bs4
import requests
from bs4 import BeautifulSoup
def get_article_content(url='https://www.ynet.co.il/laisha/article/b1al4tzyp'):
    # Given a URL, this function will return the article content in a string form
    # as well as saving the content to a file.
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Fetch the text via the class name (might need to change it for other websites)
    # This class name is applicable for some articles on ynet.co.il
    class_name = 'public-DraftEditor-content' 
    article_content = soup.find('div', {'class': class_name})
    
    if article_content:
        article_text = article_content.get_text()
        print("The content was saved to a file")
        with open('article_content.txt', 'w', encoding='utf-8') as file:
            file.write(article_text)
        # return article_text
    else:
        print('Article content wasnt fetched')


[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

# Question 2
## Idea
I approached the problem in following way:
1. Use regex in order to identify the words that represent a number in hebrew within a text, using a set of indicative words
2. Translate those words into English
3. Convert the English words into integers
4. Replace the Hebrew words in the original text with their corresponding integers

In [32]:
heb_numbers_keywords = ['אחת', 'שתיים', 'שלוש', 'ארבע', 'חמש', 'שש', 'שבע', 'שמונה', 'תשע', 'עשר', 'עשרה', 'עשרים',
                        'שלושים', 'ארבעים', 'חמישים', 'שישים', 'שבעים', 'שמונים', 'תשעים', 'מאה', 'מאתיים', 'שלוש מאות',
                        'ארבע מאות', 'חמש מאות', 'שש מאות', 'שבע מאות', 'שמונה מאות', 'תשע מאות', 'אלף', 'אלפיים',
                        'שלושת אלפים', 'ארבעת אלפים', 'חמשת אלפים', 'ששת אלפים', 'שבעת אלפים', 'שמונת אלפים',
                        'תשעת אלפים', 'עשרת אלפים', 'מיליון', 'אחת', 'אחד',  'שתי', 'שתיים', 'שניים', 'שלוש',
                        'שלושה', 'ארבע', 'ארבעה', 'חמש', 'חמישה', 'שש', 'שישה', 'שבע', 'שבעה', 'שמונה', 'תשע', 'תשעה',
                        'עשר', 'עשרה']

In [33]:
#!pip install word2number
from word2number import w2n
import re

## Implementation
The function find_sequences recieves two arguments: the text to search in and the list of words to search for.
This regex was designed with hebrew linguistics in mind, capturing single words that represent a number of a "string" of words that together form a number i.e "מאה חמישים ושלוש"



In [34]:
def find_sequences(sentence, words):
    pattern = (
        r"\b(?:[א-ת]?"
        + "|[א-ת]?".join(words)
        + r")(?:\s(?:ו?"
        + "|ו?".join(words)
        + r"))*\b"
    )
    matches = re.findall(pattern, sentence)
    return matches


In [35]:
# Test the find_sequence function
s1= "מאה חמישים ושלוש סתם בדיקה"
s2= " בדיקה מאתיים ארבעים ושתיים"
s3= "שבעים ושבע כגדכג"
res = []
for s in [s1,s2,s3]:
    res+= find_sequences(s, heb_numbers_keywords )
res

['מאה חמישים ושלוש', 'מאתיים ארבעים ושתיים', 'שבעים ושבע']

The translation function from hebrew to english.

In [36]:
#!pip install translate
from translate import Translator

def translate_hebrew_to_english(text):
    translator = Translator(to_lang="en", from_lang="he")
    translation = translator.translate(text)
    return translation

Now we want to retrieve a list of words for replacement in the text.
* This function 
    * identifies Hebrew numbers in the text using the find_sequences function
    * translates them to English using the translate_hebrew_to_english function
    * converts the translated words into integers using the word_to_num function from the word2number library.

* The information is aggregated into a list containing sublists of (English number, Hebrew number, the number as an integer).



In [37]:
def get_words_to_replace(text):
    words_to_replace = find_sequences(text, heb_numbers_keywords)
    eng_version = [translate_hebrew_to_english(w) for w in words_to_replace]
    nums = []
    for w in eng_version:
        try:
            nums.append(w2n.word_to_num(w))
        except ValueError:
            # One of the values wasn't translated correctly
            nums.append("")
            pass

    eng_heb_dict = list(zip(eng_version, words_to_replace, nums))
    sorted_words = sorted(eng_heb_dict, key=lambda x: len(x[1]), reverse=True)

    # If the english word was empty after translation, something went wrong with the translation.
    # Therefore, don't change that word.
    filtered_list = [(eng, heb, num) for eng, heb, num in sorted_words if eng != '']
    print(filtered_list)
    return filtered_list

Now we compose it to one function:

In [38]:
def question2(text):
    values_to_replace = get_words_to_replace(text)
    for t in values_to_replace:
        old_text_heb = t[1]
        new_text_num = t[2]
        text = text.replace(old_text_heb, str(new_text_num))
    with open('modified_content.txt','w' ) as file:
        file.write(text)
    return text

In [39]:
def example2():
    get_article_content()
    with open('article_content.txt', 'r') as file:
        file_content = file.read()
        modified_content = question2(file_content)
        with open('modified_content.txt', 'w') as mod_file:
            mod_file.write(modified_content)
        print('The modified file has been created')
# example2()

# Question 3


In order to solve the question, I assume a term is being counted most likely will appear after the number that quantifies it.

* This function takes a text input and searches for patterns where a number is followed by a space and another word. 
* The regular expression used in the function also allows for an optional prefix in Hebrew characters before the number. 
* The function returns a list of tuples, where each tuple contains the extracted number and the following word.

In [40]:
def find_num_word_tuple(text):
    pattern = r'\b[א-ת]*(\d+)\s*([a-zA-Zא-ת]+)\b'
    matches = re.findall(pattern, text)
    results = []
    for match in matches:
        number, word = match
        results.append((number,word))
        print(f"Number: {number}, Word: {word}")
    return results

The following function determine if a Hebrew word is a noun by translating it to English and analyzing its part of speech. 

It's a important since if the said word appears after a number, and the number is counting that word, it should be a noun.

In [43]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
def is_noun(heb_word):
    translated = translate_hebrew_to_english(heb_word)
    words = word_tokenize(word)
    tagged_words = pos_tag(words)
    # Check if the first word is tagged as a noun (NN or NNS)
    if tagged_words and tagged_words[0][1] in ['NN', 'NNS']:
        return True
    else:
        return False


Finally we combine the fnctions together, in order to answer question 3.
The function below extract and display numeric values along with their associated Hebrew words in a DataFrame.

In [44]:
import pandas as pd

def question3(text):
    # the given text is after the processing done in question2, therefore we can look for numeric values
    results = find_num_word_tuple(text)
    # filter the words that aren't nouns
    filter_nouns_func = lambda x: True if is_noun(x[1]) else False
    results = list(filter(filter_nouns_func,results))                
    n = len(results)
    data = {"Number": [int(results[i][0]) for i in range(n)], 
            "Word": [results[i][1] for i in range (n)]}
    df = pd.DataFrame(data)
    print(df)
    return df

with open('modified_content.txt', 'r') as modified_file:
        content = modified_file.read()
        question3(content)

Number: 3, Word: ימים
Number: 7, Word: באוקטובר
Number: 2, Word: שאר
Number: 18, Word: שנים
Number: 26, Word: באוקטובר
Number: 5, Word: דקות
Number: 2, Word: אמניות
Number: 2, Word: שצריך
Number: 7, Word: באוקטובר
Number: 3, Word: סיפורים
Number: 14, Word: מחזיקה
Number: 84, Word: מנחל
Number: 25, Word: שנה
Number: 2, Word: בנות
Number: 20, Word: ו
Number: 12, Word: ברמת
Number: 1, Word: היה
Number: 2, Word: את
Number: 1, Word: הייתי
Number: 3, Word: צפייה
Number: 2, Word: שאני
Number: 7, Word: שנים
Number: 2, Word: את
Number: 2, Word: שכל
Number: 22, Word: הקמתי
Number: 27, Word: עזבה
Number: 2, Word: קלטות
Number: 2, Word: שמתייחסים
Number: 2, Word: אותו
Number: 15, Word: שנה
Number: 70, Word: לאבא
Number: 10, Word: שנים
Number: 3, Word: צפייה
Number: 2015, Word: יחד
Number: 2, Word: הצגות


NameError: name 'word' is not defined

# Question 4
In order to sort all the articles by their level "closeness" to a single article chosen, we will have to define some metrics.
I would approach the problem using the following pipeline:
## Preprocessing
Parse the text and modify it:
1. Convert words to their original form (lemmatization)
2. Remove punctuation
3. In English, change the case of all words to lower case

## Vectorize each document
Compute the TF-IDF vectors for each document. TF-IDF takes into account the importance
of each term in a document relative to its frequency in the entire corpus.

Now we have a vector for each article, each element in a vector is a numerical value for
the importance of each term.

## Calculate the pairwise distance
After each document is represented by a vector, calculate the pairwise distance with the specific article using Cosine similarity.

Using cosine similarity, we are measuring the angle between the vectors.
We interpret each value of cosine similarity:
+ 1  indicates high similarity
+ -1 indicates high dissimilarity
+ 0  indicates no similarity
## Result
After that, sort the list of documents and their corresponding cosine similarity value.
The result is a sorted list of articles where the first document is the most similar to the original article

# Question 5

The function's input will be urls of the two articles and their corresponding tables (output of question 3).

First we create a table that contain only the rows that appear on both tables, meaning the rows were the terms being counted are identical.

In order to determine the liklyhood of two identical object that were countedm we will determine some values:
* The similarity of the two articles - The more similar the articles are, it increases the option the objects (with the same name) are identical.
* The level of similarity of the paragraphs where the term was mentioned - if we use question 4 output on the two paragraphs, we get the similarity value of the two. The more similar, it is more likely that they are talking about the same term.
* If the value assigned as the quantity of that term is equal on both articles, it's another reason to link the two terms.


If we aggregate these we can calculate the probability the two terms are identical.