<a href="https://colab.research.google.com/github/arina080803/CS_trial_task/blob/main/Startseva_ML_Trial_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Text Improvement Engine

Develop a tool that analyses a given text and suggests improvements based on the similarity to a list of "standardised" phrases. These standardised phrases represent the ideal way certain concepts should be articulated, and the tool should recommend changes to align the input text closer to these standards.

As a model for comparing text, I will use the spacy library, which allows to use pre-trained vector representations of words to calculate semantic similarity. To analyze the text and replace phrases, I will use the en_core_web_md language model for English.

**en_core_web_md** is a medium-sized English model trained on written web text (blogs, news, comments), that includes a tagger, a dependency parser, a lemmatizer, a named entity recognizer and a word vector table with 20k unique vectors.

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [53]:
import spacy
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity

In [54]:
nlp = spacy.load("en_core_web_md")

with open('/content/sample_text.txt', 'r') as text:
  text = text.read()

print(text)

standartised_terms = pd.read_csv('/content/Standardised terms.csv')
standard_phrases = standartised_terms['Optimal performance'].tolist()
standartised_terms

In today's meeting, we discussed a variety of issues affecting our department. The weather was unusually sunny, a pleasant backdrop to our serious discussions. We came to the consensus that we need to do better in terms of performance. Sally brought doughnuts, which lightened the mood. It's important to make good use of what we have at our disposal. During the coffee break, we talked about the upcoming company picnic. We should aim to be more efficient and look for ways to be more creative in our daily tasks. Growth is essential for our future, but equally important is building strong relationships with our team members. As a reminder, the annual staff survey is due next Friday. Lastly, we agreed that we must take time to look over our plans carefully and consider all angles before moving forward. On a side note, David mentioned that his cat is recovering well from surgery.


Unnamed: 0,Optimal performance
0,Utilise resources
1,Enhance productivity
2,Conduct an analysis
3,Maintain a high standard
4,Implement best practices
5,Ensure compliance
6,Streamline operations
7,Foster innovation
8,Drive growth
9,Leverage synergies


For each token, a nested loop is executed in the doc object, where each phrase from the standard_phrases list is compared with the current token.

For each phrase from the standard_phrases list, the similarity to the current token is calculated using the similarity() method. This method compares vector representations of the current token and the phrase.

If the similarity between the current token and the phrase is greater than the specified threshold (0.7 in this case), then the token is considered suitable for replacement.
If a suitable similarity is found, the tuple containing the text of the current token, the replacement phrase, and the similarity value is added to the replacements list.

In [70]:
def recommend_replacements(text):
    doc = nlp(text)
    replacements = []
    new_text = []
    for token in doc:
        if token.is_stop or token.is_punct:
            new_text.append(token.text)
            continue
        max_similarity = 0
        best_phrase = ""
        for phrase in standard_phrases:
            similarity = token.similarity(nlp(phrase))
            if similarity > max_similarity:
                max_similarity = similarity
                best_phrase = phrase
        if max_similarity > 0.65 and token.text.lower() != best_phrase.lower():
            replacements.append((token.text, best_phrase, max_similarity))
            new_text.append(best_phrase)
        else:
            new_text.append(token.text)
    return replacements, " ".join(new_text)


As the specified similarity threshold of the current token and phrase decreases, the function outputs more logical substitutions. This happens due to the fact that with a lower threshold value, the model is not sure that this replacement carries the correct meaning.

In [71]:
text = "In today's meeting, we discussed a variety of issues affecting our department. The weather was unusually sunny, a pleasant backdrop to our serious discussions. We came to the consensus that we need to do better in terms of performance. Sally brought doughnuts, which lightened the mood. It's important to make good use of what we have at our disposal. During the coffee break, we talked about the upcoming company picnic. We should aim to be more efficient and look for ways to be more creative in our daily tasks. Growth is essential for our future, but equally important is building strong relationships with our team members. As a reminder, the annual staff survey is due next Friday. Lastly, we agreed that we must take time to look over our plans carefully and consider all angles before moving forward. On a side note, David mentioned that his cat is recovering well from surgery."
replacements, new_text = recommend_replacements(text)
for original, replacement, similarity in replacements:
    print(f"Replace '{original}' for '{replacement}' (Similarity: {similarity})")

Replace 'discussions' for 'Demonstrate leadership' (Similarity: 0.6848907759870494)
Replace 'performance' for 'Monitor performance metrics' (Similarity: 0.8226165365470574)
Replace 'important' for 'Implement best practices' (Similarity: 0.6628278234525389)
Replace 'use' for 'Utilise resources' (Similarity: 0.693050941497876)
Replace 'efficient' for 'Utilise resources' (Similarity: 0.6761136459875585)
Replace 'tasks' for 'Prioritise tasks' (Similarity: 0.9417426949185265)
Replace 'Growth' for 'Drive growth' (Similarity: 0.6779135022454443)
Replace 'essential' for 'Implement best practices' (Similarity: 0.6975695485315163)
Replace 'important' for 'Implement best practices' (Similarity: 0.6628278234525389)
Replace 'relationships' for 'Facilitate collaboration' (Similarity: 0.7003356398668303)


Let's use the **principle of cosine similarity** to select substitutions. In this case, we will calculate the similarity between word embeddings using a pre-trained model.

In [72]:
def recommend_replacements(text):
    doc = nlp(text)
    replacements = []
    new_text_cos = []
    for token in doc:
        if token.is_stop or token.is_punct:
            new_text_cos.append(token.text)
            continue
        max_similarity = 0
        best_phrase = ""
        token_embedding = token.vector
        for phrase in standard_phrases:
            phrase_embedding = nlp(phrase).vector
            similarity = cosine_similarity(token_embedding.reshape(1, -1), phrase_embedding.reshape(1, -1))[0][0]
            if similarity > max_similarity:
                max_similarity = similarity
                best_phrase = phrase
        if max_similarity > 0.65 and token.text.lower() != best_phrase.lower():
            replacements.append((token.text, best_phrase, max_similarity))
            new_text_cos.append(best_phrase)
        else:
            new_text_cos.append(token.text)
    return replacements, " ".join(new_text_cos)


In [73]:
text = "In today's meeting, we discussed a variety of issues affecting our department. The weather was unusually sunny, a pleasant backdrop to our serious discussions. We came to the consensus that we need to do better in terms of performance. Sally brought doughnuts, which lightened the mood. It's important to make good use of what we have at our disposal. During the coffee break, we talked about the upcoming company picnic. We should aim to be more efficient and look for ways to be more creative in our daily tasks. Growth is essential for our future, but equally important is building strong relationships with our team members. As a reminder, the annual staff survey is due next Friday. Lastly, we agreed that we must take time to look over our plans carefully and consider all angles before moving forward. On a side note, David mentioned that his cat is recovering well from surgery."
replacements, new_text_cos = recommend_replacements(text)
for original, replacement, similarity in replacements:
    print(f"Replace '{original}' for '{replacement}' (Similarity: {similarity})")

Replace 'discussions' for 'Demonstrate leadership' (Similarity: 0.684890627861023)
Replace 'performance' for 'Monitor performance metrics' (Similarity: 0.8226163983345032)
Replace 'important' for 'Implement best practices' (Similarity: 0.6628279089927673)
Replace 'use' for 'Utilise resources' (Similarity: 0.6930509805679321)
Replace 'efficient' for 'Utilise resources' (Similarity: 0.6761137843132019)
Replace 'tasks' for 'Prioritise tasks' (Similarity: 0.9417427182197571)
Replace 'Growth' for 'Drive growth' (Similarity: 0.6779133677482605)
Replace 'essential' for 'Implement best practices' (Similarity: 0.6975694894790649)
Replace 'important' for 'Implement best practices' (Similarity: 0.6628279089927673)
Replace 'relationships' for 'Facilitate collaboration' (Similarity: 0.7003356218338013)


In [74]:
print("\nImproved text:")
print(new_text_cos)


Improved text:
In today 's meeting , we discussed a variety of issues affecting our department . The weather was unusually sunny , a pleasant backdrop to our serious Demonstrate leadership . We came to the consensus that we need to do better in terms of Monitor performance metrics . Sally brought doughnuts , which lightened the mood . It 's Implement best practices to make good Utilise resources of what we have at our disposal . During the coffee break , we talked about the upcoming company picnic . We should aim to be more Utilise resources and look for ways to be more creative in our daily Prioritise tasks . Drive growth is Implement best practices for our future , but equally Implement best practices is building strong Facilitate collaboration with our team members . As a reminder , the annual staff survey is due next Friday . Lastly , we agreed that we must take time to look over our plans carefully and consider all angles before moving forward . On a side note , David mentioned t

For manual text input, we use a simple text-based CLI command line interface.

In [75]:
def get_user_input():
    print("Enter the text to analyze:")
    user_input = input()
    return user_input


text = get_user_input()
print("You entered text:")
print(text)

Enter the text to analyze:
In the README file and/or in the video, the candidates will be required to analyse the results, are they good/not good, how could this have been improved if there was more time for completion etc
You entered text:
In the README file and/or in the video, the candidates will be required to analyse the results, are they good/not good, how could this have been improved if there was more time for completion etc


# Сonclusions

**SpaCy** is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

SpaCy is designed specifically for production use and helps to create applications that process and "understand" large amounts of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

In this case we have small text, however using this library I test how it works on small amounts of data, checking the accuracy of query execution and selecting the optimal threshold value of the parameter.

Unfortunately, in the allotted time, I was unable to find a list of standardized phrases that fit this template. However, some conclusions can be drawn after applying the proposed list:

*   The specified threshold plays a key role. When this value is reduced even by a small degree, the result of the function changes significantly: the library makes mistakes, mistaking articles for words with a large semantic load and replaces them with suggested phrases that have a completely different meaning and are not suitable in this particular situation. Also, the library makes a big mistake and replaces words with a parameter less than 0.65 that have completely different meanings;
*   **Cosine similarity** does not solve the problem of natural language in text data, i.e. synonyms and polysemy. This has a big impact on the accuracy of the search. At first glance, this method gives the same answer as the basic spacy library, but the result of accuracy (the confidence of the model in the result) may differ for the better by a very small fraction of one



Ideas for improvement:

*   we can train a special model that will also work with the space library or any other natural language processing library. In such a model, it is possible to implement the division of text into semantic phrases according to the vocabulary of the language in order to exclude the occurrence of article substitutions. The model will also be able to learn from hundreds of similar phrases in order to learn how to see the context and eliminate lexical, grammatical and semantic errors.
*   in case we avoid writing our own model, we could additionally use text preprocessing libraries that work not only with punctuation marks, spaces and an array of words, but also take into account the totality of words during processing. Thus, the synthesis library would allow for high-quality lexical replacement of words in any text that the user suggests, for any text size.

