In [34]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string # from this module we will use string.punctuation: A string of all punctuation characters

nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Wajih\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
# our text example
text = """During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models. In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis. Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions. Moreover, my experience with ‘EY Tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using Python and Power BI. In fact, I implemented different techniques to transform data into insights that drive business value. It was an enriching opportunity to delve into the professional world in an efficient way Added to that, my last job in Tunisia as business consultant is an enriching opportunity to delve into the professional world in an efficient way. Besides, starting a professional experience as an international freelancer in the field of data analytics was an asset for me to connect with employers from all over the world. Thus, joining your company would be an experience in perfect match with my ambitions, especially since I am very serious about enhancing my work capabilities besides the fact that I am very motivated and ready to invest my time to integrate successfully the team and to be a driving force in your approach to work.
Please find the attached Curriculum Vitae.
Thank you in advance for your time and I am at your disposal for any further information and interview. Hoping that my application will hold your attention and in anticipation of a reply, which I hope is favorable, please accept my sincere greetings.
"""

the sent_tokenize function from the Natural Language Toolkit (nltk) is used to split the input text into individual sentences.

In [36]:
# Tokenize text into sentences
sentences = nltk.sent_tokenize(text)
print(sentences)

['During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models.', 'In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis.', 'Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions.', 'Moreover, my experience with ‘EY Tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using Python and Power BI.', 'In fact, I implemented different techniques to transform data into insights that drive business value.', 'It was an enriching opportunity to delve into the profe

In [37]:
# to see how many sentences we have
print(len(sentences))

11


The preprocess function converts the input text to lowercase and removes all punctuation using str.maketrans. The list comprehension then applies this function to each sentence in the sentences list, resulting in a list of processed sentences with consistent casing and no punctuation.

In [38]:

#Clean and prepare the text (remove punctuation and lowercase)
#str.maketrans('', '', string.punctuation) creates a translation table that removes all punctuation characters. 
# When this translation table is applied to a string using text.translate(trans_table), it removes any characters from the string that are in string.punctuation.
def preprocess(text):
    return text.lower().translate(str.maketrans('', '', string.punctuation))

processed_sentences = [preprocess(sentence) for sentence in sentences]

processed_sentences

['during my studies i was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas a significant knowledge of the fundamentals of python in order to implement data science techniques and to build machine learning as well as deep learning models',
 'in addition my study experience was also reinforced by a series of online courses that delved into web scraping data analytics and projects in machine learning and time series analysis',
 'added to that i am extremely familiar with sql to manage data and also data visualization tools like power bi in order to come up with insights and wellstructured decisions',
 'moreover my experience with ‘ey tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using python and power bi',
 'in fact i implemented different techniques to transform data into insights that drive business value',
 'it was an enriching opportunity to delve into the professional 

The TfidfVectorizer is initialized with English stop words removed, which means common words like "the" or "is" will not be considered in the analysis. The fit_transform method then computes the TF-IDF (Term Frequency-Inverse Document Frequency) matrix, representing each sentence in the processed_sentences list as a vector in the feature space defined by the vectorizer.

In [39]:
#Vectorize the sentences using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(processed_sentences)
print(tfidf_matrix)

  (0, 103)	0.2032252323934614
  (0, 0)	0.2032252323934614
  (0, 2)	0.2032252323934614
  (0, 98)	0.2032252323934614
  (0, 48)	0.2032252323934614
  (0, 40)	0.2032252323934614
  (0, 91)	0.2032252323934614
  (0, 100)	0.2032252323934614
  (0, 79)	0.2032252323934614
  (0, 13)	0.2032252323934614
  (0, 99)	0.2032252323934614
  (0, 71)	0.2032252323934614
  (0, 54)	0.2032252323934614
  (0, 89)	0.1737095302955632
  (0, 82)	0.1737095302955632
  (0, 61)	0.2032252323934614
  (0, 28)	0.12325210428704043
  (0, 95)	0.2032252323934614
  (0, 107)	0.1737095302955632
  (0, 18)	0.2032252323934614
  (0, 74)	0.1737095302955632
  (0, 72)	0.3474190605911264
  (0, 30)	0.2032252323934614
  (0, 77)	0.2032252323934614
  (1, 28)	0.14466250680764378
  :	:
  (7, 65)	0.20967080559478515
  (7, 105)	0.20967080559478515
  (7, 106)	0.20967080559478515
  (7, 37)	0.20967080559478515
  (7, 52)	0.20967080559478515
  (7, 12)	0.20967080559478515
  (8, 15)	0.5773502691896258
  (8, 26)	0.5773502691896258
  (8, 117)	0.5773502691896

-- Just to better understand tfidf matrix

The todense() method converts the sparse tfidf_matrix into a dense matrix, allowing for easier visualization of the TF-IDF values as a full array. The get_feature_names_out() method retrieves the list of unique terms (words) corresponding to the columns of the matrix, which represent the vocabulary used in the TF-IDF transformation.

In [40]:
# Convert the matrix to a dense format and print it
dense_matrix = tfidf_matrix.todense()
print(dense_matrix)

# Get the terms (words) corresponding to the columns
terms = vectorizer.get_feature_names_out()
print(terms)

[[0.20322523 0.         0.20322523 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.25713503 0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.31622777 0.         ... 0.         0.         0.        ]]
['able' 'accept' 'acquire' 'added' 'addition' 'advance' 'ambitions'
 'analysis' 'analytics' 'analyze' 'anticipation' 'application' 'approach'
 'areas' 'asset' 'attached' 'attention' 'bi' 'build' 'business'
 'capabilities' 'come' 'company' 'connect' 'consultant' 'courses'
 'curriculum' 'dashboards' 'data' 'decisions' 'deep' 'delve' 'delved'
 'develop' 'different' 'disposal' 'drive' 'driving' 'efficient'
 'employers' 'enabled' 'enhancing' 'enriching' 'especially' 'experience'
 'extremely' 'ey' 'fact' 'faculties' 'familiar' 'favorable' 'field'
 'force' 

-- back to our porject:

The cosine_similarity function computes the pairwise cosine similarity between all the sentences in the tfidf_matrix. This results in a similarity matrix where each value represents how similar two sentences are, based on the angle between their TF-IDF vector representations.

In [41]:
# Calculate Cosine Similarity between sentences
cosine_sim = cosine_similarity(tfidf_matrix)

The sum(axis=1) method calculates the sum of cosine similarity scores for each sentence, effectively aggregating the similarity of each sentence with all other sentences. The resulting sentence_scores array provides a measure of the overall similarity of each sentence to the others in the dataset.

In [42]:
# Rank sentences based on similarity (here using the sum of similarities for each sentence)
sentence_scores = cosine_sim.sum(axis=1)
print(sentence_scores)


[1.34105807 1.4526298  1.44538314 1.33749316 1.42345819 1.30098255
 1.47683434 1.20918678 1.         1.10736023 1.        ]


This line selects the top 3 most similar sentences by sorting the sentence_scores array in descending order and using the indices to retrieve the corresponding sentences from the sentences list. The argsort() function is used to obtain the indices of the sentences with the highest similarity scores, and the [::-1] reverses the order to get the highest scores first.

In [43]:
# Select top N sentences for the summary (since my text contains 11 senteces, we will use 3 sentences to summarize)
top_n_sentences = [sentences[i] for i in sentence_scores.argsort()[-3:][::-1]]

This line joins the top 3 most similar sentences (stored in top_n_sentences) into a single string, with each sentence separated by a space. The result is a concise summary that includes the most relevant or representative sentences based on their cosine similarity scores.

In [44]:

# Step 7: Print the summary
summary = " ".join(top_n_sentences)
print(summary)


Besides, starting a professional experience as an international freelancer in the field of data analytics was an asset for me to connect with employers from all over the world. In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis. Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions.
