Text Summarization Using NLTK:

The provided code extracts a summary from a given text by first tokenizing the text into words and sentences. It then filters out common stopwords and calculates the frequency of each remaining word. Next, it assigns a value to each sentence based on the frequency of important words it contains. Finally, it selects and outputs sentences with values above a certain threshold (1.2 times the average sentence value) to create a summary of the text.

The steps that will be implemented are:

1- tokenization of text and sentences
2- Frequency table for words
3- creating a score for each sentence (sum of word frequencies)
4- Average Sentence Value
5- Select sentences with values > 1.2 * Average and adding them to the summary

--> basically we select the sentences that have higher score based on its words if they are highly frequent and using a threshold, we choose if that sentence is worth it to be added in the summary or not

In [22]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [23]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Wajih\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Wajih\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
text = """During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models. In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis. Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions. Moreover, my experience with ‘EY Tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using Python and Power BI. In fact, I implemented different techniques to transform data into insights that drive business value. It was an enriching opportunity to delve into the professional world in an efficient way Added to that, my last job in Tunisia as business consultant is an enriching opportunity to delve into the professional world in an efficient way. Besides, starting a professional experience as an international freelancer in the field of data analytics was an asset for me to connect with employers from all over the world. Thus, joining your company would be an experience in perfect match with my ambitions, especially since I am very serious about enhancing my work capabilities besides the fact that I am very motivated and ready to invest my time to integrate successfully the team and to be a driving force in your approach to work.
Please find the attached Curriculum Vitae.
Thank you in advance for your time and I am at your disposal for any further information and interview. Hoping that my application will hold your attention and in anticipation of a reply, which I hope is favorable, please accept my sincere greetings.
"""

This statement creates a set called stopWords that contains a list of common "stopwords" (like "the", "is", "in") from the English language, obtained from the stopwords module of the nltk library. The stopwords.words("english") function fetches these stopwords in the form of a list.

In [26]:
stopWords = set(stopwords.words("english"))
print(stopWords)

{'its', 'further', 'whom', 'few', 'isn', 'needn', 'your', "you'd", 'does', 'mustn', 'can', 'itself', 'own', 'ours', 'doing', 'during', 'more', 'will', 'hers', 'it', 'is', 'couldn', 'each', 'not', "doesn't", 'if', "weren't", 'as', 'yourself', 'what', 'who', 'up', 'with', 'he', 'only', 'theirs', 'hasn', 'they', 'has', "mightn't", 'then', 'are', 'too', 'now', 'through', 'do', 'm', 'yours', 'against', 'most', 'don', 'about', 'them', 'into', 'shouldn', 'd', 'ourselves', 'until', "it's", 'were', 'between', "wasn't", 'off', 'very', 'hadn', "isn't", 'myself', 'ain', 'be', 'there', 'or', 'our', 'than', 'other', 'll', 'out', 'me', "couldn't", 'herself', 'below', 'him', 'which', 'both', "don't", "haven't", 'any', "hadn't", 'just', 'at', 'on', 'this', 'was', 'when', 'and', "that'll", 'where', 'again', 'wouldn', 'yourselves', 'having', 'same', 'some', 'didn', 'aren', 'to', "mustn't", 'here', 'ma', 've', 'she', 'shan', 'the', 'while', 'such', 't', 'my', 'i', 'of', "aren't", "won't", 'that', "wouldn'

The word_tokenize function from the nltk library to split the input text into individual words or tokens.

In [27]:
words = word_tokenize(text)
print(words)

['During', 'my', 'studies', ',', 'I', 'was', 'able', 'to', 'acquire', 'a', 'set', 'of', 'faculties', 'that', 'enabled', 'me', 'to', 'reinforce', 'my', 'skills', 'in', 'multiple', 'areas', ':', 'a', 'significant', 'knowledge', 'of', 'the', 'fundamentals', 'of', 'Python', 'in', 'order', 'to', 'implement', 'data', 'science', 'techniques', 'and', 'to', 'build', 'machine', 'learning', 'as', 'well', 'as', 'deep', 'learning', 'models', '.', 'In', 'addition', ',', 'my', 'study', 'experience', 'was', 'also', 'reinforced', 'by', 'a', 'series', 'of', 'online', 'courses', 'that', 'delved', 'into', 'web', 'scraping', ',', 'data', 'analytics', 'and', 'projects', 'in', 'machine', 'learning', 'and', 'time', 'series', 'Analysis', '.', 'Added', 'to', 'that', ',', 'I', 'am', 'extremely', 'familiar', 'with', 'SQL', 'to', 'manage', 'data', 'and', 'also', 'data', 'visualization', 'tools', 'like', 'Power', 'BI', 'in', 'order', 'to', 'come', 'up', 'with', 'insights', 'and', 'well-structured', 'decisions', '.'

This creates a frequency table (freqTable) to count how often each word (excluding stopwords) appears in the list words. It first converts each word to lowercase to ensure case-insensitive counting, and then checks if the word is a stopword or already in the freqTable. If it's not a stopword and not in the table, it is added with a count of 1, or its count is incremented if it already exists in the table, and finally, the frequency table is printed

In [28]:
freqTable = dict()

for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] =1
print(freqTable)

{'studies': 1, ',': 12, 'able': 1, 'acquire': 1, 'set': 1, 'faculties': 1, 'enabled': 1, 'reinforce': 1, 'skills': 1, 'multiple': 1, 'areas': 1, ':': 1, 'significant': 1, 'knowledge': 1, 'fundamentals': 1, 'python': 2, 'order': 2, 'implement': 1, 'data': 6, 'science': 1, 'techniques': 2, 'build': 1, 'machine': 2, 'learning': 3, 'well': 1, 'deep': 1, 'models': 1, '.': 11, 'addition': 1, 'study': 1, 'experience': 4, 'also': 2, 'reinforced': 1, 'series': 2, 'online': 1, 'courses': 1, 'delved': 1, 'web': 1, 'scraping': 1, 'analytics': 2, 'projects': 1, 'time': 3, 'analysis': 1, 'added': 2, 'extremely': 1, 'familiar': 1, 'sql': 1, 'manage': 1, 'visualization': 1, 'tools': 1, 'like': 1, 'power': 2, 'bi': 2, 'come': 1, 'insights': 2, 'well-structured': 1, 'decisions': 1, 'moreover': 1, '‘': 1, 'ey': 1, 'tunisia': 2, '’': 1, 'golden': 1, 'ticket': 1, 'develop': 1, 'dashboards': 1, 'reports': 1, 'help': 1, 'analyze': 1, 'business': 3, 'performance': 1, 'processes': 1, 'using': 1, 'fact': 2, 'im

This statement uses the sent_tokenize function from the nltk library to split the input text into individual sentences. The text variable is expected to be a string, and the function processes it to break the text into a list of sentences. The resulting list of sentences is assigned to the variable sentences

In [29]:
sentences = sent_tokenize(text)
print(sentences)

['During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models.', 'In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis.', 'Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions.', 'Moreover, my experience with ‘EY Tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using Python and Power BI.', 'In fact, I implemented different techniques to transform data into insights that drive business value.', 'It was an enriching opportunity to delve into the profe

This function calculates a "value" for each sentence in the sentences list based on the frequency of important words, which are stored in the freqTable. For each sentence, the function checks if any word from freqTable appears in the sentence, and if so, adds the word's frequency to the sentence's value. The final result is a dictionary (sentenceValue) where each key is a sentence, and its value is the sum of the frequencies of the words it contains, which is then printed.

In [30]:
def getsentenceValue():
    sentenceValue = dict()
    for sentence in sentences:
        for word,freq in freqTable.items():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] +=freq
                else:
                    sentenceValue[sentence] = freq
    return(sentenceValue)
    
sentenceValue= getsentenceValue()
print(sentenceValue)

{'During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models.': 61, 'In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis.': 61, 'Added to that, I am extremely familiar with SQL to manage data and also data visualization tools like Power BI in order to come up with insights and well-structured decisions.': 52, 'Moreover, my experience with ‘EY Tunisia’ was a golden ticket to develop dashboards and reports that can help analyze business performance and processes using Python and Power BI.': 52, 'In fact, I implemented different techniques to transform data into insights that drive business value.': 44, 'It was an enriching opportunity to 


The getsumValues() function calculates the total value of all sentences in sentenceValue and then computes the average by dividing the total by the number of sentences. It returns the average sentence value as an integer.

In [31]:
def getsumValues():
    sumValues = 0
    for sentence in sentenceValue:
        sumValues += sentenceValue[sentence]
    
    average = int(sumValues/len(sentenceValue))
    return average

average = getsumValues()
print(average)

45


a summary by selecting sentences from sentences where the sentence value is greater than 1.2 times the average sentence value. It checks if the sentence is in sentenceValue and if its value exceeds the threshold, then adds it to the summary string. Finally, the summary is printed.

In [32]:
summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence

print(summary)


 During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models. In addition, my study experience was also reinforced by a series of online courses that delved into web scraping, data analytics and projects in machine learning and time series Analysis. Thus, joining your company would be an experience in perfect match with my ambitions, especially since I am very serious about enhancing my work capabilities besides the fact that I am very motivated and ready to invest my time to integrate successfully the team and to be a driving force in your approach to work.
