# About the project

>The project was completed as a part of the [Natural Language Processing course](https://hyperskill.org/tracks/10) on JetBrains Academy.

>The aim of this project was to learn how to extract key terms from a collection of news stories. While doing this project, I found out how to

>- read an XML file and extract the headers and the text;
>- **tokenize** each text;
>- **lemmatize** each word in each story;
>- get rid of punctuation, stopwords, and non-nouns with the help of NLTK;
>- count the **TF-IDF metric** for each word in all stories.

In [5]:
#Imports:

import string

import nltk
import pandas as pd
from lxml import etree

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

#Parcing through an XML-file

xml_path = "/Users/katerynaboguslavska/Downloads/news.xml"
tree = etree.parse(xml_path)
root = tree.getroot()

In [6]:
#Extracting the headers and the text

list_for_vectorizer = []
list_of_names = []

for tag in root[0].findall('news/value'):
    
    if tag.attrib['name'] == 'head':
        list_of_names.append(tag.text)
        
    if tag.attrib['name'] == 'text':
        
        #Applying tokenization and lemmatization
        
        tknzd_text = word_tokenize(tag.text.lower())
        lemmatizer = WordNetLemmatizer()
        lemma_text = [lemmatizer.lemmatize(w) for w in tknzd_text]
        
        #Getting rid of punctuation and stopwords
        
        without_punct = [word for word in lemma_text if word not in list(string.punctuation)]
        without_stopwords = [word for word in without_punct if word not in stopwords.words('english')]
        
        #Applying part-of-speech tagging (POS-tagging) and choosing nouns
        
        pos_tag = [nltk.pos_tag([word]) for word in without_stopwords]
        extr_nouns = [word[0][0] for word in pos_tag if word[0][1] == "NN"]
        list_of_nouns = ' '.join(extr_nouns)
        list_for_vectorizer.append(list_of_nouns)

In [7]:
#Applying TfidfVectorizer for every word in all news stories
        
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(list_for_vectorizer)

In [8]:
#Looping through each news story

for x in range(0, len(list_of_names)):
    
    print(list_of_names[x] + ':')
    
    #Creating dataFrame with TF-IDF for every word in news story
    
    df = pd.DataFrame(tfidf_matrix[x].toarray().transpose(), index=vectorizer.get_feature_names_out())
    
    #Sorting it according to the task instructions
        
    df_sorted_1 = df.sort_values(0, ascending = False)
    df_sorted_2 = df_sorted_1.sort_index(ascending = False)
        
    #Picking the five best scoring words
    
    top_five = df_sorted_2.nlargest(5,0)
    
    #Extracting keywords
    
    keywords_in_dict = top_five.transpose().to_dict(orient = 'list')
    keywords = ' '.join(list(keywords_in_dict.keys()))
    print(keywords)

Brain Disconnects During Sleep:
sleep cortex consciousness tononi tm
New Portuguese skull may be an early relative of Neandertals:
skull fossil europe trait genus
Living by the coast could improve mental health:
health coast mental living household
Did you knowingly commit a crime? Brain scans could tell:
brain suitcase study security scenario
Computer learns to detect skin cancer more accurately than doctors:
dermatologist skin melanoma cnn lesion
US economic growth stronger than expected despite weak demand:
rate growth quarter economy investment
Microsoft becomes third listed US firm to be valued at $1tn:
microsoft share cloud market company
Apple's Siri is a better rapper than you:
siri rhyme smooth rizzo producer
Netflix viewers like comedy for breakfast and drama at lunch:
netflix day comedy viewer tv
Loneliness May Make Quitting Smoking Even Tougher:
smoking loneliness smoke quit lead
