## Zajęcia 6: NLP Toolkits

Wszystkie zadania ćwiczeniowe należy rozwiązywać w języku Python w kopii Jupyter Notebook'a dla danych zajęć w wyznaczonych miejscach (komórki z komentarzem `# Solution`).

Nie należy usuwać komórek z treścią zadań.

Należy wyświetlać outputy przy pomocy `print`

## Dla chętnych - może się przydać do projektu końcowego
Gorąco zachęcam do sprawdzenia zasosób dla języka polskiego: https://github.com/sdadas/polish-nlp-resources 

# UWAGA
Każde zadanie należy przetestować na własnych tekstach.

### Zadanie 1 

Przy pomocy toolkitu NLTK napisz funkcję, która na wejściu przyjmuje tekst w języku angielskim, dokonuje tokenizacji na poziomie słów oraz zdań i wypisuje następujące informacje:
* liczba znaków w tekście
* liczba słów (tokenów według tokenizera NLTK) w tekście
* liczba zdań (według sentence tokenizera NLTK) w tekście

In [None]:
# Solution 1

import nltk

def tokenize_and_process_text(text):
    characters_count=len(text)
    
    tokens=nltk.word_tokenize(text)
    tokens_count=len(tokens)
    
    sentences=nltk.sent_tokenize(text)
    sentences_count=len(sentences)
    
    print(f"Number of characters in text: {characters_count}")
    print(f"Number of words in text: {tokens_count}")
    print(f"Number of sentences in text: {sentences_count}")

# Source of text: https://lingua.com/english/reading/valentines-day/

example="""Chocolates and flowers are commonly given as gifts during Valentine's Day, 
as are accompanying greeting cards (greeting card companies release new Valentine's Day designs annually).
Red and pink are generally understood to be "the colors" of Valentine's Day, and many individuals, 
instead of celebrating romantically, spend the holiday with their friends and/or family members."""

tokenize_and_process_text(example)

Number of characters in text: 379
Number of words in text: 67
Number of sentences in text: 2


### Zadanie 2 

Przy pomocy toolkitu SpaCy napisz funkcję, która na wejściu przyjmuje tekst w języku angielskim oraz jeden z trzech argumentów:
* rzeczownik
* czasownik
* przymiotnik

a następnie tokenizuje tekst na słowa i zwraca słowa spełniajace kryterium części mowy określone poprzez drugi argument (czyli np dla argumentu rzeczownik zwraca listę rzeczowników). Należy skorzystać z modułu PoS (part of speech).

In [None]:
# Solution 2

import spacy

def process_part_of_speech(partOfSpeech):
    match partOfSpeech:
        case "noun":
            return "NOUN"
        case "verb":
            return "VERB"
        case "adjective":
            return "ADJ"

def return_part_of_speech_from_text(text,partOfSpeech):
    sp=spacy.load("en_core_web_sm")
    doc=sp(text)
    pos=process_part_of_speech(partOfSpeech)
    found_words=[]
    for token in doc:
        if token.pos_==pos:
            found_words.append(token.text)
    print(f"List of {partOfSpeech+'s'} in this text:\n {found_words}")

# Text source: https://www.newsinlevels.com/products/worlds-dirtiest-city-level-3/

example="""Delhi’s air quality index (AQI) reached a severe level of 418, far above the ideal range of 0-50. 
Every winter, Delhi struggles with smog caused by various factors, including cold temperatures, crop burning, dust, vehicle emissions, and low wind speeds. 
Residents experience burning eyes, itchy throats, and difficulty breathing.
While some can use air purifiers indoors, many must work outside, facing hazardous conditions. 
Pollution severely impacts vulnerable groups like children and the elderly, leading to health problems such as respiratory issues. 
Experts criticize the lack of long-term solutions, noting that government measures are often reactive."""

return_part_of_speech_from_text(example,"noun")

return_part_of_speech_from_text(example,"verb")

return_part_of_speech_from_text(example,"adjective")

List of nouns in this text:
 ['air', 'quality', 'index', 'level', 'range', 'winter', 'smog', 'factors', 'temperatures', 'crop', 'burning', 'dust', 'vehicle', 'emissions', 'wind', 'speeds', 'Residents', 'eyes', 'throats', 'difficulty', 'breathing', 'air', 'purifiers', 'indoors', 'conditions', 'Pollution', 'groups', 'children', 'health', 'problems', 'issues', 'Experts', 'lack', 'term', 'solutions', 'government', 'measures']
List of verbs in this text:
 ['reached', 'struggles', 'caused', 'including', 'experience', 'burning', 'use', 'work', 'facing', 'impacts', 'leading', 'criticize', 'noting']
List of adjectives in this text:
 ['severe', 'ideal', 'various', 'cold', 'low', 'itchy', 'many', 'hazardous', 'vulnerable', 'elderly', 'such', 'respiratory', 'long', 'reactive']


#### Zadanie 3

Korzystając z toolkitu SpaCy oraz modelu en_core_web_sm napisz program dokonujący analizy dokumentu tekstowego w języku angielskim o wydarzeniach ekonomicznych w języku angielskim. Zadaniem programu jest ekstrakcja następujących jednostek nazwanych (korzystając z modułu NER - Named Entity Recognition):
* ORGANIZATION (e.g., company names)
* DATE (e.g., contract start/end dates)
* MONEY (e.g., payment amounts)

Program musi wypisać znalezione jednostki i ich wartości.

In [None]:
# Solution 3

# Sample text
text = """
The agreement between Google and Microsoft was signed on January 15, 2023.
The payment of $100,000,000 is due on March 1, 2024. Another payment of $300,000,000 is scheduled for June 15, 2024.
"""

# Dodaj tez swoj tekst
# My example, source: https://www.newsinlevels.com/products/renault-electric-cars-level-3/
# Changed "euros" to "€".

example="""
Renault aims to boost confidence in its electric vehicles unit, 
Ampere, projecting a revenue surge to 10 billion € in 2025 and 25 billion € in 2031.
The company disclosed financial objectives, including break-even in 2025 and a 10% operating margin by 2030, 
ahead of a planned market listing in 2024. The IPO faces challenges due to sluggish electric vehicle demand, 
market uncertainties, and heightened Chinese competition. 
CEO Luca de Meo emphasized Renault’s cash capacity for Ampere’s growth but preferred the IPO for accelerated development. 
However, he asserted that if the valuation falls below an undisclosed threshold, the IPO plan would be abandoned.
Renault also introduced a new EV model Twingo Legend priced below 20,000 €, 
aiming to compete with affordable Chinese alternatives.
"""

import spacy

def economical_text_analysis(text):
    sp=spacy.load("en_core_web_sm")
    doc=sp(text)
    organizations=[]
    dates=[]
    money_amounts=[]
    for ent in doc.ents:
        match (ent.label_):
            case "ORG":
                organizations.append(ent.text)
            case "DATE":
                dates.append(ent.text)
            case "MONEY":
                money_amounts.append(ent.text)
    print("Entities found in text: ")
    print("Organizations:")
    print(organizations)
    print("Dates:")
    print(dates)
    print("Money amounts: ")
    print(money_amounts)

economical_text_analysis(text)

print("\n\n")

economical_text_analysis(example)

Entities found in text: 
Organizations:
['Google', 'Microsoft']
Dates:
['January 15, 2023', 'March 1, 2024', 'June 15, 2024']
Money amounts: 
['100,000,000', '300,000,000']



Entities found in text: 
Organizations:
['Renault', 'IPO', 'Renault', 'IPO', 'IPO', 'Renault', 'EV', 'Twingo Legend']
Dates:
['2025', '2031', '2025', '2030', '2024']
Money amounts: 
['10 billion €', '25 billion €', '20,000 €']
