# Dzie≈Ñ 2 - Modu≈Ç 5: Generowanie i rozumienie tekstu

## Cele modu≈Çu:
- Tworzenie podsumowa≈Ñ tekstu (summarization)
- Generowanie tre≈õci (text generation)
- Automatyczne t≈Çumaczenia
- Analiza sentymentu w praktyce
- Question Answering (systemy Q&A)

In [None]:
# Import bibliotek
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

print("‚úÖ Biblioteki za≈Çadowane!")

## 5.1 Podsumowania tekstu (Text Summarization)

### Rodzaje podsumowa≈Ñ:

#### 1. Extractive Summarization
- Wybiera najwa≈ºniejsze zdania z oryginalnego tekstu
- Nie tworzy nowych zda≈Ñ
- Szybsze, prostsze
- Przyk≈Çad: wyb√≥r 3 najwa≈ºniejszych zda≈Ñ z artyku≈Çu

#### 2. Abstractive Summarization
- Generuje nowe zdania
- Parafrazy i przepisywanie
- Bardziej "ludzkie" podsumowania
- Wymaga modeli Seq2Seq (np. T5, BART, Pegasus)

### Zastosowania:
- Streszczenia artyku≈Ç√≥w
- Agregacja wiadomo≈õci
- Podsumowania spotka≈Ñ
- Analiza dokument√≥w prawnych
- Podsumowania recenzji produkt√≥w

In [None]:
# Przyk≈Çad 1: Abstractive Summarization
print("=== ABSTRACTIVE SUMMARIZATION ===")

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
Artificial intelligence has made tremendous progress in recent years, particularly 
in the field of natural language processing. Large language models like GPT-3 and 
GPT-4 have demonstrated remarkable capabilities in understanding and generating 
human-like text. These models are trained on vast amounts of text data and can 
perform a wide variety of tasks without specific training for each task. The impact 
of these advances extends across multiple industries including healthcare, finance, 
education, and customer service. However, there are also important concerns about 
bias, misinformation, and the ethical implications of AI systems. As these 
technologies continue to evolve, it's crucial to develop appropriate guidelines 
and regulations to ensure they benefit society as a whole.
"""

print("ARTYKU≈Å ORYGINALNY:")
print(article.strip())
print(f"\nLiczba s≈Ç√≥w: {len(article.split())}")

# Podsumowanie
summary = summarizer(
    article, 
    max_length=60, 
    min_length=30, 
    do_sample=False
)

print("\n" + "="*70)
print("PODSUMOWANIE:")
print(summary[0]['summary_text'])
print(f"\nLiczba s≈Ç√≥w: {len(summary[0]['summary_text'].split())}")

In [None]:
# Przyk≈Çad 2: R√≥≈ºne d≈Çugo≈õci podsumowa≈Ñ
print("\n=== PODSUMOWANIA O R√ì≈ªNEJ D≈ÅUGO≈öCI ===")

lengths = [
    {"max": 30, "min": 20, "name": "Kr√≥tkie"},
    {"max": 60, "min": 40, "name": "≈örednie"},
    {"max": 100, "min": 70, "name": "D≈Çugie"},
]

for config in lengths:
    summary = summarizer(
        article,
        max_length=config["max"],
        min_length=config["min"],
        do_sample=False
    )
    print(f"\n{config['name']} ({config['min']}-{config['max']} s≈Ç√≥w):")
    print(summary[0]['summary_text'])

In [None]:
# Przyk≈Çad 3: Podsumowanie wielu dokument√≥w
print("\n=== PODSUMOWANIE WIELU ARTYKU≈Å√ìW ===")

articles = [
    {
        "title": "Climate Change Report",
        "text": "Scientists warn that global temperatures are rising at an alarming rate. "
                "The effects include melting ice caps, rising sea levels, and more frequent "
                "extreme weather events. Immediate action is needed to reduce carbon emissions."
    },
    {
        "title": "Tech Industry Growth",
        "text": "The technology sector continues to expand rapidly. Cloud computing, artificial "
                "intelligence, and cybersecurity are driving innovation. Companies are investing "
                "heavily in research and development to stay competitive."
    },
    {
        "title": "Healthcare Innovation",
        "text": "Medical technology is transforming patient care. Telemedicine, AI-assisted "
                "diagnostics, and personalized treatments are becoming more common. These advances "
                "promise to improve outcomes and reduce costs."
    }
]

for article_data in articles:
    summary = summarizer(
        article_data["text"],
        max_length=40,
        min_length=20,
        do_sample=False
    )
    print(f"\nüì∞ {article_data['title']}")
    print(f"Orygina≈Ç: {article_data['text'][:100]}...")
    print(f"Podsumowanie: {summary[0]['summary_text']}")

## 5.2 Generowanie tre≈õci (Text Generation)

### Zastosowania:
- Pisanie artyku≈Ç√≥w i blog√≥w
- Generowanie opis√≥w produkt√≥w
- Tworzenie tre≈õci marketingowych
- Automatyczne odpowiedzi
- Kreatywne pisanie (opowiadania, poezja)

### Parametry kontroli generowania:
- **temperature**: Kontrola "kreatywno≈õci" (0 = deterministyczne, 2+ = bardzo kreatywne)
- **top_k**: Wyb√≥r spo≈õr√≥d k najbardziej prawdopodobnych s≈Ç√≥w
- **top_p**: Nucleus sampling - wyb√≥r spo≈õr√≥d najmniejszego zestawu s≈Ç√≥w o ≈ÇƒÖcznym prawdopodobie≈Ñstwie p
- **max_length**: Maksymalna d≈Çugo≈õƒá wygenerowanego tekstu
- **num_beams**: Beam search - przeszukiwanie wielu wariant√≥w jednocze≈õnie

In [None]:
# Przyk≈Çad 1: Podstawowe generowanie tekstu
print("=== GENEROWANIE TEKSTU ===")

generator = pipeline("text-generation", model="gpt2")

prompt = "The future of artificial intelligence is"

print(f"Prompt: {prompt}\n")

# Generuj kilka wersji
results = generator(
    prompt,
    max_length=80,
    num_return_sequences=3,
    temperature=0.8,
    do_sample=True
)

for i, result in enumerate(results, 1):
    print(f"\nWersja {i}:")
    print(result['generated_text'])
    print("-" * 70)

In [None]:
# Przyk≈Çad 2: Wp≈Çyw temperature na generowanie
print("\n=== WP≈ÅYW TEMPERATURE ===")

temperatures = [0.3, 0.7, 1.2]
prompt = "Machine learning is"

for temp in temperatures:
    result = generator(
        prompt,
        max_length=50,
        temperature=temp,
        do_sample=True,
        num_return_sequences=1
    )
    print(f"\nTemperature = {temp}:")
    print(result[0]['generated_text'])

In [None]:
# Przyk≈Çad 3: Generowanie r√≥≈ºnych typ√≥w tre≈õci
print("\n=== GENEROWANIE R√ì≈ªNYCH TYP√ìW TRE≈öCI ===")

prompts = [
    "Write a product description for a smartwatch:\n",
    "Create a blog post introduction about healthy eating:\n",
    "Write an email subject line for a sale:\n",
]

for prompt in prompts:
    result = generator(
        prompt,
        max_length=70,
        temperature=0.7,
        do_sample=True
    )
    print(f"\n{'='*70}")
    print(result[0]['generated_text'])

## 5.3 Automatyczne t≈Çumaczenia

### Rozw√≥j t≈Çumacze≈Ñ maszynowych:
1. **Rule-based** (lata 50-80) - regu≈Çy gramatyczne
2. **Statistical MT** (lata 90-2010) - statystyki z korpus√≥w r√≥wnoleg≈Çych
3. **Neural MT** (2014+) - sieci neuronowe Seq2Seq
4. **Transformer-based** (2017+) - modele attention-based

### Najpopularniejsze modele:
- **MarianMT** - szybkie, lekkie modele dla wielu par jƒôzykowych
- **mBART** - wielojƒôzyczny BART
- **M2M-100** - bezpo≈õrednie t≈Çumaczenia miƒôdzy 100 jƒôzykami
- **NLLB** (No Language Left Behind) - 200+ jƒôzyk√≥w

In [None]:
# Przyk≈Çad 1: T≈Çumaczenie angielski ‚Üí niemiecki
print("=== T≈ÅUMACZENIE: EN ‚Üí DE ===")

translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

english_texts = [
    "Hello, how are you today?",
    "Artificial intelligence is transforming the world.",
    "I would like to book a table for two people.",
]

for text in english_texts:
    translation = translator(text)
    print(f"\nEN: {text}")
    print(f"DE: {translation[0]['translation_text']}")

In [None]:
# Przyk≈Çad 2: T≈Çumaczenie angielski ‚Üí francuski
print("\n=== T≈ÅUMACZENIE: EN ‚Üí FR ===")

translator_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")

for text in english_texts:
    translation = translator_fr(text)
    print(f"\nEN: {text}")
    print(f"FR: {translation[0]['translation_text']}")

In [None]:
# Przyk≈Çad 3: Wielojƒôzyczne t≈Çumaczenia
print("\n=== T≈ÅUMACZENIA WIELOJƒòZYCZNE ===")

# Dla polskiego mo≈ºna u≈ºyƒá dedykowanych modeli
# np. Helsinki-NLP/opus-mt-en-pl

try:
    translator_pl = pipeline("translation_en_to_pl", model="Helsinki-NLP/opus-mt-en-pl")
    
    text = "Machine learning is a fascinating field of study."
    translation = translator_pl(text)
    
    print(f"\nEN: {text}")
    print(f"PL: {translation[0]['translation_text']}")
except Exception as e:
    print(f"‚ö†Ô∏è Model dla polskiego niedostƒôpny: {e}")

## 5.4 Analiza sentymentu w praktyce

### Poziomy analizy sentymentu:

#### 1. Document-level
- Sentyment ca≈Çego dokumentu
- Przyk≈Çad: recenzja jest pozytywna/negatywna

#### 2. Sentence-level
- Sentyment ka≈ºdego zdania
- Pozwala wykryƒá mieszane opinie

#### 3. Aspect-based (ABSA)
- Sentyment dla konkretnych aspekt√≥w
- Przyk≈Çad: "Jedzenie ≈õwietne, ale obs≈Çuga s≈Çaba"
  - Jedzenie: pozytywny
  - Obs≈Çuga: negatywny

### Zastosowania biznesowe:
- Monitoring opinii o marce
- Analiza recenzji produkt√≥w
- Ocena satysfakcji klient√≥w
- Analiza nastroj√≥w na rynku finansowym
- Moderacja tre≈õci w social media

In [None]:
# Przyk≈Çad 1: Podstawowa analiza sentymentu
print("=== ANALIZA SENTYMENTU ===")

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

reviews = [
    "This product is absolutely amazing! Best purchase ever!",
    "Terrible quality. Complete waste of money.",
    "It's okay, nothing special but does the job.",
    "Love it! Highly recommend to everyone!",
    "Disappointed. Not what I expected at all.",
]

results = []
for review in reviews:
    sentiment = sentiment_analyzer(review)[0]
    results.append({
        'review': review,
        'sentiment': sentiment['label'],
        'score': sentiment['score']
    })
    print(f"\nRecenzja: {review}")
    print(f"  ‚Üí {sentiment['label']}: {sentiment['score']:.4f}")

In [None]:
# Wizualizacja wynik√≥w
df_sentiment = pd.DataFrame(results)

plt.figure(figsize=(12, 6))

# Wykres s≈Çupkowy
colors = ['green' if s == 'POSITIVE' else 'red' for s in df_sentiment['sentiment']]
plt.barh(range(len(df_sentiment)), df_sentiment['score'], color=colors, alpha=0.7)
plt.yticks(range(len(df_sentiment)), [r[:40] + '...' if len(r) > 40 else r for r in df_sentiment['review']])
plt.xlabel('Confidence Score')
plt.title('Analiza Sentymentu Recenzji')
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Przyk≈Çad 2: Analiza sentymentu na poziomie zda≈Ñ
print("\n=== ANALIZA SENTYMENTU: POZIOM ZDA≈É ===")

text = """The hotel room was spacious and clean. The staff was very friendly. 
However, the breakfast was disappointing. The location is perfect though. 
Overall, I would recommend this place."""

# Podziel na zdania
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(text)

print(f"Tekst oryginalny:\n{text}\n")
print("=" * 70)
print("Analiza zdanie po zdaniu:\n")

for i, sentence in enumerate(sentences, 1):
    sentiment = sentiment_analyzer(sentence)[0]
    emoji = "üòä" if sentiment['label'] == 'POSITIVE' else "üòû"
    print(f"{i}. {sentence}")
    print(f"   {emoji} {sentiment['label']} ({sentiment['score']:.2f})\n")

## 5.5 Question Answering (Systemy Q&A)

### Rodzaje system√≥w Q&A:

#### 1. Extractive QA
- Odpowied≈∫ jest fragmentem z podanego tekstu
- Model znajduje najbardziej odpowiedni fragment
- Przyk≈Çad: SQuAD dataset

#### 2. Open-domain QA
- Odpowied≈∫ z du≈ºej bazy wiedzy
- Wymaga retrieval + reading comprehension

#### 3. Generative QA
- Model generuje odpowied≈∫
- Mo≈ºe tworzyƒá odpowiedzi spoza kontekstu

### Zastosowania:
- Chatboty i wirtualni asystenci
- Wyszukiwarki dokument√≥w
- Systemy pomocy technicznej
- Platformy edukacyjne

In [None]:
# Przyk≈Çad: Extractive Question Answering
print("=== QUESTION ANSWERING ===")

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = """
Natural Language Processing (NLP) is a field of artificial intelligence that focuses 
on the interaction between computers and human language. It combines computational 
linguistics with statistical, machine learning, and deep learning models. Modern NLP 
has been revolutionized by transformer architectures like BERT and GPT. These models 
are pre-trained on massive text corpora and can be fine-tuned for specific tasks. 
Common applications include machine translation, sentiment analysis, text summarization, 
and question answering systems. The field continues to evolve rapidly with new 
breakthroughs announced regularly.
"""

questions = [
    "What is NLP?",
    "What are transformer architectures mentioned?",
    "What are common applications of NLP?",
    "How are modern NLP models trained?",
]

print("KONTEKST:")
print(context.strip())
print("\n" + "="*70 + "\n")

for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"‚ùì Pytanie: {question}")
    print(f"‚úÖ Odpowied≈∫: {result['answer']}")
    print(f"   (Pewno≈õƒá: {result['score']:.4f})\n")

In [None]:
# Przyk≈Çad 2: Q&A z r√≥≈ºnymi kontekstami
print("\n=== Q&A: R√ì≈ªNE DOKUMENTY ===")

documents = [
    {
        "title": "Company Info",
        "text": "TechCorp was founded in 2010 by Jane Smith and John Doe. "
                "The company specializes in AI solutions and has offices in "
                "New York, London, and Tokyo. It currently employs 500 people."
    },
    {
        "title": "Product Description",
        "text": "Our flagship product, AI Assistant Pro, uses advanced NLP "
                "to help businesses automate customer support. It supports "
                "20 languages and can handle 10,000 queries per second."
    },
]

questions_per_doc = [
    "Who founded the company?",
    "How many languages does AI Assistant Pro support?",
]

for doc, question in zip(documents, questions_per_doc):
    print(f"\nüìÑ Dokument: {doc['title']}")
    print(f"‚ùì Pytanie: {question}")
    
    result = qa_pipeline(question=question, context=doc['text'])
    print(f"‚úÖ Odpowied≈∫: {result['answer']}")
    print(f"   (Pewno≈õƒá: {result['score']:.4f})")

## ƒÜwiczenie praktyczne: Kompletny pipeline NLU/NLG

### Zadanie:
Stw√≥rz system, kt√≥ry:
1. Przyjmuje d≈Çugi artyku≈Ç
2. Tworzy jego podsumowanie
3. Analizuje sentyment
4. Odpowiada na pytania o artyku≈Ç
5. Generuje tytu≈Ç dla artyku≈Çu

In [None]:
# Kompletny pipeline NLP
class NLPPipeline:
    def __init__(self):
        print("≈Åadowanie modeli...")
        self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
        self.sentiment = pipeline("sentiment-analysis")
        self.qa = pipeline("question-answering")
        self.generator = pipeline("text-generation", model="gpt2")
        print("‚úÖ Pipeline gotowy!\n")
    
    def analyze_article(self, article, questions=None):
        print("="*70)
        print("ANALIZA ARTYKU≈ÅU")
        print("="*70)
        
        # 1. Podsumowanie
        print("\n1Ô∏è‚É£ PODSUMOWANIE:")
        summary = self.summarizer(article, max_length=60, min_length=30)
        summary_text = summary[0]['summary_text']
        print(summary_text)
        
        # 2. Analiza sentymentu
        print("\n2Ô∏è‚É£ SENTYMENT:")
        sentiment = self.sentiment(summary_text)[0]
        emoji = "üòä" if sentiment['label'] == 'POSITIVE' else "üòû"
        print(f"{emoji} {sentiment['label']} (pewno≈õƒá: {sentiment['score']:.2%})")
        
        # 3. Question Answering
        if questions:
            print("\n3Ô∏è‚É£ ODPOWIEDZI NA PYTANIA:")
            for q in questions:
                answer = self.qa(question=q, context=article)
                print(f"\n‚ùì {q}")
                print(f"‚úÖ {answer['answer']}")
        
        # 4. Generowanie tytu≈Çu
        print("\n4Ô∏è‚É£ PROPOZYCJA TYTU≈ÅU:")
        title_prompt = f"Write a catchy title for this article: {summary_text[:100]}"
        title = self.generator(title_prompt, max_length=30, num_return_sequences=1)
        print(title[0]['generated_text'].split('\n')[0])
        
        return {
            'summary': summary_text,
            'sentiment': sentiment,
        }

# Test
article_example = """
Breakthrough in renewable energy research was announced today. Scientists at MIT have 
developed a new solar panel technology that is 40% more efficient than current models. 
The innovation uses a novel material that can capture a broader spectrum of sunlight. 
This development could significantly reduce the cost of solar energy and accelerate 
the transition to clean energy sources. The research team plans to commercialize the 
technology within three years. Industry experts are calling this a game-changer for 
the renewable energy sector.
"""

# Inicjalizuj pipeline
nlp_pipeline = NLPPipeline()

# Analizuj artyku≈Ç
questions = [
    "Who developed the new technology?",
    "How much more efficient is the new solar panel?",
    "When will the technology be commercialized?"
]

results = nlp_pipeline.analyze_article(article_example, questions)

## Podsumowanie Modu≈Çu 5

‚úÖ Nauczyli≈õmy siƒô tworzyƒá podsumowania tekst√≥w (extractive i abstractive)

‚úÖ Poznali≈õmy techniki generowania tre≈õci i kontroli parametr√≥w

‚úÖ Zaimplementowali≈õmy automatyczne t≈Çumaczenia miƒôdzy jƒôzykami

‚úÖ Przeprowadzili≈õmy zaawansowanƒÖ analizƒô sentymentu

‚úÖ Stworzyli≈õmy system Question Answering

‚úÖ Zbudowali≈õmy kompletny pipeline NLU/NLG

### Kluczowe wnioski:
- Modele Transformer umo≈ºliwiajƒÖ r√≥≈ºnorodne zadania NLP
- Pipeline'y Hugging Face znacznie upraszczajƒÖ implementacjƒô
- Ka≈ºde zadanie ma specyficzne modele i parametry
- Jako≈õƒá zale≈ºy od wyboru odpowiedniego modelu i dostrojenia

---

**Nastƒôpny krok**: Modu≈Ç 6 - NLP w praktyce biznesowej (chatboty, automatyzacja)