<a href="https://colab.research.google.com/github/aynuod/Text-Mining-with-Large-Language-Models-LLMs-/blob/main/Text_Mining_with_Large_Language_Models_(LLMs).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining with LLMs in Colab

### **Objectives**
1. To demonstrate how pretrained Large Language Models (LLMs) can simplify text mining tasks.
2. To perform sentiment analysis on a sample dataset to classify text as positive or negative.
3. To summarize a long paragraph using an encoder-decoder Transformer model.
4. To generate creative text using GPT-2, showcasing its generative capabilities.

---

### **About the Lab**
This lab uses the **Hugging Face Transformers library** to access pretrained models. These models were trained on large text datasets and are now fine-tuned for specific tasks like:
- Sentiment analysis.
- Summarization.
- Text generation.

By leveraging these models, we save time, avoid the need for expensive training, and focus on using the models for practical applications.


In [13]:
from transformers import pipeline
import pandas as pd

# Sample dataset
data = {
    "text": [
        "I love this product! It's fantastic.",
        "The experience was awful. Totally disappointed."
    ]
}
df = pd.DataFrame(data)


In [14]:
# Load pre-trained sentiment analysis model
sentiment_analyzer = pipeline("sentiment-analysis")

# Apply sentiment analysis
df['sentiment'] = df['text'].apply(lambda x: sentiment_analyzer(x)[0]['label'])
print(df)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


                                              text sentiment
0             I love this product! It's fantastic.  POSITIVE
1  The experience was awful. Totally disappointed.  NEGATIVE


In [15]:
# Summarization pipeline
summarizer = pipeline("summarization")

# Summarize long text
text = """
The Transformer architecture itself is quite impressive. It can be highly
parallelized and scaled in ways that previous state of the art NLP models could
not be, allowing it to scale to much larger data sets and training times than
previous NLP models. The Transformer uses a special kind of attention
calculation called self-attention to allow each word in a sequence to “attend to”
(look to for context) all other words in the sequence, enabling it to capture longrange
dependencies and contextual relationships between words. Of course, no
architecture is perfect. Transformers are still limited to an input context window
which represents the maximum length of text it can process at any given
moment.
Since the advent of the Transformer in 2017, the ecosystem around using and
deploying Transformers has only exploded. The aptly named “Transformers”
library and its supporting packages have made it accessible for practitioners to
use, train, and share models, greatly accelerating its adoption and being used by
thousands of organizations and counting. Popular LLM repositories like
Hugging Face have popped up, providing access to powerful open-source
models to the masses. In short, using and productionizing a Transformer has
never been easier.
"""
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


 The Transformer architecture itself is quite impressive. It can be highly parallelized and scaled in ways that previous state of the art NLP models could not be, allowing it to scale to much larger data sets and training times than previous models


In [16]:
!huggingface-cli login
# Text generation pipeline
generator = pipeline("text-generation", model="gpt2", token="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")

# Generate text based on a prompt
prompt = "Artificial intelligence in healthcare is"
output = generator(prompt, max_length=50, num_return_sequences=1)
print(output[0]['generated_text'])



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Artificial intelligence in healthcare is an area that has gained significant interest since the beginning of AI research. There have been several years under way, but what emerges to me is that there has been minimal progress on basic AI technology. We don't need to
