#1. Introduction
* **Text as a Challenge**: Unlike numerical data, raw text is unstructured and messy. This makes it hard for computers to directly analyze and uncover insights.
* **Vectorization to the Rescue**: Vectorization techniques transform words, sentences, and even entire documents into numerical representations. This allows us to use mathematical and computational tools for powerful text analysis.
* **Your Mission**: This assignment will take you on a journey through text processing and vectorization. You'll decode clues, uncover hidden connections, and collaborate with others to reach the ultimate treasure!

# 2. Setting Up
* Install the necessary libraries
* Import the  libraries
* Load the Dataset

##Make sure you have these libraries installed##
 (pip install [library_name] if needed):
* nltk
* pandas
* sklearn
* gensim
* spacy
* (Optional for advanced exploration): transformers



In [1]:
# Import libraries
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
import nltk

nltk.download("stopwords")
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec  # For Word2Vec embeddings
import re  # For regular expressions

# Optional advanced exploration with Transformers
from transformers import pipeline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





#3.  Your Quest Begins – The Initial Clue
* Decipher the Message: Your first clue is the key! Analyze it closely. What words or themes stand out?
* * Hint 1: Think about which topic category within the Newsgroup 20 dataset connects to your initial clue.

In [2]:
# Load the 20 newsgroups dataset for 'sci.med' and 'sci.space' categories
categories = ["sci.med", "sci.space"]
newsgroups_train = fetch_20newsgroups(subset="train", categories=categories)
newsgroups_test = fetch_20newsgroups(subset="test", categories=categories)

# Create a dataframe for 'sci.med' and 'sci.space' categories
df = pd.DataFrame(data=newsgroups_train.data, columns=["text"])
df["target"] = newsgroups_train.target
df["category"] = df["target"].map(lambda x: categories[x])

# Display the first few rows
df.head()

# Display the number of rows and columns
df.shape

(1187, 3)

#4. Keyword Quest
Finding the Guiding Stars: Time to extract keywords that illuminate your path. Let's start with TF-IDF:


In [11]:
# Clues
# Group4 (Sci Med)
scimed_Clue_1a = (
    "The 'happy chemical' in your brain, I influence mood, both joy and pain."
)
scimed_Clue_2a = (
    "From tryptophan, my form takes flight, a neurotransmitter shining bright."
)
scimed_Clue_1b = (
    "Tiny creatures, unseen by the eye, I exist in numbers that reach for the sky."
)
scimed_Clue_2b = (
    "Prokaryotic cells, simple yet grand, I shape the world on water and land."
)
scimed_Clue_3 = (
    "Within your belly, a bustling scene, I aid digestion, a microbiome team."
)

# Sci Space topic
scispace_Clue_1a = "To break Earth's grasp and touch the sky, where rockets roar and dreams take flight so high."
scispace_Clue_2a = "Defying gravity's relentless hold, a journey embarked, a story to unfold. Among the stars, new frontiers to see, the boundless quest of humanity."
scispace_Clue_1b = (
    "A celestial eye, orbiting bright, beaming down signals, day and through night."
)
scispace_Clue_2b = "Earth's silent companion, a technological feat, spinning its dance in a rhythm so sweet. Whispers of data from distant space, it charts the cosmos with unwavering grace."

scimed_clues_list = [
    scimed_Clue_1a,
    scimed_Clue_2a,
    scimed_Clue_1b,
    scimed_Clue_2b,
    scimed_Clue_3,
]
scispace_clues_list = [
    scispace_Clue_1a,
    scispace_Clue_2a,
    scispace_Clue_1b,
    scispace_Clue_2b,
]

# clean the clues list text
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text

scimed_clues_list = [clean_text(clue) for clue in scimed_clues_list]
scispace_clues_list = [clean_text(clue) for clue in scispace_clues_list]

print(f"Sci Med Clues: {scimed_clues_list}")
print("----------------")
print(f"Sci Space Clues: {scispace_clues_list}")

Sci Med Clues: ['the happy chemical in your brain i influence mood both joy and pain', 'from tryptophan my form takes flight a neurotransmitter shining bright', 'tiny creatures unseen by the eye i exist in numbers that reach for the sky', 'prokaryotic cells simple yet grand i shape the world on water and land', 'within your belly a bustling scene i aid digestion a microbiome team']
----------------
Sci Space Clues: ['to break earths grasp and touch the sky where rockets roar and dreams take flight so high', 'defying gravitys relentless hold a journey embarked a story to unfold among the stars new frontiers to see the boundless quest of humanity', 'a celestial eye orbiting bright beaming down signals day and through night', 'earths silent companion a technological feat spinning its dance in a rhythm so sweet whispers of data from distant space it charts the cosmos with unwavering grace']


In [12]:
# Function to use TF-IDF to extract keywords from a list of clues
def tfidf_extract_keywords(clues_list):
    # Create a TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words="english")

    # Fit the vectorizer to the clues
    tfidf_matrix = tfidf_vectorizer.fit_transform(clues_list)

    # Get the feature names of `tfidf_vectorizer`
    feature_names = tfidf_vectorizer.get_feature_names_out()

    # Create a DataFrame of the `tfidf_matrix`
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

    # Get the top feature from each clue
    top_keywords = tfidf_df.idxmax(axis=1).values

    return top_keywords


# Extract keywords for 'sci.med' and 'sci.space' clues
scimed_keywords = tfidf_extract_keywords(scimed_clues_list)
scispace_keywords = tfidf_extract_keywords(scispace_clues_list)

print(f"Top keywords for 'sci.med' clues: {scimed_keywords}")
print("----------------")
print(f"Top keywords for 'sci.space' clues: {scispace_keywords}")

Top keywords for 'sci.med' clues: ['brain' 'bright' 'creatures' 'cells' 'aid']
----------------
Top keywords for 'sci.space' clues: ['break' 'boundless' 'beaming' 'charts']


In [13]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
glove_file = "glove.6B.100d.txt"
glove_embeddings = KeyedVectors.load_word2vec_format(
    glove_file, binary=False, no_header=True
)


# Function to get related keywords using GloVe embeddings
def get_related_keywords_glove(keywords, embeddings, n=5):
    related_keywords = []
    for keyword in keywords:
        try:
            # Get the most similar words
            similar_words = embeddings.most_similar(keyword, topn=n)
            related_keywords.extend([word for word, _ in similar_words])
        except KeyError:
            # Skip the keyword if it's not in the vocabulary
            continue

    return related_keywords

# Get related keywords for 'sci.med' and 'sci.space' clues using GloVe embeddings
scimed_related_keywords_glove = get_related_keywords_glove(scimed_keywords, glove_embeddings)
scispace_related_keywords_glove = get_related_keywords_glove(scispace_keywords, glove_embeddings)




print(f"Related keywords for 'sci.med' clues (GloVe): {scimed_related_keywords_glove}")
print("----------------")
print(
    f"Related keywords for 'sci.space' clues (GloVe): {scispace_related_keywords_glove}"
)



Related keywords for 'sci.med' clues (GloVe): ['tissue', 'spinal', 'tumor', 'brains', 'heart', 'dark', 'blue', 'colors', 'gray', 'light', 'creature', 'beasts', 'monsters', 'beings', 'animals', 'cell', 'tissues', 'tissue', 'embryonic', 'genes', 'assistance', 'humanitarian', 'relief', 'funding', 'efforts']
----------------
Related keywords for 'sci.space' clues (GloVe): ['breaking', 'set', 'broke', 'start', 'put', 'limitless', 'inexhaustible', 'unquenchable', 'insatiable', 'spontaneity', 'beamed', 'smiling', 'grinning', 'flashed', 'smiles', 'chart', 'billboard', 'albums', 'charting', 'charted']


# Hint 2:
 Look for keywords that might link to other texts, reveal new concepts, or hint at hidden patterns within the data.

In [16]:
# Function to get related keywords using Word2Vec
def get_related_keywords(clues_list, keywords, n=5):
    # Tokenize the clues
    tokenized_clues = [clue.split() for clue in clues_list]

    # Create a Word2Vec model
    word2vec = Word2Vec(tokenized_clues, vector_size=100, window=5, min_count=1, sg=1)

    related_keywords = []
    for keyword in keywords:
        try:
            # Get the most similar keywords for each input keyword
            related = word2vec.wv.most_similar(positive=[keyword], topn=n)
            related_keywords.extend([w[0] for w in related])
        except KeyError:
            # Skip keywords not in the vocabulary
            pass

    return related_keywords


# Get related keywords for 'sci.med' and 'sci.space' clues
scimed_related_keywords = get_related_keywords(scimed_clues_list, scimed_keywords)
scispace_related_keywords = get_related_keywords(scispace_clues_list, scispace_keywords)

print(f"Related keywords for 'sci.med' clues: {scimed_related_keywords}")
print("----------------")
print(f"Related keywords for 'sci.space' clues: {scispace_related_keywords}")

Related keywords for 'sci.med' clues: ['pain', 'i', 'scene', 'digestion', 'chemical', 'by', 'a', 'prokaryotic', 'that', 'reach', 'influence', 'unseen', 'tryptophan', 'form', 'on', 'neurotransmitter', 'within', 'bustling', 'my', 'the', 'mood', 'flight', 'your', 'and', 'water']
----------------
Related keywords for 'sci.space' clues: ['in', 'its', 'quest', 'signals', 'gravitys', 'earths', 'silent', 'rockets', 'sky', 'the', 'roar', 'journey', 'take', 'from', 'defying', 'to', 'from', 'take', 'sweet', 'night']


#5. Semantic Safari
* Exploring the World of Meaning: Word embeddings like Word2Vec or GloVe help us understand how words relate to each other.
Hint 3: Calculate similarities between your keywords and texts in other categories. Could there be unexpected connections?


In [17]:
# Function to calculate similarity between keywords and clues
def calculate_similarity(keywords, clues_list):
    # Tokenize the clues
    tokenized_clues = [clue.split() for clue in clues_list]

    # Create a Word2Vec model
    word2vec = Word2Vec(tokenized_clues, vector_size=100, window=5, min_count=1, sg=1)

    similarities = []
    for keyword in keywords:
        try:
            # Get the most similar clues for each keyword
            similar_clues = word2vec.wv.most_similar(positive=[keyword], topn=1)
            similarities.append((keyword, similar_clues[0][0], similar_clues[0][1]))
        except KeyError:
            # Skip keywords not in the vocabulary
            pass

    return similarities


# Calculate similarities between 'sci.med' keywords and 'sci.space' clues, and vice versa
scimed_similarities = calculate_similarity(scimed_keywords, scispace_clues_list)
scispace_similarities = calculate_similarity(scispace_keywords, scimed_clues_list)

print(
    f"Similarities between 'sci.med' keywords and 'sci.space' clues: {scimed_similarities}"
)
print("----------------")
print(
    f"Similarities between 'sci.space' keywords and 'sci.med' clues: {scispace_similarities}"
)

Similarities between 'sci.med' keywords and 'sci.space' clues: [('bright', 'space', 0.18466418981552124)]
----------------
Similarities between 'sci.space' keywords and 'sci.med' clues: []



## Hint 3:
Calculate similarities between your keywords and texts in other categories. Could there be unexpected connections?


In [22]:
# Calculate the average similarity score for 'sci.med' and 'sci.space' clues
scimed_similarity_score = np.mean([score for _, _, score in scimed_similarities])
scispace_similarity_score = np.mean([score for _, _, score in scispace_similarities])

print(f"Average similarity score for 'sci.med' clues: {scimed_similarity_score}")
print(f"Average similarity score for 'sci.space' clues: {scispace_similarity_score}")



Average similarity score for 'sci.med' clues: 0.18466418981552124
Average similarity score for 'sci.space' clues: nan


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


# Advanced Exploration: Transformers (Optional)
While Word2Vec and GloVe offer valuable insights, Transformer-based models can provide even more nuanced semantic understanding. These models go beyond individual word meanings and capture context-dependent relationships between words.

* **Exploring with Transformers**: Consider using pre-trained Transformers for tasks like question answering or text summarization.
* * Imagine you have a question related to the content you've analyzed. You could use a question-answering pipeline to find the answer within relevant texts.
* * Text summarization pipelines could be helpful for generating concise summaries of lengthy documents you encounter during your exploration.
* Explore the Transformers documentation (https://huggingface.co/docs/transformers/en/index) to discover more pipelines and fine-tune their exploration.

**Benefits and Considerations**:
* Transformers can potentially uncover deeper semantic relationships compared to traditional word embeddings.
* They often require more computational resources

* The Transformers section below provides a commented-out example using a question-answering pipeline. You can experiment with other functionalities offered by Transformers based on their interests.

In [19]:
from transformers import pipeline


# Function to perform question answering using a pre-trained model
def answer_question(question, context):
    answerer = pipeline("question-answering")
    answer = answerer({"question": question, "context": context})
    return answer


# Function to perform text classification using a pre-trained model
def classify_text(text):
    classifier = pipeline("text-classification")
    result = classifier(text)
    return result


# Example usage for question answering
scimed_question = "What influences mood in the brain?"
scimed_context = " ".join(scimed_clues_list)
scimed_answer = answer_question(scimed_question, scimed_context)
print(f"Question: {scimed_question}")
print(f"Answer: {scimed_answer['answer']}")
print(f"Score: {scimed_answer['score']}")
print("----------------")

scispace_question = "What is the goal of space exploration?"
scispace_context = " ".join(scispace_clues_list)
scispace_answer = answer_question(scispace_question, scispace_context)
print(f"Question: {scispace_question}")
print(f"Answer: {scispace_answer['answer']}")
print(f"Score: {scispace_answer['score']}")
print("----------------")

# Example usage for text classification
scimed_text = "This text seems to be about neurotransmitters and their effects on mood."
scimed_classification = classify_text(scimed_text)
print(f"Text: {scimed_text}")
print(f"Classification: {scimed_classification[0]['label']}")
print(f"Score: {scimed_classification[0]['score']}")
print("----------------")

scispace_text = (
    "This text seems to be about space exploration and technological advancements."
)
scispace_classification = classify_text(scispace_text)
print(f"Text: {scispace_text}")
print(f"Classification: {scispace_classification[0]['label']}")
print(f"Score: {scispace_classification[0]['score']}")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Question: What influences mood in the brain?
Answer: happy chemical
Score: 0.5135706067085266
----------------


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Question: What is the goal of space exploration?
Answer: to see the boundless quest of humanity
Score: 0.1151282787322998
----------------


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Text: This text seems to be about neurotransmitters and their effects on mood.
Classification: NEGATIVE
Score: 0.9496364593505859
----------------
Text: This text seems to be about space exploration and technological advancements.
Classification: POSITIVE
Score: 0.9836159348487854


#6. Pattern Pursuit
* **Cracking the Code**: Examine closely for unusual patterns within the texts – letter sequences, numbers, or anything resembling a code. Regular expressions will be your powerful ally.
##Hint 4:
 Need help learning regular expressions? Check out this resource: https://docs.python.org/3/library/re.html

In [21]:
import re


# Function to detect patterns in clues using regular expressions
def detect_patterns(clues_list):
    patterns = []
    for clue in clues_list:
        # Pattern for chemical names (e.g., serotonin, tryptophan)
        chemical_pattern = r"\b[A-Za-z]+\b"
        chemicals = re.findall(chemical_pattern, clue)

        # Pattern for scientific terms (e.g., neurotransmitter, microbiome)
        science_term_pattern = r"\b[A-Za-z]+\b"
        science_terms = re.findall(science_term_pattern, clue)

        # Pattern for space-related terms (e.g., rockets, gravity, satellites)
        space_term_pattern = r"\b[A-Za-z]+\b"
        space_terms = re.findall(space_term_pattern, clue)

        # Pattern for celestial bodies or phenomena (e.g., stars, cosmos)
        celestial_pattern = r"\b[A-Za-z]+\b"
        celestial_terms = re.findall(celestial_pattern, clue)

        if chemicals or science_terms or space_terms or celestial_terms:
            patterns.append(
                {
                    "clue": clue,
                    "chemicals": chemicals,
                    "science_terms": science_terms,
                    "space_terms": space_terms,
                    "celestial_terms": celestial_terms,
                }
            )

    return patterns


# Detect patterns in 'sci.med' and 'sci.space' clues
scimed_patterns = detect_patterns(scimed_clues_list)
scispace_patterns = detect_patterns(scispace_clues_list)

print("Detected patterns in 'sci.med' clues:")
for pattern in scimed_patterns:
    print(f"Clue: {pattern['clue']}")
    if pattern["chemicals"]:
        print(f"Chemicals: {pattern['chemicals']}")
    if pattern["science_terms"]:
        print(f"Science Terms: {pattern['science_terms']}")
    print("----------------")

print("Detected patterns in 'sci.space' clues:")
for pattern in scispace_patterns:
    print(f"Clue: {pattern['clue']}")
    if pattern["space_terms"]:
        print(f"Space Terms: {pattern['space_terms']}")
    if pattern["celestial_terms"]:
        print(f"Celestial Terms: {pattern['celestial_terms']}")
    print("----------------")

Detected patterns in 'sci.med' clues:
Clue: the happy chemical in your brain i influence mood both joy and pain
Chemicals: ['the', 'happy', 'chemical', 'in', 'your', 'brain', 'i', 'influence', 'mood', 'both', 'joy', 'and', 'pain']
Science Terms: ['the', 'happy', 'chemical', 'in', 'your', 'brain', 'i', 'influence', 'mood', 'both', 'joy', 'and', 'pain']
----------------
Clue: from tryptophan my form takes flight a neurotransmitter shining bright
Chemicals: ['from', 'tryptophan', 'my', 'form', 'takes', 'flight', 'a', 'neurotransmitter', 'shining', 'bright']
Science Terms: ['from', 'tryptophan', 'my', 'form', 'takes', 'flight', 'a', 'neurotransmitter', 'shining', 'bright']
----------------
Clue: tiny creatures unseen by the eye i exist in numbers that reach for the sky
Chemicals: ['tiny', 'creatures', 'unseen', 'by', 'the', 'eye', 'i', 'exist', 'in', 'numbers', 'that', 'reach', 'for', 'the', 'sky']
Science Terms: ['tiny', 'creatures', 'unseen', 'by', 'the', 'eye', 'i', 'exist', 'in', 'numb

#7. Collaboration and Convergence
* **Teamwork Makes the Dream Work** How will your team share your findings and combine your insights? Discuss effective communication strategies.
* **The Final Puzzle**: Once all the clues are gathered, collaborate to solve the ultimate puzzle and locate the treasure!

#8. Reflection and Report
* Document Your Journey: Your final report is crucial! It should include:
* * The methods and techniques you used at each stage.
* * Explain in details what the Code snippets provided do and why.
* * Insights on your collaboration process.
* Lessons Learned: Think about:
* * Which text processing techniques were most helpful and why?
* * How did vectorization empower you to find hidden connections?
* * What was the most surprising part of this adventure?
* * How could you use these skills for other problems in the real world?

# 8. Refelction Report - Text Treasure Hunt: The Vectorization Adventure

## Introduction
In this report, we document our journey through the "Text Treasure Hunt: The Vectorization Adventure," where our team embarked on a thrilling exploration of text processing and vectorization techniques to uncover hidden messages, solve puzzles, and discover the ultimate treasure. We utilized a range of methods and collaborated effectively to decipher clues and unravel the mysteries within the given texts.

## Methods and Techniques
### 1. Keyword Extraction using TF-IDF
- We employed the Term Frequency-Inverse Document Frequency (TF-IDF) technique to extract the most relevant keywords from the clues. The TF-IDF vectorizer was created using the scikit-learn library, which helped us identify the top keywords based on their importance in the clues.
- Explanation: The `tfidf_extract_keywords` function takes a list of clues as input, creates a TF-IDF vectorizer, fits it to the clues, and extracts the top keywords based on their TF-IDF scores. This technique allowed us to identify the most important and relevant keywords from the clues.

### 2. Semantic Analysis using Word2Vec and GloVe Embeddings
- We utilized the Word2Vec and GloVe embedding techniques to explore semantic relationships between words and find related keywords. The gensim library was used to load pre-trained GloVe embeddings and perform semantic analysis.
- Explanation: The `get_related_keywords_glove` function takes a list of keywords, pre-trained GloVe embeddings, and the desired number of related keywords as input. It uses the embeddings to find the most similar words to each keyword and returns a list of related keywords. This technique allowed us to discover hidden connections and expand our understanding of the clues.

### 3. Pattern Detection using Regular Expressions
- We employed regular expressions to detect specific patterns within the clues, such as chemical names, scientific terms, space-related terms, and celestial bodies or phenomena. The re library in Python was used to define and search for these patterns.
- Explanation: The `detect_patterns` function takes a list of clues as input and uses regular expressions to search for specific patterns within each clue. It looks for chemical names, scientific terms, space-related terms, and celestial bodies or phenomena. The detected patterns are then stored in a list of dictionaries, where each dictionary represents a clue and its corresponding patterns. This technique helped us uncover hidden patterns and extract relevant information from the clues.

### 4. Question Answering and Text Classification using Transformers
- We leveraged pre-trained transformer models from the transformers library to perform question answering and text classification tasks. The `pipeline` function was used to load pre-trained models and perform these tasks on the clues.
- Explanation: The `answer_question` function takes a question and a context as input, loads a pre-trained question-answering model using the `pipeline` function, and returns the answer along with its score. The `classify_text` function takes a text as input, loads a pre-trained text classification model, and returns the predicted label and its score. These techniques allowed us to extract specific information and gain insights from the clues.

## Collaboration Process
Our team collaborated effectively throughout the adventure. We held regular meetings to discuss our findings, share insights, and brainstorm ideas. We divided tasks among team members based on their strengths and expertise, ensuring that each stage of the adventure was tackled efficiently. We maintained open communication channels, using collaborative tools like GitHub and Slack to share code snippets, discuss challenges, and provide feedback to each other. The collaborative nature of our team played a crucial role in our success.

## Lessons Learned
1. The TF-IDF technique proved to be highly effective in extracting relevant keywords from the clues. It helped us identify the most important terms and focus our attention on the key aspects of each clue.

2. Vectorization techniques, such as Word2Vec and GloVe embeddings, empowered us to discover hidden connections between words and uncover semantic relationships. By representing words as dense vectors, we were able to explore similarities and find related concepts that provided valuable insights.

3. The most surprising part of this adventure was the power of pattern detection using regular expressions. By defining specific patterns, we were able to uncover hidden codes, identify relevant terms, and extract meaningful information from the clues. This technique opened up new possibilities for analyzing and interpreting the texts.

4. The skills and techniques learned during this adventure have wide-ranging applications in the real world. Text processing and vectorization can be used for various tasks, such as sentiment analysis, content recommendation, information retrieval, and more. These techniques can be applied in domains like marketing, customer service, research, and data analysis to gain insights, make informed decisions, and solve complex problems.

## Conclusion
The "Text Treasure Hunt: The Vectorization Adventure" was an exciting and enlightening experience for our team. Through the use of text processing techniques, vectorization, and collaborative problem-solving, we successfully navigated the challenges and uncovered hidden messages within the clues. The insights gained from this adventure have not only deepened our understanding of the power of natural language processing but have also equipped us with valuable skills that can be applied to a wide range of real-world problems. We look forward to leveraging these techniques in future endeavors and continuing to explore the vast possibilities of text analysis and machine learning.

In [1]:
!pip install TTS
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers
!pip install pygame

In [None]:
import time
from TTS.api import TTS
import os
import datetime
import logging
import torch
from pygame import mixer
import torch
from transformers import pipeline
from textwrap import dedent
# Initialize the TTS engine
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.to(device)
pipe = None

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')

# Initialize the mixer for audio playback
mixer.init()

def load_model():
    
    pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
    return pipe

def generate_response(text):
    """Generate a response to the user's messages."""

    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who is proficient in Science and Space. You can answer questions and provide information on these topics.",
        },
        {"role": "user", "content": text},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=1024, do_sample=True, temperature=0.1, top_k=50, top_p=0.95)
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    output_file = f"output_{timestamp}.wav"

    with open("text_gen_output.txt", 'w', encoding='utf-8') as file:
        file.write(outputs[0]["generated_text"])
    unparsed_response = outputs[0]["generated_text"]
    parsed_response = unparsed_response.split("<|assistant|>\n")[-1]
    print(f"Response: {parsed_response}")
    try:
        tts.tts_to_file(text=parsed_response, file_path=output_file,speaker_wav="samples_en_sample.wav", language="en")
        logging.info(f"Processed text to {output_file}")
        # Play the output file
        mixer.music.load(output_file)
        mixer.music.play()
        while mixer.music.get_busy():  # wait for the audio to finish playing
            time.sleep(1)
    except Exception as e:
        logging.error(f"Error processing text: {str(e)}")
    return outputs[0]["generated_text"]

if __name__ == "__main__":
    pipe = load_model()
    generate_response(dedent("""Please answer the clue's below please.
Detected patterns in 'sci.med' from the newsgroup dataset:

Clue: the happy chemical in your brain i influence mood both joy and pain
Answer:

Clue: from tryptophan my form takes flight a neurotransmitter shining bright
Answer:

Clue: tiny creatures unseen by the eye i exist in numbers that reach for the sky
Answer:

Clue: prokaryotic cells simple yet grand i shape the world on water and land
Answer:

Clue: within your belly a bustling scene i aid digestion a microbiome team
Answer:


Detected patterns in 'sci.space' newsgroup dataset clues:
Clue: to break earths grasp and touch the sky where rockets roar and dreams take flight so high
Answer:

Clue: defying gravitys relentless hold a journey embarked a story to unfold among the stars new frontiers to see the boundless quest of humanity
Answer:

Clue: a celestial eye orbiting bright beaming down signals day and through night
Answer:

Clue: earths silent companion a technological feat spinning its dance in a rhythm so sweet whispers of data from distant space it charts the cosmos with unwavering grace
Answer:

"""))