In [1]:
fake_article = """
WHY SLEEP MATTERS 💡

Sleep is not just rest^1—it's a critical biological function. (1) Every night, your brain and body engage in vital repair processes. Short subtitle: SLEEP STAGES.

There are four main stages of sleep: N1, N2, N3, and REM. Each stage plays a unique role in cognitive function and memory consolidation. [2] “Missing out on deep sleep can affect how your brain stores memories,” says Dr. Luna Moon, a neurologist at BrainHealth Org. 🧠

Quick facts:
- 35% of adults report poor sleep
- REM sleep is when dreams occur
- Sleep deprivation impairs judgment

Interview with Dr. Luna Moon:
Q: What’s the #1 mistake people make about sleep?
A: They treat it like a luxury. It’s a necessity—like food or water.

🌟 ADVERTISEMENT: Try DreamEase pillows—engineered with NASA technology! Use code SLEEP2024 for 20% off.

Studies also show a strong connection between sleep quality and emotional regulation. Poor sleep increases risks of depression and anxiety. Researchers from the Sleep Science Institute^2 note that even a single night of bad sleep can elevate cortisol levels, impacting your mood for days. “Sleep is your emotional reset button,” Dr. Moon adds.

References:
1. Smith, J. (2020). Sleep and Brain Health. Sleep Journal.
2. Sleep Science Institute Annual Report, 2021.

About the Author:
Chris Napper is a freelance health writer with over 15 years of experience writing about wellness, nutrition, and sleep science.

(END)
"""


In [2]:
import pandas as pd
df = pd.DataFrame({'text': [fake_article]})


In [3]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
import pandas as pd
import re
from nltk.tokenize import sent_tokenize, word_tokenize

# Sample placeholder: Replace with actual load step
# df = pd.read_csv('raw_articles.csv')  # Should contain a 'text' column

def is_relevant(text):
    # Basic filter for irrelevant articles (customize as needed)
    text_lower = text.lower()
    return any(keyword in text_lower for keyword in ["sleep", "night", "dream"])

def clean_text(text):
    # Remove footnotes and superscripts (e.g., [1], (1), etc.)
    text = re.sub(r"\[\d+\]", "", text)
    text = re.sub(r"\(\d+\)", "", text)

    # Remove author bios and references sections
    text = re.split(r"(?i)(references|about the author|bibliography)", text)[0]

    # Remove extra symbols and unnecessary punctuation
    text = re.sub(r"[^a-zA-Z0-9\s.,!?'-]", " ", text)  # Keep only basic punctuation
    text = re.sub(r"\s+", " ", text).strip()

    return text

def filter_sentences(sentences):
    # Remove duplicates and very short sentences (<5 words)
    seen = set()
    filtered = []
    for s in sentences:
        if len(word_tokenize(s)) >= 5:
            s_clean = s.strip()
            if s_clean not in seen:
                seen.add(s_clean)
                filtered.append(s_clean)
    return filtered

def chunk_into_passages(sentences, min_words=100, max_words=150):
    passages = []
    current = []
    count = 0
    for sent in sentences:
        words = word_tokenize(sent)
        current += words
        count += len(words)
        if count >= min_words:
            if count <= max_words or len(current) > max_words:
                passages.append(" ".join(current))
                current = []
                count = 0
    if current:
        passages.append(" ".join(current))  # Add remaining words
    return passages

!pip install rake-nltk
import pandas as pd
import re
from nltk.tokenize import sent_tokenize, word_tokenize
from rake_nltk import Rake # Import the Rake class from rake_nltk
nltk.download('stopwords')
rake = Rake()

def generate_title_with_keywords(text):
    rake.extract_keywords_from_text(text)
    keywords = rake.get_ranked_phrases()
    return keywords[0].title() if keywords else "Untitled"




all_passages = []

for idx, row in df.iterrows():
    text = row["text"]

    if not is_relevant(text):
      print(1)
      continue  # Skip irrelevant articles

    cleaned = clean_text(text)
    sentences = sent_tokenize(cleaned)
    filtered_sentences = filter_sentences(sentences)
    passages = chunk_into_passages(filtered_sentences)

    all_passages.extend(passages)
print("Total passages:", len(all_passages))
# Convert to DataFrame
corpus_df = pd.DataFrame({"passage": all_passages})
corpus_df["index"] = corpus_df.index
corpus_df["title"] = corpus_df["passage"].apply(generate_title_with_keywords)
corpus_df = corpus_df[["index", "passage", "title"]]

print("Total clean passages:", len(corpus_df))
print(corpus_df.head())


Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl.metadata (6.4 kB)
Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6
Total passages: 2
Total clean passages: 2
   index                                            passage  \
0      0  WHY SLEEP MATTERS Sleep is not just rest 1 it ...   
1      1  A They treat it like a luxury . It s a necessi...   

                                            title  
0    Sleep Deprivation Impairs Judgment Interview  
1  Advertisement Try Dreamease Pillows Engineered  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
