## Data Preprocessing Lab: Time Series and Text (NLP)

This lab covers two core preprocessing workflows:

1. Time Series preprocessing (indexing, resampling, handling missing data)
2. Text preprocessing (cleaning, tokenization, vectorization)

## Part 1: Time Series Preprocessing

We will use a synthetic dataset simulating daily temperature readings over several months.

In [None]:
import pandas as pd
import numpy as np

# Generate synthetic time series data
np.random.seed(42)
date_rng = pd.date_range(start="2023-01-01", end="2023-04-30", freq="D")
temperature = np.random.normal(loc=15, scale=5, size=len(date_rng))
df_ts = pd.DataFrame({"Date": date_rng, "Temperature": temperature})
df_ts.loc[np.random.choice(df_ts.index, 10), "Temperature"] = (
    np.nan
)  # add missing values
df_ts.set_index("Date", inplace=True)

df_ts.head()

### Plot the Time Series

In [None]:
import matplotlib.pyplot as plt

df_ts.plot(figsize=(10, 4), title="Daily Temperature")
plt.ylabel("Temperature (°C)")
plt.grid(True)
plt.show()

### Fill Missing Values and Resample

In [None]:
# Interpolate missing values
df_filled = df_ts.interpolate()

# Resample to weekly average
df_weekly = df_filled.resample("W").mean()

df_weekly.head()

### Exercise 1: Time Series Handling


- Count the number of missing values in `df_ts`
- Fill missing values using forward fill instead of interpolation
- Plot both original and filled time series for comparison


In [None]:
# TODO: Fill missing values using ffill
df_ffill = df_ts.fillna(method="ffill")

## Part 2: Text Preprocessing (NLP)

We will use a small corpus of text to explore common text preprocessing techniques.

In [None]:
documents = [
    "Natural Language Processing (NLP) is a subfield of AI.",
    "It focuses on understanding and generating human language.",
    "Text preprocessing is a crucial step in NLP pipelines.",
    "Common techniques include tokenization, stopword removal, and stemming.",
]

### Basic Text Cleaning

In [None]:
import re


def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    return text


cleaned_docs = [clean_text(doc) for doc in documents]
cleaned_docs

### Tokenization and Stopword Removal

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


def tokenize(text):
    return [word for word in text.split() if word not in ENGLISH_STOP_WORDS]


tokenized_docs = [tokenize(doc) for doc in cleaned_docs]
tokenized_docs

### Vectorization with CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(cleaned_docs)
pd.DataFrame(X_counts.toarray(), columns=vectorizer.get_feature_names_out())

### Exercise 2: Text Preprocessing


- Write a function to clean and tokenize a list of new documents
- Use `TfidfVectorizer` to vectorize them
- Print the TF-IDF matrix


In [None]:
# TODO: Add your new documents and preprocess them
from sklearn.feature_extraction.text import TfidfVectorizer

new_docs = [
    "Language models are powerful tools.",
    "Preprocessing affects model performance.",
]
tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = tfidf.fit_transform(new_docs)
pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())