# üéôÔ∏è Audio-to-Vector Feature Extraction

## Pipeline: ASR ‚Üí Text Cleaning ‚Üí TF-IDF Vectorization

This notebook demonstrates a complete NLP pipeline for transforming raw audio signals into structured numerical vectors suitable for machine learning models.

---

## üõ†Ô∏è Step 2: Extracting TF-IDF Embeddings from Audio Transcripts

In this section, we convert spoken language into a machine-readable format by combining **Automatic Speech Recognition (ASR)** with **Natural Language Processing (NLP)** vectorization techniques.

### üìã The Workflow

1. **Audio Ingestion:** Load raw audio files (e.g., `.wav`, `.mp3`) into the environment.
2. **ASR Conversion:** Use **Automatic Speech Recognition** to transcribe the audio signal into raw text strings.
3. **Text Preprocessing:** Utilize the `re` (Regular Expression) library to:
* **Filter** special characters, punctuation, and numerical noise.
* **Normalize** text by converting it to lowercase.
* **Sanitize** whitespace and remove formatting artifacts.


4. **Feature Extraction:** Apply `TfidfVectorizer` to transform the cleaned transcripts into **TF-IDF (Term Frequency-Inverse Document Frequency)** vectors, capturing the statistical importance of each word.

## 1: Import the libraries
lets some necessary libraries for this step

### Install libraries not not installed

In [1]:
!pip install openai-whisper

Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/803.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m798.7/803.2 kB[0m [31m26.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m803.2/803.2 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Created wheel 

In [2]:
import numpy as np          # High-performance mathematical operations on arrays
import pandas as pd         # Structured data manipulation and CSV handling
import re                   # Regular expression operations for text cleaning and pattern matching
import os                   # Interacting with the operating system and managing file paths
from tqdm import tqdm       # Visual progress bars for loops and long-running tasks
from typing import Dict, Tuple # Type hinting for better code documentation and IDE support

# Machine Learning - Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Warning Management
import warnings
warnings.filterwarnings('ignore') # Suppress non-critical alerts and deprecation warnings

In [4]:
# import the openai library for transcript
import whisper
print(whisper.__version__)  # Should show version if correct package

20250625


## 2: Load the Dataset
For optimal performance and faster I/O speeds, we are loading the dataset directly from the Kaggle input directory. This avoids local upload times and leverages Kaggle's internal network.

In [5]:
from google.colab import userdata
dataset = userdata.get('DATASET')

In [None]:
import kagglehub
path = kagglehub.dataset_download(dataset)

### Read the csv files
let's read data and create pandas dataframe

In [8]:
# Base path for the dataset structure
base_path = f"{path}/dataset"

# Load the training and testing datasets
# Using f-strings for clean, dynamic path formatting
train_df = pd.read_csv(f"{base_path}/train.csv")
test_df = pd.read_csv(f"{base_path}/test.csv")

In [9]:
# Quick verification of data integrity
print(f"Train set loaded: {train_df.shape}")
print(f"Test set loaded:  {test_df.shape}")

Train set loaded: (444, 2)
Test set loaded:  (195, 1)


## 3: EDA
do some basic analysis

In [10]:
# view top 5 rows
train_df.head()

Unnamed: 0,filename,label
0,audio_1261.wav,1.0
1,audio_942.wav,1.5
2,audio_1110.wav,1.5
3,audio_1024.wav,1.5
4,audio_538.wav,2.0


In [11]:
# check for null values
train_df.isna().sum()

Unnamed: 0,0
filename,0
label,0


In [12]:
train_df.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
5.0,110
3.0,87
2.0,70
4.5,58
4.0,52
2.5,40
3.5,23
1.5,3
1.0,1


In [None]:
# let's create a audio path for later use
file_name = train_df.sample(1).filename.values[0]
example_path =  f'{base_path}/audios_train/audio_77.wav'
example_path

## 4: Text Extraction from Audio Files (ASR)
In this section, we convert raw audio signals into text transcripts. This process is known as Automatic Speech Recognition (ASR). We will implement three modular helper functions to ensure the process is scalable and handles errors gracefully.

### Load the whisper model
load the model for transcription

In [18]:
# Load the model
model = whisper.load_model("base")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 139M/139M [00:01<00:00, 88.5MiB/s]


In [None]:
# lets test the model
result = model.transcribe(example_path)

In [21]:
example_text = result['text']
example_text

" All right. A crowded market is where lots of women and young men go to exchange groups and services. It's a lot of people in the market. The farmers themselves, the staff, some self-reshable groups, some self-non-reshable groups, some self-splotting groups, and other stuff. So, in the market, it's always rowdy and you'll be hearing stuff like, buy your things here, buy your thoughts here. What do you want to buy? You get to and then... You like to buy your own things. You're in the same time."

In [34]:
result

{'text': " All right. A crowded market is where lots of women and young men go to exchange groups and services. It's a lot of people in the market. The farmers themselves, the staff, some self-reshable groups, some self-non-reshable groups, some self-splotting groups, and other stuff. So, in the market, it's always rowdy and you'll be hearing stuff like, buy your things here, buy your thoughts here. What do you want to buy? You get to and then... You like to buy your own things. You're in the same time.",
 'segments': [{'id': 0,
   'seek': 0,
   'start': 0.0,
   'end': 2.0,
   'text': ' All right.',
   'tokens': [50364, 1057, 558, 13, 50464],
   'temperature': 0.0,
   'avg_logprob': -0.5252492951183785,
   'compression_ratio': 1.2477064220183487,
   'no_speech_prob': 0.21202288568019867},
  {'id': 1,
   'seek': 0,
   'start': 2.0,
   'end': 15.0,
   'text': ' A crowded market is where lots of women and young men',
   'tokens': [50464,
    316,
    21634,
    2142,
    307,
    689,
   

### üîç Why We Extract Acoustic Metadata

In machine learning, especially when working with **Automatic Speech Recognition (ASR)**, the raw text is only half the story. The metadata provided by Whisper acts as a **quality control layer** for your features.

While the **Text** provides the semantic content, the **Metadata** provides essential context regarding the "confidence" and "reliability" of that text. These metrics can be used as auxiliary inputs to machine learning models to improve prediction accuracy.

> üß™ Key Metrics Explained

* **`avg_logprob` (Average Log Probability):** Measures the model's confidence in its transcription. Lower values suggest potential errors or background noise.
* **`temperature`:** Represents the randomness of the decoding process. Higher values indicate the model struggled to find a clear path.
* **`compression_ratio`:** Measures the redundancy of the text. High ratios often flag repetitive "loops" or failed transcriptions.



### Helper function 1
 **`get_features_from_transcript`**: Acts as a **parser** that extracts the raw text and key quality metrics (confidence levels and compression ratios) from the complex dictionary returned by Whisper.

In [29]:
def get_features_from_transcript(transcript: Dict) -> list:
    """
    Parses the Whisper output dictionary to extract text and acoustic metadata.
    """
    # Extract specific metadata provided by the Whisper model
    text = transcript.get('text', "")

    segments_df = pd.DataFrame(transcript['segments'])

    # Select only numeric columns and compute means
    numeric_cols = segments_df.select_dtypes(include=['int64', 'float64'])
    if numeric_cols.empty:
          print("No numeric columns found in segments DataFrame")

    segments_mean_df = numeric_cols.mean().to_dict()
    avg_logprob = segments_mean_df.get('avg_logprob', np.nan)
    temp = segments_mean_df.get('temperature', np.nan)
    comp_ratio = segments_mean_df.get('compression_ratio', np.nan)

    return [text, avg_logprob, temp, comp_ratio]

In [31]:
# lets test our helper function
get_features_from_transcript(transcript=result)

[" All right. A crowded market is where lots of women and young men go to exchange groups and services. It's a lot of people in the market. The farmers themselves, the staff, some self-reshable groups, some self-non-reshable groups, some self-splotting groups, and other stuff. So, in the market, it's always rowdy and you'll be hearing stuff like, buy your things here, buy your thoughts here. What do you want to buy? You get to and then... You like to buy your own things. You're in the same time.",
 -0.6297709218101206,
 0.0,
 1.4300646290520846]

### Helper function 2

* **`create_df_from_audio_file`**: Serves as the **primary worker**; it runs the ASR model on a single file, handles potential errors (like corrupted audio), and packages the results into a clean, labeled DataFrame row.

In [40]:
def create_df_from_audio_file(file_path: str) -> pd.DataFrame:
    """
    Transcribes a single audio file and returns a structured DataFrame row.
    """
    columns = ['text', 'avg_logprob', 'temperature', 'compression_ratio']

    try:
        # Perform transcription using the Whisper model
        result = model.transcribe(file_path, language='en')
        features = get_features_from_transcript(result)
        return pd.DataFrame([features], columns=columns)

    except Exception as e:
        print(f"‚ö†Ô∏è Error processing {file_path}: {e}")
        return pd.DataFrame([[None] * len(columns)], columns=columns)

In [33]:
create_df_from_audio_file(example_path)

Unnamed: 0,text,avg_logprob,temperature,compression_ratio
0,All right. A crowded market is where lots of ...,-0.629771,0.0,1.430065


### Helper function 3
**`batch_transcribe`**: Orchestrates the **bulk processing** of your entire dataset, using a progress bar to track status while combining individual results into one final, structured master DataFrame.

In [35]:
def batch_transcribe(file_paths: list) -> pd.DataFrame:
    """
    Processes a list of files with a progress bar and returns a concatenated DataFrame.
    """
    results = []
    for path in tqdm(file_paths, desc="Transcribing Audio"):
        results.append(create_df_from_audio_file(path))

    return pd.concat(results, ignore_index=True)

In [36]:
# lets check for single file
batch_transcribe([example_path])

Transcribing Audio: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.74s/it]


Unnamed: 0,text,avg_logprob,temperature,compression_ratio
0,All right. A crowded market is where lots of ...,-0.629771,0.0,1.430065


### Transcribing all audio and collecting meta data

In [44]:
all_files_path = train_df.filename.apply(lambda x: f'{base_path}/audios_train/{x}')

In [45]:
len(all_files_path)

444

In [46]:
df = batch_transcribe(path)

Transcribing Audio:  10%|‚ñà         | 45/444 [01:36<17:14,  2.59s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  12%|‚ñà‚ñè        | 53/444 [02:02<29:30,  4.53s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 193/444 [06:03<05:06,  1.22s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  46%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 203/444 [06:34<16:18,  4.06s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 294/444 [10:28<07:55,  3.17s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 320/444 [11:55<08:50,  4.28s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 327/444 [12:14<06:28,  3.32s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 334/444 [12:39<09:55,  5.42s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 338/444 [12:52<06:45,  3.82s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 355/444 [13:54<04:31,  3.05s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 364/444 [14:34<08:07,  6.09s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 369/444 [14:53<06:50,  5.47s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 370/444 [15:04<08:57,  7.27s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 386/444 [16:10<04:20,  4.50s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 400/444 [16:50<03:14,  4.43s/it]

No numeric columns found in segments DataFrame


Transcribing Audio:  91%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 403/444 [17:03<03:24,  4.98s/it]

No numeric columns found in segments DataFrame


Transcribing Audio: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 444/444 [19:46<00:00,  2.67s/it]


In [48]:
# let's see some rows
df.head()

Unnamed: 0,text,avg_logprob,temperature,compression_ratio
0,My favorite hobby is cultivation of plants su...,-0.477353,0.0,1.463235
1,The playground looks like very clear and neat...,-0.623653,0.0,1.081016
2,My goal is to become an electrical employee a...,-0.318257,0.0,1.412077
3,My favorite place is in Andhra Pradesh. It is...,-0.486152,0.0,1.247096
4,"My favorite places, my favorite places, Mutti...",-0.565608,0.0,1.488246


In [49]:
# check the shape
df.shape

(444, 4)

In [50]:
# check for null values
df.isna().sum()

Unnamed: 0,0
text,0
avg_logprob,16
temperature,16
compression_ratio,16


### 5: Basic Text Cleaning üßπ

In a **Grammar Ranking** task, the structure of the sentence is just as important as the words themselves. Therefore, our cleaning process is "non-destructive." We focus on normalizing the format without removing the grammatical markers (like punctuation) that the model needs to evaluate correctness.

### üìù Cleaning Strategy:

* **Lowercasing:** Standardizes the vocabulary to reduce the sparsity of the TF-IDF matrix.
* **Whitespace Normalization:** Uses Regex (`\s+`) to collapse multiple spaces, tabs, or newlines into a single space.
* **Trimming:** Removes leading and trailing whitespace that often appears at the beginning or end of ASR outputs.

In [51]:
def clean_text(text: str) -> str:
    """
    Performs non-destructive cleaning to preserve grammatical structure
    while normalizing whitespace and casing.
    """
    if not isinstance(text, str):
        return ""

    # Convert to lowercase for vocabulary consistency
    text = text.lower()

    # Replace multiple whitespaces/newlines with a single space
    text = re.sub(r"\s+", " ", text)

    # Remove whitespace from the start and end of the string
    return text.strip()


In [53]:
# Apply the cleaning to our transcribed text
df['cleaned_text'] = df['text'].apply(clean_text)

In [54]:
df.head()

Unnamed: 0,text,avg_logprob,temperature,compression_ratio,cleaned_text
0,My favorite hobby is cultivation of plants su...,-0.477353,0.0,1.463235,my favorite hobby is cultivation of plants suc...
1,The playground looks like very clear and neat...,-0.623653,0.0,1.081016,the playground looks like very clear and neat ...
2,My goal is to become an electrical employee a...,-0.318257,0.0,1.412077,my goal is to become an electrical employee an...
3,My favorite place is in Andhra Pradesh. It is...,-0.486152,0.0,1.247096,my favorite place is in andhra pradesh. it is ...
4,"My favorite places, my favorite places, Mutti...",-0.565608,0.0,1.488246,"my favorite places, my favorite places, mutti ..."


## 6: Text-to-Embedding Vectorization (TF-IDF)

Now that we have cleaned transcripts, we transform the text into a numerical format suitable for **TensorFlow**. We use **TF-IDF (Term Frequency-Inverse Document Frequency)**, which highlights unique words while devaluing common stop words that don't provide grammatical signal.

### üìê Configuration Logic

* **N-gram Range (1, 2):** We capture both individual words and two-word phrases (bigrams). This is crucial for grammar ranking, as it helps the model understand word order and local context (e.g., "he go" vs. "he goes").
* **Max Features (888):** Following the "Rule of Thumb" (Max Features <= 2 * Rows), we limit the vocabulary size to prevent the model from overfitting on a sparse matrix.
* **Vocabulary:** The vectorizer learns the most significant patterns from the training set to create a fixed-size input vector for our neural network.

In [55]:
# Initialize the TfidfVectorizer with the grammar-sensitive n-gram range
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),    # Captures word pairs to identify grammatical patterns
    max_features=888,      # Guided by Rule of Thumb: 2 * total_rows
    lowercase=False        # Text is already lowercased in our cleaning step
)

In [56]:
# Fit on training data and transform to numerical vectors
X_train_tfidf = vectorizer.fit_transform(df['cleaned_text']).toarray()

In [57]:
print(f"‚úÖ TF-IDF Matrix created with shape: {X_train_tfidf.shape}")

‚úÖ TF-IDF Matrix created with shape: (444, 888)


## 7: Persisting the Vectorizer

To ensure consistency between training and inference, we must save the fitted `TfidfVectorizer` object. This allows us to `transform` the test set (or future new audio transcripts) using the exact same vocabulary and IDF weights learned from the training set.

### ‚ö†Ô∏è The "Golden Rule" of Vectorization

* **Training Data:** Use `.fit_transform()` (Learns vocabulary + transforms).
* **Test/New Data:** Use `.transform()` (Only transforms using the *saved* vocabulary). **Never** call `.fit()` on your test data.

```python
# --- Later in your Test/Inference phase ---
# Load the vectorizer back
loaded_vectorizer = joblib.load('tfidf_vectorizer.joblib')

# Transform the test data using the LOADED vectorizer
# This ensures the test set has exactly 852 columns in the same order
X_test_tfidf = loaded_vectorizer.transform(test_df['cleaned_text']).toarray()

print(f"‚úÖ Test Matrix shape: {X_test_tfidf.shape} (Matches Train features!)")
```

In [58]:
import joblib

# 1. Save the fitted vectorizer to a file
# This 'freezes' the 888 features we defined earlier
joblib.dump(vectorizer, 'tfidf_vectorizer.joblib')
print("‚úÖ Vectorizer saved successfully as 'tfidf_vectorizer.joblib'")

‚úÖ Vectorizer saved successfully as 'tfidf_vectorizer.joblib'


## 8: Feature Fusion (Text + Metadata)

In this final preprocessing step, we concatenate our two different feature sets. This allows the model to learn not just from *what* was said, but also from *how well* the model heard it.

### üìã The Merging Logic

1. **Select Metadata:** Isolate the numerical columns (`avg_logprob`, `temperature`, `compression_ratio`) from your DataFrame.
2. **Ensure Dense Format:** Convert the metadata to a NumPy array to match the TF-IDF matrix format.
3. **Horizontal Concatenation:** Join them side-by-side. If you have 852 TF-IDF features and 3 metadata features, your final input will have **855 features**.

In [61]:
#  Convert the numpy array to a DataFrame
# We use vectorizer.get_feature_names_out() to label the columns with the actual words
tfidf_df = pd.DataFrame(
    X_train_tfidf,
    columns=vectorizer.get_feature_names_out(),
    index=train_df.index
)

# Concatenate horizontally (axis=1)
df = pd.concat([df, tfidf_df], axis=1)

In [62]:
print(f"‚úÖ Final DataFrame created! Total features: {df.shape[1]}")

‚úÖ Final DataFrame created! Total features: 893


In [63]:
df.head()

Unnamed: 0,text,avg_logprob,temperature,compression_ratio,cleaned_text,10th,able,able to,about,about my,...,you are,you back,you can,you have,you know,you re,you you,your,your own,your spirit
0,My favorite hobby is cultivation of plants su...,-0.477353,0.0,1.463235,my favorite hobby is cultivation of plants suc...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,The playground looks like very clear and neat...,-0.623653,0.0,1.081016,the playground looks like very clear and neat ...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,My goal is to become an electrical employee a...,-0.318257,0.0,1.412077,my goal is to become an electrical employee an...,0.0,0.0,0.0,0.190415,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,My favorite place is in Andhra Pradesh. It is...,-0.486152,0.0,1.247096,my favorite place is in andhra pradesh. it is ...,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"My favorite places, my favorite places, Mutti...",-0.565608,0.0,1.488246,"my favorite places, my favorite places, mutti ...",0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
df.shape

(444, 893)

## 9: Save the Processed Dataset

We now export the finalized DataFrame‚Äîcontaining the raw text, the acoustic metadata, and the high-dimensional TF-IDF embeddings‚Äîinto a single CSV file.

### üìù Storage Considerations

* **Consistency:** By setting `index=False`, we prevent pandas from adding an extra unnamed column, which keeps the file clean for future loading.
* **Large Files:** Because we added 888 TF-IDF columns, the resulting CSV will be significantly larger than the original input. This file will serve as the primary training source for your **TensorFlow** model.

In [65]:
# Save the final feature-rich dataframe to a CSV
# This file now contains: [Original Columns] + [Acoustic Metadata] + [888 TF-IDF Features]
file_name = "tf_embedding_transcripts.csv"

df.to_csv(file_name, index=False)

print("‚úÖ Data saved successfully")

‚úÖ Data saved successfully


---
### ‚öñÔ∏è A Final Note on Scaling

Since TF-IDF values range from **0 to 1** but Whisper metadata (like `avg_logprob`) can be **negative or large**, it is highly recommended to apply a `StandardScaler` to the final matrix before feeding it into a Neural Network:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_final)
X_test_scaled = scaler.transform(X_test_final)
```