In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sms-spam-collection-dataset/spam.csv


<div style="background-color:#121212;color:#e0e0e0;font-family:Arial,Helvetica,sans-serif;padding:20px;line-height:1.6;">
  <h1 style="color:#ff9800;margin-bottom:10px;">📚 Hybrid Learning in AI — SMS Spam Example</h1>

  <p style="margin-bottom:15px;">
    <strong style="color:#4fc3f7;">Hybrid Learning</strong> combines two or more approaches to build a stronger AI system.
    It blends the strengths of different methods, making them work together — much like combining a calculator’s speed 🖩 with a human’s judgment 🧠.
  </p>

  <h2 style="color:#ffcc80;margin-top:20px;">🔹 Why Use Hybrid Learning?</h2>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Better accuracy through synergy of techniques.</li>
    <li>Greater robustness against noisy or unexpected data.</li>
    <li>Flexibility — choose the right tool for the right part of the task.</li>
  </ul>

  <h2 style="color:#ffcc80;margin-top:20px;">🔹 Real-World Examples</h2>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Bank fraud detection using Neural Networks + Rule-based checks.</li>
    <li>Search engines combining Transformer NLP + TF-IDF ranking.</li>
    <li>Medical imaging with CNN predictions + doctor-verified rules.</li>
    <li>AI games blending supervised strategies + reinforcement learning.</li>
    <li>Network security using anomaly detection + classification.</li>
  </ul>

  <h2 style="color:#ffcc80;margin-top:20px;">📂 Our Tutorial Dataset — SMS Spam Collection</h2>
  <p style="margin-bottom:15px;">
    This dataset (<strong style="color:#81d4fa;">~500 KB</strong>) contains spam and ham (not spam) SMS messages.<br>
    <strong>Why it’s perfect:</strong> Tiny size, cleans easily, trains fast, and the task is relatable.
  </p>

  <h2 style="color:#ffcc80;margin-top:20px;">🛠 Hybrid Plan: TF-IDF Model + Keyword Rules</h2>
  <ol style="margin-left:20px;margin-bottom:15px;">
    <li><strong style="color:#4fc3f7;">ML Component</strong>
      <ul>
        <li>Convert SMS text to numerical vectors with TF-IDF.</li>
        <li>Train a Logistic Regression or Naive Bayes classifier.</li>
      </ul>
    </li>
    <li><strong style="color:#4fc3f7;">Rule-based Component</strong>
      <ul>
        <li>Define a dictionary of common spam words (<em>"free", "win", "click", "offer"</em>).</li>
        <li>If message contains enough triggers → predict spam directly.</li>
      </ul>
    </li>
    <li><strong style="color:#4fc3f7;">Hybrid Decision</strong>
      <ul>
        <li>If rules detect high spam risk → final label spam.</li>
        <li>Else → trust ML model output.</li>
      </ul>
    </li>
  </ol>

  <h2 style="color:#ffcc80;margin-top:20px;">📊 Tutorial Workflow</h2>
  <ol style="margin-left:20px;margin-bottom:15px;">
    <li>Dataset exploration — visualize spam vs. ham counts.</li>
    <li>Baseline ML Model — simple TF-IDF + Logistic Regression.</li>
    <li>Design keyword rules — experiment with thresholds.</li>
    <li>Combine outputs — weighted decision or rule priority.</li>
    <li>Evaluate — compare hybrid vs. pure ML performance.</li>
    <li>Conclusion — discuss where hybrid wins.</li>
  </ol>

  <h2 style="color:#ffcc80;margin-top:20px;">💡 Why This Tutorial Works Well</h2>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Lightweight — ideal for live demos and quick iteration.</li>
    <li>Clear step-by-step introduction to hybrid concepts.</li>
    <li>Real-world relevance — SMS spam filtering is everywhere.</li>
  </ul>

  <p style="margin-top:20px;font-style:italic;color:#b0bec5;">
    Tip: Treat Hybrid Learning like mixing coffee and milk ☕ + 🥛 — each can be good alone, but together, they create something smoother.
  </p>
</div>


<div style="background-color:#121212;color:#e0e0e0;padding:20px;font-family:Arial,Helvetica,sans-serif;line-height:1.6;border-radius:8px;">
  <h2 style="color:#ff9800;margin-top:0;">🔀 Different Ways to Do Hybrid Learning in AI</h2>

  <p style="margin-bottom:15px;">
    Hybrid learning in AI means <strong style="color:#4fc3f7;">combining two or more methods</strong> to solve a problem more effectively than using them in isolation.
    Below are popular hybrid strategies:
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">1️⃣ ML + Rule-Based Systems</h3>
  <p>Combine a machine learning model with human‑crafted rules.  
    <em style="color:#b0bec5;">Example:</em> Spam filter using Naive Bayes + keyword blacklist.
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">2️⃣ Supervised + Unsupervised Learning</h3>
  <p>Unsupervised clustering or representation learning feeds features into a supervised classifier.  
    <em style="color:#b0bec5;">Example:</em> Customer segmentation via KMeans → segments used for targeted prediction.
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">3️⃣ Classical ML + Deep Learning</h3>
  <p>Use deep learning for feature extraction, then apply classical algorithms for prediction.  
    <em style="color:#b0bec5;">Example:</em> CNN image embeddings + Random Forest classifier.
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">4️⃣ Reinforcement + Supervised Learning</h3>
  <p>Train a model with supervised learning, then fine‑tune using reinforcement signals.  
    <em style="color:#b0bec5;">Example:</em> Game AI that learns from human replays before exploring new strategies.
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">5️⃣ Multiple Supervised Models (Ensembles)</h3>
  <p>Blend predictions from different supervised models to improve accuracy and robustness.  
    <em style="color:#b0bec5;">Example:</em> Random Forest + Gradient Boosting + Neural Network ensemble.
  </p>

  <p style="margin-top:20px;font-style:italic;color:#b0bec5;">
    Each hybrid method has trade-offs — the art is knowing which combination fits your data and problem best.
  </p>
</div>


<div style="background-color:#121212;color:#e0e0e0;padding:20px;font-family:Arial,Helvetica,sans-serif;line-height:1.6;border-radius:8px;">
  <h2 style="color:#ff9800;margin-top:0;">📍 Step 1 — Understanding the Dataset</h2>

  <p style="margin-bottom:15px;">
    Before building any <strong style="color:#4fc3f7;">Hybrid Learning</strong> model, we first need to
    understand the shape, quality, and nature of our data.  
    This step ensures we know exactly what we’re working with — preventing surprises later.
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">🔎 What We'll Do Here</h3>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Load the <strong>SMS Spam Collection</strong> dataset into a Pandas DataFrame.</li>
    <li>Inspect the first few rows using <code style="background-color:#2e2e2e;color:#81d4fa;padding:2px 6px;border-radius:4px;">df.head()</code>.</li>
    <li>Get dataset dimensions with <code style="background-color:#2e2e2e;color:#81d4fa;padding:2px 6px;border-radius:4px;">df.shape</code>.</li>
    <li>Check class distribution (spam vs. ham) — both counts and percentages.</li>
    <li>Look at basic descriptive statistics for text lengths.</li>
    <li>Ensure there are no <strong>missing values</strong> or obviously corrupted entries.</li>
  </ul>

  <h3 style="color:#ffcc80;margin-top:15px;">💡 Why This Matters</h3>
  <p style="margin-bottom:15px;">
    Understanding the dataset's structure and target balance helps us:
  </p>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Spot data quality issues early.</li>
    <li>Choose the right preprocessing steps.</li>
    <li>Plan for strategies like resampling when the target classes are imbalanced.</li>
  </ul>

  <p style="margin-top:20px;font-style:italic;color:#b0bec5;">
    Think of this step as <em>“reading the recipe before cooking”</em> — you don’t want surprises
    halfway through the meal!
  </p>
</div>


In [2]:
import pandas as pd

# Load the SMS Spam Collection dataset
# Adjust the filename/path if needed (Kaggle usually puts it in ../input/)
df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding="latin-1")

# --- Basic Inspection ---
print("First 5 rows:")
display(df.head())

# Dataset dimensions
print(f"\nShape of dataset: {df.shape}")

# Remove any unwanted extra columns if they exist
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

# --- Missing value check ---
print("\nMissing values per column:")
print(df.isnull().sum())

# --- Target distribution ---
label_counts = df['label'].value_counts()
label_percent = df['label'].value_counts(normalize=True) * 100

print("\nLabel distribution (counts):")
print(label_counts)
print("\nLabel distribution (percentages):")
print(label_percent.round(2))

# --- Text length statistics ---
df['message_length'] = df['message'].apply(len)
print("\nMessage length statistics:")
print(df['message_length'].describe())

# Optional: quick sample of messages
print("\nRandom sample of ham/spam messages:")
display(df.groupby('label').apply(lambda x: x.sample(2, random_state=42))[['label', 'message']])


First 5 rows:


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,



Shape of dataset: (5572, 5)

Missing values per column:
label      0
message    0
dtype: int64

Label distribution (counts):
label
ham     4825
spam     747
Name: count, dtype: int64

Label distribution (percentages):
label
ham     86.59
spam    13.41
Name: proportion, dtype: float64

Message length statistics:
count    5572.000000
mean       80.118808
std        59.690841
min         2.000000
25%        36.000000
50%        61.000000
75%       121.000000
max       910.000000
Name: message_length, dtype: float64

Random sample of ham/spam messages:


  display(df.groupby('label').apply(lambda x: x.sample(2, random_state=42))[['label', 'message']])


Unnamed: 0_level_0,Unnamed: 1_level_0,label,message
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,3714,ham,"I am late,so call you tomorrow morning.take ca..."
ham,1311,ham,U r too much close to my heart. If u go away i...
spam,1455,spam,Summers finally here! Fancy a chat or flirt wi...
spam,1852,spam,This is the 2nd time we have tried 2 contact u...


<div style="background-color:#121212;color:#e0e0e0;padding:20px;font-family:Arial,Helvetica,sans-serif;line-height:1.6;border-radius:8px;">
  <h2 style="color:#ff9800;margin-top:0;">🔍 Step 1 Insight — Data Overview</h2>

  <p style="margin-bottom:15px;">
    After loading and taking an initial peek at the <strong style="color:#4fc3f7;">SMS Spam Collection Dataset</strong>,
    here’s what we observe:
  </p>

  <h3 style="color:#ffcc80;margin-top:15px;">📊 Target Distribution</h3>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Ham messages: <strong>4825</strong> </li>
    <li>Spam messages: <strong>774</strong> </li>
    <li>This means the dataset is <strong>imbalanced</strong>, but not severely </li>
  </ul>

  <h3 style="color:#ffcc80;margin-top:15px;">📜 Sample Messages</h3>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li><strong>Ham:</strong> “I am late, so call you tomorrow morning. Take care.” — casual, personal tone.</li>
    <li><strong>Spam:</strong> “Summers finally here! Fancy a chat or flirt with me?” — sales/unsolicited tone.</li>
  </ul>

  <h3 style="color:#ffcc80;margin-top:15px;">📏 Message Length Insights</h3>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>Ham messages tend to be shorter and conversational.</li>
    <li>Spam messages often contain promotional content and may be longer.</li>
    <li>Length statistics will help us design smarter preprocessing (e.g., handling very short or very long outliers).</li>
  </ul>

  <h3 style="color:#ffcc80;margin-top:15px;">🧹 Data Quality Check</h3>
  <ul style="margin-left:20px;margin-bottom:15px;">
    <li>No missing values in either <strong>label</strong> or <strong>message</strong>.</li>
    <li>Some promotional spam lines may contain repeated or unusual character sequences — worth cleaning if it improves model clarity.</li>
  </ul>

  <p style="margin-top:20px;font-style:italic;color:#b0bec5;">
    The dataset is small, clean, and interpretable — perfect for quick experiments and 
    for illustrating <strong>Hybrid Learning</strong> concepts without heavy computation.
  </p>
</div>
