# Detect Deepfake Phishing Multi-Model Attacks

3/24/2025, Dave Sisk, https://github.com/davidcsisk, https://www.linkedin.com/in/davesisk-doctordatabase/

A number of AI-driven deepfake exploits have garnered much attention over the past year, including deepfaked audio/video phishing email of the company's CEO instructing employees to make a funds transfer, and even a CEO himself being deepfaked into transferring company funds to a fake corporate headquarters.  Let's examine how we can put together tools from the AI/ML/data science realm to detect this type of deepfake attack. 

By "multi-model", I'm referring to attacks delivered as communications that might include text, audio, and video content. 

## Working Examples:

If an audio message like the one below came in what appeared to be an official communication, how believable would it be?

In [1]:
# Play the audio using the default OS audio player
wav_file = './/vector-search-with-security-logs//deepfake_DonaldTrump-audio.wav'
!start {wav_file}

If a video message similar to the one below was sent to all employees from what appeared to be an official source, how believable would it be?

In [2]:
# Play the video using the default OS video player
video_file = './vector-search-with-security-logs/deepfake_ElonMusk_SocialSecurityPhishing.mp4'
!start {video_file}

If these examples are not believable enough, consider that it's simply a matter of using better models to create those deepfakes.  Overall, the obvious answer to the question at hand is this:  These are believable enough to potentially cause harm.

## Pre-processing of Working Examples

Our first order of business is to get text transcriptions of the messages in these examples, so we can keep any vector search functionality in the text-only realm where we have known good models for that functionality. We'll leverage AI tooling to accomplish these transcriptions.

### Audio message
We can transcribe from the WAV audio file directly using the smallest open-source Vosk model that has an internalized language graph...this should produce a reasonably accurate transcription. (If the audio were in compressed MP3 file format [very likely if it came in an email, for instance], we'd have to convert it to WAV audio format first, then transcrible the text from that. That's merely a requirement of this particular model though, not the overall technology.)

In [3]:
#! pip install moviepy
#! pip install vosk


In [4]:
from vosk import Model, KaldiRecognizer
import wave
import json
import textwrap

# Load the 2nd smallest Vosk model
model = Model(".//vector-search-with-security-logs//vosk-model-en-us-0.22-lgraph")

wav_file = './/vector-search-with-security-logs//deepfake_DonaldTrump-audio.wav'

# Open the audio file
with wave.open(wav_file, "rb") as wf:
    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() not in [8000, 16000]:
        raise ValueError("Audio file must be WAV format mono PCM.")
    
    recognizer = KaldiRecognizer(model, wf.getframerate())
    recognizer.SetWords(True)
    
    transcription = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if recognizer.AcceptWaveform(data):
            result = json.loads(recognizer.Result())
            transcription.append(result.get("text", ""))
    
    # Get the final transcription
    final_result = json.loads(recognizer.FinalResult())
    transcription.append(final_result.get("text", ""))

# Combine all parts of the transcription
transcribed_text = " ".join(transcription)

# Wrap the text for better readability
wrapped_text = textwrap.fill(transcribed_text, width=80)
print("Transcribed Text:\n", wrapped_text)

Transcribed Text:
 hi shane i'm following up on the audit readiness review issue in q one expenses
for contractor payments we need to transfer one thirty seven thousand eight
hundred twenty dollar zero sense to the holding account a b c one twenty three
this must be sent by three pm today so do d o g e can reflected in the pre-
ordered submission let me know once the transfer is done thanks


In [9]:
import os
import wave
from vosk import Model, KaldiRecognizer
import json
import textwrap

# Load the Vosk model
model = Model(".//vector-search-with-security-logs//vosk-model-en-us-0.22-lgraph")

# Path to the video file
video_file = './/vector-search-with-security-logs//deepfake_ElonMusk_SocialSecurityPhishing.mp4'
audio_file = './/vector-search-with-security-logs//extracted_audio.wav'

# Delete the extracted audio file if it already exists
if os.path.exists(audio_file):
    os.remove(audio_file)

# Extract audio using ffmpeg (no moviepy)
!ffmpeg -i {video_file} -vn -acodec pcm_s16le -ar 16000 -ac 1 {audio_file}

# Transcribe the extracted audio
with wave.open(audio_file, "rb") as wf:
    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() not in [8000, 16000]:
        raise ValueError("Audio file must be WAV format mono PCM.")
    
    recognizer = KaldiRecognizer(model, wf.getframerate())
    recognizer.SetWords(True)
    
    transcription = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if recognizer.AcceptWaveform(data):
            result = json.loads(recognizer.Result())
            transcription.append(result.get("text", ""))
    
    # Get the final transcription
    final_result = json.loads(recognizer.FinalResult())
    transcription.append(final_result.get("text", ""))

# Combine all parts of the transcription
transcribed_text = " ".join(transcription)

# Wrap the text for better readability
wrapped_text = textwrap.fill(transcribed_text, width=80)
print("\nTranscribed Text:\n", wrapped_text)

ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
  built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame -


Transcribed Text:
 greetings valued employees immediate action is needed from you please go to the
listed u r l logging into your social security account and change your password
this will provide a immediate flag that your social security account is valid
please complete this task task by five pm today thank you


In [32]:
# Build the datastore of known phishing emails (30 samples)
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import warnings

# Suppress the FutureWarning from transformers
warnings.filterwarnings('ignore', category=FutureWarning, module='transformers.tokenization_utils_base')

# Load the CSV into a pandas dataframe
df = pd.read_csv('.//vector-search-with-security-logs//deepfake_phishing_examples.csv')

# Ensure 'Subject' and 'Body' columns exist, create them if missing
if 'Subject' not in df.columns:
    df['Subject'] = ''
if 'Body' not in df.columns:
    df['Body'] = ''

# Initialize the sentence-transformers model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Combine Subject and Body into a single column for embedding, handling empty values
df['Subject'] = df['Subject'].fillna('')
df['Body'] = df['Body'].fillna('')
df['combined_text'] = (df['Subject'] + ' ' + df['Body']).str.strip()

# Calculate embeddings for the combined text
df['embeddings'] = df['combined_text'].apply(lambda x: model.encode(x))

# Function to perform cosine similarity search
def search_similar(query, top_n=3):
    query_embedding = model.encode(query).reshape(1, -1)
    embeddings = np.vstack(df['embeddings'].values)
    similarities = cosine_similarity(query_embedding, embeddings).flatten()
    top_indices = similarities.argsort()[-top_n:][::-1]
    results = df.iloc[top_indices][['Subject', 'Body', 'combined_text']].copy()
    results['similarity_score'] = similarities[top_indices]
    results['preview'] = results['combined_text'].apply(lambda x: x[:100] if x else 'No content available')
    return results[['preview', 'similarity_score']]

df.shape

(34, 4)

In [33]:
df.sample(3)

Unnamed: 0,Subject,Body,combined_text,embeddings
11,Time-Sensitive Funding Approval,"Please move $64,200 to the project fund immedi...",Time-Sensitive Funding Approval Please move $6...,"[0.01350849, 0.039254844, 0.011224821, 0.06720..."
32,Director Request – Time Sensitive,"Send $119,000 to ensure project continuation. ...","Director Request – Time Sensitive Send $119,00...","[-0.0680404, 0.020338254, 0.044019964, 0.04546..."
1,Immediate Action Required – Payment Authorization,"[Employee Name], I’ve approved the invoice fro...",Immediate Action Required – Payment Authorizat...,"[-0.1030883, 0.08535468, -0.013656264, -0.0190..."


In [35]:
# Example query with a custom number of top matches
query = "urgent action required for account security"
top_n = 5  # Specify the number of top matches to return
results = search_similar(query, top_n=top_n)
print(results)

                                              preview  similarity_score
3   Urgent from Mobile I’m traveling and can’t acc...          0.497088
7   Immediate Compliance Wire I’ve been told we ha...          0.489520
22  Banking Delay Mitigation I’ve been notified th...          0.480011
13  Transfer Instruction: Urgent Initiate a transf...          0.461940
0   Urgent Wire Transfer – Confidential Hi [Employ...          0.421657
