# I-SOON Leak Analysis using Python and Generative AI

## Author: Thomas Roccia | [@fr0gger_](https://twitter.com/fr0gger_)

## Introduction

Analyzing leaked data can be a tedious task, especially if it's written in a foreign language. Luckily, with Python, it's possible to rapidly automate this process.

In the following notebook, we will analyze the data leak from I-Soon that provides sensitive information about potential Chinese espionage capabilities. This data leak is an interesting use case as most of the data are PNG files that require the use of OCR to automate the process of extraction. The leak is available here: https://github.com/I-S00N/I-S00N

The leak contains txt, logs, md and png files. This notebook will focus on the PNG file that represents the most amount of data.

The goal of this notebook is to provide the tools and workflow to let you analyze this kind of data by yourself.

Let's dive deep.

## Disclaimer
Please use the data available in this notebook "as is". This document outlines a methodology for analyzing this kind of data and should not be considered an intelligence report.

The output provided may require additional verification due to possible inaccuracies in the translation or limitations inherent to LLM technologies.

Nevertheless, this document provides an initial step for analyzing leaked data, particularly PNG files in a foreign language.

## Requirements

In [4]:
#!pip install Pillow pytesseract
#!pip install googletrans # Be carefull you might have some issue with the version of HTTPX use in this lib and OpenAI 
#!pip install openai
#!pip install bokeh

# You also need an OpenAi API key
api_key = "sk-"

## Analyzing the data

As always, the first thing to do before jumping into the data is to spend time understanding what kind of information we have, the structure, the format, the number of documents...

This is a crucial step to ensure you analyze your data in the right way. As we focus on the PNG file let's count how many we have in the repository.

In [49]:
import os
from collections import Counter

# Passing the directory 
image_directory = '0'
files = os.listdir(image_directory)

# Extracting file extensions and counting occurrences
file_extensions = [os.path.splitext(file)[1] for file in files]
extensions_count = Counter(file_extensions)

# Printing statistics about file types
print("File Type numbers:")
for ext, count in extensions_count.items():
    print(f"{ext if ext else 'No Extension'}: {count}")


File Type numbers:
.md: 70
.png: 489
.log: 6
.txt: 11
No Extension: 1


Let's create a bar and pie chart to visualize the repartition. 

In [62]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.transform import cumsum
from bokeh.models import ColumnDataSource, HoverTool
from math import pi
import pandas as pd

output_notebook()

file_type_counts = {'png': 489, 'md': 70, 'log': 6, 'txt': 11, 'other': 1}

data = pd.Series(file_type_counts).reset_index(name='value').rename(columns={'index': 'file_type'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = ['#0999d3', '#718dbf', '#e84d60', '#ddb7b1', '#ddb777']

# Pie chart
p = figure(height=350, title="File Type Distribution", toolbar_location=None,
           tools="hover", tooltips="@file_type: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='file_type', source=data)

p.axis.axis_label = None
p.axis.visible = False
p.grid.grid_line_color = None

show(p)


In [65]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
import pandas as pd

output_notebook()

data = pd.DataFrame(list(file_type_counts.items()), columns=['file_type', 'count'])

# Convert DataFrame 
source = ColumnDataSource(data)

# Create figure
p = figure(x_range=data['file_type'], plot_height=250, title="File Type Distribution",
           toolbar_location=None, tools="")

# Add vertical bars 
p.vbar(x='file_type', top='count', width=0.9, source=source, legend_field="file_type")

# Set attributes
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.yaxis.axis_label = "Count"
p.xaxis.axis_label = "File Type"
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

# Display the plot
show(p)

To give you an example of what kind of information are available let's take a look to one of the image.
![image](0/12756724-394c-4576-b373-7c53f1abbd94_15.png)

Here it is quite interesting because we have a screenshot that contains text, images and also diagrams. In other images we have screenshots of discussion or screenshot of Windows folders. Which makes a bit more difficult to analyze with context. And that will require an additional analysis. But let's focus here of extracting the text from our images. 

## Extracting and translating text for one image

Now that we have a better repartition and that we know a little bit better the content of the data, we are going to extract the text using OCR. 

In [1]:
import os
# Set the TESSDATA_PREFIX 
os.environ['TESSDATA_PREFIX'] = 'tessdata-main'

### Analyzing one image at a time

In [2]:
from PIL import Image
import pytesseract
from googletrans import Translator, LANGUAGES

# Load the PNG image
image_path = '0/12756724-394c-4576-b373-7c53f1abbd94_15.png'
image = Image.open(image_path)

# Use Tesseract to do OCR on the image
text = pytesseract.image_to_string(image, lang='chi_sim')  # 'chi_sim' for simplified Chinese

print("Extracted Text:", text)

# Translate in English
translator = Translator()
translated_text = translator.translate(text, dest='en')
print("Translated Text:", translated_text.text)


Extracted Text: 专 业 的 数 孙 情 报 解 决 方 案 提 供 商

1. 6.5 “ 产 品 图 片

噱 | 鱼 黑 | 关 黑 心 日 - 心 煜 乔 息 -| 里

小 林 5 ( 怡 织 ; 耶 fL; 电 量 ; 648; 弈 幕 亮 阮
白 pnas Q 晓 士 R 关 Hnars a

E E e J 万
-
一 3 万 = E ES3
5 “ 二

snasriaa ]ngutumsape

E ESE3SH

i E

E H

E 圆

Rs E

E 5

E 王

E 芸

〈Android 远 程 控 制 管 理 系 统 界 面 图 )

1.7 “Linux 远 程 控 制 管 理 系 统

1. 7. 1 “ 产 品 简 介
Linux 远 程 控 制 系 统 是 一 款 针 对 Linux 系 统 , 可 对 其 进 行 远 程 控 制 和 取 证 设
备 信 息 的 系 统 。
系 统 主 要 通 过 将 设 置 好 的 服 务 端 安 装 到 目 标 主 机 上 , 上 线 后 通 过 控 制 端 的 操
作 对 目 标 主 机 进 行 控 制 。 支 持 正 向 连 接 和 反 向 连 接 两 种 上 线 朱 式 。

连 接 模 式

(Linux 远 程 控 制 系 统 运 行 形 态 图 )

安 淘 信 息 技 术 有 限 公 司 16150

Translated Text: Division of the Digital Grandson Reporting Correctional Corporation

1. 6.5 "Product Graphics

Puppet | Fish Black | Guan Heixin Day -Xinyu Qiao Ending -| Li

Kobayashi 5 (Yaori; Ye FL; Electricity; 648; Yi Mu Liang Ruan
White PNAS Q Xiaoshi R Guan HNARS A

E e j Wan
-
One 30,000 = ES3
5 "Two

SNASRIAA] NGutumsape

E eSe3SH

I E

E H

E -circle

RS E

E 5

E king



------------------
And now we have the image above translated! You can do it for a specific image by changing the path of the image, in the code above.

## Processing the Image and storing them in a json file
So now let's automate the process of extraction and create a JSON file to request and access the data. 

In [26]:
import os
import json
from PIL import Image
import pytesseract
from googletrans import Translator

# Directory containing PNG files
image_directory = '0'

# Sorting the file to keep track of the context when possible
files = sorted(os.listdir(image_directory))  

translator = Translator()
results = []

for file_name in files:
    # Check if the file is a PNG
    if file_name.endswith('.png'):
        file_path = os.path.join(image_directory, file_name)
        
        # Perform OCR using Tesseract
        image = Image.open(file_path)
        text = pytesseract.image_to_string(image, lang='chi_sim')  # Adjust lang as needed

        if text and not text.isspace():
            print(f"Extracted Text from {file_name}")#, text)
            
            try:
                translated_text = translator.translate(text, dest='en') 
                #print(f"Translated Text from {file_name}:", translated_text.text)
                
                # create the json file with both original and translated text
                results.append({
                    'file_name': file_name,
                    'original_text': text,
                    'translated_text': translated_text.text
                })
                
            except Exception as e:
                print(f"Translation failed for {file_name} with error: {e}")
        else:
            print(f"No text found in {file_name} or text is not suitable for translation.")

# Save the json
with open('text_translations2.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=4)

print("All done! The extracted and translated texts are saved in text_translations2.json, maintaining the original order.")

# The process can take up to 30 min


Extracted Text from 0-08a6bcd3-6477-4252-8f35-4f8f80d114f9.png
Extracted Text from 0-0b54af64-c2cd-4acb-9864-73a584aa6ebc.png
No text found in 0-0baba509-5e81-4b88-b509-843822d09e21.png or text is not suitable for translation.
Extracted Text from 0-0f319bf6-e667-4bac-a974-dfda1142e9ff.png
No text found in 0-129ac70f-8942-4ca7-b1f2-ddeaa3d984b5.png or text is not suitable for translation.
Extracted Text from 0-1a20ded1-50fc-4153-9a95-e158eeb7199e.png
Extracted Text from 0-1afcf93d-50f1-4f1e-896d-87b0da7519f7.png
No text found in 0-1b0dc208-d2bb-43ea-b744-534f3b759394.png or text is not suitable for translation.
No text found in 0-1cc570d8-cddb-401e-8c37-ef10c0e4841f.png or text is not suitable for translation.
Extracted Text from 0-300450bf-221e-4eeb-bdda-dc1115c947ea.png
No text found in 0-32eb7662-f212-4811-a7c1-1cfeb121cd99.png or text is not suitable for translation.
No text found in 0-330f554f-a3e6-4bd3-8b1b-d5949e1f30e8.png or text is not suitable for translation.
Extracted Text f

## Leveraging Generative AI to Analyze the data

Now let's use Gen AI to help us analyzing the data collected. 

In [1]:
import json
import os

# Load the JSON file containing the translations
with open('text_translations2.json', 'r', encoding='utf-8') as file:
    data = json.load(file)
    
all_translated_texts = ""

# Loop through each item in the data list
for item in data:
    # Append the translated text to the all_translated_texts variable
    all_translated_texts += item['translated_text'] + "\n\n"

len(all_translated_texts)

507799

In [2]:
all_translated_texts[:1000]

'Euuui\n\nEhehytnr\nHHEH\nBTH\ntuattp\ny\'a]> l.] [[_ _ ′] ′ | y "quantity` _ 芗]\nEhar\nE\n\nE\nriver\n\nE\n\nnnarseran\n\\ @e\n\nFor a moment\n\nE\nCountry -1> Country\n\ng\nR2\nn\nH\nF&HSi\n\nE\n\nE\n\nE\n\nShirayuki 50 孛\n\nE\n\nThe department is simply called (the post -response is slow), and now it can be\nIs it overwhelming?\n\nIt has no physical slow\n\nDifferent from the same file service device, separate after coming out\nFu Baosun is\n\nTake the single one by one\n\nNow I look at the internal capacity\n\nWell, we want to buy Kui Box\'s Maotai Manda National Public Consultation\n\nTo be honest, this point is taken before, but it is\nThe timeliness is not good, many Dongxi reads today\nThere is a value from the beginning, and it will not be available in two days\n\nAccording to the experience of you, you have updated this one\nWow, please\n\nIs it possible to pass back from time to time?\n\n『 E\n\nS B oesxysc\n4\nLice Diao fork K 虬 K Ji stare at DAVE\n\nEUCEECR\n\nrnecoveas 一\n

In [9]:
file_name = data[0]['file_name']  # Adjusted to access the first item in the list
translated_text = data[0]['translated_text']

from openai import OpenAI
os.environ["OPENAI_API_KEY"] = api_key

client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  max_tokens=4096,
  messages=[
    {"role": "system", "content": "You are a Cyber Threat Intelligence analyst specialized in China operation. You are dedicated to analyzing leaked sensitive information in relation to Chinese espionage capabilities. The data contains multiple format documents, chat conversion, screenshot of products."},
    {"role": "user", "content": f"Make me a summary of this information: {all_translated_texts[:10000]}" }
  ]
)

print(completion.choices[0].message.content)

Based on the provided text, it appears to be a mixture of fragmented data, possibly from a larger intelligence gathering operation related to Chinese espionage or surveillance activities. The text contains various elements such as references to espionage capabilities, data breaches, targeted espionage operations against specific countries, technological vulnerabilities, and potential espionage tools and methods. Here's a succinct summary categorizing the key points:

1. **Cyber Espionage Tools and Vulnerabilities**: Mentions of "Mikrotik's 0day" and "gmail acquisition" suggest discussions or reports on exploiting vulnerabilities in network equipment and email services for intelligence gathering. The question "Is there any related to iOS?" could indicate an interest in vulnerabilities within Apple's operating systems.

2. **Data Breaches and Intelligence Gathering Operations**: The text lists extensive data breaches affecting companies and government entities across various countries, i

---------
Let's modify a little bit our prompt!

In [10]:
file_name = data[0]['file_name']  # Adjusted to access the first item in the list
translated_text = data[0]['translated_text']

from openai import OpenAI
os.environ["OPENAI_API_KEY"] = api_key

client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  max_tokens=4096,
  messages=[
    {"role": "system", "content": "You are a Cyber Threat Intelligence analyst specialized in China operation. You are dedicated to analyzing leaked sensitive information in relation to Chinese espionage capabilities. The data contains multiple format documents, chat conversion, screenshot of products..."},
    {"role": "user", "content": f"Make me a summary of the following information and include evidences such as names, ip or any other artefacts: {all_translated_texts[:10000]}" }
  ]
)

print(completion.choices[0].message.content)

The provided text appears to be a collection of seemingly random sequences interspersed with mentions of potential espionage-related activities, cyber threats, and intelligence gathering efforts attributed possibly to Chinese operations or interests. Given the fragmented and coded nature of the excerpt, the following summary attempts to identify key points and artifacts of interest within the constraints of the provided data:

### Potential Cyber Espionage Tools and Targets
- **Mikrotik's 0day and Gmail acquisition**: References to vulnerabilities or exploits possibly used for gaining unauthorized access.
- **iOS-related questions**: May imply an interest or ongoing efforts to compromise Apple's iOS devices.

### Data Breaches or Intelligence Gathering Efforts
- Specific details about data collections, including numbers of records and types of data obtained from various organizations and countries, notably from Myanmar, Vietnam, and presumably other regions. The data mentioned includes

------------
Interesting! Now we can get a broader overview of what kind of infomation the data might contains. 

### Retrieval Augmented Generation (RAG)
In that part I want to experiment by using a RAG to load our data. 

In [14]:
#!pip install langchain langchain-community chromadb jq
#!pip install langchain-openai
#!pip install jq
#!pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken

### Using ChromaDB
Let's create a simple RAG using Langchain to be able to query our data based on the question we have.

In [77]:
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import Chroma, FAISS
from langchain_community.document_loaders import JSONLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Create the embedding
embedding_function = OpenAIEmbeddings()

# Loading our JSON file
loader = JSONLoader(file_path="text_translations2.json", jq_schema=".[] | .translated_text", text_content=False)
documents = loader.load()

db = Chroma.from_documents(documents, embedding_function)
retriever = db.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(temperature=0, model_name="gpt-4-0125-preview")

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [49]:
query = "Can you give me details about intelligence capabilities from this data leak?"#3"Give me details about the email collection platform"
print(chain.invoke(query))

Based on the provided context from the data leak, the intelligence capabilities described involve the use of professional Advanced Persistent Threat (APT) penetration techniques. The document outlines a service system that specializes in intelligence services, emphasizing a "loose supply" approach. Here are the key points regarding the intelligence capabilities:

1. **Professional APT Penetration Team**: The service boasts a research team with significant expertise in APT penetration. This suggests a high level of skill in conducting sophisticated cyber attacks that remain undetected over long periods.

2. **Rich Experience in APT Penetration**: The document highlights the team's extensive experience in APT operations, indicating they have successfully executed multiple APT campaigns in the past. This experience likely contributes to their ability to navigate complex security environments and achieve their objectives.

3. **Cooked APT Implementation Flow**: The term "cooked" here sugge

### Using FAISS for RAG with Memory

In [50]:
# Uncomment the following line if you need to initialize FAISS with no AVX2 optimization
# os.environ['FAISS_NO_AVX2'] = '1'

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

loader = loader
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1433, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)

Created a chunk of size 3069, which is longer than the specified 1433
Created a chunk of size 2432, which is longer than the specified 1433
Created a chunk of size 1507, which is longer than the specified 1433
Created a chunk of size 2345, which is longer than the specified 1433
Created a chunk of size 2518, which is longer than the specified 1433
Created a chunk of size 3880, which is longer than the specified 1433
Created a chunk of size 2013, which is longer than the specified 1433
Created a chunk of size 1659, which is longer than the specified 1433


In [73]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts.prompt import PromptTemplate

retriever = db.as_retriever(search_kwargs={"k":5})


# Define your custom template
custom_template = """YYou are a Cyber Threat Intelligence analyst specialized in China operation. You are dedicated to analyzing leaked sensitive information in relation to Chinese espionage capabilities. The data contains multiple format documents, chat conversion, screenshot of products... If you do not know the answer reply with 'I am sorry'.
Chat History:
{chat_history}
Follow Up Input: {question}
Answer: """

CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)

# Initialize memory for chat history
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Initialize the ConversationalRetrievalChain
qa_chain = ConversationalRetrievalChain.from_llm(model, retriever, condense_question_prompt=CUSTOM_QUESTION_PROMPT, memory=memory)

def execute_conversation(question):
    # Load conversational history from file
    try:
        with open('conversational_history.json', 'r') as f:
            conversational_history = json.load(f)
    except FileNotFoundError:
        conversational_history = []
    
    # Update conversational history with the user's question
    conversational_history.append(("User", question))
    
    # Use the ConversationalRetrievalChain to get the answer
    result = qa_chain({"question": question})
    
    # Extract the 'answer' part from the result
    response_text = result.get('answer', 'Sorry, I could not generate a response.')
    
    # Update conversational history with the bot's response
    conversational_history.append(("Bot", response_text))
    
    # Limit the history to the last 10 turns
    if len(conversational_history) > 10:
        conversational_history = conversational_history[-10:]
    
    # Save conversational history to file
    with open('conversational_history.json', 'w') as f:
        json.dump(conversational_history, f)
    
    # Save conversational history to file
    with open('conversational_history.json', 'w') as f:
        json.dump(conversational_history, f)
    
    # Print only the last message in the conversational history
    last_message = conversational_history[-1]
    print(f"Discussion:\n{last_message[0]}: {last_message[1]}")


In [56]:
# Call the function with a question
execute_conversation("WHich countries might be a target according to the documents?")

Discussion:
Bot: Based on the document, the countries that might be targets include:

1. Afghanistan (referred to as "Afu Khan Guojia")
2. Countries in Southeast Asia (mentioned in the context of anti-terrorism postal mail)
3. Countries in West Asia (mentioned in relation to the Ministry of Communications)
4. Thailand (Thai Ministry of Finance and Ministry of Commerce)
5. Mongolia (Mentioned in relation to Foreign Communications and the Police Bureau)
6. Kazakhstan (referred to in the context of airlines and possibly telecommunications with "Airastanna Airlines" and "Harzakstan Kcell")
7. Malaysia (mentioned in the context of the military, specifically the Malaysian Army Network)
8. Macau (mentioned in the context of airlines, "Macau Airlines")
9. Pakistan (mentioned in the context of cooperation with the Pakistani Public Security Bureau)
10. Syria (mentioned in the context of a specific direction of focus)
11. Uzbekistan (mentioned as "Wu Zabbettan")
12. Iran (mentioned as "Yilang")



In [62]:
# Call the function with a question
execute_conversation("Give me more details about any references related to espionage capabilities")

Discussion:
Bot: The document outlines a company's capabilities in network attack, anti-penetration, and security research, emphasizing its experience in network infiltration and data extraction. It mentions the company's service to central government bodies, law enforcement, and departments concerned with public order, providing them with key information data on specific targets or network systems of interest. This includes the extraction and analysis of large-scale data for national web departments, aiming at specific target data excavation services.

The company boasts a professional APT (Advanced Persistent Threat) penetration research team with extensive experience in APT penetration and a well-established implementation process. It caters to domestic public security departments based on their business needs, offering services to obtain crucial information data on specific targets through APT appointments.

Furthermore, the document highlights the company's involvement in counter-

In [65]:
# Call the function with a question
execute_conversation("Give me more details about the email/social network intelligence platform")

Discussion:
Bot: The email/social network intelligence platform described in the provided context appears to be a sophisticated system designed for comprehensive email analysis and decision-making. Here are the key features and functionalities based on the details provided:

1. **Product Superiority:**
   - **High Accuracy:** Utilizes a large data frame and "Wenben Intelligence" to accurately recognize techniques, enabling fast analysis of high-volume emails.
   - **Powerful Analysis:** Capable of conducting various types of custom analyses on target emails, including but not limited to the mailing list, email content, and sender information.
   - **High Availability:** Offers a stable and reliable system with different interface operations that are convenient and easy to use.

2. **Email Collection:**
   - The platform provides a function for the self-initiated collection of emails based on specified criteria such as target mailbox account, IP address of the server device, and passwor

In [67]:
# Call the function with a question
execute_conversation("Extract and summarize any kind of information that might be useful for a defender or a threat analyst, including potential IOCs and TTPs")

Discussion:
Bot: The document outlines a comprehensive set of tools and methodologies used by a professional network attack and anti-penetration team, which could be of significant interest to defenders and threat analysts. Here's a summary of the relevant information, including potential Indicators of Compromise (IOCs) and Tactics, Techniques, and Procedures (TTPs):

### TTPs:

1. **Auxiliary Modules for Initial Reconnaissance:**
   - Scanning and checking various network services.
   - Collecting login credentials through fake services.
   - Port scanning to identify open services and potential vulnerabilities.

2. **Penetration and Post-Penetration Modules:**
   - Utilization of custom and predefined modules for conducting penetration attacks and maintaining access post-penetration.
   - Techniques for lateral movement within the network.
   - Reuse of evidence or high-value information obtained during the initial penetration for further attacks.

3. **Attack Persistence:**
   - Tec

## What next?
In this document, we explored a method to analyze data in PNG format and in the Chinese language, concerning a leak related to a government contractor with offensive capabilities. We demonstrated how to leverage OCR to extract data from the PNG files and translate them into English to glean more details about it. Then, we used generative AI to summarize some of the information available and finally created a RAG to enable the exploration of and specific data requests without manually digging through the vast amount of information.

There is obviously more to discover, and I hope this process will assist you in further analyzing the data leak.

If you like this notebook, stay tuned for more soon!

Thomas Roccia | [@fr0gger_](https://twitter.com/fr0gger_)