# 目的与意义

## 背景

nifH 基因是研究生物固氮功能的重要标记基因。通过分析该基因的序列特征（如 GC 含量、序列长度等），可以评估其功能活跃性或丰度。这对于理解海洋生态系统中的氮循环具有重要意义。

## 研究目的

1. 探索 nifH 基因的关键特征与功能活跃性之间的关系。
2. 使用机器学习模型（线性回归、随机森林、神经网络），预测未知样本的固氮功能。
3. 比较不同模型的性能，评估其在生物信息学分析中的适用性。

## 意义

通过引入模型方法，可以：

1. 提高固氮基因功能预测的准确性，扩展其应用场景。
2. 捕捉复杂的特征关系，为未来基因功能研究提供参考。
3. 构建一个通用的分析框架，为其他基因功能分析提供方法论支持。


## ID：基因序列的唯一标识符。

## Sequence：基因的碱基序列（A、T、C、G）。

## Length：基因的长度，单位为碱基对。

## GC_Content：GC 含量，是 (G + C) / 总长度。


In [1]:
import sys
print(sys.version)

3.9.18 (main, Sep 11 2023, 08:38:23) 
[Clang 14.0.6 ]


In [2]:
pip install tensorflow


Collecting tensorflow
  Downloading tensorflow-2.16.2-cp39-cp39-macosx_10_15_x86_64.whl.metadata (4.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Using cached flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Using cached gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting h5py>=3.10.0 (from tensorflow)
  Downloading h5py-3.12.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (2.5 kB)
Collecting libclang>=13.0.0 (from tensorflow)
  Using cached libclang-18.1.1-py2.py3-none-macosx_10_9_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow

In [3]:
pip install seaborn

Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pandas>=1.2 (from seaborn)
  Downloading pandas-2.2.3-cp39-cp39-macosx_10_9_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m845.4 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0mm
Collecting pytz>=2020.1 (from pandas>=1.2->seaborn)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas>=1.2->seaborn)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Downloading pandas-2.2.3-cp39-cp39-macosx_10_9_x86_64.whl (12.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.6/12.6 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Downloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
pip install biopython


Collecting biopython
  Downloading biopython-1.84-cp39-cp39-macosx_10_9_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp39-cp39-macosx_10_9_x86_64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84
Note: you may need to restart the kernel to use updated packages.


In [73]:
from Bio import SeqIO
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score




<!-- 是一个典型的 nifH 基因序列长度范围，GC 含量为 0.646，符合固氮基因常见的高 GC 含量特性。 -->


In [76]:
# 定义主数据文件夹路径
data_folder = "./nifHdata"

# 存储所有序列的列表
all_sequences = []

# 标签，每个基因组的标签
all_labels = []

all_annotations = []  # 存储注释信息

In [77]:
# 遍历20个文件夹
# 遍历20个文件夹
for i in range(20):
    folder_name = f"nifH_datasets ({i})" if i > 0 else "nifH_datasets"
    folder_path = os.path.join(data_folder, folder_name)
    gene_file = os.path.join(folder_path, "ncbi_dataset/data/gene.fna")
    
    # 确保文件存在
    if os.path.exists(gene_file):
        # 读取 fasta 文件中的序列和注释信息
        for record in SeqIO.parse(gene_file, "fasta"):
            all_sequences.append(str(record.seq))  # 转换为字符串存储
            all_annotations.append(record.description)  # 存储注释信息
            all_labels.append(i)  # 标签对应文件夹编号
    else:
        print(f"can not find file：{gene_file}")


# 打印结果
print(f"successfully load {len(all_sequences)} 条序列！")
print(f"共有 {len(set(all_labels))} 个类别，标签为：{set(all_labels)}")
print("示例序列：", all_sequences[0][:50], "...")  # 只显示第一条序列前 50 个字符
print("示例注释：", all_annotations[0])  # 显示第一条序列的注释信息
print("示例标签：", all_labels[0])  # 显示第一条序列的标签


successfully load 20 条序列！
共有 20 个类别，标签为：{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
示例序列： ATGTCTTTGCGCCAGATTGCGTTCTACGGTAAGGGCGGTATCGGCAAGTC ...
示例注释： NZ_VISK01000015.1:262164-263045 nifH [organism=Azospirillum brasilense] [GeneID=56451760] [chromosome=]
示例标签： 0


# Calculate the length of each sequence （计算每条序列的长度）


In [53]:
sequence_lengths = [len(seq) for seq in all_sequences]

print(f"Sequence length statistics：the shortest = {np.min(sequence_lengths)}, the longest = {np.max(sequence_lengths)}, averge = {np.mean(sequence_lengths):.2f}")

Sequence length statistics：the shortest = 813, the longest = 894, averge = 856.05


# Directly convert these original sequences into One-hot encoding

### The main reason for converting the original sequence into one-hot encoding is to represent the gene sequence in a numerical form that can be understood by computers while retaining biological information.

The original gene sequence is composed of characters (such as A, T, G, C, etc.), which cannot be processed directly by computers. Before modeling, these characters need to be converted into numerical form, and One-hot encoding is a common representation method that can retain the information of each base in the original sequence.


In [54]:
# One-hot 编码函数
def one_hot_encode(seq):
    mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3, 'N': 4}  # 包括 "N" 作为特殊字符
    one_hot = np.zeros((len(seq), len(mapping)))
    for i, char in enumerate(seq):
        if char in mapping:
            one_hot[i, mapping[char]] = 1
    return one_hot


# 对所有序列进行编码
one_hot_sequences = [one_hot_encode(seq) for seq in all_sequences]



print(f"One-hot convert successfully, the total:{len(one_hot_sequences)}")
print(f"the first sequences One-hot encoding:{one_hot_sequences[0].shape}")

One-hot convert successfully, the total:20
the first sequences One-hot encoding:(882, 5)


# Divide the dataset（划分数据集）

### We need to divide the sequence data into training and validation sets. Since there are 20 sequences in total, we can use a simple 8:2 ratio.

我们需要将序列数据划分为训练集和验证集。由于总共有 20 条序列，我们可以采用简单的 8:2 比例划分。


In [55]:
labels = np.array([1] * 10 + [0] * 10, dtype=np.float32)
# 根据任务需求生成目标标签，例如分类问题的标签（假设前10条为 1，后10条为 0）：

In [56]:
from sklearn.model_selection import train_test_split

# 模拟目标标签（假设是二分类问题，1 和 0）
# 实际应用中需要根据实验设计提供真实标签
labels = [1 if i < 10 else 0 for i in range(20)]  # 示例标签，前 10 条为 1，后 10 条为 0

# 划分数据集
train_sequences, test_sequences, train_labels, test_labels = train_test_split(
    one_hot_sequences, labels, test_size=0.2, random_state=42
)

# 打印划分结果
print(f"Number of training set samples：{len(train_sequences)}")
print(f"Number of validation set samples：{len(test_sequences)}")


Number of training set samples：16
Number of validation set samples：4


In [57]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [58]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten, LSTM


# 构建模型


In [59]:
# 转换为标准张量，并填充值为 0
train_data_tensor = tf.ragged.constant(train_sequences).to_tensor(default_value=0.0)
test_data_tensor = tf.ragged.constant(test_sequences).to_tensor(default_value=0.0)

# 确认形状一致性
print("训练数据形状：", train_data_tensor.shape)
print("验证数据形状：", test_data_tensor.shape)

训练数据形状： (16, 894, 5)
验证数据形状： (4, 894, 5)


In [60]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten

# 模型定义
model = Sequential([
    Flatten(input_shape=(894, 5)),  # 输入形状必须与数据一致
    Dense(64, activation='relu'),   # 第一层全连接层
    Dense(32, activation='relu'),   # 第二层全连接层
    Dense(1, activation='sigmoid')  # 输出层，二分类问题用 sigmoid 激活
])

# 编译模型
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 打印模型结构
model.summary()


# TRAIN THE MODULE


In [65]:
import numpy as np

# 转换训练和测试数据为 NumPy 数组
train_data_tensor = np.array(train_data_tensor, dtype=np.float32)
test_data_tensor = np.array(test_data_tensor, dtype=np.float32)

# 转换标签为 NumPy 数组
train_labels = np.array(train_labels, dtype=np.float32)
test_labels = np.array(test_labels, dtype=np.float32)

# 再次检查形状和类型
print("Train data shape:", train_data_tensor.shape)
print("Train data dtype:", train_data_tensor.dtype)

print("Test data shape:", test_data_tensor.shape)
print("Test data dtype:", test_data_tensor.dtype)

print("Train labels shape:", train_labels.shape)
print("Train labels dtype:", train_labels.dtype)

print("Test labels shape:", test_labels.shape)
print("Test labels dtype:", test_labels.dtype)


Train data shape: (16, 894, 5)
Train data dtype: float32
Test data shape: (4, 894, 5)
Test data dtype: float32
Train labels shape: (16,)
Train labels dtype: float32
Test labels shape: (4,)
Test labels dtype: float32


In [66]:
# 开始训练模型
history = model.fit(
    train_data_tensor, 
    train_labels, 
    epochs=10,  # 设置训练轮数
    validation_data=(test_data_tensor, test_labels), 
    batch_size=4  # 每次处理 4 条数据
)

# 打印训练完成
print("模型训练完成！")


Epoch 1/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 88ms/step - accuracy: 0.4083 - loss: 0.8034 - val_accuracy: 0.7500 - val_loss: 0.4310
Epoch 2/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 1.0000 - loss: 0.2429 - val_accuracy: 0.7500 - val_loss: 0.5307
Epoch 3/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step - accuracy: 1.0000 - loss: 0.0775 - val_accuracy: 0.7500 - val_loss: 0.5965
Epoch 4/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 1.0000 - loss: 0.0541 - val_accuracy: 0.7500 - val_loss: 0.7410
Epoch 5/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 1.0000 - loss: 0.0215 - val_accuracy: 0.7500 - val_loss: 0.8515
Epoch 6/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 1.0000 - loss: 0.0068 - val_accuracy: 0.7500 - val_loss: 0.9170
Epoch 7/10
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━

In [71]:
# 模型评估
evaluation = model.evaluate(test_data_tensor, test_labels)
print(f"验证集损失：{evaluation[0]:.4f}, 验证集准确率：{evaluation[1]:.4f}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step - accuracy: 0.7500 - loss: 1.1155
验证集损失：1.1155, 验证集准确率：0.7500


In [43]:
# 从验证集取一条数据
sample_data = tf.expand_dims(test_data_tensor[0], axis=0)  # 添加 batch 维度

# 预测
prediction = model.predict(sample_data)
print(f"预测结果：{prediction[0][0]:.4f}")  # 输出为一个概率值


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 94ms/step
预测结果：0.7205


In [72]:
# 从验证集选择一条数据
sample_data = tf.expand_dims(test_data_tensor[0], axis=0)  # 添加 batch 维度

# 预测
prediction = model.predict(sample_data)
print(f"预测结果（概率）：{prediction[0][0]:.4f}")  # 输出为一个概率值


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 97ms/step
预测结果（概率）：0.0121
