<a href="https://colab.research.google.com/github/Vrohs/learning-to-code-Python/blob/main/dcnn_sqlinjection_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CODEBASE**

In [8]:
# Import necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np  # Import NumPy

# Sample data (replace with your dataset)
texts = [
    "SELECT * FROM users WHERE id=1",
    "DROP TABLE users",
    "UPDATE products SET price=0 WHERE id=5",
    "INSERT INTO logs (message) VALUES ('SQL injection attack')"
]

labels = [0, 1, 0, 1]  # 0 for normal, 1 for SQL injection

# Tokenization and padding
max_words = 1000  # Maximum number of words in the vocabulary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_train = pad_sequences(sequences)

# Convert labels to a NumPy array
labels = np.array(labels)


# Define the DCNN model
model = keras.Sequential([
    # Embedding layer to convert words to dense vectors
    keras.layers.Embedding(input_dim=max_words, output_dim=32, input_length=x_train.shape[1]),

    # Convolutional layers
    keras.layers.Conv1D(32, 3, activation='relu'),
    keras.layers.Conv1D(64, 3, activation='relu'),
    keras.layers.Conv1D(128, 3, activation='relu'),

    # Global max pooling to reduce dimensions
    keras.layers.GlobalMaxPooling1D(),

    # Dense layer with 16 neurons and ReLU activation
    keras.layers.Dense(16, activation='relu'),

    # Final dense output layer with 1 neuron and sigmoid activation for binary classification
    keras.layers.Dense(1, activation='sigmoid')
])




# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, labels, epochs=10)

# ... (rest of the code remains the same)


# Demonstration: Predict SQL injection
new_texts = [
    "SELECT * FROM orders WHERE id=2",  # Normal query
    "DELETE FROM orders WHERE id=2 OR 1=1"  # SQL injection attempt
]


# Define the maximum sequence length (should be the same as during training)
max_sequence_length = x_train.shape[1]

# Tokenize and pad the new input sequences
sequences = tokenizer.texts_to_sequences(new_texts)
x_new = pad_sequences(sequences, maxlen=max_sequence_length)

# Predict SQL injection
predictions = model.predict(x_new)

for i, text in enumerate(new_texts):
    if predictions[i] > 0.5:
        print(f"'{text}' is a SQL injection attempt.")
    else:
        print(f"'{text}' is a normal query.")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
'SELECT * FROM orders WHERE id=2' is a SQL injection attempt.
'DELETE FROM orders WHERE id=2 OR 1=1' is a SQL injection attempt.


**Explained code block**


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# #explanation
# Import TensorFlow and Keras, which are libraries for building and training deep learning models.
# Import necessary modules for text preprocessing using Tokenizer and sequence padding




# Sample data (replace with your dataset)
texts = [
    "SELECT * FROM users WHERE id=1",
    "DROP TABLE users",
    "UPDATE products SET price=0 WHERE id=5",
    "INSERT INTO logs (message) VALUES ('SQL injection attack')"
]

labels = [0, 1, 0, 1]  # 0 for normal, 1 for SQL injection

# #explanation
# Here, we define some sample input texts, which represent SQL queries. These texts can be replaced with your own dataset.
# We also define labels: 0 for normal queries and 1 for SQL injection attempts.





# Tokenization and padding
max_words = 1000  # Maximum number of words in the vocabulary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_train = pad_sequences(sequences)

# #explanation
# Tokenization is the process of converting text into numerical form. We define a maximum vocabulary size of 1000 words.
# We create a tokenizer object and fit it on the input texts, building a vocabulary based on these texts.
# Next, we convert the input texts to sequences of integers using the tokenizer.
# To ensure that all sequences have the same length, we pad them using pad_sequences. This is necessary for neural network input.




# Define the model
model = keras.Sequential([
    keras.layers.Embedding(input_dim=max_words, output_dim=32, input_length=x_train.shape[1]),
    keras.layers.Flatten(),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# #explanation
# We define the neural network model using Keras' Sequential API.
# The model consists of:
# An Embedding layer: Converts integer-encoded words into dense vectors. 'input_dim' is the size of the vocabulary, 'output_dim' is the dimension of the word vectors, and 'input_length' is the length of input sequences.
# A Flatten layer: Flattens the output from the Embedding layer.
# A Dense layer with 16 neurons and ReLU activation.
# A final Dense layer with 1 neuron and sigmoid activation for binary classification.




# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# #explanation
# We compile the model, specifying the optimizer, loss function, and evaluation metric. Here, we use the Adam optimizer and binary cross-entropy loss for binary classification.




# Train the model (You should use your own dataset)
model.fit(x_train, labels, epochs=10)

# #explanation
# We train the model using the 'fit' method. replace 'x_train' and 'labels' with training data.





# Demonstration: Predict SQL injection
new_texts = [
    "SELECT * FROM orders WHERE id=2",  # Normal query
    "DELETE FROM orders WHERE id=2 OR 1=1"  # SQL injection attempt
]

sequences = tokenizer.texts_to_sequences(new_texts)
x_new = pad_sequences(sequences)

predictions = model.predict(x_new)

for i, text in enumerate(new_texts):
    if predictions[i] > 0.5:
        print(f"'{text}' is a SQL injection attempt.")
    else:
        print(f"'{text}' is a normal query.")

# #explanation
# demonstration of how to use the trained model to make predictions on new input texts...🤔
# We prepare the new input texts (new_texts) by tokenizing and padding them, similar to the training data.
# using the trained model to predict whether each input text represents a 'SQL injection attempt' or 'a normal query'. A prediction above 0.5 is considered a SQL injection attempt.
# We print the results based on the predictions.
# This code provides a basic understanding of how to create a neural network model for SQL injection detection using TensorFlow and Keras. For a production-ready system, you would need to train the model on a larger and more diverse dataset and consider additional techniques for improving accuracy and robustness.

*if the explanation is a little fuzzy, which I often do...🙋 *

**step-by-step explanation**

1. Import necessary libraries: We import TensorFlow and Keras for building and training neural networks, as well as Tokenizer and pad_sequences for text preprocessing.

2. Sample data: We define a list of sample SQL queries (texts) and corresponding labels (labels) where 0 represents normal queries, and 1 represents SQL injection attempts.

3. Tokenization and padding: We tokenize the input texts, build a vocabulary, convert texts to sequences of integers, and pad sequences to have consistent lengths.

4. Model definition: We define the neural network model with an embedding layer, flattening layer, dense hidden layer, and a final dense output layer for binary classification.

5. Model compilation: We compile the model, specifying the optimizer (Adam), loss function (binary cross-entropy), and metrics (accuracy) for training.

6. Model training: We train the model using the provided training data (x_train and labels) for a specified number of epochs (10 in this case).

7. Demonstration: We prepare new input texts (new_texts), tokenize and pad them, and use the trained model to predict whether each input represents a SQL injection attempt or a normal query. We print the results based on the predictions.