**Please Note**: As of now, this Jupyter notebook is under active development. Starting from June 5th, 2024, I intend to initiate a series of refinements and expansions. These updates will include additional changes and improvements to enhance the functionality and usability of the notebook. Your patience and understanding during this development phase are greatly appreciated.

In [21]:
# Importing the required libraries
import numpy as np
import pandas as pd
from math import log2
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from scipy.fft import fft, ifft
from scipy.special import erfc

## pre processing part


the pre-processing part was based on a previous commit of the public repository of sid-chava [QRNGClassifier Repository](https://github.com/sid-chava/QRNGClassifier)


**QRNG Classifier Preprocessing Functions Improvements**: Starting from June 5th, 2024, I plan to enhance the preprocessing functions of the QRNG Classifier. This could involve:

1. **Refining Feature Extraction**: Improve the methods used to extract features from the raw data. This could involve using more sophisticated techniques or algorithms to better capture the characteristics of the data.


2. **Introducing New Data Transformation Techniques**: Implement new techniques for transforming the data into a format that's more suitable for the classifier. This could involve normalization, scaling, or other transformation methods.

By implementing these improvements, we aim to enhance the effectiveness of the preprocessing functions, which could lead to better performance of the QRNG Classifier.

In [22]:
# File path in Google Drive
file_path = 'AI_2qubits_training_data.txt'

# Read the data from the file
data = []
with open(file_path, 'r') as file:
    for line in file:
        if line.strip():
            binary_number, label = line.strip().split()
            data.append((binary_number, int(label)))

# Convert the data into a DataFrame
df = pd.DataFrame(data, columns=['binary_number', 'label'])

num_concats = 1

new_df = pd.DataFrame({'Concatenated_Data': [''] * (len(df) // num_concats), 'label': [''] * (len(df) // num_concats)})

# Loop through each group of 10 rows and concatenate their 'Data' strings
for i in range(0, len(df), num_concats):
    new_df.iloc[i // num_concats, 0] = ''.join(df['binary_number'][i:i+num_concats])
    new_df.iloc[i // num_concats, 1] = df['label'][i]

# Calculate Shannon entropy for each concatenated binary sequence
def calculate_2bit_shannon_entropy(binary_string):
    # Ensure the string length is a multiple of 2 for exact 2-bit grouping
    if len(binary_string) % 4 != 0:
        raise ValueError("Binary string length must be a multiple of 2.")
    
    # Define possible 2-bit combinations
    #patterns = ['0000', '1000', '1100', '1110', '1111', '0100', '0110', '0111', '0010', '0011', '0001', '1001', '1101', '0110', '0101', '1010']
    patterns = ['00', '10', '11', '01']
    frequency = {pattern: 0 for pattern in patterns}
    
    # Count frequency of each pattern
    for i in range(0, len(binary_string), 2):
        segment = binary_string[i:i+2]
        if segment in patterns:
            frequency[segment] += 1
    
    # Calculate total segments counted
    total_segments = sum(frequency.values())
    
    # Calculate probabilities and entropy
    entropy = 0
    for count in frequency.values():
        if count > 0:
            probability = count / total_segments
            entropy -= probability * log2(probability)
    
    return entropy

def classic_spectral_test(bit_string):
    bit_array = 2 * np.array([int(bit) for bit in bit_string]) - 1
    dft = fft(bit_array)
    n_half = len(bit_string) // 2 + 1
    mod_dft = np.abs(dft[:n_half])
    threshold = np.sqrt(np.log(1 / 0.05) / len(bit_string))
    peaks_below_threshold = np.sum(mod_dft < threshold)
    expected_peaks = 0.95 * n_half
    d = (peaks_below_threshold - expected_peaks) / np.sqrt(len(bit_string) * 0.95 * 0.05)
    p_value = erfc(np.abs(d) / np.sqrt(2)) / 2
    return p_value

# Apply the entropy calculationnew_df['shannon_entropy'] = new_df['Concatenated_Data'].apply(calculate_2bit_shannon_entropy)

new_df['shannon_entropy'] = new_df['Concatenated_Data'].apply(calculate_2bit_shannon_entropy)
new_df['spectral_test'] = new_df['Concatenated_Data'].apply(classic_spectral_test)

df_features = pd.DataFrame(new_df['Concatenated_Data'].apply(list).tolist())
new_df = pd.concat([new_df.drop(columns='Concatenated_Data'), df_features], axis=1)

#print(df)

#print(df.head(10))


# Split the data into features (X) and labels (y)
X = new_df.drop(columns='label').values
#print(X)
y = new_df['label'].values
y=y.astype('int')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## getting to understand the data and modeling

In [23]:
new_df["label"].value_counts()

label
4    8000
1    2000
2    2000
3    2000
Name: count, dtype: int64

In [24]:
new_df

Unnamed: 0,label,shannon_entropy,spectral_test,0,1,2,3,4,5,6,...,90,91,92,93,94,95,96,97,98,99
0,1,1.935451,2.158752e-105,0,1,0,0,1,1,1,...,1,1,1,1,1,1,0,0,1,0
1,1,1.963615,8.731597e-110,0,1,1,0,0,1,1,...,0,1,1,0,0,0,1,1,0,1
2,1,1.939471,8.731597e-110,1,1,1,0,1,0,0,...,0,1,1,0,0,0,0,0,1,1
3,1,1.872164,8.731597e-110,1,1,0,1,0,0,0,...,1,1,0,1,1,1,1,1,0,1
4,1,1.976281,8.731597e-110,0,0,0,0,0,0,0,...,0,0,1,1,1,1,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,4,1.942653,8.731597e-110,1,1,1,1,0,1,0,...,0,0,1,1,0,0,1,1,0,0
13996,4,1.919479,8.731597e-110,0,1,0,1,1,0,0,...,0,1,0,0,0,0,1,1,0,0
13997,4,1.862236,8.731597e-110,1,1,0,0,1,1,1,...,0,0,0,0,1,1,1,0,0,0
13998,4,1.856367,8.731597e-110,1,1,0,1,1,1,0,...,1,1,0,0,1,1,0,1,0,0


In [25]:
import numpy as np

# Assuming y_train is a numpy array
values, counts = np.unique(y_train, return_counts=True)
for value, count in zip(values, counts):
    print(f"Value: {value}, Count: {count}")

Value: 1, Count: 1593
Value: 2, Count: 1602
Value: 3, Count: 1598
Value: 4, Count: 6407


In [26]:
from imblearn.over_sampling import RandomOverSampler

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')

# fit and apply the transform
X_over, y_over = oversample.fit_resample(X_train, y_train)

# Now, you can use X_over and y_over to train your model

## Random Forest

In [27]:
from sklearn.ensemble import RandomForestClassifier

def calculate_min_entropy(sequence):
    sequence = np.asarray(sequence, dtype=float)  # Convert sequence to float
    p = np.mean(sequence)  # Proportion of ones
    max_prob = max(p, 1 - p)
    if max_prob == 0:  # Handle the case where all bits are the same
        return 0
    min_entropy = -np.log2(max_prob)
    return min_entropy




vectorized_entropy = np.vectorize(calculate_min_entropy, signature='(n)->()')

# Calculate min-entropy for each sequence in the training and testing datasets
min_entropy_train = vectorized_entropy(X_over)
min_entropy_test = vectorized_entropy(X_test)

X_over_with_entropy = np.column_stack((X_over, min_entropy_train))
X_test_with_entropy = np.column_stack((X_test, min_entropy_test))
# Create the Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_over_with_entropy, y_over)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test_with_entropy)

# Calculate the accuracy of the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)


Random Forest Accuracy: 0.5807142857142857


In [28]:
# most important features
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X_over_with_entropy.shape[1]):
    print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")

# Select the top 10 features

top_10_features = indices[:10]

# Train the model with the top 10 features
class_weights = {1: 0.5, 2: 0.2, 3: 0.2, 4: 0.1}


rf_model_top_10 = RandomForestClassifier(random_state=42, class_weight=class_weights)

# Train the model

rf_model_top_10.fit(X_over_with_entropy[:, top_10_features], y_over)

# Make predictions on the test set


y_pred_rf_top_10 = rf_model_top_10.predict(X_test_with_entropy[:, top_10_features])

# Calculate the accuracy of the Random Forest model with the top 10 features

accuracy_rf_top_10 = accuracy_score(y_test, y_pred_rf_top_10)

print("Random Forest Accuracy with Top 10 Features:", accuracy_rf_top_10)


# cross validation score

from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the Random Forest model with the top 10 features

cv_scores = cross_val_score(rf_model_top_10, X_test_with_entropy[:, top_10_features], y_test, cv=5)

print("Cross-Validation Scores:", cv_scores)

# Calculate the mean cross-validation score

mean_cv_score = np.mean(cv_scores)

print("Mean Cross-Validation Score:", mean_cv_score)





Feature ranking:
1. feature 0 (0.08477059987352682)
2. feature 102 (0.06434640975426185)
3. feature 17 (0.009360488671683025)
4. feature 63 (0.009163701720908513)
5. feature 77 (0.009079627343959694)
6. feature 15 (0.00900906297108786)
7. feature 93 (0.008950080531223542)
8. feature 45 (0.008922831238742648)
9. feature 101 (0.008918315560353371)
10. feature 57 (0.00891624892302456)
11. feature 75 (0.008901839749523063)
12. feature 71 (0.008867368045092748)
13. feature 87 (0.008865473227228818)
14. feature 21 (0.008858333071439956)
15. feature 40 (0.008851016009282807)
16. feature 29 (0.008840669183816813)
17. feature 13 (0.008816316806955847)
18. feature 23 (0.008815627007443063)
19. feature 97 (0.008811754262263737)
20. feature 6 (0.008796758562269922)
21. feature 43 (0.008787512165949007)
22. feature 5 (0.008741807351821028)
23. feature 91 (0.008737613662650044)
24. feature 83 (0.008735842074073328)
25. feature 42 (0.008731787876909754)
26. feature 51 (0.008709759322460665)
27. featu

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# Define your class weights
# class_weights = {1: 0.5, 2: 0.2, 3: 0.2, 4: 0.1}

# Initialize a list to store the results
results = []
vectorized_entropy = np.vectorize(calculate_min_entropy, signature='(n)->()')

# Calculate min-entropy for each sequence in the training and testing datasets
min_entropy_train = vectorized_entropy(X_train)
min_entropy_test = vectorized_entropy(X_test)

X_train_with_entropy = np.column_stack((X_train, min_entropy_train))
X_test_with_entropy = np.column_stack((X_test, min_entropy_test))


# Loop over the desired range of feature counts
for num_features in range(5, 51):
    # Select the top features
    top_features = indices[:num_features]

    # Train the model with the top features and class weights
    rf_model = RandomForestClassifier(random_state=42)
    rf_model.fit(X_train_with_entropy[:, top_features], y_train)

    # Make predictions on the test set
    y_pred_rf = rf_model.predict(X_test_with_entropy[:, top_features])

    # Calculate the accuracy of the model
    accuracy_rf = accuracy_score(y_test, y_pred_rf)

    # Store the number of features and the accuracy in the results list
    results.append((num_features, accuracy_rf))
    # print(num_features)
# Sort the results by accuracy in descending order
results.sort(key=lambda x: x[1], reverse=True)

# Print the top 10 results
for i in range(10):
    print(f"Number of features: {results[i][0]}, Accuracy: {results[i][1]}")

Number of features: 40, Accuracy: 0.6057142857142858
Number of features: 29, Accuracy: 0.6053571428571428
Number of features: 36, Accuracy: 0.6032142857142857
Number of features: 46, Accuracy: 0.6032142857142857
Number of features: 44, Accuracy: 0.6028571428571429
Number of features: 26, Accuracy: 0.6014285714285714
Number of features: 39, Accuracy: 0.6014285714285714
Number of features: 28, Accuracy: 0.6003571428571428
Number of features: 42, Accuracy: 0.6
Number of features: 27, Accuracy: 0.5996428571428571


In [30]:
# class_weights = {1: 20, 2: 20, 3: 20, 4: 1}

from sklearn.utils import compute_class_weight


top_10_features = indices[:16]


rf_model_top_10 = RandomForestClassifier(random_state=42, class_weight=class_weights)

# Train the model

rf_model_top_10.fit(X_train_with_entropy[:, top_10_features], y_train)

# Make predictions on the test set


y_pred_rf_top_10 = rf_model_top_10.predict(X_test_with_entropy[:, top_10_features])

# Calculate the accuracy of the Random Forest model with the top 10 features

accuracy_rf_top_10 = accuracy_score(y_test, y_pred_rf_top_10)

print("Random Forest Accuracy with Top 10 Features:", accuracy_rf_top_10)


# cross validation score

from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the Random Forest model with the top 10 features

cv_scores = cross_val_score(rf_model_top_10, X_train_with_entropy[:, top_10_features], y_train, cv=10)

print("Cross-Validation Scores:", cv_scores)

# Calculate the mean cross-validation score

mean_cv_score = np.mean(cv_scores)

print("Mean Cross-Validation Score:", mean_cv_score)

# confusion matrix

from sklearn.metrics import confusion_matrix

# Calculate the confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred_rf_top_10)


print("Confusion Matrix:")
print(conf_matrix)

Random Forest Accuracy with Top 10 Features: 0.5853571428571429
Cross-Validation Scores: [0.59464286 0.60267857 0.59732143 0.59464286 0.58928571 0.59107143
 0.58571429 0.5875     0.6        0.59464286]
Mean Cross-Validation Score: 0.59375
Confusion Matrix:
[[  43   56   47  261]
 [  33   90   34  241]
 [  26   11  113  252]
 [  54   56   90 1393]]


In [31]:
from imblearn.over_sampling import RandomOverSampler

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')

# fit and apply the transform
X_over, y_over = oversample.fit_resample(X_train, y_train)

# Now, you can use X_over and y_over to train your model

In [32]:
import numpy as np

# Assuming y_train is a numpy array
values, counts = np.unique(y_over, return_counts=True)
for value, count in zip(values, counts):
    print(f"Value: {value}, Count: {count}")

Value: 1, Count: 6407
Value: 2, Count: 1602
Value: 3, Count: 1598
Value: 4, Count: 6407


In [33]:
from sklearn.ensemble import RandomForestClassifier

def calculate_min_entropy(sequence):
    sequence = np.asarray(sequence, dtype=float)  # Convert sequence to float
    p = np.mean(sequence)  # Proportion of ones
    max_prob = max(p, 1 - p)
    if max_prob == 0:  # Handle the case where all bits are the same
        return 0
    min_entropy = -np.log2(max_prob)
    return min_entropy

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, test_size=0.2, random_state=42)


vectorized_entropy = np.vectorize(calculate_min_entropy, signature='(n)->()')

# Calculate min-entropy for each sequence in the training and testing datasets
min_entropy_train = vectorized_entropy(X_train)
min_entropy_test = vectorized_entropy(X_test)

X_train_with_entropy = np.column_stack((X_train, min_entropy_train))
X_test_with_entropy = np.column_stack((X_test, min_entropy_test))

from sklearn.utils import compute_class_weight


top_10_features = indices[:13]


rf_model_top_10 = RandomForestClassifier(random_state=42, class_weight=class_weights)

# Train the model

rf_model_top_10.fit(X_train_with_entropy[:, top_10_features], y_train)

# Make predictions on the test set


y_pred_rf_top_10 = rf_model_top_10.predict(X_test_with_entropy[:, top_10_features])

# Calculate the accuracy of the Random Forest model with the top 10 features

accuracy_rf_top_10 = accuracy_score(y_test, y_pred_rf_top_10)

print("Random Forest Accuracy with Top 10 Features:", accuracy_rf_top_10)


conf_matrix = confusion_matrix(y_test, y_pred_rf_top_10)


print("Confusion Matrix:")
print(conf_matrix)

Random Forest Accuracy with Top 10 Features: 0.765220106150484
Confusion Matrix:
[[1246    2    5   23]
 [  92   43   18  172]
 [  60    8   66  181]
 [ 109   26   56 1096]]


In [34]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = cross_val_score(rf_model_top_10, X_train_with_entropy[:, top_10_features], y_train, cv=10)
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", np.mean(cv_scores))

# Check accuracy on the training dataset
train_accuracy = rf_model_top_10.score(X_train_with_entropy[:, top_10_features], y_train)
print("Training accuracy:", train_accuracy)

# Check accuracy on the test dataset
test_accuracy = rf_model_top_10.score(X_test_with_entropy[:, top_10_features], y_test)
print("Test accuracy:", test_accuracy)

Cross-validation scores: [0.73478939 0.754879   0.74395004 0.75019516 0.74004684 0.7392662
 0.73224044 0.7392662  0.75565964 0.74863388]
Mean cross-validation score: 0.7438926784237646
Training accuracy: 0.99867301537741
Test accuracy: 0.765220106150484


In [35]:
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=12)
gb_model.fit(X_train_with_entropy[:, top_10_features], y_train)
y_pred_gb = gb_model.predict(X_test_with_entropy[:, top_10_features])
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", accuracy_gb)

# AdaBoost
ab_model = AdaBoostClassifier(random_state=12)
ab_model.fit(X_train_with_entropy[:, top_10_features], y_train)
y_pred_ab = ab_model.predict(X_test_with_entropy[:, top_10_features])
accuracy_ab = accuracy_score(y_test, y_pred_ab)
print("AdaBoost Accuracy:", accuracy_ab)

Gradient Boosting Accuracy: 0.6063065875741492




AdaBoost Accuracy: 0.5707149547299407


In [36]:
ab_model = RandomForestClassifier(random_state=12)
ab_model.fit(X_train_with_entropy[:, top_10_features], y_train)
y_pred_ab = ab_model.predict(X_test_with_entropy[:, top_10_features])
accuracy_ab = accuracy_score(y_test, y_pred_ab)
print("RNF Accuracy:", accuracy_ab)

RNF Accuracy: 0.7664689353730877


In [37]:
from imblearn.over_sampling import RandomOverSampler

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='auto')

# fit and apply the transform
X_over, y_over = oversample.fit_resample(X_train, y_train)

# Now, you can use X_over and y_over to train your model

In [38]:
import numpy as np

# Assuming y_train is a numpy array
values, counts = np.unique(y_over, return_counts=True)
for value, count in zip(values, counts):
    print(f"Value: {value}, Count: {count}")

Value: 1, Count: 5131
Value: 2, Count: 5131
Value: 3, Count: 5131
Value: 4, Count: 5131


In [39]:
from sklearn.ensemble import RandomForestClassifier

def calculate_min_entropy(sequence):
    sequence = np.asarray(sequence, dtype=float)  # Convert sequence to float
    p = np.mean(sequence)  # Proportion of ones
    max_prob = max(p, 1 - p)
    if max_prob == 0:  # Handle the case where all bits are the same
        return 0
    min_entropy = -np.log2(max_prob)
    return min_entropy

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, test_size=0.2, random_state=42)


vectorized_entropy = np.vectorize(calculate_min_entropy, signature='(n)->()')

# Calculate min-entropy for each sequence in the training and testing datasets
min_entropy_train = vectorized_entropy(X_train)
min_entropy_test = vectorized_entropy(X_test)

X_train_with_entropy = np.column_stack((X_train, min_entropy_train))
X_test_with_entropy = np.column_stack((X_test, min_entropy_test))

from sklearn.utils import compute_class_weight


top_10_features = indices[:13]


rf_model_top_10 = RandomForestClassifier(random_state=42, class_weight=class_weights)

# Train the model

rf_model_top_10.fit(X_train_with_entropy[:, top_10_features], y_train)

# Make predictions on the test set


y_pred_rf_top_10 = rf_model_top_10.predict(X_test_with_entropy[:, top_10_features])

# Calculate the accuracy of the Random Forest model with the top 10 features

accuracy_rf_top_10 = accuracy_score(y_test, y_pred_rf_top_10)

print("Random Forest Accuracy with Top 10 Features:", accuracy_rf_top_10)


conf_matrix = confusion_matrix(y_test, y_pred_rf_top_10)


print("Confusion Matrix:")
print(conf_matrix)

Random Forest Accuracy with Top 10 Features: 0.9003654080389769
Confusion Matrix:
[[ 972   15   19   16]
 [  18 1005    5    7]
 [   6    8  977   17]
 [ 104   83  111  742]]


In [40]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = cross_val_score(rf_model_top_10, X_train_with_entropy[:, top_10_features], y_train, cv=10)
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", np.mean(cv_scores))

# Check accuracy on the training dataset
train_accuracy = rf_model_top_10.score(X_train_with_entropy[:, top_10_features], y_train)
print("Training accuracy:", train_accuracy)

# Check accuracy on the test dataset
test_accuracy = rf_model_top_10.score(X_test_with_entropy[:, top_10_features], y_test)
print("Test accuracy:", test_accuracy)

Cross-validation scores: [0.88124239 0.88611449 0.9043849  0.89159562 0.88976857 0.89159562
 0.8818514  0.89220463 0.8909866  0.89335771]
Mean cross-validation score: 0.8903101923086915
Training accuracy: 0.9987209939704002
Test accuracy: 0.9003654080389769


test accuracy 89~90% 