**Please Note**: As of now, this Jupyter notebook is under active development. Since June 5th, 2024, I have initiated a series of refinements and expansions. These updates include additional preprocessing functions, changes, and improvements to enhance the functionality and usability of the notebook. Your patience and understanding during this development phase are greatly appreciated.

In [1]:
# Importing the required libraries
import numpy as np
import pandas as pd
from math import log2
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from scipy.fft import fft, ifft
from scipy.special import erfc

## pre processing part

The pre-processing part of this project was initially based on a previous commit from the public repository of Sid Chava, available at [QRNGClassifier Repository](https://github.com/sid-chava/QRNGClassifier).

**QRNG Classifier Preprocessing Functions Enhancements**: As of June 8th, 2024, this Jupyter notebook is under active development. I am working on enhancing the preprocessing functions of the QRNG Classifier, which includes:

1. **Refining Feature Extraction**: I am improving the methods used to extract features from the raw data. This involves using more sophisticated techniques or algorithms to better capture the characteristics of the data.

2. **Introducing New Data Transformation Techniques**: I am implementing new techniques for transforming the data into a format that's more suitable for the classifier. This includes normalization, scaling, or other transformation methods.

These improvements are aimed at enhancing the effectiveness of the preprocessing functions, which could potentially lead to better performance of the QRNG Classifier. Your patience and understanding during this development phase are greatly appreciated.

the good results that we had previously was due to data leakage, after removing it, we see that by oversampling, the model is overfitting and not preforming as well as we thought it will be 

In [2]:
# Concatenate data
import itertools
from math import sqrt


def concatenateData(df, num_concats):
    new_df = pd.DataFrame({
        'Concatenated_Data': [''] * (len(df) // num_concats), 
        'label': [0] * (len(df) // num_concats)
    })

    # Loop through each group of num_concats rows and concatenate their 'binary_number' strings
    for i in range(0, len(df), num_concats):
        new_df.iloc[i // num_concats, 0] = ''.join(df['binary_number'][i:i + num_concats])
        new_df.iloc[i // num_concats, 1] = df['label'][i]

    return new_df

# Calculate Shannon entropy for each concatenated binary sequence
def shannon_entropy(binary_string):
    if len(binary_string) % 2 != 0:
        raise ValueError("Binary string length must be a multiple of 2.")
    
    patterns = ['00', '10', '11', '01']
    frequency = {pattern: 0 for pattern in patterns}
    
    for i in range(0, len(binary_string), 2):
        segment = binary_string[i:i+2]
        if segment in patterns:
            frequency[segment] += 1
    
    total_segments = sum(frequency.values())
    
    entropy = 0
    for count in frequency.values():
        if count > 0:
            probability = count / total_segments
            entropy -= probability * log2(probability)
    
    return entropy


def classic_spectral_test(bit_string):
    bit_array = 2 * np.array([int(bit) for bit in bit_string]) - 1
    dft = fft(bit_array)
    n_half = len(bit_string) // 2 + 1
    mod_dft = np.abs(dft[:n_half])
    threshold = np.sqrt(np.log(1 / 0.05) / len(bit_string))
    peaks_below_threshold = np.sum(mod_dft < threshold)
    expected_peaks = 0.95 * n_half
    d = (peaks_below_threshold - expected_peaks) / np.sqrt(len(bit_string) * 0.95 * 0.05)
    p_value = erfc(np.abs(d) / np.sqrt(2)) / 2
    return d

def frequency_test(bit_string):
    n = len(bit_string)
    count_ones = bit_string.count('1')
    count_zeros = bit_string.count('0')
    
    # The test statistic
    s = (count_ones - count_zeros) / sqrt(n)
    
    # The p-value
    p_value = erfc(abs(s) / sqrt(2))
    
    return p_value

def runs_test(bit_string):
    n = len(bit_string)
    runs = 1  # Start with the first run
    for i in range(1, n):
        if bit_string[i] != bit_string[i - 1]:
            runs += 1
    
    n0 = bit_string.count('0')
    n1 = bit_string.count('1')
    
    # Expected number of runs
    expected_runs = (2 * n0 * n1 / n) + 1
    variance_runs = (2 * n0 * n1 * (2 * n0 * n1 - n)) / (n ** 2 * (n - 1))
    
    # The test statistic
    z = (runs - expected_runs) / sqrt(variance_runs)
    
    # The p-value
    p_value = erfc(abs(z) / sqrt(2))
    
    return p_value

def linear_complexity(bit_string, M=500):
    # Perform linear complexity test with block size M
    n = len(bit_string)
    bit_array = np.array([int(bit) for bit in bit_string])
    lc = 0  # Initialize linear complexity
    
    # Process blocks of size M
    for i in range(0, n, M):
        block = bit_array[i:i+M]
        if len(block) < M:
            continue
        
        lc_block = 0
        for j in range(M):
            if block[j] == 1:
                lc_block = j + 1
        
        lc += lc_block
    
    lc = lc / (n / M)
    return lc

def autocorrelation_test(bit_string, lag=1):
    n = len(bit_string)
    bit_array = np.array([int(bit) for bit in bit_string])
    autocorrelation = np.correlate(bit_array, np.roll(bit_array, lag), mode='valid')[0]
    return autocorrelation / n

def maurer_universal_test(bit_string):
    k = 6
    l = 5
    q = 20
    bit_array = np.array([int(bit) for bit in bit_string])
    max_val = 2**k
    init_subseq = bit_array[:q]
    rest_subseq = bit_array[q:]
    d = {}
    for i in range(len(init_subseq) - k + 1):
        d[tuple(init_subseq[i:i+k])] = i
    t = []
    for i in range(len(rest_subseq) - k + 1):
        subseq = tuple(rest_subseq[i:i+k])
        if subseq in d:
            t.append(i - d[subseq])
            d[subseq] = i
    if not t:
        return 0
    t = np.array(t)
    log_avg = np.mean(np.log2(t))
    return log_avg - np.log2(q)

def binary_matrix_rank_test(bit_string, M=32, Q=32):
    bit_array = np.array([int(bit) for bit in bit_string])
    num_matrices = len(bit_array) // (M * Q)
    ranks = []
    for i in range(num_matrices):
        matrix = bit_array[i*M*Q:(i+1)*M*Q].reshape((M, Q))
        rank = np.linalg.matrix_rank(matrix)
        ranks.append(rank)
    return np.mean(ranks)

def cumulative_sums_test(bit_string):
    bit_array = np.array([int(bit) for bit in bit_string])
    adjusted = 2 * bit_array - 1
    cumulative_sum = np.cumsum(adjusted)
    max_excursion = np.max(np.abs(cumulative_sum))
    return max_excursion

def longest_run_ones_test(bit_string, block_size=100):
    bit_array = np.array([int(bit) for bit in bit_string])
    num_blocks = len(bit_array) // block_size
    max_runs = []
    for i in range(num_blocks):
        block = bit_array[i*block_size:(i+1)*block_size]
        max_run = max([len(list(g)) for k, g in itertools.groupby(block) if k == 1])
        max_runs.append(max_run)
    return np.mean(max_runs)

def random_excursions_test(bit_string):
    bit_array = np.array([int(bit) for bit in bit_string])
    bit_array = 2 * bit_array - 1  # Convert to ±1

    cumulative_sum = np.cumsum(bit_array)
    states = np.unique(cumulative_sum)

    if 0 not in states:
        states = np.append(states, 0)
    state_counts = {state: 0 for state in states}
    for state in cumulative_sum:
        state_counts[state] += 1

    state_counts[0] -= 1  # Adjust for zero state
    pi = [0.5 * (1 - (1 / (2 * state + 1)**2)) for state in states]
    x = np.sum([(state_counts[state] - len(bit_string) * pi[i])**2 / (len(bit_string) * pi[i]) for i, state in enumerate(states)])

    return x


def unique_subsequences(bit_string, length=4):
    bit_array = np.array([int(bit) for bit in bit_string])
    n = len(bit_array)
    subsequences = set()
    
    for i in range(n - length + 1):
        subseq = tuple(bit_array[i:i+length])
        subsequences.add(subseq)
    
    return len(subsequences)

def sample_entropy(bit_string, m=2, r=0.2):
    bit_array = np.array([int(bit) for bit in bit_string])
    N = len(bit_array)
    
    def _phi(m):
        x = np.array([bit_array[i:i+m] for i in range(N - m + 1)])
        C = np.sum(np.all(np.abs(x[:, None] - x) <= r, axis=2), axis=0) / (N - m + 1.0)
        return np.sum(C) / (N - m + 1.0)
    
    return -np.log(_phi(m + 1) / _phi(m))

def permutation_entropy(bit_string, order=3):
    bit_array = np.array([int(bit) for bit in bit_string])
    n = len(bit_array)
    
    permutations = np.array(list(itertools.permutations(range(order))))
    c = np.zeros(len(permutations))
    
    for i in range(n - order + 1):
        sorted_index_array = tuple(np.argsort(bit_array[i:i+order]))
        for j, p in enumerate(permutations):
            if np.array_equal(p, sorted_index_array):
                c[j] += 1
    
    c = c / (n - order + 1)
    pe = -np.sum(c * np.log2(c + np.finfo(float).eps))
    return pe

def lyapunov_exponent(bit_string, m=2, t=1):
    bit_array = np.array([int(bit) for bit in bit_string])
    N = len(bit_array)
    
    def _phi(m):
        x = np.array([bit_array[i:i+m] for i in range(N - m + 1)])
        C = np.sum(np.all(np.abs(x[:, None] - x) <= t, axis=2), axis=0) / (N - m + 1.0)
        return np.sum(np.log(C + np.finfo(float).eps)) / (N - m + 1.0)
    
    return abs(_phi(m) - _phi(m + 1))

def entropy_rate(bit_string, k=2):
    bit_array = np.array([int(bit) for bit in bit_string])
    n = len(bit_array)
    prob = {}
    
    for i in range(n - k + 1):
        subseq = tuple(bit_array[i:i + k])
        if subseq in prob:
            prob[subseq] += 1
        else:
            prob[subseq] = 1
    
    for key in prob:
        prob[key] /= (n - k + 1)
    
    entropy_rate = -sum(p * log2(p) for p in prob.values())
    return entropy_rate

# Apply randomness tests
def apply_randomness_tests(df, tests):
    if not tests:
        raise ValueError("No randomness tests specified.")

    test_functions = {
        'shannon_entropy': shannon_entropy,
        'classic_spectral_test': classic_spectral_test,
        'frequency_test': frequency_test,
        'runs_test': runs_test,
        'linear_complexity': linear_complexity,
        'autocorrelation_test': autocorrelation_test,
        'maurer_universal_test': maurer_universal_test,
        'binary_matrix_rank_test': binary_matrix_rank_test,
        'cumulative_sums_test': cumulative_sums_test,
        'longest_run_ones_test': longest_run_ones_test,
        'random_excursions_test': random_excursions_test,
        'unique_subsequences': unique_subsequences,
        'sample_entropy': sample_entropy,
        'permutation_entropy': permutation_entropy,
        'lyapunov_exponent': lyapunov_exponent,
        'entropy_rate': entropy_rate,
        'min_entropy': calculate_min_entropy
    }

    for test in tests:
        if test not in test_functions:
            raise ValueError(f"Invalid randomness test: {test}")
        df[test] = df['Concatenated_Data'].apply(test_functions[test])

    return df


# Preprocess data
def preprocess_data(df, num_concats, tests):
    df = concatenateData(df, num_concats)
    processed_df = apply_randomness_tests(df, tests)
    
    # Convert concatenated binary strings into separate columns
    df_features = pd.DataFrame(processed_df['Concatenated_Data'].apply(list).tolist(), dtype=float)
    processed_df = pd.concat([processed_df.drop(columns='Concatenated_Data'), df_features], axis=1)

    return processed_df

# Calculate min-entropy
def calculate_min_entropy(sequence):
    sequence = np.asarray(sequence, dtype=float)  # Convert sequence to float
    p = np.mean(sequence)  # Proportion of ones
    max_prob = max(p, 1 - p)
    if max_prob == 0:  # Handle the case where all bits are the same
        return 0
    min_entropy = -np.log2(max_prob)
    return min_entropy

# Main
file_path = 'AI_2qubits_training_data.txt'

# Read the data from the file
data = []
with open(file_path, 'r') as file:
    for line in file:
        if line.strip():
            binary_number, label = line.strip().split()
            data.append((binary_number, int(label)))

# Convert the data into a DataFrame
df = pd.DataFrame(data, columns=['binary_number', 'label'])

tests_to_apply = [
    'shannon_entropy', 'classic_spectral_test', 'frequency_test', 'runs_test',
    'linear_complexity', 'autocorrelation_test', 'maurer_universal_test', 
    'binary_matrix_rank_test', 'cumulative_sums_test', 'longest_run_ones_test', 
    'random_excursions_test', 'unique_subsequences', 'sample_entropy', 
    'permutation_entropy', 'lyapunov_exponent', 'entropy_rate', 'min_entropy'
]

# Preprocess data and apply randomness tests
preprocessed_df = preprocess_data(df, num_concats=1, tests=tests_to_apply)
# Split the data into features (X) and labels (y)

print(preprocessed_df)

  log_avg = np.mean(np.log2(t))
  log_avg = np.mean(np.log2(t))
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  x = np.sum([(state_counts[state] - len(bit_string) * pi[i])**2 / (len(bit_string) * pi[i]) for i, state in enumerate(states)])
  x = np.sum([(state_counts[state] - len(bit_string) * pi[i])**2 / (len(bit_string) * pi[i]) for i, state in enumerate(states)])


       label  shannon_entropy  classic_spectral_test  frequency_test  \
0          1         1.935451             -21.771553        0.423711   
1          1         1.963615             -22.230385        0.841481   
2          1         1.939471             -22.230385        0.109599   
3          1         1.872164             -22.230385        0.071861   
4          1         1.976281             -22.230385        0.230139   
...      ...              ...                    ...             ...   
13995      4         1.942653             -22.230385        0.689157   
13996      4         1.919479             -22.230385        0.689157   
13997      4         1.862236             -22.230385        0.317311   
13998      4         1.856367             -22.230385        0.841481   
13999      4         1.717977             -22.230385        0.071861   

       runs_test  linear_complexity  autocorrelation_test  \
0       0.120217                0.0                  0.33   
1       0.027

In [3]:
preprocessed_df

Unnamed: 0,label,shannon_entropy,classic_spectral_test,frequency_test,runs_test,linear_complexity,autocorrelation_test,maurer_universal_test,binary_matrix_rank_test,cumulative_sums_test,...,90,91,92,93,94,95,96,97,98,99
0,1,1.935451,-21.771553,0.423711,0.120217,0.0,0.33,0.028726,,10,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0
1,1,1.963615,-22.230385,0.841481,0.027240,0.0,0.31,,,8,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
2,1,1.939471,-22.230385,0.109599,0.498506,0.0,0.32,-inf,,21,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,1,1.872164,-22.230385,0.071861,0.620874,0.0,0.36,,,18,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
4,1,1.976281,-22.230385,0.230139,0.725698,0.0,0.18,,,18,...,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,4,1.942653,-22.230385,0.689157,0.556584,0.0,0.24,,,10,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
13996,4,1.919479,-22.230385,0.689157,0.987149,0.0,0.23,-0.014534,,6,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
13997,4,1.862236,-22.230385,0.317311,0.011137,0.0,0.36,-0.127187,,15,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0
13998,4,1.856367,-22.230385,0.841481,0.548989,0.0,0.25,,,8,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0


In [4]:
import numpy as np
from scipy.fftpack import fft
from collections import Counter
from itertools import groupby

# Count the number of 0s and 1s
def count_bits(sequence):
    return Counter(sequence)

# Count the number of transitions from 0 to 1 and from 1 to 0
def count_transitions(sequence):
    return sum(sequence[i-1] != sequence[i] for i in range(1, len(sequence)))

# Calculate the lengths of runs of consecutive 0s or 1s
def run_lengths(sequence):
    return [len(list(group)) for key, group in groupby(sequence)]

# Measure the entropy of the bit sequence
def entropy(sequence):
    value,counts = np.unique(list(sequence), return_counts=True)
    return -np.sum((counts/len(sequence)) * np.log2(counts/len(sequence)))

# Perform a Fourier transform and use the power spectrum as features
def spectral_analysis(sequence):
    transform = fft([int(bit) for bit in sequence])
    power_spectrum = np.abs(transform)**2
    return power_spectrum

# Compute the autocorrelation of the bit sequence
def autocorrelation(sequence):
    sequence = np.array([int(bit) for bit in sequence])
    result = np.correlate(sequence, sequence, mode='full')
    return result[result.size // 2:]

# Count the occurrences of each possible n-gram
def ngrams(sequence, n=2):
    return Counter(sequence[i:i+n] for i in range(len(sequence) - n + 1))

# Look for cyclic patterns in the bit sequence
def cyclic_patterns(sequence):
    # This is a complex task that may require specific domain knowledge
    # Placeholder function
    pass

# Find the longest run of 0s and 1s
def longest_run(sequence):
    return max(len(list(group)) for key, group in groupby(sequence))

# Calculate the rate at which bits flip from 0 to 1 or vice versa
def bit_flipping_rate(sequence):
    return count_transitions(sequence) / len(sequence)

In [5]:
from scipy.special import gammaincc

def serial_test(bit_string, m=2):
    n = len(bit_string)
    bit_array = np.array([int(bit) for bit in bit_string])
    counts = np.zeros(2**m)
    for i in range(n):
        counts[int(bit_string[i:i+m], 2)] += 1
    counts /= n
    psim = sum(counts**2) * 2**m - 1
    del1 = psim - (2**(m-1))
    del2 = psim - (2**(m-2)) if m > 1 else 0
    p_value1 = gammaincc(2**(m-2), del1 / 2)
    p_value2 = gammaincc(2**(m-3), del2 / 2) if m > 1 else 0
    return p_value1, p_value2

def poker_test(bit_string, m=4):
    n = len(bit_string)
    k = n // m
    counts = np.zeros(2**m)
    for i in range(k):
        counts[int(bit_string[i*m:(i+1)*m], 2)] += 1
    counts /= k
    x = 2**m / k * sum(counts**2) - k
    p_value = gammaincc(2**(m-2), x / 2)
    return p_value

def runs_above_below_test(bit_string):
    n = len(bit_string)
    pi = bit_string.count('1') / n
    tau = 2 / sqrt(n)
    if abs(pi - 0.5) >= tau:
        return 0.0
    else:
        bit_array = np.array([int(bit) for bit in bit_string])
        mean = np.mean(bit_array)
        diff = bit_array - mean
        runs = 1
        for i in range(1, n):
            if diff[i] * diff[i-1] < 0:
                runs += 1
        p_value = erfc(abs(runs - 2*n*pi*(1-pi)) / (2*sqrt(2*n)*pi*(1-pi)))
        return p_value

In [6]:
preprocessed_df=preprocessed_df.drop("binary_matrix_rank_test",axis=1)
preprocessed_df=preprocessed_df.drop("linear_complexity",axis=1)

# Apply the feature extraction methods to the binary_number column
# Apply the serial_test function to the binary_number column
df[['serial_test_p1', 'serial_test_p2']] = df['binary_number'].apply(lambda x: pd.Series(serial_test(x)))
df['poker_test'] = df['binary_number'].apply(lambda x: poker_test(x))
df['runs_above_below_test'] = df['binary_number'].apply(lambda x: runs_above_below_test(x))
df['count_0'] = df['binary_number'].apply(lambda x: count_bits(x)['0'])
df['count_1'] = df['binary_number'].apply(lambda x: count_bits(x)['1'])
df['transitions'] = df['binary_number'].apply(count_transitions)
df['run_lengths_0'] = df['binary_number'].apply(lambda x: run_lengths(x.replace('1', ' ')))
df['run_lengths_1'] = df['binary_number'].apply(lambda x: run_lengths(x.replace('0', ' ')))
df['entropy'] = df['binary_number'].apply(entropy)
df['spectral_analysis'] = df['binary_number'].apply(lambda x: np.mean(spectral_analysis(x)))
df['autocorrelation'] = df['binary_number'].apply(lambda x: np.mean(autocorrelation(x)))
df['ngrams'] = df['binary_number'].apply(lambda x: ngrams(x, 2)[x[:2]])  # Using 2-grams as an example
df['longest_run_0'] = df['binary_number'].apply(lambda x: max(run_lengths(x.replace('1', ' '))))
df['longest_run_1'] = df['binary_number'].apply(lambda x: max(run_lengths(x.replace('0', ' '))))
df['bit_flipping_rate'] = df['binary_number'].apply(bit_flipping_rate)
df
# Get the current column names
current_columns = df.columns

# Identify the columns to drop
columns_to_drop = [col for col in current_columns if col.startswith('run_length_0_') or col.startswith('run_length_1_')]

# Drop the identified columns
df = df.drop(columns=columns_to_drop)
# Add columns for the lengths of the first 5 runs of 0 and 1
for i in range(10):
    df[f'run_length_0_{i+1}'] = df['run_lengths_0'].apply(lambda x: x[i] if i < len(x) else np.nan)
    df[f'run_length_1_{i+1}'] = df['run_lengths_1'].apply(lambda x: x[i] if i < len(x) else np.nan)
df['mean_run_length_0'] = df['run_lengths_0'].apply(np.mean)
df['max_run_length_0'] = df['run_lengths_0'].apply(max)
df['min_run_length_0'] = df['run_lengths_0'].apply(min)
df['std_run_length_0'] = df['run_lengths_0'].apply(np.std)
df['mean_run_length_1'] = df['run_lengths_1'].apply(np.mean)
df['max_run_length_1'] = df['run_lengths_1'].apply(max)
df['min_run_length_1'] = df['run_lengths_1'].apply(min)
df['std_run_length_1'] = df['run_lengths_1'].apply(np.std)
df = df[df['label'] != 1]

df=df.drop("run_lengths_0",axis=1)
df=df.drop("run_lengths_1",axis=1)
df=df.drop("binary_number",axis=1)


In [7]:
df=df.drop("label",axis=1)

In [8]:
df

Unnamed: 0,serial_test_p1,serial_test_p2,poker_test,runs_above_below_test,count_0,count_1,transitions,entropy,spectral_analysis,autocorrelation,...,run_length_0_10,run_length_1_10,mean_run_length_0,max_run_length_0,min_run_length_0,std_run_length_0,mean_run_length_1,max_run_length_1,min_run_length_1,std_run_length_1
2000,,,,0.226743,35,65,39,0.934068,65.0,21.45,...,1,1,2.500000,11,1,1.949359,2.500000,11,1,1.949359
2001,,,,0.505677,44,56,45,0.989588,56.0,15.96,...,2,2,2.173913,5,1,1.166644,2.173913,5,1,1.166644
2002,,,,0.339922,33,67,39,0.914926,67.0,22.78,...,1,1,2.500000,12,1,2.291288,2.500000,12,1,2.291288
2003,,,,0.739835,39,61,45,0.964800,61.0,18.91,...,3,3,2.173913,8,1,1.632607,2.173913,8,1,1.632607
2004,,,,0.297566,60,40,42,0.970951,40.0,8.20,...,1,1,2.325581,11,1,2.248730,2.325581,11,1,2.248730
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,,,,0.700522,52,48,47,0.998846,48.0,11.76,...,1,1,2.083333,6,1,1.238839,2.083333,6,1,1.238839
13996,,,,0.828718,52,48,50,0.998846,48.0,11.76,...,2,2,1.960784,7,1,1.468102,1.960784,7,1,1.468102
13997,,,,0.020167,45,55,37,0.992774,55.0,15.40,...,3,3,2.631579,7,1,1.494218,2.631579,7,1,1.494218
13998,,,,0.691988,51,49,47,0.999711,49.0,12.25,...,2,2,2.083333,6,1,1.204736,2.083333,6,1,1.204736


In [9]:
preprocessed_df=preprocessed_df.join(df)

In [44]:
preprocessed_df

Unnamed: 0,label,shannon_entropy,classic_spectral_test,frequency_test,runs_test,autocorrelation_test,cumulative_sums_test,longest_run_ones_test,random_excursions_test,unique_subsequences,...,run_length_0_10,run_length_1_10,mean_run_length_0,max_run_length_0,min_run_length_0,std_run_length_0,mean_run_length_1,max_run_length_1,min_run_length_1,std_run_length_1
2000,2,1.857699,-22.230385,0.002700,0.150635,0.45,30,11.0,,16,...,1.0,1.0,2.500000,11.0,1.0,1.949359,2.500000,11.0,1.0,1.949359
2001,2,1.953521,-21.771553,0.230139,0.382632,0.33,15,5.0,,15,...,2.0,2.0,2.173913,5.0,1.0,1.166644,2.173913,5.0,1.0,1.166644
2002,2,1.791342,-22.230385,0.000674,0.234812,0.47,36,12.0,,16,...,1.0,1.0,2.500000,12.0,1.0,2.291288,2.500000,12.0,1.0,2.291288
2003,2,1.881277,-22.230385,0.027807,0.585556,0.38,23,8.0,,16,...,3.0,3.0,2.173913,8.0,1.0,1.632607,2.173913,8.0,1.0,1.632607
2004,2,1.837127,-22.230385,0.045500,0.208791,0.19,23,5.0,,16,...,1.0,1.0,2.325581,11.0,1.0,2.248730,2.325581,11.0,1.0,2.248730
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,4,1.942653,-22.230385,0.689157,0.556584,0.24,10,4.0,,16,...,1.0,1.0,2.083333,6.0,1.0,1.238839,2.083333,6.0,1.0,1.238839
13996,4,1.919479,-22.230385,0.689157,0.987149,0.23,6,7.0,,16,...,2.0,2.0,1.960784,7.0,1.0,1.468102,1.960784,7.0,1.0,1.468102
13997,4,1.862236,-22.230385,0.317311,0.011137,0.36,15,7.0,,15,...,3.0,3.0,2.631579,7.0,1.0,1.494218,2.631579,7.0,1.0,1.494218
13998,4,1.856367,-22.230385,0.841481,0.548989,0.25,8,5.0,,16,...,2.0,2.0,2.083333,6.0,1.0,1.204736,2.083333,6.0,1.0,1.204736


In [11]:
preprocessed_df = preprocessed_df[preprocessed_df['label'] != 1]

In [12]:
non_numeric_column_names = [col for col in preprocessed_df.columns if not str(col).isdigit()]
print(non_numeric_column_names)

['label', 'shannon_entropy', 'classic_spectral_test', 'frequency_test', 'runs_test', 'autocorrelation_test', 'maurer_universal_test', 'cumulative_sums_test', 'longest_run_ones_test', 'random_excursions_test', 'unique_subsequences', 'sample_entropy', 'permutation_entropy', 'lyapunov_exponent', 'entropy_rate', 'min_entropy', 'serial_test_p1', 'serial_test_p2', 'poker_test', 'runs_above_below_test', 'count_0', 'count_1', 'transitions', 'entropy', 'spectral_analysis', 'autocorrelation', 'ngrams', 'longest_run_0', 'longest_run_1', 'bit_flipping_rate', 'run_length_0_1', 'run_length_1_1', 'run_length_0_2', 'run_length_1_2', 'run_length_0_3', 'run_length_1_3', 'run_length_0_4', 'run_length_1_4', 'run_length_0_5', 'run_length_1_5', 'run_length_0_6', 'run_length_1_6', 'run_length_0_7', 'run_length_1_7', 'run_length_0_8', 'run_length_1_8', 'run_length_0_9', 'run_length_1_9', 'run_length_0_10', 'run_length_1_10', 'mean_run_length_0', 'max_run_length_0', 'min_run_length_0', 'std_run_length_0', 'mea

In [13]:
null_columns = preprocessed_df.columns[preprocessed_df.isnull().any()]
num_null_columns = len(null_columns)
print(f"Number of columns with null values: {num_null_columns}")

Number of columns with null values: 5


In [45]:
null_rows = preprocessed_df[non_numeric_column_names].isnull().sum()
print(null_rows)

KeyError: "['maurer_universal_test'] not in index"

In [15]:
# drop column binary_matrix_rank_test

# preprocessed_df = preprocessed_df.drop(columns=['binary_matrix_rank_test'])

In [16]:
#drop maurer_universal_test

preprocessed_df = preprocessed_df.drop(columns=['maurer_universal_test'])


In [17]:
# random_excursions_test avrage value for null values

preprocessed_df['random_excursions_test'] = preprocessed_df['random_excursions_test'].fillna(preprocessed_df['random_excursions_test'].mean())

In [18]:
# # Replace infinities with NaN
# preprocessed_df = np.where(np.isinf(preprocessed_df), np.nan, preprocessed_df)

# # Calculate the mean of each column, ignoring NaN values
# col_means = np.nanmean(preprocessed_df, apreprocessed_dfis=0)

# # Find indices in preprocessed_df where NaN values are present
# inds = np.where(np.isnan(preprocessed_df))

# # Replace NaNs with corresponding column mean
# preprocessed_df[inds] = np.take(col_means, inds[1])

preprocessed_df = preprocessed_df.replace([np.inf, -np.inf], np.nan)
preprocessed_df = preprocessed_df.fillna(preprocessed_df.mean())


In [48]:
X=preprocessed_df.drop(columns='label').values
y=preprocessed_df['label'].values

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [46]:
X.shape

(12000, 156)

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
clf = RandomForestClassifier()

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.725
Precision: 0.7036748879017508
F1 Score: 0.7016699516855223
Confusion Matrix:
[[ 162   33  207]
 [  36  150  240]
 [  69   75 1428]]


In [50]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
# APPLY WEIGHTS
class_weights = { 2: 1, 3: 1, 4: 20}
clf = RandomForestClassifier(class_weight=class_weights)


# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.72875
Precision: 0.711240331182139
F1 Score: 0.7143627575379586
Confusion Matrix:
[[ 170   43  189]
 [  38  189  199]
 [  84   98 1390]]


In [23]:
# # top 10 features 

# # Get feature importances
# importances = clf.feature_importances_

# # Get the indices of the top 10 features
# indices = np.argsort(importances)[-20:]

# # Get the names of the top 10 features
# top_features = X_train.columns[indices]

# # Print the top 10 features
# print("Top 10 features: ", top_features)

importances = clf.feature_importances_

# Get the indices of the top 10 features
indices = np.argsort(importances)[-20:]

# Get the names of the top 10 features
top_features = preprocessed_df.columns[indices]

# Print the top 10 features
print("Top 10 features: ", top_features)

Top 10 features:  Index(['run_length_1_10', 'count_1', 'longest_run_1', 'spectral_analysis',
       'entropy', 'runs_above_below_test', 'std_run_length_0', 'count_0',
       'runs_test', 'autocorrelation', 'lyapunov_exponent',
       'autocorrelation_test', 'poker_test', 'frequency_test',
       'min_run_length_0', 'min_run_length_1', 'entropy_rate',
       'unique_subsequences', 'sample_entropy', 'label'],
      dtype='object')


try to do the approach of one vs all to try to identify the best model 

In [24]:
X_2=preprocessed_df.drop("label",axis=1)
y_2=preprocessed_df["label"]


y_2 = np.where(y_2 != 4, 0, y_2)
unique_values, counts = np.unique(y_2, return_counts=True)
for value, count in zip(unique_values, counts):
    print(f"Value {value} appears {count} times in y_2")
from sklearn.model_selection import train_test_split

# Assuming X is your feature set and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.2, random_state=42)

X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)
# Replace infinities with NaN
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_train.fillna(X_train.mean(), inplace=True)

# Replace infinities with NaN
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_test.fillna(X_test.mean(), inplace=True)

Value 0 appears 4000 times in y_2
Value 4 appears 8000 times in y_2


In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
# APPLY WEIGHTS
# class_weights = {1: 1, 2, 4: 20}
clf = RandomForestClassifier()


# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.7608333333333334
Precision: 0.7544003555787422
F1 Score: 0.7542625490601116
Confusion Matrix:
[[ 465  363]
 [ 211 1361]]


In [26]:
# top 10 features 

# Get feature importances
importances = clf.feature_importances_

# Get the indices of the top 10 features
indices = np.argsort(importances)[-35:]

# Get the names of the top 10 features
top_features = X_train.columns[indices]

# Print the top 10 features
print("Top 10 features: ", top_features)

Top 10 features:  Index(['run_length_0_2', 'run_length_1_6', 'run_length_0_9', 'run_length_0_3',
       'run_length_0_7', 'max_run_length_0', 'max_run_length_1',
       'run_length_0_4', 'run_length_0_8', 'run_length_1_9', 'run_length_1_8',
       'longest_run_0', 'longest_run_ones_test', 'entropy', 'frequency_test',
       'ngrams', 'autocorrelation_test', 'cumulative_sums_test', 'count_1',
       'autocorrelation', 'min_entropy', 'count_0', 'entropy_rate',
       'spectral_analysis', 'sample_entropy', 'std_run_length_1',
       'std_run_length_0', 'bit_flipping_rate', 'mean_run_length_1',
       'runs_above_below_test', 'transitions', 'runs_test',
       'mean_run_length_0', 'permutation_entropy', 'shannon_entropy'],
      dtype='object')


In [27]:
# Select only the top 10 features
X_train_top = X_train[top_features]
X_test_top = X_test[top_features]

# Train the model
clf.fit(X_train_top, y_train)

# Make predictions on the test set
y_pred_top = clf.predict(X_test_top)

# Calculate metrics
accuracy_top = accuracy_score(y_test, y_pred_top)
precision_top = precision_score(y_test, y_pred_top, average='weighted')  # Use 'weighted' for multi-class problems
f1_top = f1_score(y_test, y_pred_top, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix_top = confusion_matrix(y_test, y_pred_top)

# Print metrics
print(f"Accuracy: {accuracy_top}")
print(f"Precision: {precision_top}")
print(f"F1 Score: {f1_top}")
print(f"Confusion Matrix:\n{conf_matrix_top}")

Accuracy: 0.7629166666666667
Precision: 0.7569843284405694
F1 Score: 0.7574754098360656
Confusion Matrix:
[[ 478  350]
 [ 219 1353]]


In [28]:
from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting classifier
from sklearn.impute import SimpleImputer

# Create an imputer object that replaces NaN values with the mean value of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the training data and transform it
X_train_imputed = imputer.fit_transform(X_train)

# Transform the testing data with the same imputer
X_test_imputed = imputer.transform(X_test)

# Create a Gradient Boosting classifier
clf = GradientBoostingClassifier(n_estimators=100)

# Train the model with the imputed data
clf.fit(X_train_imputed, y_train)

# Make predictions with the imputed test data
y_pred = clf.predict(X_test_imputed)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")



Accuracy: 0.7741666666666667
Precision: 0.7687035333087965
F1 Score: 0.768357953887151
Confusion Matrix:
[[ 485  343]
 [ 199 1373]]


In [29]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

# Create a SVC classifier
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the training data and transform it
X_train_imputed = imputer.fit_transform(X_train)

# Transform the testing data with the same imputer
X_test_imputed = imputer.transform(X_test)

# Create a Gradient Boosting classifier
clf = GradientBoostingClassifier(n_estimators=100)

# Train the model with the imputed data
clf.fit(X_train_imputed, y_train)

# Make predictions with the imputed test data
y_pred = clf.predict(X_test_imputed)
# Make predictions

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")



Accuracy: 0.7741666666666667
Precision: 0.7687035333087965
F1 Score: 0.768357953887151
Confusion Matrix:
[[ 485  343]
 [ 199 1373]]


In [30]:
# try another 2vs all

In [31]:
X_2=preprocessed_df.drop("label",axis=1)
y_2=preprocessed_df["label"]


y_2 = np.where(y_2 != 2, 0, y_2)
unique_values, counts = np.unique(y_2, return_counts=True)
for value, count in zip(unique_values, counts):
    print(f"Value {value} appears {count} times in y_2")
from sklearn.model_selection import train_test_split

# Assuming X is your feature set and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.2, random_state=42)

X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)
# Replace infinities with NaN
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_train.fillna(X_train.mean(), inplace=True)

# Replace infinities with NaN
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_test.fillna(X_test.mean(), inplace=True)

Value 0 appears 10000 times in y_2
Value 2 appears 2000 times in y_2


In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
# APPLY WEIGHTS
# class_weights = {1: 1, 2, 4: 20}
clf = RandomForestClassifier()


# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.8541666666666666
Precision: 0.8332262399937216
F1 Score: 0.8308128196115337
Confusion Matrix:
[[1932   66]
 [ 284  118]]


In [33]:
# with three 

In [34]:
X_2=preprocessed_df.drop("label",axis=1)
y_2=preprocessed_df["label"]


y_2 = np.where(y_2 != 3, 0, y_2)
unique_values, counts = np.unique(y_2, return_counts=True)
for value, count in zip(unique_values, counts):
    print(f"Value {value} appears {count} times in y_2")
from sklearn.model_selection import train_test_split
X_2=X_2.drop("random_excursions_test",axis=1)

# Assuming X is your feature set and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.2, random_state=42)

X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)
# Replace infinities with NaN
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_train.fillna(X_train.mean(), inplace=True)

# Replace infinities with NaN
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the mean of the column
X_test.fillna(X_test.mean(), inplace=True)

Value 0 appears 10000 times in y_2
Value 3 appears 2000 times in y_2


In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
# APPLY WEIGHTS
# class_weights = {1: 1, 2, 4: 20}
clf = RandomForestClassifier()


# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.8420833333333333
Precision: 0.8195182160747122
F1 Score: 0.8088492389364309
Confusion Matrix:
[[1924   50]
 [ 329   97]]


In [36]:
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix

# Initialize the RandomForestClassifier
clf = RandomForestClassifier()

# Initialize a SimpleImputer model
imputer = SimpleImputer(strategy='mean')

# Fit the imputer model on the training data
X_train = imputer.fit_transform(X_train)

# Now you can apply SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Train the model
clf.fit(X_train_res, y_train_res)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

found 0 physical cores < 1
  File "c:\Users\moham\anaconda32\lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")


ValueError: X has 155 features, but RandomForestClassifier is expecting 152 features as input.

In [None]:
from imblearn.over_sampling import RandomOverSampler

# Initialize RandomOverSampler
ros = RandomOverSampler(sampling_strategy=1.0)  # 100% oversampling

# Fit RandomOverSampler and resample the data
X_train_res, y_train_res = ros.fit_resample(X_train, y_train)

# Continue with your model training as before
clf = RandomForestClassifier()
clf.fit(X_train_res, y_train_res)
# Train the model with the best parameters

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
f1 = f1_score(y_test, y_pred, average='weighted')  # Use 'weighted' for multi-class problems
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Accuracy: 0.8425
Precision: 0.8218908345752608
F1 Score: 0.8245785479382748
Confusion Matrix:
[[1876   98]
 [ 280  146]]


