# Investigation of Missing Data

The aim of this investigation is to quantify the effects of missing data. 

1. Begin with a base data set.
2. Separate data with train and test
3. Use a simple ML algorithm to predict the churn with full training (Baseline)
4. Simulate missing data (NANs)
5. Perform Deletion or Imputation on training data 
6. Perform same ML algorithm with distorted data to predict the churn
7. Cross validate this method for statistically significant results

Import Packages

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

## Import Data

Importing data from California House Prices from sklearn.

This is a numerical data set for Regression analysis. 

The data has a range of important and unimportant data to modify.

In [426]:
# Import Test Data

df = pd.read_excel('fetch_california_housing.xlsx')
df.rename(columns={'target': 'MedHouseValue'}, inplace=True)
print('Features:', list(df.columns))
print(f"Total data rows: {len(df)}")

Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedHouseValue']
Total data rows: 20640


In [427]:
target = 'MedHouseValue'
y = df[target]
X = df.drop(columns = [target])

We will use KFold methodology because we want to use all our data, breaking them into train

In [490]:
from sklearn.model_selection import KFold

folds = 5

kf = KFold(n_splits=folds, shuffle=True, random_state=40)

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]


Right now the data is just in train_test_split form, however, we will loop through them later to get a cross validated model.

Set up the basic Linear Regression Model, this will be used as a base reference. 

The model can be changed in the future.

In [491]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse_base = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error Baseline: {mse:.5f}")

Mean Squared Error Baseline: 0.50234


### Theory: Missing data Types

MCAR - Missing Competely at Random
- no relationship between missing data
- i.e Sensor randomly fails readings, does not bias depending on data 
- Simplest and most general case (no bias from imputation or deletion)

MAR - Missing at Random 
- Missingness is related to other observed variables, but not the missing value itself.
- i.e Suppose income is more likely to be missing for younger people, but among people of the same age, missingness is random.
- If we model the relationship well (e.g. using age), imputation can be accurate.
- Listwise deletion can introduce bias if the data distribution shifts.

MNAR — Missing Not at Random
- Missingness depends on the value that is missing itself.
- i.e People with higher income are less likely to report their income.
- Hardest to deal with — even advanced imputations may introduce bias.
- Sometimes requires domain knowledge or modeling missingness explicitly.

In this analysis, MCAR will be investigated.

Artificially implement NANs data. Remove X percentage of data from each column (MCAR)

In [442]:
def unclean_X_train(X_train, missing_percentage = 0.2, seed = None):
    X_copy = X_train.copy()
    rng = np.random.default_rng(seed)
    for col in X_copy.columns:
        num_missing = int(len(X_copy) * missing_percentage)
        missing_indices = rng.choice(X_copy.index, size=num_missing, replace=False)
        X_copy.loc[missing_indices, col] = np.nan
    return X_copy

mp = 0.2
uc_X_train = unclean_X_train(X_train, missing_percentage= mp)


## Deletion

Now we investigate the effects of removing all the rows with at least 1 missing data value

In [499]:
def deletion(df_X = pd.DataFrame(), df_y = pd.DataFrame()):
    df = pd.concat([df_X, df_y], axis = 1)
    nona_df = df.dropna()
    return nona_df.iloc[:, :-1], nona_df.iloc[:,-1]

50.0% missing values from 8 column, ~0.3% of data remain
Mean Squared Error Baseline: 1.71992


In [535]:
mps = np.arange(0.2, 0.35, 0.005)
mses = []
print(f"Total Training Columns: {len(pX_train.columns)}")
for mp in mps:
    # Inject missing values
    uc_X_train = unclean_X_train(X_train, missing_percentage= mp, seed = 42)

    # Remove Rows
    pX_train, py_train = deletion(df_X = uc_X_train, df_y = y_train)

    # Show remaining data
    pc_vals = round((len(pX_train)/len(X_train)),3)*100
    
    # Train and evaluate
    model.fit(pX_train, py_train)
    y_pred = model.predict(X_test)
    mser = mean_squared_error(y_test, y_pred)/mse_base
    mses.append(mser)

    print(f'MP: {mp*100:.2f}%, ~{pc_vals:.1f}% of data remain. MSE score: {mser:.5f}')    

Total Training Columns: 8
MP: 20.00%, ~16.5% of data remain. MSE score: 1.01375
MP: 20.50%, ~16.2% of data remain. MSE score: 1.01088
MP: 21.00%, ~14.9% of data remain. MSE score: 1.00571
MP: 21.50%, ~14.4% of data remain. MSE score: 1.00234
MP: 22.00%, ~13.8% of data remain. MSE score: 1.20302
MP: 22.50%, ~13.3% of data remain. MSE score: 1.00455
MP: 23.00%, ~12.1% of data remain. MSE score: 1.00708
MP: 23.50%, ~11.4% of data remain. MSE score: 1.28220
MP: 24.00%, ~10.9% of data remain. MSE score: 1.00348
MP: 24.50%, ~10.6% of data remain. MSE score: 1.27739
MP: 25.00%, ~10.2% of data remain. MSE score: 1.24522
MP: 25.50%, ~9.4% of data remain. MSE score: 0.98733
MP: 26.00%, ~9.0% of data remain. MSE score: 1.09385
MP: 26.50%, ~8.6% of data remain. MSE score: 0.98920
MP: 27.00%, ~8.4% of data remain. MSE score: 0.99772
MP: 27.50%, ~7.6% of data remain. MSE score: 1.36321
MP: 28.00%, ~7.4% of data remain. MSE score: 1.29513
MP: 28.50%, ~6.7% of data remain. MSE score: 1.42153
MP: 29.00

In [563]:
import plotly.graph_objects as go
def plotter(X = mps, y = mses):
    # Create DataFrame
    df = pd.DataFrame({
        "Missing Percentage": X,
        "Deletion MSE score": y
    })
    # Create line plot
    fig = go.Figure()

    # Add a line trace
    fig.add_trace(go.Scatter(
        x=df["Missing Percentage"]*100,
        y=df["Deletion MSE score"],
        mode='lines+markers',
        name='Deletion',
        line=dict(color='blue'),
        marker=dict(size=8)
    ))

    # Update layout
    fig.update_layout(
        title="Standardised MSE vs. Missing Percentage",
        xaxis_title="Missing Data Percentage (%)",
        yaxis_title="MSE Score",
        template="plotly_white"
    )

    fig.show()
    
plotter(X = mps, y = mse_res)


This data has large variance, we will now implement cross validation to smooth this data with our Kfolds

In [586]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error
def ml_mse(model, X_train, y_train, X_test, y_test): 
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse

folds = 10
model = LinearRegression()

res_av = []
for i in range(50):
    kf = KFold(n_splits=folds, shuffle=True, random_state=np.random.randint(0, 10000))
    mse_res = []
    for mp in mps:
        mses = []
        for fold_idx, (train_index, test_index) in enumerate(kf.split(X)):
            X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
            y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()
            mse_base = ml_mse(model = model, X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test)
            uc_X_train = unclean_X_train(X_train, missing_percentage= mp, seed = None)
            pX_train, py_train = deletion(df_X = uc_X_train, df_y = y_train)
            if len(pX_train) == 0:
                print("Help")
                continue
            mse_sample = ml_mse(model = model, X_train = pX_train, y_train = py_train, X_test = X_test, y_test = y_test)
            mses.append(mse_sample/mse_base)
        mse_res.append(np.median(mses))
        #lower_q = np.nanpercentile(mse_res, 25, axis=0)
        #upper_q = np.nanpercentile(mse_res, 75, axis=0)
        #iqr = upper_q - lower_q
        #med = np.median(mses)
        #std = np.std(mses)
        #mean = np.mean(mses)
        #print(round(mp,3), f'\tMSE = {med:.1f} ± {iqr:.1f} (95% CI)"')
    res_av.append(mse_res)
arr = np.array(res_av)
smooth_mse = np.median(arr, axis = 0)
print(smooth_mse)

plotter(X = mps, y = smooth_mse)

[1.00377256 1.00440298 1.00478078 1.00522651 1.00638375 1.00828765
 1.00688738 1.01071522 1.00702187 1.00728756 1.01234431 1.01031229
 1.00928024 1.00725438 1.01007581 1.00968669 1.01443434 1.01672375
 1.02046944 1.01912646 1.03099034 1.0223759  1.01898457 1.02495749
 1.03893593 1.02827901 1.0326383  1.05074787 1.04665583 1.07917232]


In [558]:
folds = 10
kf = KFold(n_splits=folds, shuffle=True, random_state=40)

mps = np.arange(0.2, 0.35, 0.005)
mse_matrix = np.zeros((folds, len(mps)))  # Collect all MSEs
model = LinearRegression()

for fold_idx, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
    y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse_base = mean_squared_error(y_test, y_pred)

    for i, mp in enumerate(mps):
        uc_X_train = unclean_X_train(X_train, missing_percentage=mp, seed=fold_idx)
        pX_train, py_train = deletion(df_X=uc_X_train, df_y=y_train)

        if len(pX_train) == 0:
            mse_matrix[fold_idx, i] = np.nan  # safer than skipping
            continue

        model = LinearRegression()
        model.fit(pX_train, py_train)
        y_pred = model.predict(X_test)
        mser = mean_squared_error(y_test, y_pred) / mse_base
        mse_matrix[fold_idx, i] = mser

# Compute average ignoring NaNs
mse_res = np.nanmean(mse_matrix, axis=0)


import plotly.graph_objects as go

# Create DataFrame
df = pd.DataFrame({
    "Missing Percentage": mps,
    "Deletion MSE score": mse_res
})
# Create line plot
fig = go.Figure()

# Add a line trace
fig.add_trace(go.Scatter(
    x=df["Missing Percentage"]*100,
    y=df["Deletion MSE score"],
    mode='lines+markers',
    name='Deletion',
    line=dict(color='blue'),
    marker=dict(size=8)
))

# Update layout
fig.update_layout(
    title="Standardised MSE vs. Missing Percentage",
    xaxis_title="Missing Data Percentage (%)",
    yaxis_title="MSE Score",
    template="plotly_white"
)

fig.show()


In [537]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X and y are your feature matrix and target vector
kf = KFold(n_splits=50, shuffle=True, random_state=42)

for fold_idx, (train_index, test_index) in enumerate(kf.split(X)):
    # Split the data
    X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
    y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()

    # --- Your custom processing here, e.g., uncleaning ---
    uc_X_train = unclean_X_train(X_train, missing_percentage=0.2, seed=fold_idx)
    pX_train, py_train = deletion(df_X=uc_X_train, df_y=y_train)

    # Skip if all rows were dropped
    if len(pX_train) == 0:
        print(f"Fold {fold_idx}: skipped due to all rows dropped.")
        continue

    # Train model
    model = LinearRegression()
    model.fit(pX_train, py_train)

    # Evaluate on the untouched test fold
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    print(f"Fold {fold_idx} - MSE: {mse:.5f}")


Fold 0 - MSE: 0.43430
Fold 1 - MSE: 0.56216
Fold 2 - MSE: 0.55924
Fold 3 - MSE: 0.52693
Fold 4 - MSE: 0.52758
Fold 5 - MSE: 0.40324
Fold 6 - MSE: 0.67056
Fold 7 - MSE: 0.58414
Fold 8 - MSE: 0.46409
Fold 9 - MSE: 0.62650
Fold 10 - MSE: 0.46930
Fold 11 - MSE: 0.40726
Fold 12 - MSE: 0.49491
Fold 13 - MSE: 0.38911
Fold 14 - MSE: 0.50702
Fold 15 - MSE: 0.52363
Fold 16 - MSE: 0.66189
Fold 17 - MSE: 0.57179
Fold 18 - MSE: 0.53255
Fold 19 - MSE: 0.59938
Fold 20 - MSE: 0.41633
Fold 21 - MSE: 0.53134
Fold 22 - MSE: 0.53726
Fold 23 - MSE: 0.45918
Fold 24 - MSE: 0.46759
Fold 25 - MSE: 0.47237
Fold 26 - MSE: 0.44324
Fold 27 - MSE: 0.46572
Fold 28 - MSE: 0.56485
Fold 29 - MSE: 0.57374
Fold 30 - MSE: 0.52261
Fold 31 - MSE: 0.59363
Fold 32 - MSE: 0.40692
Fold 33 - MSE: 0.42484
Fold 34 - MSE: 10.67180
Fold 35 - MSE: 0.45533
Fold 36 - MSE: 0.40793
Fold 37 - MSE: 0.41866
Fold 38 - MSE: 0.50465
Fold 39 - MSE: 0.51349
Fold 40 - MSE: 0.59466
Fold 41 - MSE: 0.67135
Fold 42 - MSE: 0.51113
Fold 43 - MSE: 0.588

In [409]:

X_train, y_train = X.copy(), y.copy()
uc_X = unclean_df(df = X_train, missing_percentage=mp, seed = None)
nona_X, nona_y = remove_na(df_X = uc_X, df_y = y_train)
pc_vals = round((len(nona_X)/len(X_train)),3)*100
print(f'{mp*100}% missing values from {len(uc_X.columns)} column,', f"~{pc_vals}% of data remain")

model = LinearRegression()
def deletion_method(X, y, model = LinearRegression(), cv = 5, mp = 0.2, repeats = 30, seed = None):
    all_mse = []
    for i in range(repeats):
        X_train, y_train = X.copy(), y.copy()
        uc_X = unclean_df(df = X_train, missing_percentage=mp, seed = None)
        nona_X, nona_y = remove_na(df_X = uc_X, df_y = y_train)
        mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
        cv_scores = cross_val_score(model, nona_X, nona_y, scoring=mse_scorer, cv=cv)
        all_mse.append(-np.median(cv_scores))
    return np.median(all_mse)

def method_pipe(method=deletion_method, X=None, y=None, model=LinearRegression(), cv=50, mp=0.2, repeats=100, outer_repeats=15):
    results = []
    for i in range(outer_repeats):
        res = method(X=X, y=y, model=model, cv=cv, mp=mp, repeats=repeats)
        results.append(res)
    return results

results = method_pipe(method=deletion_method,
                        X=X,
                        y=y,
                        model=LinearRegression(),
                        cv=50,
                        mp=0.2,
                        repeats=100,
                        outer_repeats=15)

def sweep_missing_levels(method, X, y, mps,
                         model=LinearRegression(),
                         cv=50,repeats=100,outer_repeats=15):
    results_dict = {}
    for mp in mps:
        scores = method_pipe(method=method, X=X, y=y, model=model, cv=cv, mp=mp, repeats=repeats, outer_repeats=outer_repeats)
        mean_score = np.mean(scores)
        std_score = np.std(scores)
            
        results_dict[mp] = {"mean": mean_score,
                            "std": std_score,
                            "raw": scores}
    return results_dict

def print_results(name="Deletion Method", results_dict={}):
    print(f"=== {name} ===")
    for mp in sorted(results_dict.keys()):
        mean = results_dict[mp][0]/mse_base
        std = results_dict[mp][1]/mse_base
        ci = 1.96 * std  # 95% confidence interval
        print(f"For {int(mp*100)}% missing: MSE = {mean:.5f} ± {ci:.5f} (95% CI)")

20.0% missing values from 8 column, ~16.900000000000002% of data remain
[0.4282837773584294, 0.4388100680910673, 0.43310207510821597, 0.4307078500256206, 0.4317484762200053, 0.428199106308657, 0.4389581645959829, 0.42833666372588075, 0.4319093371565851, 0.42916451975164616, 0.43408863311980117, 0.4341204069586203, 0.43903227307939113, 0.43014731204391266, 0.42765344045362913]


In [410]:
mps = [0.15,0.175, 0.20,0.225, 0.25, 0.275, 0.30]
del_spectrum = {}
for mp in mps:
    results = method_pipe(
    method=deletion_method,
    X=X,
    y=y,
    model=LinearRegression(),
    cv=50,
    mp=mp  ,
    repeats=100,
    outer_repeats=15
    )
    del_spectrum[mp] = [np.mean(results), np.std(results)]

In [413]:
def sweep_missing_levels(method, X, y, mps,
                         model=LinearRegression(),
                         cv=50,repeats=100,outer_repeats=15):
    results_dict = {}
    for mp in mps:
        scores = method_pipe(method=method, X=X, y=y, model=model, cv=cv, mp=mp, repeats=repeats, outer_repeats=outer_repeats)
        mean_score = np.mean(scores)
        std_score = np.std(scores)
            
        results_dict[mp] = {"mean": mean_score,
                            "std": std_score,
                            "raw": scores}
    return results_dict

def print_results(name="Deletion Method", results_dict={}):
    print(f"=== {name} ===")
    for mp in sorted(results_dict.keys()):
        mean = results_dict[mp][0]/mse_base
        std = results_dict[mp][1]/mse_base
        ci = 1.96 * std  # 95% confidence interval
        print(f"For {int(mp*100)}% missing: MSE = {mean:.5f} ± {ci:.5f} (95% CI)")
        
print_results(name = "Deletion Method", results_dict = del_spectrum)



=== Deletion Method ===
For 15% missing: MSE = 0.80075 ± 0.01097 (95% CI)
For 17% missing: MSE = 0.79510 ± 0.01087 (95% CI)
For 20% missing: MSE = 0.78538 ± 0.01725 (95% CI)
For 22% missing: MSE = 0.77894 ± 0.01817 (95% CI)
For 25% missing: MSE = 0.76590 ± 0.02007 (95% CI)
For 27% missing: MSE = 0.75143 ± 0.01898 (95% CI)
For 30% missing: MSE = 0.73216 ± 0.01842 (95% CI)


In [408]:
import numpy as np
results = [0.78465, 0.78562, 0.7821, 0.79894, 0.78495,
           0.78557, 0.78756, 0.77896, 0.7809, 0.78123,
           0.78493, 0.79469, 0.77427, 0.78404, 0.79616]

mean_score = np.mean(results)
std_dev = np.std(results)
ci_low = mean_score - 1.96 * std_dev
ci_high = mean_score + 1.96 * std_dev

pm = ci_high-ci_low
print(pm)

print(f"Mean Normalized MSE (Deletion @ 20%): {mean_score:.5f}")
print(f"95% CI: [{ci_low:.5f}, {ci_high:.5f}]")

0.024916350529476272
Mean Normalized MSE (Deletion @ 20%): 0.78564
95% CI: [0.77318, 0.79810]


Because we are randomly choosing which percentage of features are being deleted, we need to some statistical samples

### Gather Statistics

In [96]:
model = LinearRegression()
mp = 0.20
samples = 100


def deletion_method(model = model, missing_percentage = 0.2, samples = samples):
    mses = []
    for i in range(samples):
        X_train_testing = X_train.copy()
        uc_X_train = unclean_df(df = X_train_testing, missing_percentage=missing_percentage)
        nona_X_train, nona_y_train = remove_na(df_X = uc_X_train, df_y = y_train)
        model.fit(nona_X_train, nona_y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        mses.append(mse)
    av = np.median(mses)
    return av
array = []
for i in range(15):
    res = deletion_method(model = model, missing_percentage = mp, samples = samples)
    print(f"Out of {samples} trials, the MSE is {res:.3f}")
    array.append(res)
print(np.array(array).std())

Out of 100 trials, the MSE is 0.572
Out of 100 trials, the MSE is 0.567
Out of 100 trials, the MSE is 0.573
Out of 100 trials, the MSE is 0.566
Out of 100 trials, the MSE is 0.577
Out of 100 trials, the MSE is 0.568
Out of 100 trials, the MSE is 0.567
Out of 100 trials, the MSE is 0.568
Out of 100 trials, the MSE is 0.570
Out of 100 trials, the MSE is 0.569
Out of 100 trials, the MSE is 0.569
Out of 100 trials, the MSE is 0.572
Out of 100 trials, the MSE is 0.573
Out of 100 trials, the MSE is 0.570
Out of 100 trials, the MSE is 0.571
0.002780457207882115


We can see the MSE

33.7

In [105]:
model = LinearRegression()
samples = [50,100,200,300,500,750,1000,1500]
dat = {}
for s in samples:
    array = []
    for i in range(15):
        res = deletion_method(model = model, missing_percentage = mp, samples = s)
        #print(f"Out of {samples} trials, the MSE is {res:.3f}")
        array.append(res)
    dat[s] = [np.array(array).mean(), np.array(array).std()]
    print(s,np.array(array).mean()/np.array(array).std())
        

50 32.7081565674406
100 60.92481409370882
200 95.8776794698639
300 111.81757160634179
500 213.26987246494366
750 162.67796828584923
1000 200.33957258642906
1500 272.8709671503992


In [109]:
print(dat[50])

[0.5503443099442408, 0.016825904230019558]


In [20]:
medians = []
for j in range(10):
    model = LinearRegression()
    mse_array = []
    mean_array = []
    median_array = []
    cv_array = []
    std_array = []
    for i in range(5000):
        X_train_testing = X_train.copy()
        uc_X_train = unclean_df(df = X_train_testing, missing_percentage=0.2)
        nona_X_train, nona_y_train = remove_na(df_X = uc_X_train, df_y = y_train)
        model.fit(nona_X_train, nona_y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        
        mse_array.append(mse/mse_base)
        #mean = np.mean(mse_array)
        #std = np.std(mse_array)
        #mean_array.append(mean)
        median_array.append(np.median(mse_array))
        #std_array.append(std)
        #cv_array.append(std/mean*100)
    medians.append(median_array[-1])
    print(np.mean(medians), np.median(medians))
    print(j, median_array[-1])
            

3.327880245275264 3.327880245275264
0 3.327880245275264
3.394586564174972 3.394586564174972
1 3.4612928830746803
3.382254254481424 3.357589635094329
2 3.357589635094329
3.35542593942453 3.3427349401847968
3 3.274940994253848
3.355446914527208 3.3555308149379215
4 3.3555308149379215
3.344303880474495 3.3417055301065925
5 3.288588710210929
3.331307835510668 3.327880245275264
6 3.253331565727706
3.3527220467343994 3.3417055301065925
7 3.5026215253005164
3.340657831551241 3.327880245275264
8 3.2441441100859736
3.3360222596815077 3.3110911790645856
9 3.2943021128539067


In [22]:
#10x5000 50k
values = [3.327880245275264, 3.4612928830746803, 3.357589635094329, 3.274940994253848, 3.3555308149379215, 3.288588710210929, 3.253331565727706, 3.5026215253005164, 3.2441441100859736, 3.2943021128539067]
print(np.mean(values))
print(np.std(values))
print(values)


3.3360222596815077
0.08207365636529661
[3.327880245275264, 3.4612928830746803, 3.357589635094329, 3.274940994253848, 3.3555308149379215, 3.288588710210929, 3.253331565727706, 3.5026215253005164, 3.2441441100859736, 3.2943021128539067]


In [24]:
#10x2000 20k
values = [3.4376347158664244, 3.171580983297706, 3.5011636762403775, 3.2619653853847783, 3.2951237407960523, 3.2376586627135233, 3.595510072930888, 3.3197196678389655, 3.2665110438313443, 3.3733587260065043]
print(np.mean(values))
print(np.std(values))
print(values)


3.3460226674906566
0.1242158032573488
[3.4376347158664244, 3.171580983297706, 3.5011636762403775, 3.2619653853847783, 3.2951237407960523, 3.2376586627135233, 3.595510072930888, 3.3197196678389655, 3.2665110438313443, 3.3733587260065043]


In [25]:
#50x300 15k
values = [3.2449768334294173, 3.377668911564963, 3.745694854483121, 3.341872831103161, 3.389833225094573, 3.287254459614655, 3.818407202444228, 3.5392590731504185, 3.6763509551129125, 2.8727369944420493, 3.235944502883862, 3.1782726746006698, 1.5811353709152407, 3.1159598262266286, 3.6643917157158725, 3.3312595691194176, 3.833441483607693, 3.1796581847538516, 3.18872722193215, 3.6525409188193496, 3.090049444337705, 3.172172587913545, 3.6722327693047676, 3.058631987257057, 3.4323437364418057, 3.0661657375976734, 3.245569690885966, 2.983057292865514, 3.354759629637024, 2.979685249144371, 3.3192152916536575, 3.2608384349224346, 3.5521544595997554, 3.4562735156590363, 3.3966145545351907, 2.8226999678510585, 2.533600642112428, 4.37761281956724, 3.3947439904520476, 2.7462787794329895, 3.999552075156877, 2.89433793654894, 3.658914578453439, 2.7422175004091844, 2.927537488858441, 2.9307012390427873, 3.95293636406072, 3.369552676138811, 3.687587361028585, 2.8913393673724577]
print(np.mean(values))
print(np.std(values))
print(values)

3.2844952795451148
0.4354236070894812
[3.2449768334294173, 3.377668911564963, 3.745694854483121, 3.341872831103161, 3.389833225094573, 3.287254459614655, 3.818407202444228, 3.5392590731504185, 3.6763509551129125, 2.8727369944420493, 3.235944502883862, 3.1782726746006698, 1.5811353709152407, 3.1159598262266286, 3.6643917157158725, 3.3312595691194176, 3.833441483607693, 3.1796581847538516, 3.18872722193215, 3.6525409188193496, 3.090049444337705, 3.172172587913545, 3.6722327693047676, 3.058631987257057, 3.4323437364418057, 3.0661657375976734, 3.245569690885966, 2.983057292865514, 3.354759629637024, 2.979685249144371, 3.3192152916536575, 3.2608384349224346, 3.5521544595997554, 3.4562735156590363, 3.3966145545351907, 2.8226999678510585, 2.533600642112428, 4.37761281956724, 3.3947439904520476, 2.7462787794329895, 3.999552075156877, 2.89433793654894, 3.658914578453439, 2.7422175004091844, 2.927537488858441, 2.9307012390427873, 3.95293636406072, 3.369552676138811, 3.687587361028585, 2.89133936

In [26]:
#20x1000 20k
values = [3.4394095192618535, 3.615864327348639, 3.6269170871708143, 3.20048570812129, 3.089724308665157, 3.287636004301067, 3.1600876264977824, 3.4483581993730166, 3.460831276114738, 3.378088654590651, 3.0209822458663855, 3.484882803002507, 3.473739563649847, 3.0655851820711066, 3.291860137056357, 3.20622900900359, 2.9926461227776606, 3.0838926170577414, 3.599345156725983, 2.9460502951682805]
print(np.mean(values))
print(np.std(values))
print(values)

3.293630792191223
0.21391295042534003
[3.4394095192618535, 3.615864327348639, 3.6269170871708143, 3.20048570812129, 3.089724308665157, 3.287636004301067, 3.1600876264977824, 3.4483581993730166, 3.460831276114738, 3.378088654590651, 3.0209822458663855, 3.484882803002507, 3.473739563649847, 3.0655851820711066, 3.291860137056357, 3.20622900900359, 2.9926461227776606, 3.0838926170577414, 3.599345156725983, 2.9460502951682805]


In [27]:
#5x5000 25k
values = [3.201568069061337, 3.213140384398362, 3.236317727146057, 3.2629776937396846, 3.3854302160962506]
print(np.mean(values))
print(np.std(values))
print(values)

3.259886818088338
0.066200462074394
[3.201568069061337, 3.213140384398362, 3.236317727146057, 3.2629776937396846, 3.3854302160962506]


In [161]:
#round 1
print(mean_array[-1])
print(median_array[-1])
print(np.mean(mean_array))
print(np.median(mean_array))
print(np.mean(median_array))
print(np.median(median_array))

14.265114700639796
3.493093347290814
14.30840334135783
14.158984460021559
3.5872695623261657
3.4886314355636876


In [168]:
#round 2
print(mean_array[-1])
print(median_array[-1])
print(np.mean(mean_array))
print(np.median(mean_array))
print(np.mean(median_array))
print(np.median(median_array))

13.598078264284128
3.205806404836883
13.868958751532391
13.756609745222967
3.4117068267330555
3.278013557416217


In [170]:
#round 3
print(mean_array[-1])
print(median_array[-1])
print(np.mean(mean_array))
print(np.median(mean_array))
print(np.mean(median_array))
print(np.median(median_array))

13.60159992285426
3.2218040556590966
13.56899209077242
13.529322994295768
3.2108811678677562
3.185513047708655


In [172]:
#round 4
print(mean_array[-1])
print(median_array[-1])
print(np.mean(mean_array))
print(np.median(mean_array))
print(np.mean(median_array))
print(np.median(median_array))

14.164656170938967
3.2806866693939147
14.200324169850548
14.133510724396526
3.2565711170204676
3.2613582924655566


In [174]:
#round 5
print(mean_array[-1])
print(median_array[-1])
print(np.mean(mean_array))
print(np.median(mean_array))
print(np.mean(median_array))
print(np.median(median_array))

14.03855394287025
3.2939291402335797
14.23975325750225
14.16707137970554
3.31618930839682
3.3305067336027134


In [156]:
print(np.std(mean_array))
print(np.std(median_array))

0.06752498670066134
0.11583670680915746


In [1]:
import plotly.graph_objects as go

# Sample data for demonstration
x = list(range(len(mean_array)))

fig = go.Figure()

# Define the first trace
fig.add_trace(go.Scatter(x=x,y=mean_array,mode='markers',name='mean'))
fig.add_trace(go.Scatter(x=x,y=median_array,mode='markers',name='median'))
fig.add_trace(go.Scatter(x=x,y=std_array,mode='markers',name='std'))
fig.add_trace(go.Scatter(x=x,y=cv_array,mode='markers',name='cv'))
fig.add_trace(go.Scatter(x=x,y=mse_array,mode='markers',name='mse'))
# Define the second

# Update layout if needed
fig.update_layout(
    title='Scatter Plot with Multiple Traces',
    xaxis_title='X Axis',
    yaxis_title='Y Axis'
)

# Show the plot
fig.show()

NameError: name 'mean_array' is not defined

In [13]:
import plotly.graph_objects as go
import numpy as np

# Assuming data_array is your array
# Create a histogram trace
fig = go.Figure(data=[go.Histogram(x=medians)])

# Update layout
fig.update_layout(
    title="Distribution of Array",
    xaxis_title="Value",
    yaxis_title="Frequency"
)

# Show the plot
fig.show()


Fill df nan values

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random.randint(1,1000))
model_type = LinearRegression()

mps, nonas, fills = [],[],[]
for mp in np.arange(0.28, 0.31, 0.0025):
    nona_mse = []
    fill_mse = []
    mp = np.round(mp,5)
    print(mp)
    for i in range(1500):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random.randint(1,1000))
        model = model_type
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse_base = mean_squared_error(y_test, y_pred)

        X_train_testing = X_train.copy()
        uc_X_train = unclean_df(df = X_train_testing, missing_percentage=mp)
        nona_model = model_type
        nona_X_train, nona_y_train = remove_na(df_X = uc_X_train, df_y = y_train)
        #print('You have:', round((len(nona_X_train)/len(X_train_testing))*100,3),"% of your data")
        nona_model.fit(nona_X_train, nona_y_train)
        nona_y_pred = nona_model.predict(X_test)
        nona_mse.append(mean_squared_error(y_test, nona_y_pred)/mse_base)

        fill_model = model_type
        #fill_type = df.mean()
        fill_type = df.mean()
        fill_X_train = uc_X_train.fillna(fill_type)
        fill_model.fit(fill_X_train, y_train)
        fill_y_pred = fill_model.predict(X_test)
        fill_mse.append(mean_squared_error(y_test, fill_y_pred)/mse_base)
    mps.append(mp)
    nonas.append(np.median(nona_mse))
    fills.append(np.median(fill_mse))



0.28
0.2825
0.285
0.2875
0.29
0.2925
0.295
0.2975
0.3
0.3025
0.305
0.3075


In [37]:
import plotly.graph_objects as go

fig = go.Figure()

# Define the first trace
fig.add_trace(go.Scatter(x=mps,y=nonas,mode='lines+markers',name='No Nans'))
fig.add_trace(go.Scatter(x=mps,y=fills,mode='lines+markers',name='Mean Fills'))
# Define the second

# Update layout if needed
fig.update_layout(
    title='Logistic regression of 1000 trials',
    xaxis_title='Missing percentage data',
    yaxis_title='Median MSE/Full MSE',
    yaxis_range=[1, 1.4],
    xaxis_range = [0.28, 0.31]
)

# Show the plot
fig.show()

In [110]:
print('You have:', round((len(nona_X_train)/len(X_train_testing))*100,3),"% of your data")
print('Nona:', np.median(nona_mse))
print('Fill:', np.median(fill_mse))
print(nona_mse)

You have: 16.527 % of your data
Nona: 1.0164830877270417
Fill: 1.1507339175911555
[1.4338394572645514, 1.0084033866528856, 3.8777993030515465, 67.43740336298035, 1.3651462902118128, 1.0012994980640373, 1.6737457081288933, 62.371403935065864, 15.319578785157807, 4.41476315488352, 0.9953384651031596, 0.964375013168074, 1.006158141396367, 1.795827648159923, 32.41431758253976, 12.54508300623257, 0.9769849248977207, 45.22404204285804, 10.106446963735428, 3.34911102836339, 1.0655257249059116, 1.5061013018719103, 0.9883054062995351, 0.9982524356511878, 62.5349177644916, 3.37597898765677, 1.0096060638684279, 0.9972484566170596, 1.0085114661596117, 1.007331162516382, 9.71258254925982, 1.0653568873392307, 1.0228546780665801, 1.0088606925591395, 1.497315534098148, 0.9554761354957103, 7.719024375906334, 10.33843838921872, 47.73969175425282, 0.9903908605238381, 10.619887151085353, 75.69912917782605, 2.9850466837005367, 10.40567671573482, 1.0033744212938183, 1.0422151465149376, 11.792131726489623, 0

In [108]:
print('You have:', round((len(nona_X_train)/len(X_train_testing))*100,3),"% of your data")
print('Nona:', np.median(nona_mse))
print('Fill:', np.median(fill_mse))
print(nona_mse)

You have: 16.697 % of your data
Nona: 1.0152770533601854
Fill: 1.1503015409322126
[1.1284562206081128, 1.0045278619072013, 0.9944357279280278, 13.51671457629931, 1.5388361918549907, 0.9952093709647705, 0.9927385203495255, 0.9664621219625419, 1.0109075896676352, 0.9829729631015001, 0.9849892188887295, 0.9751595795955345, 25.087576993283953, 1.3831655015493338, 0.9967907118810151, 0.9528521663377012, 15.364672082863933, 1.0577413240403597, 0.9735382586818185, 1.358481685124242, 1.1430176179325275, 1.029658168655174, 2.2065498991166543, 8.183158561337043, 0.9809754318127785, 0.9958220459888285, 2.4420775989129817, 1.0129972833316467, 31.142658945284857, 0.9896084651079073, 0.9658750440452193, 1.0019449784082348, 1.1499815512051657, 0.962132555651637, 1.0093300115059658, 63.43166012515986, 9.032845059075344, 1.0101916083317195, 1.0422009523293523, 0.9958783705004005, 1.0045557186701541, 9.098214873412761, 0.9775397848436291, 0.989236095034933, 0.9661736135935886, 45.79596736098328, 0.98978

In [112]:
print('You have:', round((len(nona_X_train)/len(X_train_testing))*100,3),"% of your data")
print('Nona:', np.median(nona_mse))
print('Fill:', np.median(fill_mse))
print(nona_mse)

You have: 16.885 % of your data
Nona: 1.0171564111889095
Fill: 1.1497994684683723
[1.0153063411751335, 1.0003356897623972, 0.9870143004152432, 0.9724153170442432, 77.00232975515026, 0.9981804658668846, 0.996514013895072, 0.909535451423761, 0.993050352988959, 1.6760941710405746, 3.5977400877447754, 12.672545771929741, 1.005588918679266, 1.032899509248033, 1.0098911147457061, 1.0021054681824548, 0.9612813458074355, 1.4417457330626775, 1.0037448826427189, 0.9958770881953729, 14.890539563705987, 1.1046786201094854, 0.9554040536314846, 0.9983931823138797, 15.387131613367497, 1.0062337124830574, 11.79025248324135, 63.981038848567714, 1.0010655952016188, 0.9571201328012523, 0.9888868667613974, 5.31214325204129, 0.9971375928116498, 11.987217868189848, 4.0760287877287675, 1.0033064202479032, 0.9823690466169077, 0.9127843922819009, 3.444725172373367, 4.981404760725319, 0.9871909594900499, 46.33904784537466, 0.996714301799808, 54.6030546471981, 2.8062668853267474, 1.0016049057151812, 49.047042745

In [114]:
print('You have:', round((len(nona_X_train)/len(X_train_testing))*100,3),"% of your data")
print('Nona:', np.median(nona_mse))
print('Fill:', np.median(fill_mse))
print(nona_mse)

You have: 16.715 % of your data
Nona: 1.0147253118142427
Fill: 1.1494705198754365
[1.1514877180312508, 0.9777755125150416, 22.110328938495762, 8.795010678183294, 1.9336221949850294, 7.777255116987351, 12.5967167126659, 10.720197225948985, 1.788954567698377, 0.9592063316066775, 0.964224441626847, 0.959165379499256, 0.9847300485642878, 56.967840229008885, 2.622708115193513, 57.104685561316764, 0.9849785296696245, 0.9954307834740056, 1.0048810182108936, 1.0142985034566827, 1.0090579222587759, 4.4649876245684466, 1.0002299008534374, 0.9850312794659031, 1.0105471033948001, 1.0010973605528393, 1.0024387788616607, 0.9957590232415883, 3.131937555020995, 1.8188085490057009, 12.337228072643379, 3.1982340729212724, 3.6531584028636805, 1.0013354113584136, 2.0760680770168314, 0.998533836810119, 6.500176663981946, 9.796498213946833, 1.0364830874448454, 61.7083989183134, 0.9805425752863047, 0.9940344354052046, 9.96295250447026, 1.0347589180028913, 1.0116691474600592, 1.0037519429869444, 1.02181647206