`Types of missing data `

MCAR : Missing completely at random
MAR: Missing at random
MNAR: Missing not at random
In order to understand the difference between the differnet types of missig data, we will try to construct them ourserlves.

`MCAR`
Data missing completely at random means that the probability of missingness in one independent variable does not depend on the other observed independent variables. In other words, the probability of missingness in an indepdent variable depends only on some external factors.

For more clarification we will attempt to construct some data with a MCAR pattern.

Let us imagine that we collect a set of 3000 entry data about some employees including: Sex, Height, Weight and Salary. We suppose that all information about sex was correctly collected (with no missing data). In a first stage, the simulated dataframe has no missing data (the missing entries will be included later).

In [24]:
import numpy as np
import pandas as pd
N = 3000
sex = np.random.choice(["Male", "Female"], N, p=[0.6, 0.4])
height = 140 + (200-140) * np.random.rand(N)
weight = 40 + (120-40) * np.random.rand(N)
salary = 30000+(80000-30000) * np.random.rand(N)
df = pd.DataFrame(data=[sex, height, weight, salary]).transpose()
df.columns = ["Sex", "Height", "weight", "salary"]

Suppose that we want to impose that Height will have a MCAR pattern. As previously said, the missingness should be included by an external factor. We suppose that this external factor is throwing a dice, if the value is equal to 6, then the record will be missing. Let us apply this for the height variable.

In [5]:
# Initialize the Dice columns
df["Dice"] = df["Sex"]
# Fill the Dice column with the probability values
df["Dice"] = np.random.choice([1, 2, 3, 4, 5, 6], N, p=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
# Dtermine indices where Dice=6
index = df[df["Dice"]==6].index
# Replace with NaN
df.loc[index,"Height"] = np.nan

In [5]:
from sklearn.datasets import make_classification,make_regression
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0, 
    n_repeated=0,
    n_classes=4,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [6]:
from sklearn.cluster import DBSCAN

In [16]:
db_scan=DBSCAN(eps=0.6)
db_scan.fit(X)

DBSCAN(eps=0.6)

In [11]:
db_scan.fit_predict(X[0].reshape(-1,1))

array([ 0,  0,  0,  0,  0, -1, -1, -1,  0, -1], dtype=int64)

In [17]:
db_scan.labels_

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1

In [30]:
from sklearn.ensemble import RandomForestClassifier

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0,oob_score=True)
forest.fit(X_train, y_train)

RandomForestClassifier(oob_score=True, random_state=0)

In [31]:
import time
start_time = time.time()
importances = forest.feature_importances_
future=[tree.feature_importances_ for tree in forest.estimators_]
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

In [32]:
importances

array([0.20944276, 0.31787234, 0.19518962, 0.04039265, 0.03860917,
       0.03406563, 0.04025513, 0.0425733 , 0.0400183 , 0.04158111])

In [33]:
l=[]
for i in range(0,10):
    l.append(future[i][0])

In [46]:
forest.oob_decision_function_,forest.oob_score_

(array([[0.12121212, 0.87878788],
        [0.94736842, 0.05263158],
        [0.08571429, 0.91428571],
        ...,
        [0.13793103, 0.86206897],
        [0.59459459, 0.40540541],
        [0.17241379, 0.82758621]]),
 0.9146666666666666)

In [37]:
len(forest.estimators_

100

In [45]:
forest.ccp_alpha

0.0

In [16]:
from sklearn.svm import SVC,SVR

In [17]:
svc=SVC(C=0.25,kernel='linear')

In [18]:
svc.fit(X_train,y_train)

SVC(C=0.25, kernel='linear')

In [19]:
pred_input=X_train[0].reshape(1,-1)

In [37]:
pred_input,y_train[0]

(array([[ 1.68252279e+00,  6.04696997e-01, -1.45709613e+00,
         -9.50376747e-01,  1.49789647e-03,  1.82697091e-01,
          1.88464997e-01,  3.97595514e-01, -6.64581115e-01,
         -1.33331408e-01]]),
 2)

In [62]:
predicted_decision=svc.decision_function(pred_input)
predicted_decision

array([[ 2.25734165, -0.29752679,  3.28955432,  0.79052905]])

In [43]:
predicted=svc.predict(pred_input)

In [40]:
from sklearn.metrics import hinge_loss

In [61]:
hinge_loss([y_train[0]],pred_decision=predicted_decision,labels=np.array([0,1,2,3]))

0.0

In [59]:
import numpy as np
from sklearn import svm
X = np.array([[0], [1], [2], [3]])
Y = np.array([0, 1, 2, 3])
labels = np.array([0, 1, 2, 3])
est = svm.LinearSVC()
est.fit(X, Y)
pred_decision = est.decision_function([[-1], [2], [3]])
y_true = [0, 2, 3]
hinge_loss(y_true, pred_decision, labels=labels)


0.5641176877140288

In [60]:
pred_decision

array([[ 1.27272366,  0.03419818, -0.68378804, -1.40168089],
       [-1.45453282, -0.58119921, -0.37605156, -0.1710036 ],
       [-2.36361831, -0.78633168, -0.27347274,  0.23922216]])

In [35]:
from box import Box

movie_box = Box({ "Robin Hood: Men in Tights": { "imdb stars": 6.7, "length": 104 } })

In [86]:
from sklearn.preprocessing import StandardScaler
rt=[23,456,667,6,6,7,8]

In [99]:
Sc = StandardScaler(with_mean=False)

In [89]:
import numpy as np
np.array(rt).reshape(-1,1)

array([[ 23],
       [456],
       [667],
       [  6],
       [  6],
       [  7],
       [  8]])

In [90]:
Sc.fit_transform(np.array(rt).reshape(-1,1))

array([[-0.56582741],
       [ 1.12885923],
       [ 1.95467651],
       [-0.63236245],
       [-0.63236245],
       [-0.62844863],
       [-0.6245348 ]])

In [92]:
Sc.fit_transform(np.array(rt).reshape(-1,1))

array([[0.090018  ],
       [1.78470464],
       [2.61052191],
       [0.02348296],
       [0.02348296],
       [0.02739678],
       [0.03131061]])

In [20]:
import pandas as pd
import numpy as np
rt_ct=[1,2,1,2,1,2,2]
np.array(rt_ct).reshape(7,1)

array([[1],
       [2],
       [1],
       [2],
       [1],
       [2],
       [2]])

In [106]:
Sc.fit_transform(np.array(rt_ct).reshape(-1,1))

array([[2.02072594],
       [4.04145188],
       [2.02072594],
       [4.04145188],
       [2.02072594],
       [4.04145188],
       [4.04145188]])

In [39]:
from sklearn.preprocessing import StandardScaler,OneHotEncoder,LabelEncoder

In [23]:
s=OneHotEncoder(drop='first').fit_transform(np.array(rt_ct).reshape(-1,1))

In [35]:
X = [['Male', 1], ['Female', 3]]

In [40]:
s=LabelEncoder()

In [42]:
s.fit_transform(X)

ValueError: y should be a 1d array, got an array of shape (2, 2) instead.

In [38]:
s.get_feature_names()

array(['x0_Male', 'x1_3'], dtype=object)

In [9]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy import stats
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,StratifiedShuffleSplit

In [10]:
RAW_TRAIN_DF = pd.read_csv('data/Kaggle_Training_Dataset_v2.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [10]:
CATEGORICAL_COLUMNS = [i for i in RAW_TRAIN_DF.columns if len(RAW_TRAIN_DF[i].unique())<20]
NUMERICAL_COLUMNS = [i for i in RAW_TRAIN_DF.columns if i not in CATEGORICAL_COLUMNS]

In [11]:
SimpleImputer(strategy='median').fit_transform(RAW_TRAIN_DF[NUMERICAL_COLUMNS])

array([[ 6.200e+01,  8.000e+00,  0.000e+00, ..., -9.900e+01, -9.900e+01,
         0.000e+00],
       [ 9.000e+00,  8.000e+00,  0.000e+00, ..., -9.900e+01, -9.900e+01,
         0.000e+00],
       [ 1.700e+01,  8.000e+00,  0.000e+00, ...,  9.200e-01,  9.500e-01,
         0.000e+00],
       ...,
       [ 1.000e+01,  1.200e+01,  0.000e+00, ...,  4.800e-01,  4.800e-01,
         0.000e+00],
       [ 2.913e+03,  1.200e+01,  0.000e+00, ...,  4.800e-01,  4.800e-01,
         0.000e+00],
       [ 1.500e+01,  8.000e+00,  0.000e+00, ...,  8.200e-01,  8.100e-01,
         0.000e+00]])

In [8]:
RAW_TRAIN_DF.drop(columns='sku',inplace=True)

In [47]:
from sklearn.base import BaseEstimator,TransformerMixin
class ManualFeatureEditor(BaseEstimator, TransformerMixin):

    def __init__(self):
        """
        ManualFeatureEditor Initialization
        Replacing -99.0 to np.nan
        and then filling it with median
        """
        try:
            pass
        except Exception as e:
            raise BackOrderException(e, sys) from e

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        try:
            nn = []
            for i in X.columns:
                d=np.array(X[i])
                d[d==-99.0]=X[i].median()
                d[np.isnan(d)]=X[i].median()
                nn.append(d)
                print(d)
            genrated_feature = np.c_[nn[0],nn[1]]
            print(genrated_feature)
            return genrated_feature
        except Exception as e:
            raise e

In [48]:
out_=ManualFeatureEditor().fit_transform(X=RAW_TRAIN_DF[['perf_6_month_avg','perf_12_month_avg']])

[0.82 0.82 0.92 ... 0.48 0.48  nan]
[0.81 0.81 0.95 ... 0.48 0.48  nan]
[[0.82 0.81]
 [0.82 0.81]
 [0.92 0.95]
 ...
 [0.48 0.48]
 [0.48 0.48]
 [ nan  nan]]


In [34]:
from backorder.credentials import Decrypt
from backorder.constant import *
from backorder.util.util import read_yaml_file
config_path= CREDENTIAL_FILE_PATH
config = read_yaml_file(file_path=config_path)
username = Decrypt(config['mongodb']['user_name']).get_decrypted_massage()
password = Decrypt(config['mongodb']['password']).get_decrypted_massage()

In [26]:
import pymongo
client_db = pymongo.MongoClient(f"mongodb+srv://{username}:{password}@cluster0.o1yzz.mongodb.net/?retryWrites=true&w=majority")
db=client_db['DATA']
table=db['aws']♣♣
for i in table.find():
    print(i)

{'_id': ObjectId('63298bfd4897510123a69f6b')}
{'_id': ObjectId('63298c294897510123a69f6c'), 'access_key': 'AKIA5Y2HP7EX3PJBYSEF', 'secret_access_key': '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'}


In [13]:
db=client.myclient["backorder_prediction"]
table=db["train_data"]

In [15]:
import os
import boto3
from botocore import UNSIGNED
from botocore.config import Config

In [16]:
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA5Y2HP7EX3PJBYSEF'
os.environ['AWS_SECRET_ACCESS_KEY'] = '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'

In [28]:
client = boto3.resource('s3',)
bucket=client.Bucket('backorderprediction')
for s3_file in bucket.objects.all():
    print(s3_file.key)

data/
data/dataset.zip


In [22]:
boto3.client ('s3').download_file('backorderprediction','data/dataset.zip','E:\project\dataset.zip')

In [33]:
from backorder.cloud.cloud import CloudKey
CloudKey().get_cloud_key()

AttributeError: type object 'MangoDbconnection' has no attribute 'get_records_from_collection'