<a href="https://colab.research.google.com/github/cprachaseree/shl_datascience/blob/main/SHL_data_science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data wrangling and library implementation

You are given a matrix A. Transform it into a new matrix B such that its i column has a mean i and variance i .
Note: i equals 1 for the first column and increases sequentially.
The input to the function editMatrix shall be a matrix X. Return the transformed matrix B.
The test cases tab illustrates some examples.

In [5]:
import numpy as np
def editMatrix(A):
    B = np.zeros(A.shape)
    for j in range(A.shape[1]): # for each row
        col_mean = A[:, j].mean()
        col_std = A[:, j].std()
        B[:, j] = (j + 1) * (j + 1) * (A[:, j] - col_mean) / col_std + (j + 1)
    return B

In [6]:
# test
for i in range(1):
    A = np.random.rand(4, 4)
    B = editMatrix(A)
    print(A)
    print(B)

[[0.5153917  0.42142499 0.42704058 0.81997677]
 [0.59408262 0.47548337 0.30943911 0.34900357]
 [0.70545768 0.89041964 0.74888284 0.08761036]
 [0.85918967 0.73615281 0.64558016 0.44039621]]
[[ -0.18583528  -2.37821318  -2.48098384  28.11090824]
 [  0.42351075  -1.24818441  -8.57939149  -0.58438998]
 [  1.28594761   7.42558481  14.20864857 -16.51046841]
 [  2.47637692   4.20081277   8.85172677   4.98395015]]


Emma has to implement a classification algorithm for a 2-class problem. For each sample, her algorithm outputs the
probability of the sample belonging to a particular class. She wants to evaluate the log-loss error of the predictions
made by her algorithm.

In [18]:
def log_loss_sklearn(y, y_pred):
    from sklearn.metrics import log_loss
    return log_loss(y, y_pred)

def log_loss(y_true, y_pred):
    loss = 0
    for y_t, y_p in zip(y_true, y_pred):
        loss += (y_t * np.log(y_p[1]) + (1-y_t) * np.log(y_p[0]))
    return - loss / len(y_true)

y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
res_log_loss_sklearn =  log_loss_sklearn(y_true, y_pred)
print(res_log_loss_sklearn)
res_log_loss =  log_loss(y_true, y_pred)
print(res_log_loss)

0.1738073366910675
0.1738073366910675


Emma, a data scientist, wishes to build a spam classifier that can classify emails as spam or non-spam. When given an
input email as a vector, the classifier returns the probability of an email being spam. Emma decides to use a threshold
on probability values according to which the classifier will decide whether the email is spam. She makes a set of
threshold values and calculates precision and recall for each, in order to gauge the performance at various thresholds

mathbf{Precision=frac{t_p}
{t_p+f_p}}

mathbf{Recall=frac{t_p}
{t_p+f_n}}


In [25]:
# IMPORT LIBRARY PACKAGES NEEDED BY YOUR PROGRAM
# SOME CLASSES WITHIN A PACKAGE MAY BE RESTRICTED
# DEFINE ANY CLASS AND METHOD NEEDED
# THIS FUNCTION IS REQUIRED
#
# Parameters: y: ndarray, shape (n_samples,1)
# y_pred: ndarray, shape (n_samples,1)
# Thres: list
#
# Returns: score: list
#
def computePrecisionRecall_1(y,y_pred,Thres):
    # INSERT YOUR CODE HERE
    import numpy as np
    score = np.zeros((len(Thres), 2))
    for i, thres in enumerate(Thres):
        y_pred_bool = np.zeros(y_pred.shape)
        y_pred_bool[y_pred >= thres] = 1
        y_pred_bool[y_pred < thres] = 0
        tp = np.sum((y == 1) & (y_pred_bool == 1))
        fp = np.sum((y == 0) & (y_pred_bool == 1))
        fn = np.sum((y == 1) & (y_pred_bool == 0))
        score[i, 0] = tp / (tp + fp)
        score[i, 1] = tp / (tp + fn)
    return score

def computePrecisionRecall_2(y,y_pred,Thres):
    from sklearn.metrics import precision_score, recall_score
    import numpy as np
    score = np.zeros((len(Thres), 2))

    for i, thres in enumerate(Thres):
        y_pred_bool = np.zeros(y_pred.shape)
        y_pred_bool[y_pred >= thres] = 1
        y_pred_bool[y_pred < thres] = 0

        score[i, 0] = precision_score(y, y_pred_bool)
        score[i, 1] = recall_score(y, y_pred_bool)
    return score

y_true = np.array([0, 0, 1, 0, 1, 1])
y_pred = np.array([0.1, 0.6, 0.8, 0.3, 0.9, 0.2])
Thres = [0.3, 0.5, 0.8]
score1 = computePrecisionRecall_1(y_true,y_pred,Thres)
score2 = computePrecisionRecall_2(y_true,y_pred,Thres)
print(score1)
print(score2)

[[0.5        0.66666667]
 [0.66666667 0.66666667]
 [1.         0.66666667]]
[[0.5        0.66666667]
 [0.66666667 0.66666667]
 [1.         0.66666667]]


When building a predictive model, feature selection is an important step. You wish to find the best set of features in a
data set. To know whether a feature is informative or not, you need to build a linear regression model using one
feature at a time and compare its F-statistic with the rest of the features.
Given a number k, identify the k best features in the data set provided to you.
The inputs to the function fstatFeatures shall be the input matrix X, response matrix y, and an integer k. The function
must return an array of booleans wherein the i element is 1 if the i feature is among the top k features. Otherwise
the ith element is 0.
The test cases tab illustrate some examples.

In [None]:
# IMPORT LIBRARY PACKAGES NEEDED BY YOUR PROGRAM
# SOME CLASSES WITHIN A PACKAGE MAY BE RESTRICTED
# DEFINE ANY CLASS AND METHOD NEEDED
# THIS FUNCTION IS REQUIRED
#
# Parameters: X: ndarray shape (n_samples,n_features)
# y: ndarray shape(n_samples,1)
# k: integer
#
# Returns: X_new: ndarray shape (n_samples,k)
#
def fstatFeatures(X,y,k):
    #INSERT YOUR CODE HERE
    from sklearn.feature_selection import f_regression
    f_statistic, p = f_regression(X,y)
    n = (-f_statistic).argsort()[:k]
    res = np.array([0]*X.shape[1])
    for i in n:
        res[i] = 1
    return res


# ML workflow implementation


Your manager has asked you to build a prediction model using the company's marketing data. The data is a mix of
numerical as well as categorical attributes. You only need to one-hot encode (N-1 dummy variables for N categories)
the categorical attributes. You can use the numerical attributes as is.
Given an input feature dataframe and a column matrix of true responses, your task is to build a prediction model using
linear regression, after applying one-hot encoding on the categorical attributes. Calculate the Mean Squared Error
(MSE) between the true responses and the predicted responses.
The inputs to the function linearRegressionMSE shall be a dataframe X of features and a column matrix y of true
responses. The function must return the Mean Squared Error (MSE).
The test cases tab illustrates some examples

In [None]:
def linearRegressionMSE(X,y):
    import numpy as np
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error

    enc = OneHotEncoder()
    ncols, nrows = X.shape
    for col in X.columns:
        if X[col].dtype == object:
            unique, unique_inverse_indices = np.unique(X[col], return_inverse=True)
            unique_inverse=unique_inverse.reshape(len(unique_inverse), 1)
            bits=enc.fit_transform(unique_inverse).toarray()
            bits=bits[:,range(0,bits.shape[1]-1)]

            if(col=='c0'):
                preds=bits
            else:
                preds=np.concatenate((preds,bits),axis=1)
        else:
            if(col=='c0'):
                preds=X[col].values.reshape(nrows,1)
            else:
                preds=np.concatenate((preds,X[col].values.reshape(nrows, 1)),axis = 1)
    LR = LinearRegression()
    LR.fit(preds,y)
    y_pred=LR.predict(preds)
    ans=mean_squared_error(y,y_pred)
    return ans



In [1]:
'''
# index
a = np.array(['a', 'b', 'b', 'c', 'a'])
u, indices = np.unique(a, return_index=True)
u
array(['a', 'b', 'c'], dtype='<U1')
indices
array([0, 1, 3])
a[indices]
array(['a', 'b', 'c'], dtype='<U1')

# inverse
a = np.array([1, 2, 6, 4, 2, 3, 2])
u, indices = np.unique(a, return_inverse=True)
u
array([1, 2, 3, 4, 6])
indices
array([0, 1, 4, 3, 1, 2, 1])
u[indices]
array([1, 2, 6, 4, 2, 3, 2])

# counts
a = np.array([1, 2, 6, 4, 2, 3, 2])
values, counts = np.unique(a, return_counts=True)
values
array([1, 2, 3, 4, 6])
counts
array([1, 3, 1, 1, 1])
np.repeat(values, counts)
array([1, 2, 2, 2, 3, 4, 6])
'''

"\na = np.array(['a', 'b', 'b', 'c', 'a'])\nu, indices = np.unique(a, return_index=True)\nu\narray(['a', 'b', 'c'], dtype='<U1')\nindices\narray([0, 1, 3])\na[indices]\narray(['a', 'b', 'c'], dtype='<U1')\n\na = np.array([1, 2, 6, 4, 2, 3, 2])\nu, indices = np.unique(a, return_inverse=True)\nu\narray([1, 2, 3, 4, 6])\nindices\narray([0, 1, 4, 3, 1, 2, 1])\nu[indices]\narray([1, 2, 6, 4, 2, 3, 2])\n"

Glen is a data science engineer at an online assessment company. The data from a particular test consists of various
competency scores for candidates. His manager asks him to build a personality score prediction model based on the
competency scores present in the input score data. He loads the input score data in a matrix and finds that some of
the scores are missing. The missing scores are represented as NaNs. He decides to replace each NaN value with an
average value of that feature. He then builds a model using linear regression to predict the values of personality
scores. He evaluates his model by computing R-Squared statistic between actual and predicted personality scores.
In this question, you must replicate Glen’s work.
The input to the function linearRegressionWithMissingData shall be two matrices, X and y. X represents the input score
data matrix and y represents the column matrix for personality scores. The function must return the computed RSquared statistic between actual and predicted personality scores.
The test cases tab illustrates some examples.

In [None]:
# IMPORT LIBRARY PACKAGES NEEDED BY YOUR PROGRAM
# SOME CLASSES WITHIN A PACKAGE MAY BE RESTRICTED
# DEFINE ANY CLASS AND METHOD NEEDED
# THIS FUNCTION IS REQUIRED
#
# Parameters: X: ndarray, shape (n_samples,n_features)
# y: ndarray, shape (n_samples,1)
#
# Returns: score: float
#
def linearRegressionWithMissingData(X,y):
    # INSERT YOUR CODE HERE
    import numpy as np
    from sklearn.linear_model import LinearRegression

    x_mean = np.nanmean(X, axis=0)

    for i in range(X.shape[1]):
        X[np.isnan(X[:, i]), i] = x_mean[i]

    model = LinearRegression()
    model = model.fit(X, y)
    score = model.score(X, y)
    return score