In [2]:
##import libraries

import pandas as pd
import numpy as np
from sklearn import linear_model

1. Create a function to read the CSV file provided into a DataFrame. 
2. You MUST place the CSV file in the same directory/folder where your notebook is located. The method below should work without change when you give the file name "stroke-data.csv". 
4. The file imported has NA (not available) values in some columns. These rows need to be dropped as machine learning algorithms cannot process data with missing values. Remember when rows are dropped some (row) indexes will be missing. 
3. The first step in processing data is to review the data types of the features (columns). 
4. Use **pandas** features *columns* and *dtypes* to create a dictionary with column names as keys and the datatype as values.
5. This function then returns the new dataframe (df) and the df_types dictionary (df_types), where a key-value pair represents column name-column's dtype. 

In [168]:
def process_data(fl):
    
    # Import the CSV file (fl)
    # Your code goes here
    df2 = pd.read_csv(fl)
        
    # Drop all rows with NA values
    # Your code goes here
    df2 = df2.dropna()
    
    # Create a dictionary with keys the column names and values the type of data
    # Your code goes here
 
    df2_types = {}
    
    for i in range(len(df2.columns)):
        df2_types[df2.columns[i]] = df2.dtypes[i]

    return df2, df2_types

{'id': dtype('int64'),
 'gender': dtype('O'),
 'age': dtype('float64'),
 'hypertension': dtype('int64'),
 'heart_disease': dtype('int64'),
 'ever_married': dtype('O'),
 'avg_glucose_level': dtype('float64'),
 'bmi': dtype('float64'),
 'smoking_status': dtype('O'),
 'stroke': dtype('int64')}

Many machine learning algorithms are designed to process numeric data and cannot natively handle categorical data. Therefore as part of the model building process, we must apply pre-processing steps to convert the data into an encoded format which the algorithms can handle.

1. In the following function you will identify and convert categorical variables to numeric data type. 
2. You will need the python *dictionary* "df2_types" of the function "process_data" we created in task 1. We can use this to identify data in a categorical (non-numeric) data format.
3. Create a list "cat_ls" of column names which are non-numeric. 
4. Process each column named in "cat_ls" separately. 
5. For a column name, say "col_name", find the *distinct* categories. For example, in column "gender" there are 2 categories "Male" and "Female". 
6. For a (categorical) column 'C' with *k* categories *k-1* new columns are created and 'C' is replaced by these new columns. For example, the "*gender*" column will be replaced by one numerical column. The column "*smoking_status*" is to be replaced with 2 numerical columns. This process is referred to as *one-hot encoding*.
7. The encoding is done as follows. Suppose there are 3 categories "cat1", "cat2", "cat3" in column 'C'. Create 2 columns with distinct names, say "cat_level1", "cat_level2. If an observation corresponding to a row is 'cat1' then put a 1 in 'cat_level1' and 0 in 'cat_level2' in the same row. If it is 'cat2' put 0 in 'cat_level1' and 1 in 'cat_level2' and put 0 in both if the observation is 'cat3'. 
8. It is simpler if the column has only 2 categories (like "gender"). It will be replaced by 1 column of 1's and 0's. 
9. The number of columns in the new DataFrame will be generally more than the original. For the *stroke-dataset* this number is 11. Remember to **drop** the old non-numeric columns.  
10. Depending on how you do it the column orderings may change. This is important for identifying the output column "stroke". 
11. You may reorder the columns. Suggestion:move "stroke" to the last column in the new dataframe. 
13. You should NOT use any feature-processing modules from **sklearn** or pandas.get_dummies()for this part. If used the maximum mark for this task will not exceed 60%. 


In [112]:
def one_hot_encode(y,df,ColumnToEncode,column_name):
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    df_t = pd.DataFrame(np.eye(nb_classes)[targets])
    df_t.columns = 
    df_t =  df_t.iloc[:,0:(len(df_t.columns)-1)]
    
    df = pd.concat([df.reset_index(drop = True), # Cbind DataFrames
                            df_t],
                           axis = 1)
    
    df = df.drop([ColumnToEncode], axis=1)
    return df

def convert_to_numeric():
    
    # Read the appropriate file, should be in the same directory as the notebook
    df, dict_types = process_data("Stroke_data.csv")
    df_temp = df    
    
    # Apply the one hot encoding process outlined to the new dataframe df2
    # Your code goes here
    df2 = None 
    ColumnsToEncode = list(df_temp.select_dtypes(include=['category','object']))
    
    for i in range(len(ColumnsToEncode)):
        df_temp = one_hot_encode(df_temp[ColumnsToEncode[i]],df_temp,ColumnsToEncode[i],
                                 list(df_temp[ColumnsToEncode[i]].unique()))
    
    df2 = df_temp
    
    return df2       

1. Convert all columns except "id" and "stroke" into a numerical feature matrix **X**. The size of the matrix will be *no_of_rows* $\times$  *(no_of_columns-2)*. The number of columns should be 9. 
2. Put the values in the "stroke" column in the array **y**. 
3. Use the sklearn [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to generate *X_train, X_test, y_train. y_test*. 
5. In the *train_test_split()* method the fraction of data to be split for testing has to be specified. Vary this fraction between .2 to .33. Run your program  a few times to choose  an optimim value. The optimum will correspond to the fraction giving the best accuracy/precision (see Task 5). 
6. Return the 4 arrays. 

In [140]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

def create_arrays(size):
    
    # Call the function created in Task 2 to source the encoded data frame
    df = convert_to_numeric()
    
    # Create the X and y objects
    # Your code goes here
    X = np.array(df.drop(['id', 'stroke'], axis=1))
    y = np.array(df['stroke'])
    
    # Create test/train splits for X and y
    # Your code goes here
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = size)
    
    # Function returns the four newly created objects
    return X_train, X_test, y_train, y_test

1. In the following function you will use the [liner_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from sklearn to create and train a logistic regression model. 
2. The model should be trained on the train set created in task 3. **Do not use the full dataset or test set for training**.
2. As this is a binary classification problem (2 classes: "stroke", "no-stroke") the default model does not need significant adjustment
3. You should refer to the document and experiment with changing the hyperparameters of the model


Once you have a trained a model, answer the below questions:
1. In the LogisticRegression class, the first keyword argument is *penalty='l2'*. What is penalized and why? Explain this in 2 sentences.  
Answer: the penalization means regularizing the linear or regression model to avoid overfitting and reduce the impact of some high magnitude coeffecients. In other words, it reduces the impact of parameters in the model and simplifies the model.
2. Instead of $l_2$ penalty one may use $l_1$ penalty? What is the difference between the two?  
L1 introduces the penalty equal to the absolute value of coefficient's model and may eliminate a feature completely from a model. Whereas, L2 introduces the penalty equal to the square of the magnitude of the coefficients. It doesnot eliminate a feature rather shrinks the impact of each every coefficients.

In [133]:
def fit_logitmodel(X, y):
    
    # Create the logitmodel_stroke model
    # Your code goes here
    logitmodel_stroke = None
    
    from sklearn.linear_model import LogisticRegression
    LogRegr_Mod = LogisticRegression()
    
    
    # Train the logitmodel_stroke model
    # Your code goes here
    logitmodel_stroke = LogRegr_Mod.fit(X,y)
    
    return logitmodel_stroke

X_train, X_test, y_train, y_test = create_arrays(0.3)
model = fit_logitmodel(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[ 0.07348576,  0.5010995 ,  0.44139574,  0.00351817, -0.00437526,
         0.04879587,  0.15959432, -0.46904383, -0.42487511]])

The process for evaluating a classification model is different from a regression model. In regression we have a wide range of values so we measure variance, however classification has a much smaller problem space so we measure how often the correct prediction is made. There are multiple metrics for measuring this, [this article](https://www.mage.ai/blog/definitive-guide-to-accuracy-precision-recall-for-product-developers) and the [Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) provide additional context.

1. As this is binary classification there are 2 classes. Class 1 indicates positive stroke risk and class 0 indicates negative stroke risk. 
2. When testing we use a separate dataset which the model was not trained on. This is essential to observe how the model performs on data it has not seen before.
3. In the function below *X_ts* represents the data used to generate test predictions and *y_obs* represents the actual values we are trying to predict. 
4. We can evaluate a classification model by having it make a set of predictions for a test set (X_ts) and comparing these with the actual values (y_obs).
5. Suppose *y_pred* is a predicted value when run on a sample from *X_ts*. We compare it to the corresponding observed value in *y_obs*. There are four potential outcomes from this comparison:

    1. *y_pred* = 1 (positive) and *y_obs* = 1 (positive): counted as *true positive*.
    2. *y_pred* = 1 (positive) and *y_obs* = 0 (negative): counted as *false positive*. 
    3. *y_pred* = 0 (negative) and *y_obs* = 0 (negative): counted as *true negative*. 
    4. *y_pred* = 0 (positive) and *y_obs* = 1 (negative): counted as *false negative*. 
    
5. Count all the 4 cases for the entire sample input to the function *evaluate_logitmodel* and store them in 4 variables: *tp*, *fp*, *tn* and *fn*. For example, *tp* will give total number of true positives and *fn* the total of true negatives. 
6. The two metrics we will be using for evaluation are *accuracy* and *precision*. The formula for these is below. 
$$acc = \frac{tp+tn}{tp+tn+fp+fn} \quad\text{(accuracy)}, \quad prec = \frac{tp}{tp + fp} \quad\text{(precision)}$$

7. Run the model training/evaluation process for 5 different test/train split ratios (see task 3). Add a paragraph below outlining:
    1. The results of your different test/train splits
    2. How the different split sizes effected model evaluation
    3. Was there a difference in accuracy/precision and if so, what could be causing this?
    4. For a fixed train/test data evaluate the metrics on the train data (*X_train*) and test (*X_test*) seprately and record the valuse of the metrics.  
8. This task is designed to test your understanding of model evaluation. **No built-in evaluation functions or metrics should be used**. 


Answer:

as the train/test split ration increases with more data going into test split, the accuracy increases from 0.93 to 0.94. However, precision cannot be calculated as model is not predicting any stroke value as both false positive and true positive value are 0. This due to highly skewed data being present in the training set as there are only 180 values out of approx 3000 values that have stroke = 1. Yes there is difference between accuracy and precision, as accuracy takes into both true positives and true negatives. Its value is also close to 95percent because even if it predicts all test data points as negative, since data is highly skewed, it will be high, however, precision is not defined or close to zero as it is not prediciting any positive value

In [164]:
#the model object is the output of the function fit_logitmodelto obtain y_pred
def evaluate_logitmodel(model, X_ts,  y_obs):
    
    # Use the .predict() method of the model to generate a set of predictions for X_ts
    # Your code goes here
    
    y_pred = model.predict(X_ts)
    
    # Determine the tp, fp, tn and fn values for the prediction set
    # Your code goes here
    tp = np.sum(np.logical_and(y_pred == 1, y_obs == 1))
    tn = np.sum(np.logical_and(y_pred == 0, y_obs == 0))
    fp = np.sum(np.logical_and(y_pred == 1, y_obs == 0))
    fn = np.sum(np.logical_and(y_pred == 0, y_obs == 1))
    
    
    # Calculate the accuracy and precision values
    # Your code goes here
    acc = (tp + tn)/(tp+tn+fp+fn)
    prec = tp/(tp+fp)
    
    return acc,prec

In [165]:
test_size = [0.2,0.25,0.28,0.3,0.33]

for i in range(4):
    X_train, X_test, y_train, y_test = create_arrays(test_size[i])
    model = fit_logitmodel(X_train, y_train)
    
    acc,prec = evaluate_logitmodel(model,X_test,y_test)
    
    print("accurracy for iteration ",i," is:", acc)
    print("prec for iteration ",i," is:", prec)

accurracy for iteration  0  is: 0.9357664233576642
prec for iteration  0  is: nan
accurracy for iteration  1  is: 0.9369894982497082
prec for iteration  1  is: nan
accurracy for iteration  2  is: 0.9489583333333333
prec for iteration  2  is: 0.0
accurracy for iteration  3  is: 0.9426070038910506
prec for iteration  3  is: 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  prec = tp/(tp+fp)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  prec = tp/(tp+fp)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the 