# Final Exam
*author: Logan Reine*

## Introduction

### `MLequations_v3.ipynb` is a machine learning library of my own making.  All functions and equations used to calculate virtually all answers are defined in the MLequations file.  It will be submitted with this assignment as a `.ipynb` file and a pdf file.

## Headings

In [57]:
%run MLequations_v3.ipynb
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)
warnings.simplefilter(action = 'ignore', category = pd.errors.SettingWithCopyWarning)

## Data

In [21]:
education = pd.read_csv("education.csv")
entertainment = pd.read_csv("entertainment.csv")
cancer = pd.read_csv("cancer.csv")
mailshot = pd.read_csv("mailshot.csv")
clustering = pd.read_csv("clustering.csv")

# 1  Classification vs. Regression

### [3 points] A pollster is collecting data to predict the 2024 United States Presidential election winner. Is this a classification or regression problem? Explain. Also, what are suitable predictive features/attributes for this machine learning problem?

**I think this would be a regression problem, assuming the data the pollster is collecting is based on political party affiliation.  That metric could be used for predictive features; tallying up the communities political party affiliation would be indicative of the outcome of how that district may vote.**

# 2 ML Application Pipeline

### [3 points] 80% of the work done on predictive data analytics projects is often expended in the Business Understanding, Data Understanding, and Data Preparation phases, and just 20% is spent on the Modeling, Evaluation, and Deployment phases. Why do you think this would be the case?

**This is probably the case because you need to have a proper scope and abstraction of the business problem/question to appropriately design the data.  Once the abstraction is complete, then you'd need to realize and collect relevant data to the business questions, and data collection alone can take an extended period of time.  Once all of the relevant data has been collected, the data may need to be appropriately formatted to fit within the scheme of the model.**

# 3 Predictive Data Analytics Use Case

### [4 points] An online fashion retailer is struggling to generate the volume of sales that they had originally hoped for when they started the business. List at least two ways in which predictive data analytics could be used to help address this business problem. For each proposed approach, describe the predictive model that will be built, how the model will be used by the business, and how using the model will help address the original business problem.

**The most obvious data analytics would probably be similarity based learning.  The Russel-Rao, Sokal-Michener, or Jaccard formulas could be applied to recommend items to shoppers whose past trends are similar to other shoppers.  Another form of predictive data analytics I think that could be interesting is error based learning.  Based on what similarity based algorithm suggested, an error learning algorithm could be used to adjust whether the recommended purchase was succesful or not.  This could be used to find to discover which items have the most effective purchase rate. Now you're not only suggesting similar items to shoppers, but now items with a high purchase history for your shopping demographic are being recommended.**

# 4 Data Quality Assessment

### [10 points] Worldwide, breast cancer is the most common form of cancer for women and the second most common form of cancer overall. Reliable, population-wide screening is one tool that can be used to reduce the impact of breast cancer, and there is an opportunity for machine learning to be used for this. A large hospital group has collected a cancer screening dataset for possible use with machine learning that contains features extracted from tissue samples extracted by biopsy from adults presenting for screening. Features have been extracted from these biopsies by lab technicians who rate samples across a number of categories on a scale of 1 to 10. The samples have then been manually categorized by clinicians as either benign or malignant. The descriptive features in this dataset are defined as follows:

    AGE: The age of the person screened.
    SEX: The sex of the person screened, either male or female.
    SIZE UNIFORMITY: A measure of the variation in size of cells in the tissue samples, higher values indicate more uniform sizes (1 to 10).
    SHAPE UNIFORMITY: A measure of the variation in shape of cells in the tissue samples, higher values indicate more uniform shapes (1 to 10).
    MARGINAL ADHESION: A measure of how much cells in the biopsy stick together (1 to 10).
    MITOSES: A measure of how fast cells are growing (1 to 10).
    CLUMP THICKNESS: A measure of the amount of layering in cells (1 to 10).
    BLAND CHROMATIN: A measure of the texture of cell nuclei (1 to 10).
    CLASS: The clinician’s assessment of the biopsy sample as either benign or malignant.

### The following data is given to you (an extract from the Analytics Base Table (ABT) — the full ABT contains 680 instances; data quality report; and distribution of feature values):

![Local Image](fig-1.png)

![Local Image](fig-2.png)

![Local Image](fig-3.png)

### Discuss this data quality report in terms of the following:

### a. Missing Values

**The *Size* and *Shape Uniformity* are missing approximately 10% of their data, and *Bland Chromatin* data is missing by approx. 23%.**

### b. Irregular cardinality

***Shape Uniformity* has a cardinality of 11 when the scale is between 1-10. *Mitoses* has a cardinality of 9 where the scale is 1-10; it is plausible that no measure of *Mitoses* scaled to a 10, but it is worth noting either way.**

### c. Outliers

**The *ABT* and *Data Quality Report* both report an age of 0, and after cross referencing with the bar charts, this age range could be an outlier, along with the other end of the specturm with the age of 100+.**

### d. Feature distribution

**For feature distribution, there is substantial disproportion between *males* and *females*.  The *female* population accounts for almost 93% of the entire dataset.** 

# 5 Regression Parameters

### [5 points] Why is it bad to have large values for regression parameters?

**Large values may lead to overfitting and/or a failure to generalize.  Outlying large values can distort the machine learning process by over-accounting for the larger value.**

# 6 ML Definition

### [5 points] Many machine learning models are represented using parameters. Use this idea to define what machine learning is.

**Parameters are the variables, values, or weights machine learning use to make a decision.  These parameters are adjusted based on the accuracy of the machine learning's predictions, and thus the machine "learns".**

# 7 Ensemble Models

### [5 points] What is an ensemble model? When do you use this model?

**An ensemble model is where you would have multiple randomly populated sample populations of the actual population.  This method can be useful when you have an unequal distribution of features or diversity in the dataset, but an equal or random distribution of features is necessary to conduct the research.  So processes like bootstrap aggregating or subspace sampling are employed to mimic or imitate random samples of populations and/or random samples of features.**

# 8 Random Forest

### [5 points] Explain how the random forest machine learning model addresses bias and overfitting?

**In a random forest model, there are many decision trees, and each tree is constructed with random samples with replacement (bootstrap aggregating), and each tree is trained independently of the other decision trees.  Due to this nature of random construction, overfitting isn't really relevant, and since all trees are random, there isn't a concern for any implicit biases or such.  Also, decision trees are pruned if the error sum of the child nodes are greater than the parent node.  This practice also combats overfitting.**

# 9 k-NN and Naive Bayes

### [5 points] You have studied the k-NN and naive Bayes’ machine learning models. Describe application characteristics where the k-NN is preferred over naive Bayes’. Also, what application characteristics warrant naive Bayes’ over k-NN?

**The k-NN model doesn't require any training, so if the data is subject to frequent changes, then a k-NN model may serve better than a Bayes model.  But k-NN models suffer from the dimensionality curse, and the higher the dimension of the data the more computationally intensive the predictions can become.  But in a Bayes model, each feature is conditionally independent so features/decisions are calculated linearly regardless of the dimensionality.**

# 10 Support Vector Machines 1

### [5 points] What are support vectors? What is a kernel in the context of a support vector machine (SVM)?

**Support vectors are the data points/vectors that outline where the decision boundaries lie in an SVM.  a kernel is a function that maps the data to the necessary dimensions for SVMs to define decision boundaries.**

# 11 Support Vector Machines 2

### [5 points] Can we use the Support Vector Machine (SVM) model for both regression and classification? If not, why not?

**Yes, support vector machines can define a hyperplane boundary that separates the data into the corresponding classes.  Support vector machines can also model for regression with the hyperplane modeling the trend of the continuous features.**

# 12 Harmonic Mean

### [5 points] When evaluating machine learning models, we often prefer harmonic mean over arithmetic mean. Why?

**Arithmetic means can greatly overexaggerate data in terms of rates of ratios, where harmonic means have a way of effectively delivering the same information with "normalized" rates and ratios.**

# 13 Naive Bayes

### [10 points] Imagine that you have been given a dataset of 1,000 documents that have been classified as being about entertainment or education. There are 700 entertainment documents in the dataset and 300 education documents in the dataset. The tables below give the number of documents from each topic that a selection of words occurred in.

In [70]:
entertainment

Unnamed: 0,fun,is,machine,christmas,family,learning
0,415,695,35,0,400,70


In [71]:
education

Unnamed: 0,fun,is,machine,christmas,family,learning
0,200,295,120,0,10,105


### a. What target level will a naive Bayes model predict for the following query document: “machine learning is fun”?

In [25]:
columns = ['machine', 'learning', 'is', 'fun']
entertainment_values = [35, 70, 695, 415]

print(f'Prediction in \'entertainment\' document: {bayes_predict_sin(entertainment, columns, entertainment_values, 700):.4f}%') 

columns = ['machine', 'learning', 'is', 'fun']
education_values = [120, 105, 295, 200]

print(f'Prediction in \'education\' document: {bayes_predict_sin(education, columns, education_values, 300):.4f}%') 

Prediction in 'entertainment' document: 0.0029%
Prediction in 'education' document: 0.0918%


### b. What target level will a naive Bayes model predict for the following query document: “christmas family fun”?

**No calculation is necessary; the feature 'christmas' has zero occurences in both documents.  Therefore a naive Bayes model would predict 0% probability.**

### c. What target level will a naive Bayes model predict for the query document in Part (b) of this question, if Laplace smoothing with k=10 and a vocabulary size of 6 is used?

In [52]:
columns = ['christmas', 'family', 'fun']
entertainment_values = [0, 400, 415]

print(f'Prediction in \'entertainment\' with k = {10} and domain = {6}: {bayes_predict_sin(entertainment, columns, entertainment_values, 700, 10, 6):.4f}%') 

columns = ['christmas', 'family', 'fun']
education_values = [0, 10, 200]

print(f'\nPrediction in \'education\' with k = {10} and domain = {6}: {bayes_predict_sin(education, columns, education_values, 300, 10, 6):.4f}%') 

Prediction in 'entertainment' with k = 10 and domain = 6: 0.0040%

Prediction in 'education' with k = 10 and domain = 6: 0.0009%


# 14 Multivariate logistic regression model

### [10 points] A multivariate logistic regression model has been built to diagnose breast cancer in patients on the basis of features extracted from tissue samples extracted by biopsy. The model uses three descriptive features — MITOSES, a measure of how fast cells are growing; CLUMP THICKNESS, a measure of the amount of layering in cells; and BLAND CHROMATIN, a measure of the texture of cell nuclei — and predicts the status of a biopsy as either benign or malignant. The weights in the trained model are shown in the following table.

In [51]:
cancer

Unnamed: 0,ID,MITOSES,THICKNESS,CHROMATIN
0,1,7,4,3
1,2,3,5,1
2,3,3,3,3
3,4,5,3,1
4,5,7,4,4
5,6,10,4,1
6,7,5,2,1


In [50]:
w = [13.92, 3.09, 0.63, 1.11]

for i in range(len(cancer)):
    d = cancer.iloc[i, 1:].tolist()
    print(f"\tQuery {i + 1} Prediction: {multi_reg(w ,d):.4f}")

	Query 1 Prediction: 41.4000
	Query 2 Prediction: 27.4500
	Query 3 Prediction: 28.4100
	Query 4 Prediction: 32.3700
	Query 5 Prediction: 42.5100
	Query 6 Prediction: 48.4500
	Query 7 Prediction: 31.7400


# 15 Confusion Matrix

### [10 points] A marketing company working for a charity has developed two different models that predict the likelihood that donors will respond to a mailshot asking them to make a special extra donation. The prediction scores generated for a test set for these two models are shown in the table below.

In [12]:
mailshot

Unnamed: 0,ID,Target,Score 1,Score 2
0,1,False,0.1026,0.2089
1,2,False,0.2937,0.008
2,3,True,0.512,0.8378
3,4,True,0.8645,0.716
4,5,False,0.1987,0.1891
5,6,True,0.76,0.9398
6,7,True,0.7519,0.98
7,8,True,0.2994,0.8578
8,9,False,0.0552,0.156
9,10,False,0.9231,0.56


### a. Using a classification threshold of 0.5, and assuming that true is the positive target level, construct a confusion matrix for each of the models.

In [31]:
mailshot_bin = mailshot.copy()

i = 0
for score in mailshot_bin['Score 1']:
    if score < .5:
        mailshot_bin.at[i, 'Score 1'] = False
    else:
        mailshot_bin.at[i, 'Score 1'] = True
    i+=1

i = 0
for score in mailshot_bin['Score 2']:
    if score < .5:
        mailshot_bin.at[i, 'Score 2'] = False
    else:
        mailshot_bin.at[i, 'Score 2'] = True
    i+=1

In [38]:
print(f'\t**MODEL 1**')
con_matrix(mailshot_bin['Target'], mailshot_bin['Score 1'])

	**MODEL 1**


Unnamed: 0,Positive,Negative
Positive,15,2
Negative,2,11


In [39]:
print(f'\t**MODEL 2**')
con_matrix(mailshot_bin['Target'], mailshot_bin['Score 2'])

	**MODEL 2**


Unnamed: 0,Positive,Negative
Positive,14,3
Negative,3,10


### b. Calculate the simple accuracy and average class accuracy (using an arithmetic mean) for each model.

In [47]:
sim_1 = simple_accuracy(mailshot_bin['Target'], mailshot_bin['Score 1'])
sim_2 = simple_accuracy(mailshot_bin['Target'], mailshot_bin['Score 2'])

print(f'Simple Accuracy for Model 1: {sim_1:.4f}')
print(f'Simple Accuracy for Model 2: {sim_2:.4f}')

Simple Accuracy for Model 1: 0.8667
Simple Accuracy for Model 2: 0.8000


In [48]:
aca_1 = average_class_accuracy(mailshot_bin['Target'], mailshot_bin['Score 1'])
aca_2 = average_class_accuracy(mailshot_bin['Target'], mailshot_bin['Score 2'])

print(f'Average Class Accuracy for Model 1: {aca_1:.4f}')
print(f'Average Class Accuracy for Model 2: {aca_2:.4f}')

Average Class Accuracy for Model 1: 0.8643
Average Class Accuracy for Model 2: 0.7964


### c. Based on the average class accuracy measures, which model appears to perform best at this task?

**Model 1**

# 16 Clustering

### [10 points] The following table shows a small dataset used for human activity recognition from a wearable accelerometer sensor.16 Each instance describes the average acceleration in the X, Y, and Z directions within a short time window. There are no labels, so this data is being clustered in an attempt to recognize different activity from this simple data stream. The k-means clustering approach is to be applied to this dataset with k = 2 and using Euclidean distance. The initial cluster centroids for the two clusters C1​ and C2 are c1=⟨−0.235,0.253;0.438⟩ and c2=∠0.232,0.325,0.159⟩. The following table also shows the distance to these three cluster centers for each instance in the dataset after iteration 1.

In [23]:
clustering

Unnamed: 0,ID,X,Y,Z,Dist(di-c1),Dist(di-c2)
0,1,-0.154,0.376,0.099,0.37,0.467
1,2,-0.103,0.476,-0.027,0.532,0.39
2,3,0.228,0.036,-0.251,0.858,0.303
3,4,0.33,0.013,-0.263,0.932,0.343
4,5,-0.114,0.482,0.014,0.497,0.417
5,6,0.295,0.084,-0.297,0.922,0.285
6,7,0.262,0.042,-0.304,0.918,0.319
7,8,-0.051,0.416,-0.306,0.784,0.332


### a. Assign each instance to its nearest cluster to generate the clustering at the first iteration of k-means on the basis of the initial cluster centroids.

In [58]:
nearest_cluster = clustering[['ID', 'X', 'Y', 'Z']]

nearest_cluster['Clusters'] = [1, 2, 2, 2, 2, 2, 2, 2]

nearest_cluster

Unnamed: 0,ID,X,Y,Z,Clusters
0,1,-0.154,0.376,0.099,1
1,2,-0.103,0.476,-0.027,2
2,3,0.228,0.036,-0.251,2
3,4,0.33,0.013,-0.263,2
4,5,-0.114,0.482,0.014,2
5,6,0.295,0.084,-0.297,2
6,7,0.262,0.042,-0.304,2
7,8,-0.051,0.416,-0.306,2


### b. On the basis of the clustering calculated in Part (a), calculate a set of new cluster centroids.

In [66]:
clusters = nearest_cluster.groupby('Clusters')

new_centroids = clusters[['X', 'Y', 'Z']].mean()

new_centroids

Unnamed: 0_level_0,X,Y,Z
Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-0.154,0.376,0.099
2,0.121,0.221286,-0.204857


### c. Calculate the distances of each instance to these new cluster centers and perform another clustering iteration.

In [69]:
clustering_3rd = clustering[['X', 'Y', 'Z']].copy()

for i, row in new_centroids.iterrows(): 
    
    clustering_3rd[f'Dist_c{i}'] = np.sqrt((clustering_3rd['X'] - row['X'])**2 + (clustering_3rd['Y'] - row['Y'])**2 + (clustering_3rd['Z'] - row['Z'])**2)

clustering_3rd

Unnamed: 0,X,Y,Z,Dist_c1,Dist_c2
0,-0.154,0.376,0.099,0.0,0.438053
1,-0.103,0.476,-0.027,0.168751,0.382999
2,0.228,0.036,-0.251,0.619697,0.218881
3,0.33,0.013,-0.263,0.705031,0.30074
4,-0.114,0.482,0.014,0.141637,0.413637
5,0.295,0.084,-0.297,0.666094,0.240028
6,0.262,0.042,-0.304,0.668596,0.248704
7,-0.051,0.416,-0.306,0.419802,0.278797
