# Breast Cancer Dataset Analysis with Logistic Regression

Logistic Regression is a Machine Learning algorithm that models the probability that something is either belongs to the identity of one thing or it doesn't. In more scientific terminology, it models the probability that an input belongs to a certain classification such as whether an email is "spam" or "not-spam". More complicately, and out-of-scope for this assignment, logistic regression can classify inputs into more than two categories, however, we don't get in to that here.

The main idea behind how logistic regression works is that given data as input, we determine the probability of those data points belonging to a certain classification using something called a sigmoid function (S-shaped-curve) which grants a particular input a number between 0 - 1 with a 'decision threshold' of 0.5 (picture an imaginary line on a graph) where data points further away from that threshold (closer to 0 or 1, are weightly lightly to 'correct' the probability to the aboslute classification of 0 or 1 and heavily weighting data points that end up closer to that imaginary threshold to mitigate incorrect classification.

We use something called a loss function to help us determine the 'incorrectness' of a probabalistic guess using our sigmoid function, but that, too, is out of scope for this assignment, so I will skip that for now.

## Table of contents
1. Import necessary modules
2. Load the dataset, extracting data and target variables
3. View the data
4. How frequently does the positive target occur?
5. Generate summary statistics
6. Create a pairplot for the first few features
7. Create a correlation coefficient heatmap
8. Create a boxplot for mean radius by target type
9. Split the data into training and test sets
10. Build and train the logistic regression model
11. Evaluate the model
12. Generate the confusion matrix
13. Generate a classification report
14. Extra coefficients
15. Normalize the coefficients by the standard deviation
16. Sort feature names and coefficients by absolute value of coefficients
17. Visualize feature importance

-----------

### Step 1: Import necessary modules

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

### Step 2: Load the dataset, extracting the data and target variables

In [9]:
dataset = load_breast_cancer(as_frame=True) # if `as_frame=True`, both arrays are pandas objects, i.e. X a dataframe and y a series.
target = dataset.target
df = dataset.data # df will _not_ include target (you need to use `dataset.frame` for that)