# Logistic Regression

## Back to the Breast Cancer Data

Let's start off with this dataset and see if we can apply Logistic Regression!

In [None]:
# The usual imports
import pandas as pd
import numpy as np
import plotly.express as px

# Below will be necessary for Logistic Regression
from sklearn.metrics import accuracy_score


Start by reading in the data and taking a look at it.

In [None]:
# Put csv into dataframe (name it df)
URL = "https://raw.githubusercontent.com/ishaandey/node/master/week-8/lab/breast_cancer.csv"
df = pd.read_csv(URL)

# Quick data cleaning 
df = df.drop(columns=['id','Unnamed: 32'])
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Single Variable Logistic Regression

Split into the proper X and y arrays for the train test split. For now, we'll have the X only contain one variable: `concave points_mean`. If you remember, that was the feature with most importance in our Decision Tree Model.

In [None]:
from sklearn.model_selection import train_test_split

# Split into X and y
# What would our y be in this case?
X = df[['concave points_mean']]
y = df.diagnosis

# Use train_test_split to split into training/testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Now, let's go ahead and create our model then train it.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create the Logistic Regression model
# Hint: It's very similar to the rest of our classification models
clf = LogisticRegression()

# Fit the model with the training data
clf.fit(X_train, y_train)

LogisticRegression()

Let's see what the data looks like and if it would work well with a LogisticRegression

In [None]:
fig = px.scatter(df, x='concave points_mean', y='diagnosis')
fig.show()

After seeing the graph, let's test our model with the testing data and get some metrics! The import we will be using is `f1_score`

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Get predicted and actual
predicted = clf.predict(X_test)
actual = np.array(y_test)

# Get the f1_score from the import above 
# (feel free to look back at your classification metrics notes to know what that means)
f1 = f1_score(actual, predicted)
print(f1)

0.24489795918367346


## Multiple Variables with Logistic Regression

Let's see if we can get a better F1 Score with more variables in the model.

In [None]:
# Split into X and y
# What would our y be in this case?
X = df[['concave points_mean', 'concave points_worst', 'radius_worst']]
y = df.diagnosis

# Use train_test_split to split into training/testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [None]:
# Create the Logistic Regression model
clf = LogisticRegression()

# Fit the model with the training data
clf.fit(X_train, y_train)

LogisticRegression()

In [None]:
# Get predicted and actual
predicted = clf.predict(X_test)
actual = np.array(y_test)

# Get the f1_score from the import above 
# (feel free to look back at your classification metrics notes to know what that means)
f1 = f1_score(actual, predicted)
print(f1)

0.9135802469135803
