<a href="https://colab.research.google.com/github/byhqsr/DSAI-Professional-Training-in-Machine-Learning/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lesson, we're going to use logistic regression to predict the species of penguin.

To recap, logistic regression is a classification technique for predicting a qualitative outcome, such as "positive" or "negative" in response to a COVID test, which is different from linear regression (where the outcome is numerical/quantitative).

The dataset chosen for this example is the penguins' dataset available with Seaborn (https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv). This dataset contains information about penguins including their bill and flipper dimensions as well as mass, species, and the island they inhabit. Note that this is a snippet of the dataset and the full dataset has 344 rows (or 344 penguins).

Data Scrubbing:
*   Removing rows with missing values
*   One-hot encoding for island and sex

Independent Variables:
*   bill_length_mm
*   bill_depth_mm
*   flipper_length_mm
*   day body_mass_g
*   island
*   sex

Dependent Variable:
*   species

Evaluation:
*   Confusion matrix
*   Classification report

In [None]:
# 1) Import the following Python libraries: A) pandas B) train_test_split from Scikit-learn C) LogisticRegression from Scikit-learn D) confusion_matrix and classification_report from Scikit-learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# 2) Import dataset from the web: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')

In [None]:
# 3) Delete rows with missing values
df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)

In [None]:
# 4) Convert non-numeric variables using one-hot encoding. These variables are: island and sex
df = pd.get_dummies(df, columns=['island', 'sex'])

In [None]:
# 5) Assign the X and y variables
X = df.drop('species',axis=1)
y = df['species']

In [None]:
# 6) Shuffle the dataset and split the data into test/train sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

In [None]:
# 7) Assign LogisticRegression as the model's algorithm
model = LogisticRegression()

In [None]:
# 8) Link model to X and y variables using the fit function
model.fit(X_train, y_train)

In [None]:
# 9) Run algorithm on test data to make predictions
model_test = model.predict(X_test)

In [None]:
# 10) Evaluate predictions by comparing the model's predictions with the actual outcome of the test data using a confusion matrix and classification report
print(confusion_matrix(y_test, model_test)) 
print(classification_report(y_test, model_test))

In [None]:
# 11) Make a prediction with the model using a sample data point (called 'penguin') and the predict function
penguin = [
	39, #bill_length_mm
	18.5, #bill_depth_mm
	180, #flipper_length_mm 
	3750, #body_mass_g
	0, #island_Biscoe    
	0, #island_Dream
	1, #island_Torgersen    
	1, #sex_Male
	0, #sex_Female
]

# Make prediction
new_penguin = model.predict([penguin])
new_penguin