# Logistic Regression Model

In [16]:
#Step N1 import Libraries and needed functions for easy logistic regression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Import dataset
df = pd.read_csv('Dataset.csv')

# Drop rows containing missing values
df = df.dropna(axis=0, how='any')

# Convert non-numeric data using one-hot encoding
df = pd.get_dummies(df, columns=['island', 'sex'])

# Assign X and y variables
X = df.drop('species',axis=1)
y = df['species']

# Split data into test/train set (70/30 split) and shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

# Assign algorithm
model = LogisticRegression()

# Link algorithm to X and y variables
model.fit(X_train, y_train)

# Run algorithm on test data to make predictions
model_test = model.predict(X_test)

# Evaluate predictions
print(confusion_matrix(y_test, model_test)) 
print(classification_report(y_test, model_test))


[[46  1  0]
 [ 0 18  0]
 [ 0  0 35]]
              precision    recall  f1-score   support

      Adelie       1.00      0.98      0.99        47
   Chinstrap       0.95      1.00      0.97        18
      Gentoo       1.00      1.00      1.00        35

    accuracy                           0.99       100
   macro avg       0.98      0.99      0.99       100
weighted avg       0.99      0.99      0.99       100



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Datapoint to predict: 

In [17]:
# Data point to predict
penguin = [
	39, #bill_length_mm
	18.5, #bill_depth_mm
	180, #flipper_length_mm 
	3750, #body_mass_g
	0, #island_Biscoe    
	0, #island_Dream
	1, #island_Torgersen    
	1, #sex_Male
	0, #sex_Female
]

# Make prediction
new_penguin = model.predict([penguin])
new_penguin




array(['Adelie'], dtype=object)

# process explained step by step

1<n>) Okay firstly we import the libraries and a dataset <br>
<h3>2) Then we drop any incomplete data points (i.e containing Nan value) <code>df = df.dropna(axis=0, how='any')</code></h3> 
3) after we use one hot encoding to eliminate all the non numeric values <br> 
4)Then we assign independent and dependent variables (our dependent variable is going to be species. we are trying to find what species is the penguin depending on it's characteristics) <b>Dependent Variable: Species </b> <br> 
5) Then we just shuffle the data and split it into training and test data <br> 
6) Then we just assign our algorithm specific model. for this problem we chose <b>LogisticRegression()</b> <br>
7) Then we just fit the model to Training data by this line <code>model.fit(X_train, y_train)</code> (Almost nothing new till here)
<h3>8)Then run our algorithm on Test Data to predict the accuracy <code>model_test = model.predict(X_test)</code></h3>
<h3><b>9)Accuracy check:</b> <code>print(confusion_matrix(y_test, model_test))</code> <br>
<code>print(classification_report(y_test, model_test))</code> By these lines of code we basically tell the program to print the confusion matrix and classification report which can be used to predict the accuracy of the model</h3>
10) Lastly we want to make a SPECIES prediction based on an actual penguin so we create an array penguin put all the related data<br>
11) <code>new_penguin = model.predict([penguin])</code> By this line of code we test our jimmy penguin into our Algorithm and see the prediction


# Things You didn't know:

 Dropping Nan Values: By this line of code we basically tell the program to exclude all the datapoints that are not complete (i.e containing Nan as values) <code>df = df.dropna(axis=0, how='any')</code> (if any value of a row contains Nan value we drop the entire row)

 Dependent Variable: We do not need to use one hot encoding for dependent variable

Predictions: We run the algorithm on our test data to predict the results of the test Data <code>model_test = model.predict(X_test)</code>

## ACCURACY CHECK!!! (We use the <b>Confusion Matrix</b> and <b>Classification Report</b> for all classification algorithms)

<code>print(confusion_matrix(y_test, model_test))</code> we basically print the predictions and actual results of the test data

### The output will be something like this:
[[46  1  0]   # this is how we predicted: 46 species N1 1 species N2 0 species N3<br>
 [ 0 18  0]  # this is how we predicted: 0 species N1 18 species N2 0 species N3<br>
 [ 0  0 35]] # this is how we predicted: 0 species N1 0 species N2 35 species N3

this is how accurately we predicted the test data (in our test data we had 100 penguins (30% of total dataset minus the empty datapoints). We misspredicted only one second spicies as the first species. 

## Classification Report

<code>print(classification_report(y_test, model_test))</code> This gives us a detailed report with relevant variables like <b>precision</b> and <b>recall</b> with which we can judge how accurate our model is (Unfortunately you don't know how to properly use those yet) 

### The output will be something like this:

# Some important Things to remember:

![Multinomial Logistic Regression.png](attachment:b1b77295-eda1-4d75-a310-845228ff9954.png)

![Logistic Regression Graph.png](attachment:4f018862-5eda-4968-a877-e93112e3653c.png)