# Logistic Regression Algorithm using Scikit Learn

__What is Logistic Regression and when should you use it?__

Logistic Regression is a type of supervised machine learning algorithm that is used for classification problems. It is a statistical method that is used to fit a regression model to a set of binary data.

The logistic regression model is a linear model that uses a logistic function (also called the sigmoid function) to map the input features to a value between 0 and 1, which can be interpreted as the probability of the input belonging to a certain class. Based on this probability, the logistic regression model makes a prediction about the class of the input.

The logistic regression algorithm is particularly useful when you want to predict a binary outcome (e.g. yes/no, true/false, 0/1) based on a set of input features. It is a simple and interpretable model that can be easily regularized to avoid overfitting. Additionally, it is easy to implement and computationally efficient.

Logistic Regression is commonly used in:

- Medical field, to predict patient outcomes based on their medical history and other factors
- Marketing, to predict the likelihood of a customer purchasing a product or service
- Credit risk analysis, to predict the likelihood of a loan applicant defaulting on a loan
- Natural Language Processing, to classify text into different categories

It is important to note that Logistic Regression is not suitable for highly non-linear decision boundaries, in that case, other models like Random Forest, SVM, etc should be considered.

__When we shouldn't use Logistic Regression Algorithm?__

There are several situations where logistic regression may not be the best choice of algorithm:

__Non-linear decision boundaries:__ Logistic regression is a linear model, and it may not be able to accurately capture non-linear relationships between the input features and the output. In cases where the decision boundary is highly non-linear, other algorithms such as decision trees, random forests, or support vector machines may be more appropriate.

__Multi-class classification problems:__ Logistic regression is primarily used for binary classification problems, where the output can take on only two possible values. If you have a multi-class classification problem (i.e. more than two possible output classes), you may want to consider using other algorithms such as one-vs-all logistic regression, decision trees or Random Forest.

__High-dimensional data:__ Logistic regression can become computationally expensive and inefficient when dealing with high-dimensional data (i.e. a large number of input features). In these cases, algorithms such as linear discriminant analysis (LDA) or regularized linear models may be more appropriate.

__Large number of categorical variables:__ Logistic regression assumes that all input features are numerical, and it may not be able to handle categorical variables with a large number of categories. In these cases, algorithms such as decision trees or Random Forest may be more appropriate.

__Large number of missing values:__ Logistic regression can be sensitive to missing data, and it may not work well when there is a large number of missing values in the input data. In these cases, algorithms such as Random Forest or XGBoost may be more appropriate.

It's always important to consider the characteristics of your data and problem in order to choose the most appropriate algorithm.

In [1]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# load the iris dataset
iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [3]:
# Spliting the data set into X and y
X = iris.data[:, :2] # we only take the first two features
y = iris.target

In [4]:
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# initialize the model
model = LogisticRegression(solver='lbfgs', multi_class='auto') # solver parameter specifies the algorithm to be used and multi_class parameter us used to specify the type of problem.

In [6]:
# train the model
model.fit(X_train, y_train)

LogisticRegression()

In [7]:
# make predictions on the test set
y_pred = model.predict(X_test)

In [8]:
# evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9


__Advantage of Logistic Regression Algorithm__

Logistic Regression has several advantages, including:

__Simple and interpretable:__ Logistic regression is a simple and interpretable model that is easy to understand and explain to others. It provides a clear relationship between the input features and the probability of the output belonging to a certain class.

__Computationally efficient:__ Logistic regression is a fast and efficient algorithm that can handle large amounts of data. It is also easy to implement in most programming languages, including Python, R, and MATLAB.

__Can be regularized to avoid overfitting:__ Logistic regression can be regularized to prevent overfitting by adding a penalty term to the cost function. This can be done by adding L1 or L2 regularization to the model.

__Can handle categorical and numerical data:__ Logistic regression can handle both categorical and numerical data, and it can also handle missing data.

__Can be used for binary and multi-class classification:__ Logistic regression can be used for both binary and multi-class classification problems. One-vs-all logistic regression can be used for multi-class classification problem.

__Good performance on small datasets:__ Logistic Regression is a good algorithm to use when you have small datasets, it can still deliver good performance in those scenarios.

__Logistic Regression can be used for online learning:__ Logistic Regression can be updated on the fly as new data arrives and can be used for online learning, which is useful in many scenarios such as spam detection or credit risk analysis.

__Logistic Regression is a well-established algorithm:__ Logistic Regression is a well-established algorithm that has been widely used in many fields such as medical research, marketing, and finance.

__Disadvantage of Logistic Regression Algorithm__

Logistic Regression also has some disadvantages, including:

__Assumes linearity:__ Logistic regression assumes a linear relationship between the input features and the log-odds of the output. If the relationship is non-linear, the model may not be able to accurately capture the relationship and may not perform well.

__Assumes independence between features:__ Logistic regression assumes that the input features are independent of each other. If there are strong relationships or interactions between the features, the model may not be able to accurately capture these relationships and may not perform well.

__Sensitive to outliers:__ Logistic regression is sensitive to outliers, and a single outlier in the training data can have a large impact on the model's predictions.

__Not suitable for large number of categorical variables:__ Logistic regression is sensitive to categorical variables with a large number of categories, and it may not work well when there is a large number of categorical variables in the input data.

__Limited to binary or multi-class classification:__ Logistic regression is primarily used for binary and multi-class classification problems, and it may not be suitable for other types of problems such as regression or clustering.

__Logistic Regression requires large sample size:__ Logistic Regression requires a large sample size to get accurate results. The sample size should be at least 10 times greater than the number of independent variables.

__Logistic Regression is not suitable for highly non-linear decision boundaries:__ In cases where the decision boundary is highly non-linear, other algorithms such as decision trees, random forests, or support vector machines may be more appropriate.

__Logistic Regression is sensitive to irrelevant features:__ Logistic Regression can be sensitive to irrelevant features, which can decrease the performance of the model. It is important to select the relevant features for the model.

##### Md. Ashiqur Rahman
##### Thank You