Problem Statement No. 08
1. Implement logistic regression using Python/R to perform classification on Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset

Use :Social_Network_Ads.csv

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [2]:
data = pd.read_csv('Social.csv')

In [3]:
# Splitting the dataset into features and target variable
X = data.iloc[:, [2, 3]].values  # Features: Age and EstimatedSalary
y = data.iloc[:, 4].values  # Target variable: Purchased

In [4]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [5]:
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [6]:
# Fitting Logistic Regression to the Training set
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

In [7]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [9]:
# Creating the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = cm.ravel()

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Error Rate
error_rate = 1 - accuracy

# Precision
precision = precision_score(y_test, y_pred)

# Recall
recall = recall_score(y_test, y_pred)

# F1 Score
f1 = f1_score(y_test, y_pred)

print("Confusion Matrix:")
print(cm)
print("True Positives:", TP)
print("False Positives:", FP)
print("True Negatives:", TN)
print("False Negatives:", FN)
print("Accuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Confusion Matrix:
[[65  3]
 [ 8 24]]
True Positives: 24
False Positives: 3
True Negatives: 65
False Negatives: 8
Accuracy: 0.89
Error Rate: 0.10999999999999999
Precision: 0.8888888888888888
Recall: 0.75
F1 Score: 0.8135593220338982



Certainly! Let's break down each aspect:

    1. Logistic Regression:
        ◦ Logistic regression is a type of regression analysis used for predicting the probability of a binary outcome. It's commonly used for classification problems.
        ◦ Unlike linear regression, which predicts a continuous outcome, logistic regression predicts the probability of a categorical outcome (e.g., whether a customer will buy a product or not).
        ◦ It uses the logistic function (also known as the sigmoid function) to model the probability of the outcome as a function of the input features.
    2. Dataset:
        ◦ The dataset used here is the Social_Network_Ads.csv dataset, which likely contains information about users' age, gender, estimated salary, and whether they purchased a product advertised on a social network.
    3. Confusion Matrix:
        ◦ A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.
        ◦ It's a matrix where each row represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
        ◦ It helps to visualize the performance of an algorithm by showing the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.
    4. Evaluation Metrics:
        ◦ True Positive (TP): The number of correctly predicted positive instances.
        ◦ True Negative (TN): The number of correctly predicted negative instances.
        ◦ False Positive (FP): The number of instances that were incorrectly predicted as positive when they are actually negative (Type I error).
        ◦ False Negative (FN): The number of instances that were incorrectly predicted as negative when they are actually positive (Type II error).
        ◦ Accuracy: The ratio of correctly predicted instances to the total instances in the dataset.
        ◦ Error Rate: The ratio of incorrectly predicted instances to the total instances in the dataset.
        ◦ Precision: The ratio of correctly predicted positive observations to the total predicted positives (TP / (TP + FP)).
        ◦ Recall (Sensitivity): The ratio of correctly predicted positive observations to the all observations in actual class (TP / (TP + FN)).
        ◦ F1 Score: The harmonic mean of precision and recall. It balances both precision and recall.
    5. Implementation:
        ◦ In the provided code, logistic regression is implemented using Python's scikit-learn library.
        ◦ The dataset is split into training and testing sets, features are scaled using standardization, and logistic regression is fitted to the training set.
        ◦ Predictions are made on the test set, and a confusion matrix is created.
        ◦ Various evaluation metrics such as accuracy, error rate, precision, recall, and F1 score are computed using the confusion matrix and actual/predicted values.
