Part A:

Problem Statement: Consider the dataset Credit Card Fraud Detection from Kaggle and build a machine-learning model that detects whether a credit card transaction is fraudulent. Demonstrate the steps of data preprocessing and analysis, consider applying train (0.7) and test (0.3), using the logistic regression to build the model, and evaluate to determine the accuracy.

There are various processes involved in creating a credit card fraud detection model, including data pretreatment, analysis, model creation, and evaluation.

Data preprocessing and analysis in Step 1

Add the required libraries, then load the data.

Explore the dataset to learn about its features, statistics, and organisational structure.

Check for missing values and take the proper action (such as impute or remove) if any are found.

If necessary, carry out feature engineering.

Since credit card fraud typically occurs seldom relative to regular transactions, the dataset should be balanced.

70% of the data should be used for training and 30% should be used for testing.

Step 1: Data processing

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv(r"C:\Users\asus\Downloads\archive (1).zip")

# Check for missing values
print(data.isnull().sum())

# Splitting the data into features (X) and target variable (y)
X = data.drop("Class", axis=1)
y = data["Class"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Perform feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


Step 2: Model Building (using polynomial Kernel)

In [5]:
# Import the necessary libraries
from sklearn.svm import SVC

# Create an SVM model with Polynomial kernel
model_poly = SVC(kernel='poly', random_state=42)

# Train the model using the training data
model_poly.fit(X_train, y_train)


Step 3: Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test data for the polynomial kernel
y_pred_poly = model_poly.predict(X_test)

# Evaluate the model's performance for the polynomial kernel
accuracy_poly = accuracy_score(y_test, y_pred_poly)
print("Accuracy (Polynomial Kernel):", accuracy_poly)

# Print classification report and confusion matrix for the polynomial kernel
print("Classification Report (Polynomial Kernel):")
print(classification_report(y_test, y_pred_poly))

print("Confusion Matrix (Polynomial Kernel):")
print(confusion_matrix(y_test, y_pred_poly))

Accuracy (Polynomial Kernel): 0.9994850368081645
Classification Report (Polynomial Kernel):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85307
           1       0.90      0.76      0.83       136

    accuracy                           1.00     85443
   macro avg       0.95      0.88      0.91     85443
weighted avg       1.00      1.00      1.00     85443

Confusion Matrix (Polynomial Kernel):
[[85295    12]
 [   32   104]]


Code using sigmoid kernel:

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create an SVM model with Sigmoid kernel
model_sigmoid = SVC(kernel='sigmoid', random_state=42)

# Define hyperparameters to tune
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.1, 1],
    'coef0': [0.1, 0.5, 1]
}

# Create GridSearchCV object for hyperparameter tuning
grid_search = GridSearchCV(model_sigmoid, param_grid, cv=5, n_jobs=-1, verbose=1)

# Train the model using the training data with hyperparameter tuning
grid_search.fit(X_train, y_train)

# Get the best hyperparameters found during the tuning process
best_params = grid_search.best_params_

# Create an SVM model with the best hyperparameters
model_sigmoid_best = SVC(kernel='sigmoid', C=best_params['C'], gamma=best_params['gamma'],
                         coef0=best_params['coef0'], random_state=42)

# Train the model using the training data with the best hyperparameters
model_sigmoid_best.fit(X_train, y_train)

# Make predictions on the test data for the sigmoid kernel
y_pred_sigmoid_best = model_sigmoid_best.predict(X_test)

# Evaluate the model's performance for the sigmoid kernel
accuracy_sigmoid_best = accuracy_score(y_test, y_pred_sigmoid_best)
print("Accuracy (SVM with Sigmoid Kernel - Tuned):", accuracy_sigmoid_best)

# Print classification report and confusion matrix for the sigmoid kernel
print("Classification Report (SVM with Sigmoid Kernel - Tuned):")
print(classification_report(y_test, y_pred_sigmoid_best))

print("Confusion Matrix (SVM with Sigmoid Kernel - Tuned):")
print(confusion_matrix(y_test, y_pred_sigmoid_best))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Accuracy (SVM with Sigmoid Kernel - Tuned): 0.9986891846026006
Classification Report (SVM with Sigmoid Kernel - Tuned):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85307
           1       0.60      0.52      0.56       136

    accuracy                           1.00     85443
   macro avg       0.80      0.76      0.78     85443
weighted avg       1.00      1.00      1.00     85443

Confusion Matrix (SVM with Sigmoid Kernel - Tuned):
[[85260    47]
 [   65    71]]


Part B:

Problem Statement: Use the following insurance dataset and build a predictive system to predict insurance costs. Demonstrate the steps of data preprocessing and analysis, consider applying train (0.7) and test (0.3), using linear regression to build the model, and evaluate the accuracy of predicting the insurance cost.

Library Imports:We import the required libraries, including train_test_split from sklearn.model_selection to split the dataset into training and test sets, LinearRegression from sklearn.linear_model to build the linear regression model, mean_squared_error from sklearn.linear_model to build the model, and r2_score from sklearn.metrics to evaluate the model.

Install the Data:Using the pandas read_csv function, we load the insurance dataset and save it in the 'data' variable.

Data preparation:To get the data ready for modelling, we preprocess the data.
A single-hot encoding To turn categorical data (such as gender, smoking status, and location) into numerical values (dummy variables), we utilise pd.get_dummies.We isolate the goal variable (y), which is the 'charges' column, from the feature variables (X).

Test-Train Split:With the help of the train_test_split function from sklearn.model_selection, we divided the data into training and test sets.
30% of the data are in the test set, while 70% are in the training set.
42 is chosen as the random_state to ensure reproducibility.

Create the linear regression model and fit it:To the variable "model," we create a LinearRegression object and assign it.
We then used the training data (X_train, y_train) to fit the model using the fit technique.
During the fitting phase, the model discovers the relationship between the features and the target variable.

Construct predictions:Using the predict method, we use the training model to make predictions about the test data (X_test).
Model assessment:Mean Squared Error (MSE) and R-squared (R2) score are two measures we use to assess how accurate the model's predictions are.
The average squared difference between the anticipated values and the actual values is what MSE calculates. Better performance is indicated by a lower MSE.
The R2 score gauges the percentage of the target variable's volatility that can be predicted from the characteristics. A higher R2 value denotes better performance and spans from 0 to 1..



In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the insurance dataset
data = pd.read_csv(r"C:\Users\asus\Downloads\insurance.csv")

# Convert categorical variables into dummy variables
data = pd.get_dummies(data, drop_first=True)

# Separate features (X) and target variable (y)
X = data.drop('charges', axis=1)
y = data['charges']

# Split the data into training and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions using the test data
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error (MSE) and R-squared (R2) score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

Mean Squared Error (MSE): 33780509.57479163
R-squared (R2) Score: 0.7696118054369012
