**Part A: Credit Card Fraud Detection using Logistic Regression**

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Data Preprocessing and Analysis
creditcard_data = pd.read_csv("creditcard.csv")

# Assuming the 'Class' column represents the target variable (0 for non-fraud, 1 for fraud)
X = creditcard_data.drop(columns=["Class"])
y = creditcard_data["Class"]

# Split the data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Building the Logistic Regression Model with increased max_iter
logistic_model = LogisticRegression(max_iter=1000, solver='lbfgs')
logistic_model.fit(X_train, y_train)

# Step 3: Evaluation
y_pred = logistic_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)


Accuracy: 0.999133925541004
Confusion Matrix:
[[85283    24]
 [   50    86]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85307
           1       0.78      0.63      0.70       136

    accuracy                           1.00     85443
   macro avg       0.89      0.82      0.85     85443
weighted avg       1.00      1.00      1.00     85443



The updated results show that the logistic regression model has improved its performance. <br>
The accuracy is very high at approximately **99.91%**, and the confusion matrix and classification report <br>
indicate that the model is able to detect both non-fraudulent (class 0) and fraudulent (class 1) transactions effectively.

**Part B: Insurance Cost Prediction using Linear Regression**

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Step 1: Data Preprocessing and Analysis
insurance_data = pd.read_csv("insurance.csv")

X = insurance_data.drop(columns=["charges"])
y = insurance_data["charges"]

# Identify categorical columns
categorical_cols = ["sex", "smoker", "region"]

# Perform one-hot encoding on categorical columns
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(), categorical_cols)], remainder='passthrough')
X = preprocessor.fit_transform(X)

# Split the data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Building the Linear Regression Model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Step 3: Evaluation
y_pred = linear_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Absolute Error: 4145.4505556276035
Mean Squared Error: 33780509.57479167
R-squared: 0.7696118054369009


-> **Mean Absolute Error (MAE):** The MAE represents the average absolute difference between the actual and predicted values. <br>In this case, the average absolute difference is approximately **$4145.45**. Smaller MAE values indicate better performance.<br>

-> **Mean Squared Error (MSE):** The MSE measures the average squared difference between the actual and predicted values. <br>In this case, the average squared difference is approximately **$33,780,509.57**. Smaller MSE values indicate better performance.

-> **R-squared (Coefficient of Determination):** The R-squared value measures how well the model explains the variance in the target variable (insurance costs). <br>The R-squared value ranges from 0 to 1, where 1 represents a perfect fit. In this case, the R-squared value is approximately **0.77**, indicating that the model explains about **77%** of the variance in insurance costs.