# Machine Learning Financial Fraud Detector

### Version 1: Financial Fraud Detection Using Text Classification

I developed a machine learning model to detect fraudulent financial statements using natural language processing techniques. The project leverages a Kaggle dataset containing financial filings from companies with known fraudulent and non-fraudulent histories. I pre-processed the text data by applying a TF-IDF vectorizer to convert financial statement text into numerical features. These features were then fed into a Logistic Regression classifier. To optimize performance, I implemented hyperparameter tuning using GridSearchCV, exploring various TF-IDF and Logistic Regression parameters. The final model achieved an accuracy of approximately 94% on the test set, demonstrating its effectiveness in identifying potential fraud in financial documents.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

In [11]:
df = pd.read_csv("Final_Dataset.csv")

X = df['Fillings']
y = df['Fraud']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('clf', LogisticRegression(solver='lbfgs', max_iter=1000))
])

In [16]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('clf', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42))
])

In [17]:
# 5. Set up parameter grid for hyperparameter tuning
param_grid = {
    'tfidf__max_df': [0.85, 0.90, 0.95],
    'tfidf__min_df': [1, 2, 5],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'clf__C': [0.01, 0.1, 1, 10, 100]
}

In [20]:
# 6. Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

In [21]:
# 7. Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 90 candidates, totalling 450 fits




In [22]:
# 8. Output the best parameters and best cross-validation score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters found:  {'clf__C': 10, 'tfidf__max_df': 0.95, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 1)}
Best cross-validation score: 0.82


In [23]:
# 9. Evaluate the best estimator on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

In [24]:
print("Test set Accuracy:", accuracy_score(y_test, y_pred))
print("Test set Classification Report:")
print(classification_report(y_test, y_pred))

Test set Accuracy: 0.9411764705882353
Test set Classification Report:
              precision    recall  f1-score   support

          no       0.94      0.94      0.94        18
         yes       0.94      0.94      0.94        16

    accuracy                           0.94        34
   macro avg       0.94      0.94      0.94        34
weighted avg       0.94      0.94      0.94        34



### Version 2: Enhanced Fraud Detection with Benford’s Law Features
Building upon the initial model, this version incorporates additional numeric features derived from Benford’s Law to further enhance fraud detection. I created a custom transformer to extract leading digit counts from numbers in the financial statements—an approach based on Benford's Law, which posits that in naturally occurring datasets, the frequency of digits follows a predictable distribution. These Benford features were combined with the TF-IDF text features using a FeatureUnion, and the integrated feature set was used to train a Logistic Regression classifier. As with Version 1, hyperparameter tuning was performed using GridSearchCV. This enhanced model explores the synergy between text analytics and numeric pattern recognition to potentially improve the detection of fraudulent financial reporting.

In [34]:
from sklearn.base import BaseEstimator, TransformerMixin
import re
import numpy as np
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler

In [35]:
class BenfordTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        benford_features = []
        for text in X:
            # Find all numbers in the text
            numbers = re.findall(r'\d+', text)
            # Extract the first digit from each number (ignoring numbers with no digits)
            first_digits = [num[0] for num in numbers if num]
            # Count occurrences for digits 1 through 9
            counts = [first_digits.count(str(digit)) for digit in range(1, 10)]
            benford_features.append(counts)
        return np.array(benford_features)

In [36]:
# Create a pipeline for the Benford features that includes scaling
benford_pipeline = Pipeline([
    ('benford', BenfordTransformer()),
    ('scaler', StandardScaler())
])

In [37]:
# Combine TF-IDF features with the scaled Benford features
combined_features = FeatureUnion([
    ("tfidf", TfidfVectorizer(stop_words='english', max_features=5000)),
    ("benford", benford_pipeline)
])

In [38]:
# Create a new pipeline that uses the combined features and Logistic Regression
pipeline_with_benford_scaled = Pipeline([
    ('features', combined_features),
    ('clf', LogisticRegression(solver='lbfgs', max_iter=2000, random_state=42))
])

In [39]:
# Parameter grid for GridSearchCV (tuning only the TF-IDF and classifier parts)
param_grid_benford_scaled = {
    'features__tfidf__max_df': [0.85, 0.90, 0.95],
    'features__tfidf__min_df': [1, 2, 5],
    'features__tfidf__ngram_range': [(1, 1), (1, 2)],
    'clf__C': [0.01, 0.1, 1, 10, 100]
}

In [40]:
# Perform grid search with 5-fold cross-validation
grid_search_benford_scaled = GridSearchCV(pipeline_with_benford_scaled, 
                                          param_grid_benford_scaled, 
                                          cv=5, 
                                          n_jobs=-1, 
                                          verbose=1)
grid_search_benford_scaled.fit(X_train, y_train)

Fitting 5 folds for each of 90 candidates, totalling 450 fits




In [42]:
# Output the best parameters and cross-validation score
print("Best parameters found (with scaled Benford features):", grid_search_benford_scaled.best_params_)
print("Best cross-validation score (with scaled Benford features): {:.2f}".format(grid_search_benford_scaled.best_score_))

Best parameters found (with scaled Benford features): {'clf__C': 10, 'features__tfidf__max_df': 0.95, 'features__tfidf__min_df': 1, 'features__tfidf__ngram_range': (1, 2)}
Best cross-validation score (with scaled Benford features): 0.84


In [43]:
# Evaluate the best estimator on the test set
best_model_benford_scaled = grid_search_benford_scaled.best_estimator_
y_pred_benford_scaled = best_model_benford_scaled.predict(X_test)

print("Test set Accuracy (with scaled Benford features):", accuracy_score(y_test, y_pred_benford_scaled))
print("Test set Classification Report (with scaled Benford features):")
print(classification_report(y_test, y_pred_benford_scaled))

Test set Accuracy (with scaled Benford features): 0.9411764705882353
Test set Classification Report (with scaled Benford features):
              precision    recall  f1-score   support

          no       0.94      0.94      0.94        18
         yes       0.94      0.94      0.94        16

    accuracy                           0.94        34
   macro avg       0.94      0.94      0.94        34
weighted avg       0.94      0.94      0.94        34

