# Fake profile detection with logistic regression and hyperparameter tuning in Python

## Description:

This Jupyter Notebook outlines a machine learning workflow for detecting fake profiles using a logistic regression model. The process involves loading a dataset, preprocessing features, performing cross-validation, and optimizing the model's hyperparameters using Bayesian optimization with Optuna. 

Each step is encapsulated in a separate code cell for clear organization and easy execution. The notebook concludes with an evaluation of the model's performance on test data and an examination of feature importance derived from the model's coefficients. 

By following these steps, we aim to create a robust classifier that can accurately identify fake profiles based on the provided dataset.

# Step 1: Load the Data

In [3]:
import pandas as pd

# Specify data path for the model
file_path = r'C:\Users\ferna\OneDrive\Documents\Skills\ML and AI\MODULE 25 - capstone project\Instagram.csv'

# Load the data
df = pd.read_csv(file_path)

# Display the first five rows of the DataFrame to check the headers and top entries
df.head()

Unnamed: 0,profile pic,nums/length username,fullname words,nums/length fullname,name==username,description length,external URL,private,#posts,#followers,#follows,fake
0,1,0.27,0,0.0,0,53,0,0,32,1000,955,0
1,1,0.0,2,0.0,0,44,0,0,286,2740,533,0
2,1,0.1,2,0.0,0,0,0,1,13,159,98,0
3,1,0.0,1,0.0,0,82,0,0,679,414,651,0
4,1,0.0,2,0.0,0,0,0,1,6,151,126,0


# Step 2: Preprocess the Data

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume 'fake' is the target variable
X = df.drop('fake', axis=1)
y = df['fake']
print("Feature matrix and target vector created.")

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Features have been standardized.")

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print("Data has been split into training and test sets.")

Feature matrix and target vector created.
Features have been standardized.
Data has been split into training and test sets.


# Step 3: Define the Logistic Regression Model

This step is integrated into the Optuna optimization process, so we will define the model within the objective function during the hyperparameter tuning step.

# Step 4: Perform Cross-Validation and Hyperparameter Tuning Using Optuna

In [7]:
# Install Optuna if it's not already installed
!pip install optuna

Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
     -------------------------------------- 409.6/409.6 kB 6.4 MB/s eta 0:00:00
Collecting colorlog
  Downloading colorlog-6.8.0-py3-none-any.whl (11 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.13.0-py3-none-any.whl (230 kB)
     -------------------------------------- 230.6/230.6 kB 2.3 MB/s eta 0:00:00
Collecting Mako
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.6/78.6 kB 4.3 MB/s eta 0:00:00
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.0 colorlog-6.8.0 optuna-3.4.0


In [10]:
import optuna
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def objective(trial):
    # Hyperparameters to be tuned
    C = trial.suggest_float('C', 1e-4, 1e4, log=True)
    
    # Logistic Regression model with increased max_iter
    model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    
    # Cross-validation
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    
    return score

# Create a study object and specify the optimization direction as maximizing accuracy
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

[I 2023-12-06 16:14:15,410] A new study created in memory with name: no-name-ca75e942-b27f-4ce2-88fc-1549bf45b300
[I 2023-12-06 16:14:15,425] Trial 0 finished with value: 0.9108695652173914 and parameters: {'C': 0.015441878622321182}. Best is trial 0 with value: 0.9108695652173914.
[I 2023-12-06 16:14:15,485] Trial 1 finished with value: 0.9086956521739131 and parameters: {'C': 158.17333750153313}. Best is trial 0 with value: 0.9108695652173914.
[I 2023-12-06 16:14:15,495] Trial 2 finished with value: 0.8913043478260869 and parameters: {'C': 0.0006051526096588452}. Best is trial 0 with value: 0.9108695652173914.
[I 2023-12-06 16:14:15,525] Trial 3 finished with value: 0.9108695652173914 and parameters: {'C': 0.01305923342531416}. Best is trial 0 with value: 0.9108695652173914.
[I 2023-12-06 16:14:15,558] Trial 4 finished with value: 0.9086956521739131 and parameters: {'C': 6.843316984517569}. Best is trial 0 with value: 0.9108695652173914.
[I 2023-12-06 16:14:15,574] Trial 5 finished w

# Step 5: Output the Best Hyperparameters

In [11]:
# Best hyperparameters
best_params = study.best_params
print('Best Parameters:', best_params)

Best Parameters: {'C': 0.3495200485853912}


# Step 6: Train the Model with the Best Parameters Found

In [12]:
# Train the model with the best parameters found
model_best = LogisticRegression(**best_params, random_state=42)
model_best.fit(X_train, y_train)

# Step 7: Predictions and Performance Evaluation

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Predictions
y_pred = model_best.predict(X_test)

# Performance Evaluation
print('Classification Report:\n', classification_report(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.97      0.89        63
           1       0.95      0.75      0.84        53

    accuracy                           0.87       116
   macro avg       0.89      0.86      0.87       116
weighted avg       0.88      0.87      0.87       116

Confusion Matrix:
 [[61  2]
 [13 40]]
Accuracy Score: 0.8706896551724138


# Step 8: Show Feature Importance

In [18]:
# Sorting the features based on the original value in descending order
feature_importance_sorted = feature_importance.sort_values(by='importance', ascending=False)

# Display the sorted feature importance
print('Feature Importance (in decreasing order of impact):\n', feature_importance_sorted)

Feature Importance (in decreasing order of impact):
                       importance  abs_importance
nums/length username    1.532483        1.532483
name==username          0.560643        0.560643
nums/length fullname    0.166220        0.166220
private                -0.072266        0.072266
#followers             -0.207122        0.207122
#follows               -0.209910        0.209910
fullname words         -0.276121        0.276121
description length     -0.478877        0.478877
external URL           -0.781232        0.781232
#posts                 -1.313050        1.313050
profile pic            -1.690651        1.690651
