# Fake profile detection with Random Forest and Bayesian hyperparameter tuning in Python

## Description:

This Jupyter Notebook is designed for a comprehensive analysis of Instagram data using machine learning. It begins by loading and preprocessing the data, followed by comparing various models including Logistic Regression, K-Nearest Neighbors, Support Vector Machine, and Random Forest. 

The notebook then focuses on optimizing the Logistic Regression model using Optuna for hyperparameter tuning. The final sections are dedicated to training the optimized model, making predictions, evaluating performance, and exploring feature importance. 

The overall goal is to provide a structured approach to model selection and optimization in a real-world data analysis context.

# Step 1: Load the Data

In [1]:
import pandas as pd

# Specify data path for the model
file_path = 'input_file_path'

# Load the data
df = pd.read_csv(file_path)

# Display the first five rows of the DataFrame to check the headers and top entries
df.head()

Unnamed: 0,profile pic,nums/length username,fullname words,nums/length fullname,name==username,description length,external URL,private,#posts,#followers,#follows,fake
0,1,0.27,0,0.0,0,53,0,0,32,1000,955,0
1,1,0.0,2,0.0,0,44,0,0,286,2740,533,0
2,1,0.1,2,0.0,0,0,0,1,13,159,98,0
3,1,0.0,1,0.0,0,82,0,0,679,414,651,0
4,1,0.0,2,0.0,0,0,0,1,6,151,126,0


# Step 2: Preprocess the Data

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume 'fake' is the target variable
X = df.drop('fake', axis=1)
y = df['fake']
print("Feature matrix and target vector created.")

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Features have been standardized.")

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print("Data has been split into training and test sets.")

Feature matrix and target vector created.
Features have been standardized.
Data has been split into training and test sets.


# Section 3: Explore Different Models

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Define models
models = {
    "Logistic Regression": LogisticRegression(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC(),
    "Random Forest": RandomForestClassifier()
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model: {name}")
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print('Classification Report:\n', classification_report(y_test, y_pred))
    print('------------------------------------------------------\n')

Model: Logistic Regression
Accuracy Score: 0.8620689655172413
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.95      0.88        63
           1       0.93      0.75      0.83        53

    accuracy                           0.86       116
   macro avg       0.88      0.85      0.86       116
weighted avg       0.87      0.86      0.86       116

------------------------------------------------------

Model: K-Nearest Neighbors
Accuracy Score: 0.8706896551724138
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.92      0.89        63
           1       0.90      0.81      0.85        53

    accuracy                           0.87       116
   macro avg       0.87      0.87      0.87       116
weighted avg       0.87      0.87      0.87       116

------------------------------------------------------

Model: Support Vector Machine
Accuracy Score: 0.87068965517241

# Step 4: Perform Cross-Validation and Hyperparameter Tuning for Random Forest Using Optuna

In [7]:
# Install Optuna if it's not already installed
!pip install optuna

Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
     -------------------------------------- 409.6/409.6 kB 6.4 MB/s eta 0:00:00
Collecting colorlog
  Downloading colorlog-6.8.0-py3-none-any.whl (11 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.13.0-py3-none-any.whl (230 kB)
     -------------------------------------- 230.6/230.6 kB 2.3 MB/s eta 0:00:00
Collecting Mako
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.6/78.6 kB 4.3 MB/s eta 0:00:00
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.0 colorlog-6.8.0 optuna-3.4.0


In [4]:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    # Hyperparameters to be tuned
    n_estimators = trial.suggest_int('n_estimators', 10, 300)
    max_depth = trial.suggest_int('max_depth', 2, 32, log=True)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 150)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 60)

    # Random Forest model
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth,
                                   min_samples_split=min_samples_split, 
                                   min_samples_leaf=min_samples_leaf,
                                   random_state=42)
    
    # Cross-validation
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    
    return score

# Create a study object and specify the optimization direction as maximizing accuracy
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

[I 2023-12-07 10:57:57,739] A new study created in memory with name: no-name-2a933841-4af0-4b75-a95b-db5f9456c2d1
[I 2023-12-07 10:57:58,121] Trial 0 finished with value: 0.8978260869565219 and parameters: {'n_estimators': 71, 'max_depth': 19, 'min_samples_split': 3, 'min_samples_leaf': 52}. Best is trial 0 with value: 0.8978260869565219.
[I 2023-12-07 10:57:59,337] Trial 1 finished with value: 0.9130434782608695 and parameters: {'n_estimators': 223, 'max_depth': 3, 'min_samples_split': 25, 'min_samples_leaf': 27}. Best is trial 1 with value: 0.9130434782608695.
[I 2023-12-07 10:58:00,305] Trial 2 finished with value: 0.9043478260869564 and parameters: {'n_estimators': 175, 'max_depth': 16, 'min_samples_split': 30, 'min_samples_leaf': 39}. Best is trial 1 with value: 0.9130434782608695.
[I 2023-12-07 10:58:01,895] Trial 3 finished with value: 0.9065217391304348 and parameters: {'n_estimators': 282, 'max_depth': 2, 'min_samples_split': 143, 'min_samples_leaf': 5}. Best is trial 1 with v

# Step 5: Output the Best Hyperparameters

In [5]:
# Best hyperparameters
best_params = study.best_params
print('Best Parameters:', best_params)

Best Parameters: {'n_estimators': 155, 'max_depth': 15, 'min_samples_split': 15, 'min_samples_leaf': 8}


# Step 6: Train the Model with the Best Parameters Found

In [6]:
# Train the model with the best parameters found
model_best = RandomForestClassifier(**best_params, random_state=42)
model_best.fit(X_train, y_train)

# Step 7: Predictions and Performance Evaluation

In [7]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Predictions
y_pred = model_best.predict(X_test)

# Performance Evaluation
print('Classification Report:\n', classification_report(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.94      0.92        63
           1       0.92      0.89      0.90        53

    accuracy                           0.91       116
   macro avg       0.91      0.91      0.91       116
weighted avg       0.91      0.91      0.91       116

Confusion Matrix:
 [[59  4]
 [ 6 47]]
Accuracy Score: 0.9137931034482759


# Step 8: Show Feature Importance

In [8]:
import pandas as pd

# Extracting feature importances
feature_importances = pd.DataFrame(model_best.feature_importances_,
                                   index = X.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)

print('Feature Importance:\n', feature_importances)

Feature Importance:
                       importance
#followers              0.337346
#posts                  0.262252
nums/length username    0.127657
profile pic             0.103287
description length      0.076681
#follows                0.052121
fullname words          0.031636
external URL            0.003658
nums/length fullname    0.003191
private                 0.002143
name==username          0.000028
