This project is part of Data Analysis with Python (Task 8). It focuses on: Feature Engineering & Hyperparameter Tuning (generic example with student dataset). Fraud Detection using Decision Trees (synthetic dataset).
- Created new features (Total_Score) in a student dataset.
- Tuned a Random Forest model using GridSearchCV.
- Generated a synthetic fraud dataset (fraud_detection.csv).
- Encoded categorical features (credit/debit).
- Engineered new features (Amount_Squared, Log_Amount).
- Trained & tuned a Decision Tree using GridSearchCV.
- Evaluated with Precision, Recall, F1-score.
- Best Params: {'max_depth': 3, 'n_estimators': 50}
- Accuracy: ~60%
- Best Params: {'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 5}
- Accuracy: ~90%
⚠️ Accuracy is high because most transactions are legitimate, but the model struggles with detecting fraud due to class imbalance.
- Use SMOTE (Synthetic Minority Oversampling) or undersample legitimate transactions.
- Try class weights (class_weight="balanced") in Decision Trees.
- Random Forest, XGBoost, or LightGBM often perform better than a single Decision Tree.
- Ensemble methods can reduce overfitting and capture complex fraud patterns.
- Create time-based features (e.g., transactions per hour/day).
- Calculate average transaction amount per user.
- Flag unusual transactions (very high or very frequent).
- Use algorithms like Isolation Forest or One-Class SVM to detect rare frauds.
- pip install pandas numpy scikit-learn
- python Task8_FeatureEngineering_ModelTuning.py
- Run Section 1 (Random Forest on student dataset).
- Run Section 2 (Decision Tree on fraud detection dataset).
- Best hyperparameters
- Accuracy (Section 1)
- Classification Report (Section 2)