# Project Retrospective: Lessons Learned and Future Improvements
## 1. Introduction
This notebook does not contain code or models. Instead, it serves as a Post-Mortem analysis of the entire project lifecycle. While the previous notebooks focused on quantitative results, this document focuses on the qualitative learning process. It details the challenges faced, the architectural decisions made, the pitfalls encountered regarding data leakage, and the realization of the Data Science role versus Data Engineering.

## 2. The "Synthetic Data" Reality Check
One of the most significant realizations of this project was the limitation of generating a custom ecosystem.

**The Complexity of Banking Ecosystems:** Attempting to simulate a complex banking environment from scratch was an ambitious goal that ultimately revealed the limitations of my current tools. The resulting dataset was too deterministic. As feared, the models achieved near-perfect scores (AUC ~1.0). This was not a triumph of modeling, but a symptom of data simplicity. The patterns were too explicit, making the classification task trivial.

This experience highlighted that a model trained on a specific, self-generated schema cannot be naively extrapolated to real-world datasets, which was one of my initial ideas. Real data is messier, noisier, and less logical.

For future personal projects, it is more valuable to find "dirty" real-world data than to spend time engineering a "perfect" synthetic world.

## 3. Methodological Pitfalls: Data Leakage & Splitting
Reflecting on the workflow, I identified several critical issues regarding data integrity and strict separation of concerns.

### 3.1. The Train/Test Split Timing
I realized too late that the Train/Test split must be the absolute first step before any exploration or engineering. Performing EDA on the entire dataset introduces bias (I "saw" the test data patterns before training).

In future projects, the split will be performed immediately after loading the raw data.

### 3.2. Clustering and Leakage
The implementation of K-Means for customer profiling introduced subtle forms of data leakage:

* Using Is_Fraud (indirectly or directly) to define clusters is dangerous. It works for historical analysis but creates a dependency that might not exist for new, unlabelled clients. I think I did a good job in this case, but I need to be careful.

* The K-Means algorithm was fitted on the entire dataset (Train + Test). This means the clusters in the test set were influenced by the distribution of the train set, which is a violation of the blind testing principle.

## 4. Engineering Wins: The Custom Transformer
Despite the data limitations, the Data Engineering and Software Architecture aspects provided the most value in this project.

* **Custom Scikit-Learn Transformer:** Implementing the FraudPreprocessor class from scratch was a major technical milestone. I learned how to structure a class that adheres to the fit/transform API.

* **The History Logic:** I successfully implemented a creative solution for the .fit() method to generate client history (profiling) and store it, ensuring that during .transform() (inference), new transactions are compared against the stored profile.

* **Pipeline Integration:** This class proved robust enough to work within a larger pipeline, seamlessly integrating with both Logistic Regression and XGBoost without breaking the workflow.

## 5. Workflow and Project Management
The process of building the project revealed several insights about planning and execution.

### 5.1. The 80/20 Rule Verified
Paradoxically, for a "Data Science" project, the modeling phase was negligible. 80% of the time was consumed by EDA and Data Engineering (Dataset creation). The actual model training yielded no new discoveries, serving only to validate the engineering.

### 5.2. Planning vs. Reacting
* **Lack of Strategy:** I had to redesign components mid-project because I hadn't anticipated the requirements of the next step. Future projects require a clear architectural diagram and data flow strategy before writing the first line of code.

### 5.3. Reproducibility and Assets
I learned the importance of deciding which assets (especially heavy plots) to save. Re-running a notebook just to generate a specific image is inefficient. A better strategy for saving/loading visuals is needed.

## 6. Coding Practices & Tools
### 6.1. Code Quality and Structure
* **Notebook Setup:** I established a standard for initializing notebooks (organized imports, config settings).

* **Readability:** I made a conscious effort to write legible, well-annotated code, aiming for a professional standard. However, without external code review, I cannot confirm if the code is truly optimized or well annoteted.

* **Version Control:** I implemented Git/GitHub. While usage was basic (no branching strategies), it proved useful for version rollback on two occasions.

### 6.2. The "Testing" Gap
I identified a significant inefficiency in my coding style, I tend to write large blocks of code with multiple dependencies before running it. I do not write Unit Tests. While I was lucky in this project (few bugs), this methodology is fragile.

I need to investigate debugging tools and adopt TDD (Test-Driven Development) principles, testing small functions individually before integrating them into the pipeline.

## 7. Final Conclusion
This project was an exercise in humility and simulation.

While the models did not solve a complex mathematical problem due to the simplicity of the data, the project succeeded as a Professional Simulation. It forced me to deal with pipeline architecture, class inheritance, Git integration, and documentation.

The key realization is that in a real-world scenario, the value often lies not in the algorithm (which can be imported in one line), but in the robustness of the data pipeline and the quality of the feature engineering, the two areas where this project focused its greatest efforts.