# Module 4: Final Assignment

First, you'll take on both the role of the client and the data scientist to develop a business problem related to one of the following topics:
1. Emails
2. Hospitals
3. Credit Cards

You'll use the business problem you defined to demonstrate your knowledge of the Business Understanding stage.

Then, taking on the role of a data scientist, you'll describe how you would apply data science methodology practices at each of the the listed stages to address the business problem you identified.

#### Which topic did you choose to apply the data science methodology to?

<u>**Answer:**</u>

I chose to apply the data science methodology to *credit cards*.

Financial institutions and customers often face credit card fraud. The problem is well-suited for a data-driven solution because of the large volume of transactional data available and the reall for real-time, accurate predictions. Addressing this issue can help companies save currencies, enhance customer trust, and comply with regulations aimed at reducing financial crime.

---

#### Next, you will play the role of the client and the data scientist. 

Using the topic that you selected, complete the Business Understanding stage by coming up with a problem that you would like to solve and phrasing it in the form of a question that you will use data to answer.

You are required to:
1. Describe the problem, related to the topic you selected.
2. Phrase the problem as a question to be answered using data.

<u>**Answer:**</u>

**Problem:** Credit card fraud poses challenges in distinguishing fraudulent from legitimate transactions due to the high-class imbalance (few fraudulent transactions compared to legitimate ones). Current fraud detection methods generate false positives that inconvenience customers while missing some fraudulent transactions.

**Question:** *How can we design a fraud detection system that identifies fraudulent transactions in real time with high accuracy while minimizing false positives?*

---

#### Business Understanding

Briefly explain how you would complete each of the following stages for the problem that you described in the Business Understanding stage, so that you are ultimately able to answer the question that you came up with.
1. Analytic Approach
2. Data Requirements
3. Data Collection
4. Data Understanding and Preparation
5. Modeling and Evaluation

<u>**Answer:**</u>

1. **Analytic Approach:** To address the problem, we’ll adopt a predictive modeling approach. The goal is to classify transactions as fraudulent or legitimate based on historical data.

- *Type of Analytics:* Supervised learning, specifically binary classification.
- *Desired Output:* A model that assigns a probability score to each transaction, allowing it to be flagged as fraudulent or legitimate.

2. **Data Requirements:** Identify the data attributes needed for the model:

- *Transaction Details:* Transaction amount, timestamp, merchant category, location.
- *Customer Behavior:* Frequency of transactions, average spending patterns, typical locations.
- *Fraud Markers:* Historical labels indicating whether a transaction was fraudulent.
- *Device/Network Data:* IP address, device type, and geolocation.

3. **Data Collection:** Data sources include:

- *Internal Bank Records:* Historical transaction logs with fraud labels.
- *External Data:* Industry benchmarks for fraud detection, such as blacklisted IPs or suspicious merchants.
- *Real-Time Streams:* Data from customer transactions as they occur.

4. **Data Understanding and Preparation:**
- *Data Understanding:*
   - Perform exploratory data analysis (EDA) to identify trends, correlations, and anomalies.
   - Handle challenges like class imbalance by examining the ratio of fraudulent to legitimate transactions.
- *Data Preparation:*
   - *Cleaning:* Remove duplicate transactions, fix missing or inconsistent values.
   - *Transformation:* Normalize transaction amounts, encode categorical variables (e.g., merchant categories).
   - *Feature Engineering:*
      - Create features like "average spending per hour" or "distance between transaction locations."
      - Use time-based features like "transactions within the last 5 minutes."
- *Balancing the Dataset:* Use oversampling (e.g., SMOTE) or undersampling techniques to address the class imbalance.

5. **Modeling and Evaluation:**
- *Modeling:*
   - *Algorithms:* Random forests, gradient boosting (XGBoost/LightGBM), or deep learning models (e.g., neural networks for large datasets).
   - Train multiple models and use grid search or Bayesian optimization to tune hyperparameters.
   - Ensemble methods could combine models for better accuracy.
- *Evaluation:*
   - *Metrics:* Focus on precision, recall, and F1-score to balance false positives and false negatives.
   - *Confusion matrix:* Analyze true positives, false positives, false negatives, and true negatives.
   - *AUC-ROC curve:* Assess the trade-off between sensitivity and specificity.

---