## Data Collection

## Data Collection

#### To solve the problem of detecting fraudulent credit card transactions, we need to collect data that contains both legitimate and fraudulent transactions. This dataset can be obtained from a financial institution or a publicly available dataset such as the **Kaggle Credit Card Fraud Detection Dataset**.

### Key Data Points to Collect:

1. **Transaction Details:**
   - Transaction ID: A unique identifier for each transaction.
   - Cardholder ID: A unique identifier for the cardholder, anonymized for privacy.
   - Amount: The monetary value of the transaction.
   - Merchant Category: The category of the merchant where the transaction occurred (e.g., retail, electronics, groceries).
   - Location: The geographic location of the transaction.
   - Transaction Time: The timestamp when the transaction took place.
   - Mode of Transaction: Whether the transaction was online, in-store, or over the phone.

2. **Cardholder Information:**
   - Cardholder's spending history: Past spending behavior of the cardholder, including average transaction amount and frequency of transactions.
   - Credit limit: The cardholder’s maximum allowable credit.

3. **Fraud Label:**
   - A binary label indicating whether the transaction was fraudulent (1) or not (0).

### Data Source:
- **Kaggle Credit Card Fraud Detection Dataset:** This dataset contains 284,807 transactions made by European cardholders over two days in September 2013. The dataset is highly imbalanced, with only 492 frauds (0.172% of all transactions).

- **Internal Bank Data:** If working with a bank or financial institution, we can use historical transaction data collected over a period of time, where fraudulent transactions are already identified and labeled.

### Example Data Table:

| Transaction ID | Cardholder ID | Amount | Merchant Category | Location | Transaction Time      | Mode of Transaction | Fraudulent |
|----------------|---------------|--------|-------------------|----------|-----------------------|---------------------|------------|
| 001            | 123           | 100    | Retail            | New York | 2024-01-01 10:00:00   | In-store            | No         |
| 002            | 456           | 5000   | Electronics       | Online   | 2024-01-01 11:30:00   | Online              | Yes        |
| 003            | 789           | 200    | Restaurant        | Chicago  | 2024-01-01 12:45:00   | In-store            | No         |
| ...            | ...           | ...    | ...               | ...      | ...                   | ...                 | ...        |

Once the data is collected, we can proceed to the next step of **Data Understanding**, where we will analyze the structure and distributions within the dataset.


## Data Understanding

#### The goal here is to identify patterns that differentiate fraudulent from legitimate transactions. 

**For example:**

- Are there certain merchant categories that have more fraudulent transactions?
- Is the transaction amount a significant factor?
- Are there certain locations or times where fraud is more likely to occur?

## Data Preparation:

#### The data will likely need cleaning and pre-processing before being used for modeling. Some steps include:

- Handling missing values (e.g., missing transaction details).
- Encoding categorical variables (e.g., mode of transaction).
- Scaling numerical values (e.g., transaction amounts) to ensure consistency.

## Modeling & Evaluations
#### Using a labeled dataset, I can apply machine learning models such as:

- Logistic Regression: To classify transactions as fraudulent or non-fraudulent.
- Decision Trees/Random Forests: To capture non-linear patterns in the data.
- Neural Networks: To identify complex relationships and patterns that might not be evident using traditional models.

#### The performance of the model will be evaluated based on key metrics:

- Accuracy: Overall percentage of correctly classified transactions.
- Precision and Recall: Precision to ensure minimizing false positives (legitimate transactions marked as fraud) and recall to ensure catching as many fraudulent transactions as possible.
- F1 Score: A balance between precision and recall.