# Session 65: Capstone Project Part 1 (Data Collection & Inspection)

**Unit 6: Data Ethics, Privacy, and Future Trends**
**Hour: 65**
**Mode: Practical Project**

---

### 1. Objective

This lab session focuses on the **Obtain** and initial **Explore** steps for our StyleSphere marketing campaign project. We will load the dataset, perform our standard inspection checks, and get a first feel for its structure and contents.

### 2. Setup

Import Pandas and any other libraries you anticipate needing.

In [None]:
import pandas as pd
import numpy as np

### 3. Obtain: Loading the Data

The dataset is available online. We will load it from a public URL. This dataset uses a tab `\t` as a separator, so we need to specify that in our `read_csv` command.

In [None]:
url = 'https://raw.githubusercontent.com/LeoFernan/Marketing-Campaigns-Analysis/main/marketing_campaign.csv'
df = pd.read_csv(url, sep='\t')


### 4. Initial Inspection

Let's run our standard set of inspection commands to understand the data we're working with.

#### 4.1. `.head()` and `.shape`

In [None]:
df.head()

In [None]:
df.shape

**Finding:** We have 2240 customers (rows) and 29 columns of information.

#### 4.2. `.info()` - The Technical Summary

In [None]:
df.info()

**Findings & Potential Issues:**
1.  **Missing Values:** The `Income` column has `2216` non-null values, but there are `2240` rows in total. This confirms that we have **24 missing income values** that we will need to handle in the next session.
2.  **Incorrect Data Type:** The `Dt_Customer` column is an `object` (text) but it represents a date. To do any time-based analysis (like calculating customer lifetime), we will need to convert this to a proper datetime format.

#### 4.3. `.describe()` - Statistical Summary

In [None]:
df.describe()

**Findings:**
*   **`Year_Birth`:** The minimum value is 1893. This is highly unlikely to be a real birth year for a current customer and is probably a data entry error or an outlier we should investigate.
*   **`Income`:** The average household income is around $52,247. The maximum is very high at $666,666, which could also be an outlier.
*   **`Response`:** The mean of this column is `0.149`. Since 'Yes' is coded as 1 and 'No' as 0, this mean value tells us the overall response rate: **14.9%** of customers in the dataset responded to the last campaign. This is another **imbalanced dataset**.

### 5. Conclusion

In this session, we successfully loaded our project data and performed a thorough initial inspection. We have already identified several key data quality issues that need to be addressed before we can proceed with analysis:
1.  24 missing values in the `Income` column.
2.  An incorrect data type for `Dt_Customer`.
3.  Potential outliers or errors in `Year_Birth` and `Income`.

This sets a clear agenda for our next session, which will be dedicated to data cleaning.

**Next Session:** We will execute the **Scrub** phase of our project, cleaning the data to prepare it for analysis.