## Data Validation & Cleaning

The original dataset had 15,000 rows and 8 columns, covering weekly sales for the new product line. Each row included sales method, customer ID, quantity sold, revenue, years as a customer, website visits, and state.

**Here’s what was checked and changed:**

- **Sales Method:**  
  - Fixed spelling mistakes and made sure only three sales methods are used: 'Email', 'Call', and 'Email + Call'.

- **Years as Customer:**  
  - Removed two rows with values higher than 41 years. Since the company started in 1984 and the data goes up to 2025, the longest possible customer history is 41 years.

- **Revenue:**  
  - Found 1,074 rows without revenue data. Kept these rows for customer and sales approach analysis, but left them out when analyzing revenue.

- **Customer ID:**  
  - Checked that every customer ID was unique, as expected.

- **Other Columns:**  
  - All other fields (`week`, `nb_sold`, `nb_site_visits`, `state`) looked reasonable and didn’t need changes.

- **Data Types:**  
  - All columns were loaded with the right data types, matching the description in the project brief. No changes were needed.


## Exploratory Data Analysis

This section looks at the main patterns in the sales data using a few simple charts and summaries.

### Single Variable Analysis

To get a better idea of the customer base and their activity, I started with two histograms:

- The first histogram shows how many years each customer has been with the company. Most customers are new or have only a few years with us, but there’s a spread all the way up to 41 years, which matches the maximum possible based on the company’s history. The distribution is right-skewed, meaning most customers are newer, and only a few have been with the company for a very long time.

![Distribution of Years as Customer](years_as_customer_hist.png)

- The second histogram looks at how many times customers visited the company’s website in the last six months. Most customers have between 20 and 30 visits, with some as low as 12 and some as high as 41. This distribution is fairly symmetric but slightly right-skewed, with most customers clustered near the average but a few with much higher activity.

![Distribution of Site Visits](nb_site_visits_hist.png)

### Linking the Exploratory Analysis

After seeing these general patterns in the customer base, I used boxplots to check if different sales methods were being used with any particular group of customers or if they reached similar types of people.

- The boxplot for **years as customer** across sales methods shows that all approaches—Call, Email, and Email + Call—were used with customers who have almost the same range and average tenure. The median and spread are very close for each method.
- The boxplot for **site visits** by sales method shows a similar pattern: all three methods were applied to customers with similar levels of website activity, and there is no clear bias toward targeting only the most or least active users.

**Outcome:**  
The key takeaway is that both customer tenure and online activity are distributed similarly across all sales methods. This means that the higher revenue observed from the combined 'Email + Call' approach isn’t just because it was aimed at a special segment like long-term or highly active customers, but because it genuinely performs better for the typical customer seen in our data. This reinforces the decision to recommend 'Email + Call' as the primary sales strategy, since it works well across the main customer base established in the histograms.

![Years as Customer by Sales Method](years_as_customer_by_sales_method.png)
![Site Visits by Sales Method](nb_site_visits_by_sales_method.png)

### Comparing Two Variables

To see how revenue changes with the sales approach, I plotted a boxplot for revenue by sales method. This chart makes it clear that the combined 'Email + Call' approach brings in the highest revenue per customer, while 'Call' on its own leads to the lowest. The 'Email' method sits somewhere in between.

![Revenue by Sales Method](revenue_by_sales_method_boxplot.png)

**Key points:**
- The combined approach not only has the highest median revenue but also the widest spread, showing some customers respond really well to this method.
- The distribution for 'Call' is tight and low, which means it’s not just the lowest on average but it’s also less likely to surprise you with a big sale.
- Outliers are present in all groups, but they don’t change the overall trend.

---

All images above were exported from Jupyter Notebook as it was selected the main tool for Data Analysis.

## Key Metric for the Business

For this project, the most useful metric for the business is average revenue per customer, broken down by sales method. This directly shows how effective each approach is at generating sales from new products.

- By tracking average revenue per customer, the business can see which sales strategy works best and where to focus their resources for the highest impact.

### Current Results

Based on the current data:
- Call: average revenue per customer is $47.60
- Email: average revenue per customer is $97.13
- Email + Call: average revenue per customer is $183.65

These results show a clear pattern, that using both email and call together leads to the highest returns, while relying only on phone calls brings in the least.

---

## Recommendation

Based on the findings, I recommend that the company:
- Focus more on the combined 'Email + Call' approach, as it brings in the highest revenue per customer.
- Review the current use of phone calls alone, since this method brings the lowest revenue and may not be an efficient use of time and resources.
- Consider further exploring why 'Email + Call' works so well—there may be lessons that can be applied to other campaigns or products.