## Analysis Report


**Introduction**

In this analysis, we aimed to identify small businesses most likely to respond to the Wave-2 mailing campaign for upgrading to QuickBooks Version 3.0. Using predictive modeling, we analyzed past purchasing behavior, geographic trends, and customer engagement to optimize the mailing strategy. Logistic regression was selected as the final model due to its interpretability and strong predictive performance. By applying data-driven selection criteria, we targeted high-probability responders while minimizing costs. This approach ensured an efficient and profitable campaign, maximizing response rates and return on investment.



**Q1.Describe how you developed your predictive models, and discuss predictive performance for each model?**
→ We developed multiple predictive models to determine the most effective approach for identifying customers likely to upgrade to QuickBooks following the wave-2 mailing. Our focus was on comparing logistic regression and neural network models to assess their predictive capabilities.
Logistic Regression Model: This model was selected due to its interpretability, robustness, and strong predictive performance. Logistic regression is well-suited for binary classification tasks, such as predicting customer upgrades. The model was trained on historical data using key features such as past orders, spending, geographic location, software version, and engagement history.
Neural Network Model (NNR): We tested several neural network architectures with different configurations of hidden layers and nodes using GridSearchCV to optimize hyperparameters. The best-performing model had one hidden layer with three nodes. However, while the neural network achieved competitive AUC scores, logistic regression provided more stable results and was easier to interpret.
Performance Comparison: Both models were evaluated using AUC (Area Under the Curve) as the primary metric. The logistic regression model demonstrated slightly better generalizability, with higher predictive stability across training and test sets. While the neural network model had a slightly higher AUC in training, it did not significantly outperform logistic regression in real-world application.
Given these findings, we chose the logistic regression model as the final predictive model for determining the wave-2 mailing list.The logistic regression model achieved a pseudo R-squared of 0.129 and an AUC of 0.768, making it a robust tool for predicting upgrade likelihood.

**Q2. How did you compare and evaluate different Models?**

→ To ensure a data-driven model selection, we used multiple evaluation techniques.
Evaluation Metrics:
AUC-ROC: The logistic regression model achieved an AUC of 0.755, indicating strong predictive power.
Cumulative Gains Chart: Confirmed effective ranking of businesses by upgrade likelihood, optimizing marketing efforts.
Permutation Feature Importance: Identified key predictors—zip_bins, upgraded, last, and numords.
Despite slightly higher AUC scores for neural networks, logistic regression was preferred for its clear feature insights, consistent performance, and statistical validation of predictors without overfitting.

**Q3. If you created new variables to include in the model, please describe these as well?**

→ New Variables and Their Impact
To enhance predictive power, interaction terms were added to capture complex relationships:
Geographic Response (zip_bins * numords): Captured regional variations in purchasing behavior.
Recency & Location (zip_bins * last): Reflected how recent purchases influenced response rates across regions.
Upgrade History & Location (zip_bins * upgraded): Highlighted the link between past upgrades and future engagement by region.
Impact on Model Performance:
Pseudo R-squared improved from 0.121 to 0.135, boosting explanatory power.
Interaction terms increased statistical significance, reinforcing the role of geography and behavior in response probability.
Enhanced precision in targeting high-probability responders, optimizing Wave-2 mailing efficiency.
These refinements strengthened the model’s ability to capture customer behavior while maintaining interpretability and reliability.

**Q4.What criteria did you use to decide which customers should receive the wave-2 mailing?**
→ The final Wave-2 mailing list was generated using a combination of model predictions and business-driven criteria to maximize response rates while minimizing costs. Businesses were selected if they had a predicted response probability of at least 0.0235, were located in high-response zip bins (top quartile), had made more than two purchases, had previously upgraded, and had recent transactions (last below the median). Businesses in historically low-response zip bins, those with no past orders, or those inactive for an extended period were excluded. This approach ensured a targeted and cost-effective mailing strategy focused on high-likelihood responders.

**Q5. How much profit do you anticipate from the wave-2 mailing?**
→ Projected Profit from Wave-2 Mailing

The financial viability of the wave-2 mailing campaign was assessed based on anticipated revenue and cost considerations.
	•Total Customers Targeted: 6,843 customers, representing 30.41% of the total pool.
	•Expected Response Rate: 7.09%, based on historical response trends.
	•Estimated Responses: 485 customers predicted to upgrade.
	•Revenue Per Response: $60 per upgrade.
	•Total Expected Revenue: $29,095.25.
	•Cost Per Mailing: $1.41 per customer.
	•Total Mailing Cost: $9,648.63.
	•Projected Profit: $19,446.62.
	•Return on Marketing Effort (ROME): 201.55%, indicating that for every dollar spent, more than $2 is generated in revenue.

This analysis confirms that the wave-2 mailing is a highly profitable initiative with a strong return on investment.
The above expected profit is based on the test data that we have collected and if we target Using the response rate derived from the training data and the percentage of the customer base targeted in the test data, we can estimate the overall impact of the direct mail campaign on a larger customer base, excluding those who have already responded. The campaign is expected to reach 221,844 businesses, representing 29.06% of the remaining customer base. This effort is projected to yield 16,190 positive responses, resulting in a profit of $658,692.77 and a Return on Marketing Effort (ROME) of 210.55%.


**Q6. What did you learn about the type of businesses that are likely to upgrade?**
→ Businesses most likely to upgrade were in high-response zip bins, had made frequent past purchases (numords > 2), previously upgraded QuickBooks (upgraded = 1), engaged recently (low last and sincepurch values), and were moderate to high spenders. In contrast, low-probability responders were in low-response zip bins, had long inactivity periods, no prior upgrades, and minimal past engagement. Key takeaways for Wave-2 highlighted the importance of geographic segmentation, past engagement as a strong predictor, and recency as a crucial factor. These insights helped refine the marketing strategy to focus on high-likelihood businesses while avoiding low-response segments.


**Conclusion:**

The logistic regression model provided a structured, interpretable, and statistically validated approach for selecting businesses for the Wave-2 mailing. It maximized response rates through data-driven selection, optimized marketing costs by reducing unnecessary outreach, and ensured high projected returns for a profitable campaign. By focusing on high-probability responders, the Wave-2 campaign was designed for strong ROI while maintaining efficiency and precision.The Cumulative Gains Chart shows that both models, pred_logit (blue) and pred_nnr3 (orange), effectively rank customers based on their likelihood to purchase, as both curves are well above the random selection baseline. However, pred_logit demonstrates a slight advantage, capturing a marginally higher proportion of buyers at various points, particularly in the middle range of the population (30%-80% targeted). This suggests that pred_logit is slightly more effective at prioritizing high-response customers. Toward the upper end (90%-100% targeted), both models converge, indicating that they ultimately capture nearly the same total number of buyers. While the difference is minimal, pred_logit may be the better choice for optimizing marketing efforts.

