### Goal - To use sykit-learn linear regression to make predictions with a single and multiple features 

### Dataset Overview
**Dataset:** E-Commerce Customer Data  
- **500 customer records** with 8 columns.  
- **Target Variable:** Yearly Amount Spent (ranges from $256 to $765).  

**Key Features:**  
- Avg. Session Length (minutes on site per session).  
- Time on App (minutes spent on mobile app per year).  
- Time on Website (minutes spent on website per year).  
- Length of Membership (years as a customer).  

**Notes:**  
- No missing values - clean dataset!

### Problem Statement 1: Single Feature Linear Regression

**Business Problem:**  
"As an e-commerce analyst, predict customer's Yearly Amount Spent based solely on their Length of Membership. This will help understand if customer loyalty (tenure) directly translates to revenue."

---

### High-Level Technical Steps

#### Data Preparation
- Load the dataset using pandas.  
- Select 'Length of Membership' as X (feature).  
- Select 'Yearly Amount Spent' as y (target).  

#### Data Visualization
- Create a scatter plot to visualize the relationship.  
- Check for a linear pattern.  

#### Train-Test Split
- Split data: 80% training, 20% testing.  
- Use `random_state=42` for reproducibility.  

#### Model Training
- Import and instantiate `LinearRegression()`.  
- Fit the model on training data.  

#### Model Evaluation
- Make predictions on the test set.  
- Calculate metrics: MAE, MSE, RMSE, R².  
- Plot the regression line over the scatter plot.  

#### Interpretation
- Extract coefficient and intercept.  
- Interpret: "For each additional year of membership, spending increases by $X."

<br>
<br>

### Problem Statement 2: Multiple Feature Linear Regression

**Business Problem:**  
"Build a comprehensive model to predict Yearly Amount Spent using customer engagement metrics: Length of Membership, Time on App, and Avg. Session Length. Determine which platform (app vs website) drives more revenue."

---

### High-Level Technical Steps

#### Feature Selection
- Select features: ['Length of Membership', 'Time on App', 'Avg. Session Length'].  
- Exclude 'Time on Website' initially (due to low correlation).  

#### Data Preparation
- Create feature matrix X with selected columns.  
- Keep y as 'Yearly Amount Spent'.  

#### Exploratory Analysis
- Create pairplot to visualize relationships.  
- Check multicollinearity using correlation heatmap.  

#### Train-Test Split
- Perform an 80-20 split with `random_state=42`.  

#### Model Training
- Train a `LinearRegression` model.  
- Fit the model on training data.  

#### Model Evaluation
- Calculate R², MAE, RMSE.  
- Compare results with the single feature model.  
- Create residual plots.  

#### Feature Importance Analysis
- Extract and rank coefficients.  
- Create a bar plot of feature importance.  
- **Business Insight:** "Which channel drives more revenue?"

<br>
<br>
<br>

### Problem Statement 3: Advanced Model with Feature Engineering & Scaling

**Business Problem:**  
"Create an optimized prediction model for Yearly Amount Spent by engineering new features from customer behavior patterns and properly scaling all inputs. Test hypothesis: 'Power users (high app usage + long membership) spend disproportionately more.'"

---

### High-Level Technical Steps

#### Feature Engineering
- Create 'Total Digital Time' = Time on App + Time on Website.  
- Create 'Engagement Score' = Avg. Session Length × Time on App.  
- Create 'Power User' indicator = (Time on App > median) & (Length > median).  
- Create 'App vs Web Ratio' = Time on App / (Time on Website + 1).  

#### Feature Scaling Analysis
- Check feature distributions using histograms.  
- Identify features needing scaling (different scales).  

#### Data Preprocessing Pipeline
- Import `StandardScaler` from sklearn.  
- Create pipeline: scaling → model.  
- Consider `MinMaxScaler` as an alternative.  

#### Advanced Train-Test Split
- Use the same 80-20 split.  
- Scale training and test data separately (prevent data leakage).  

#### Model Training & Comparison
- Train model with original features (scaled).  
- Train model with engineered features (scaled).  
- Compare both performances.  

#### Polynomial Features (Optional)
- Test quadratic terms for Length of Membership.  
- Use `PolynomialFeatures(degree=2, include_bias=False)`.  

#### Model Validation
- Perform cross-validation (5-fold).  
- Check for overfitting.  
- Visualize learning curves.  

#### Final Analysis
- Analyze feature importance with engineered features.  
- Provide business recommendations based on findings.  
- Assess model deployment readiness.  