# ðŸš€ Getting Started with Data Science Master System

Welcome to the **Data Science Master System** - a comprehensive, production-ready framework for data science and machine learning.

## What You'll Learn

In this notebook, you'll learn:
1. How to install and import the system
2. Basic data loading and exploration
3. Your first machine learning pipeline
4. Making predictions

**Prerequisites**: Basic Python knowledge

**Time Required**: ~15 minutes

## 1. Installation

First, let's install the Data Science Master System:

In [None]:
# Install the package (uncomment if needed)
# !pip install data-science-master-system

# Or install from local source
import sys
sys.path.insert(0, '../../')

## 2. Import the System

In [None]:
# Import main components
from data_science_master_system import (
    Pipeline,
    DataLoader,
    FeatureFactory,
    Evaluator,
    Plotter,
)

# Standard libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("âœ… All imports successful!")

## 3. Load Sample Data

Let's load a customer churn dataset - this is a classification problem where we predict if a customer will leave.

In [None]:
# Method 1: Using DataLoader (recommended)
loader = DataLoader()
df = loader.read('../data/csv/customer_churn.csv')

# Check the data
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

In [None]:
# Quick data info
print("\nðŸ“Š Data Summary:")
print(f"  â€¢ Total customers: {len(df)}")
print(f"  â€¢ Churned customers: {df['churn'].sum()} ({df['churn'].mean()*100:.1f}%)")
print(f"  â€¢ Features: {len(df.columns) - 1}")

## 4. Quick Data Exploration

In [None]:
# Visualize churn distribution
plotter = Plotter(style='default')

import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Churn distribution
df['churn'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Churn Distribution')
axes[0].set_xticklabels(['No Churn', 'Churn'], rotation=0)

# Age distribution
df['age'].hist(bins=20, ax=axes[1], color='steelblue')
axes[1].set_title('Age Distribution')

# Monthly charges
df['monthly_charges'].hist(bins=20, ax=axes[2], color='coral')
axes[2].set_title('Monthly Charges')

plt.tight_layout()
plt.show()

## 5. Build Your First Pipeline

Now let's create a machine learning pipeline with just **3 lines of code**!

In [None]:
# Remove ID column (not useful for prediction)
df_ml = df.drop(columns=['customer_id'])

# Create and train pipeline - JUST 3 LINES!
pipeline = Pipeline.auto_detect(df_ml, target='churn')  # Auto-detect problem type
pipeline.fit()  # Train the model

print("âœ… Pipeline trained successfully!")
print(f"  â€¢ Problem type: {pipeline.problem_type}")
print(f"  â€¢ Model: {pipeline.model_name}")

## 6. Evaluate the Model

In [None]:
# Split data for proper evaluation
from sklearn.model_selection import train_test_split

X = df_ml.drop(columns=['churn'])
y = df_ml['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Retrain on training data only
pipeline_proper = Pipeline(
    problem_type='classification',
    model_name='random_forest',
    auto_preprocess=True
)
pipeline_proper.fit(X_train, y_train)

# Evaluate on test set
metrics = pipeline_proper.evaluate(X_test, y_test)

print("\nðŸ“ˆ Model Performance:")
for metric, value in metrics.items():
    print(f"  â€¢ {metric}: {value:.4f}")

## 7. Make Predictions

In [None]:
# Make predictions on test data
predictions = pipeline_proper.predict(X_test)
probabilities = pipeline_proper.predict_proba(X_test)

# Show sample predictions
results = pd.DataFrame({
    'Actual': y_test.values[:10],
    'Predicted': predictions[:10],
    'Churn Probability': probabilities[:10, 1].round(3)
})

print("\nðŸ”® Sample Predictions:")
display(results)

## 8. Feature Importance

Let's see which features are most important for predicting churn:

In [None]:
# Get feature importance
importance_df = pipeline_proper.feature_importance(top_n=10)

# Plot
fig = plotter.feature_importance(importance_df, title='Top 10 Features for Churn Prediction')
plt.show()

## 9. Save Your Model

In [None]:
# Save the trained pipeline
pipeline_proper.save('my_first_model.joblib')
print("âœ… Model saved!")

# Later, you can load it:
# loaded_pipeline = Pipeline.load('my_first_model.joblib')

## ðŸŽ‰ Congratulations!

You've just:
- âœ… Loaded data using DataLoader
- âœ… Explored data with visualizations
- âœ… Built a machine learning pipeline
- âœ… Evaluated model performance
- âœ… Made predictions
- âœ… Analyzed feature importance
- âœ… Saved your model

### Next Steps

Continue learning with:
1. **02_data_loading_and_exploration.ipynb** - Deep dive into data handling
2. **03_feature_engineering.ipynb** - Create powerful features
3. **04_model_comparison.ipynb** - Compare multiple models