# Problem Statement

## **Business Context**

"Visit with Us," a leading travel company, is revolutionizing the tourism industry by leveraging data-driven strategies to optimize operations and customer engagement. While introducing a new package offering, such as the Wellness Tourism Package, the company faces challenges in targeting the right customers efficiently. The manual approach to identifying potential customers is inconsistent, time-consuming, and prone to errors, leading to missed opportunities and suboptimal campaign performance.

To address these issues, the company aims to implement a scalable and automated system that integrates customer data, predicts potential buyers, and enhances decision-making for marketing strategies. By utilizing an MLOps pipeline, the company seeks to achieve seamless integration of data preprocessing, model development, deployment, and CI/CD practices for continuous improvement. This system will ensure efficient targeting of customers, timely updates to the predictive model, and adaptation to evolving customer behaviors, ultimately driving growth and customer satisfaction.


## **Objective**

As an MLOps Engineer at "Visit with Us," your responsibility is to design and deploy an MLOps pipeline on GitHub to automate the end-to-end workflow for predicting customer purchases. The primary objective is to build a model that predicts whether a customer will purchase the newly introduced Wellness Tourism Package before contacting them. The pipeline will include data cleaning, preprocessing, transformation, model building, training, evaluation, and deployment, ensuring consistent performance and scalability. By leveraging GitHub Actions for CI/CD integration, the system will enable automated updates, streamline model deployment, and improve operational efficiency. This robust predictive solution will empower policymakers to make data-driven decisions, enhance marketing strategies, and effectively target potential customers, thereby driving customer acquisition and business growth.

## **Data Description**

The dataset contains customer and interaction data that serve as key attributes for predicting the likelihood of purchasing the Wellness Tourism Package. The detailed attributes are:

**Customer Details**
- **CustomerID:** Unique identifier for each customer.
- **ProdTaken:** Target variable indicating whether the customer has purchased a package (0: No, 1: Yes).
- **Age:** Age of the customer.
- **TypeofContact:** The method by which the customer was contacted (Company Invited or Self Inquiry).
- **CityTier:** The city category based on development, population, and living standards (Tier 1 > Tier 2 > Tier 3).
- **Occupation:** Customer's occupation (e.g., Salaried, Freelancer).
- **Gender:** Gender of the customer (Male, Female).
- **NumberOfPersonVisiting:** Total number of people accompanying the customer on the trip.
- **PreferredPropertyStar:** Preferred hotel rating by the customer.
- **MaritalStatus:** Marital status of the customer (Single, Married, Divorced).
- **NumberOfTrips:** Average number of trips the customer takes annually.
- **Passport:** Whether the customer holds a valid passport (0: No, 1: Yes).
- **OwnCar:** Whether the customer owns a car (0: No, 1: Yes).
- **NumberOfChildrenVisiting:** Number of children below age 5 accompanying the customer.
- **Designation:** Customer's designation in their current organization.
- **MonthlyIncome:** Gross monthly income of the customer.

**Customer Interaction Data**
- **PitchSatisfactionScore:** Score indicating the customer's satisfaction with the sales pitch.
- **ProductPitched:** The type of product pitched to the customer.
- **NumberOfFollowups:** Total number of follow-ups by the salesperson after the sales pitch.-
- **DurationOfPitch:** Duration of the sales pitch delivered to the customer.


# Model Building

In [1]:
# Create a master folder to keep all files created when executing the below code cells
import os
os.makedirs("tourism_project", exist_ok=True)

In [2]:
# Create a folder for storing the model building files
os.makedirs("tourism_project/model_building", exist_ok=True)

## Data Registration

In [3]:
os.makedirs("tourism_project/data", exist_ok=True)

Once the **data** folder created after executing the above cell, please upload the **tourism.csv** in to the folder

## Data Preparation

In [9]:
HF_TOKEN = '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'
login(token=HF_TOKEN)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset, load_dataset

# Load data from HuggingFace
try:
    dataset = load_dataset("abhishek-kumar/tourism-package-prediction", split="train")
    df = dataset.to_pandas()
    print(f"Dataset loaded from HuggingFace: {len(df)} rows")
except:
    df = pd.read_csv("tourism.csv")
    print(f"Dataset loaded locally: {len(df)} rows")

# Data cleaning and preprocessing
print("Starting data cleaning...")

# Remove unnecessary columns
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

# Handle missing values
print("Missing values before cleaning:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Fill missing values
numerical_cols = ['Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 
                 'NumberOfTrips', 'PitchSatisfactionScore', 'NumberOfChildrenVisiting', 'MonthlyIncome']

for col in numerical_cols:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].median())

categorical_cols = ['TypeofContact', 'Occupation', 'Gender', 'MaritalStatus', 
                   'ProductPitched', 'Designation']

for col in categorical_cols:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown')

# Fix data inconsistencies
df['Gender'] = df['Gender'].replace('Fe Male', 'Female')

print("Missing values after cleaning:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Feature engineering
print("Feature engineering...")

# Create income categories
df['IncomeCategory'] = pd.cut(df['MonthlyIncome'], 
                             bins=[0, 15000, 25000, 35000, float('inf')], 
                             labels=[0, 1, 2, 3])  # Use numeric labels

# Create age groups
df['AgeGroup'] = pd.cut(df['Age'], 
                       bins=[0, 25, 35, 45, 55, float('inf')], 
                       labels=[0, 1, 2, 3, 4])  # Use numeric labels

# Encode categorical variables
label_encoders = {}
categorical_columns = df.select_dtypes(include=['object']).columns

for col in categorical_columns:
    if col != 'CustomerID':
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        label_encoders[col] = le

print(f"Data preprocessing completed! Final shape: {df.shape}")

# Split data
print("Splitting data...")
X = df.drop(['CustomerID', 'ProdTaken'], axis=1)
y = df['ProdTaken']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create train and test dataframes
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

# Add CustomerID
train_df.insert(0, 'CustomerID', range(300000, 300000 + len(train_df)))
test_df.insert(0, 'CustomerID', range(400000, 400000 + len(test_df)))

# Save locally
train_df.to_csv("tourism_project/data/train_data.csv", index=False)
test_df.to_csv("tourism_project/data/test_data.csv", index=False)

print(f"Data split completed!")
print(f"Training set: {len(train_df)} samples")
print(f"Test set: {len(test_df)} samples")

# Upload train dataset
train_dataset = Dataset.from_pandas(train_df)
train_dataset.push_to_hub(
    "abhishek-kumar/tourism-package-prediction-train",
    private=False,
    token=HF_TOKEN
)

# Upload test dataset
test_dataset = Dataset.from_pandas(test_df)
test_dataset.push_to_hub(
    "abhishek-kumar/tourism-package-prediction-test",
    private=False,
    token=HF_TOKEN
)
print("Processed datasets uploaded to HuggingFace!")



Dataset loaded from HuggingFace: 4128 rows
Starting data cleaning...
Missing values before cleaning:
Series([], dtype: int64)
Missing values after cleaning:
Series([], dtype: int64)
Feature engineering...
Data preprocessing completed! Final shape: (4128, 22)
Splitting data...
Data split completed!
Training set: 3302 samples
Test set: 826 samples


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########|  101kB /  101kB            

README.md: 0.00B [00:00, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 29.7kB / 29.7kB            

README.md: 0.00B [00:00, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Processed datasets uploaded to HuggingFace!


## Model Training and Registration with Experimentation Tracking

In [12]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
from huggingface_hub import HfApi
import xgboost as xgb
import mlflow
import mlflow.sklearn
import mlflow.xgboost
from datasets import load_dataset
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Setup MLflow
mlflow.set_experiment("tourism_package_prediction")

# Load train and test data
try:
    train_dataset = load_dataset("abhishek-kumar/tourism-package-prediction-train", split="train")
    train_df = train_dataset.to_pandas()
    test_dataset = load_dataset("abhishek-kumar/tourism-package-prediction-test", split="train")
    test_df = test_dataset.to_pandas()
    print("Data loaded from HuggingFace")
except:
    train_df = pd.read_csv("tourism_project/data/train_data.csv")
    test_df = pd.read_csv("tourism_project/data/test_data.csv")
    print("Data loaded locally")

# Prepare features
X_train = train_df.drop(['CustomerID', 'ProdTaken'], axis=1)
y_train = train_df['ProdTaken']
X_test = test_df.drop(['CustomerID', 'ProdTaken'], axis=1)
y_test = test_df['ProdTaken']

print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")

# Function to evaluate models
def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba)
    }
    
    print(f"\n{model_name} Performance:")
    for metric, value in metrics.items():
        print(f"   {metric.capitalize()}: {value:.4f}")
    
    return metrics

# Train models with hyperparameter tuning
models_results = []

# 1. Decision Tree
print("Training Decision Tree...")
with mlflow.start_run(run_name="DecisionTree"):
    param_grid = {
        'max_depth': [5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    dt = DecisionTreeClassifier(random_state=42)
    grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_dt = grid_search.best_estimator_
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_param("model_type", "DecisionTree")
    
    dt_metrics = evaluate_model(best_dt, X_test, y_test, "Decision Tree")
    mlflow.log_metrics(dt_metrics)
    mlflow.sklearn.log_model(best_dt, "model")
    
    models_results.append(("Decision Tree", best_dt, dt_metrics['roc_auc']))

# 2. Random Forest
print("Training Random Forest...")
with mlflow.start_run(run_name="RandomForest"):
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [10, 15, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    }
    
    rf = RandomForestClassifier(random_state=42)
    grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_rf = grid_search.best_estimator_
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_param("model_type", "RandomForest")
    
    rf_metrics = evaluate_model(best_rf, X_test, y_test, "Random Forest")
    mlflow.log_metrics(rf_metrics)
    mlflow.sklearn.log_model(best_rf, "model")
    
    models_results.append(("Random Forest", best_rf, rf_metrics['roc_auc']))

# 3. Gradient Boosting
print("Training Gradient Boosting...")
with mlflow.start_run(run_name="GradientBoosting"):
    param_grid = {
        'n_estimators': [100, 200],
        'learning_rate': [0.05, 0.1, 0.15],
        'max_depth': [3, 5, 7]
    }
    
    gb = GradientBoostingClassifier(random_state=42)
    grid_search = GridSearchCV(gb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_gb = grid_search.best_estimator_
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_param("model_type", "GradientBoosting")
    
    gb_metrics = evaluate_model(best_gb, X_test, y_test, "Gradient Boosting")
    mlflow.log_metrics(gb_metrics)
    mlflow.sklearn.log_model(best_gb, "model")
    
    models_results.append(("Gradient Boosting", best_gb, gb_metrics['roc_auc']))

# 4. XGBoost
print("Training XGBoost...")
with mlflow.start_run(run_name="XGBoost"):
    param_grid = {
        'n_estimators': [100, 200],
        'learning_rate': [0.05, 0.1, 0.15],
        'max_depth': [3, 5, 7],
        'subsample': [0.8, 0.9]
    }
    
    xgb_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
    grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_xgb = grid_search.best_estimator_
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_param("model_type", "XGBoost")
    
    xgb_metrics = evaluate_model(best_xgb, X_test, y_test, "XGBoost")
    mlflow.log_metrics(xgb_metrics)
    mlflow.xgboost.log_model(best_xgb, "model")
    
    models_results.append(("XGBoost", best_xgb, xgb_metrics['roc_auc']))

# 5. AdaBoost
print("Training AdaBoost...")
with mlflow.start_run(run_name="AdaBoost"):
    param_grid = {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.5, 1.0, 1.5]
    }
    
    ada = AdaBoostClassifier(random_state=42)
    grid_search = GridSearchCV(ada, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_ada = grid_search.best_estimator_
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_param("model_type", "AdaBoost")
    
    ada_metrics = evaluate_model(best_ada, X_test, y_test, "AdaBoost")
    mlflow.log_metrics(ada_metrics)
    mlflow.sklearn.log_model(best_ada, "model")
    
    models_results.append(("AdaBoost", best_ada, ada_metrics['roc_auc']))

# Compare models and select best
print("\n" + "="*60)
print("MODEL COMPARISON RESULTS")
print("="*60)

results_df = pd.DataFrame([(name, score) for name, model, score in models_results], 
                         columns=['Model', 'ROC_AUC'])
print(results_df)

# Find best model
best_model_name, best_model, best_score = max(models_results, key=lambda x: x[2])
print(f"\nBest Model: {best_model_name} (ROC-AUC: {best_score:.4f})")

# Save best model
os.makedirs("tourism_project/model_building", exist_ok=True)
joblib.dump(best_model, "tourism_project/model_building/best_model.joblib")

# Register best model to HuggingFace
api = HfApi()
repo_id = "abhishek-kumar/tourism-package-prediction-model"

try:
    api.create_repo(repo_id=repo_id, exist_ok=True, private=False)
    api.upload_file(
        path_or_fileobj="tourism_project/model_building/best_model.joblib",
        path_in_repo="best_model.joblib",
        repo_id=repo_id,
        token=HF_TOKEN
    )
    print(f"Best model registered to HuggingFace: {repo_id}")
except Exception as e:
    print(f"Error registering model: {e}")


Data loaded from HuggingFace
Training features shape: (3302, 21)
Test features shape: (826, 21)
Training Decision Tree...





Decision Tree Performance:
   Accuracy: 0.8668
   Precision: 0.6667
   Recall: 0.6164
   F1_score: 0.6405
   Roc_auc: 0.8710




Training Random Forest...





Random Forest Performance:
   Accuracy: 0.9116
   Precision: 0.9216
   Recall: 0.5912
   F1_score: 0.7203
   Roc_auc: 0.9678




Training Gradient Boosting...





Gradient Boosting Performance:
   Accuracy: 0.9395
   Precision: 0.9360
   Recall: 0.7358
   F1_score: 0.8239
   Roc_auc: 0.9740




Training XGBoost...





XGBoost Performance:
   Accuracy: 0.9358
   Precision: 0.9077
   Recall: 0.7421
   F1_score: 0.8166
   Roc_auc: 0.9588




Training AdaBoost...





AdaBoost Performance:
   Accuracy: 0.8584
   Precision: 0.7059
   Recall: 0.4528
   F1_score: 0.5517
   Roc_auc: 0.8592





MODEL COMPARISON RESULTS
               Model   ROC_AUC
0      Decision Tree  0.871046
1      Random Forest  0.967842
2  Gradient Boosting  0.974004
3            XGBoost  0.958766
4           AdaBoost  0.859212

Best Model: Gradient Boosting (ROC-AUC: 0.9740)


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...ct/model_building/best_model.joblib:  45%|####4     | 1.30MB / 2.89MB            

Best model registered to HuggingFace: abhishek-kumar/tourism-package-prediction-model


# Deployment

## Dockerfile

In [13]:
os.makedirs("tourism_project/deployment", exist_ok=True)

In [14]:
%%writefile tourism_project/deployment/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
	PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app

COPY --chown=user . $HOME/app

# Define the command to run the Streamlit app on port "8501" and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]


Overwriting tourism_project/deployment/Dockerfile


## Streamlit App

Please ensure that the web app script is named `app.py`.

In [15]:
%%writefile tourism_project/deployment/app.py
#!/usr/bin/env python3
"""
Streamlit App for Tourism Package Prediction
"""

import streamlit as st
import pandas as pd
import numpy as np
import joblib
from huggingface_hub import hf_hub_download

# Page configuration
st.set_page_config(
    page_title="Tourism Package Prediction",
    page_icon="üèñÔ∏è",
    layout="wide",
    initial_sidebar_state="expanded"
)

@st.cache_resource
def load_model():
    """Load the trained model from HuggingFace Hub"""
    try:
        model_path = hf_hub_download(
            repo_id="abhishek-kumar/tourism-package-prediction-model",
            filename="best_model.joblib"
        )
        model = joblib.load(model_path)
        return model
    except Exception as e:
        st.error(f"Error loading model: {e}")
        return None

def prepare_input_data(age, gender, marital_status, city_tier, type_of_contact,
                      occupation, designation, monthly_income, num_person_visiting,
                      num_children_visiting, preferred_property_star, num_trips,
                      passport, own_car, duration_of_pitch, product_pitched,
                      num_followups, pitch_satisfaction_score):
    """Prepare input data for model prediction"""
    
    # Create mapping dictionaries
    gender_map = {"Male": 1, "Female": 0}
    marital_map = {"Single": 2, "Married": 1, "Divorced": 0, "Unmarried": 3}
    contact_map = {"Self Enquiry": 1, "Company Invited": 0}
    occupation_map = {"Salaried": 2, "Small Business": 1, "Free Lancer": 0}
    designation_map = {"Executive": 0, "Manager": 1, "Senior Manager": 2, "AVP": 3, "VP": 4}
    product_map = {"Basic": 0, "Standard": 1, "Deluxe": 2, "Super Deluxe": 3}
    passport_map = {"Yes": 1, "No": 0}
    car_map = {"Yes": 1, "No": 0}
    
    # Feature engineering (matching training data encoding)
    if monthly_income <= 15000:
        income_category = 0  # Low
    elif monthly_income <= 25000:
        income_category = 1  # Medium
    elif monthly_income <= 35000:
        income_category = 2  # High
    else:
        income_category = 3  # Very High
    
    if age <= 25:
        age_group = 0  # Young
    elif age <= 35:
        age_group = 1  # Adult
    elif age <= 45:
        age_group = 2  # Middle-aged
    elif age <= 55:
        age_group = 3  # Senior
    else:
        age_group = 4  # Elderly
    
    # Create input array
    input_array = np.array([[
        age, contact_map[type_of_contact], city_tier, duration_of_pitch,
        occupation_map[occupation], gender_map[gender], num_person_visiting,
        num_followups, product_map[product_pitched], preferred_property_star,
        marital_map[marital_status], num_trips, passport_map[passport],
        pitch_satisfaction_score, car_map[own_car], num_children_visiting,
        designation_map[designation], monthly_income, income_category, age_group
    ]])
    
    return input_array

def main():
    """Main Streamlit app"""
    
    st.title("Tourism Package Prediction")
    st.markdown("### Predict Customer Purchase Likelihood for Wellness Tourism Package")
    st.markdown("---")
    
    # Load model
    model = load_model()
    if model is None:
        st.error("Failed to load the prediction model.")
        return
    
    # Sidebar inputs
    st.sidebar.header("Customer Information")
    
    # Demographics
    st.sidebar.subheader("Demographics")
    age = st.sidebar.slider("Age", 18, 80, 35)
    gender = st.sidebar.selectbox("Gender", ["Male", "Female"])
    marital_status = st.sidebar.selectbox("Marital Status", ["Single", "Married", "Divorced", "Unmarried"])
    
    # Location & Contact
    st.sidebar.subheader("Location & Contact")
    city_tier = st.sidebar.selectbox("City Tier", [1, 2, 3])
    type_of_contact = st.sidebar.selectbox("Type of Contact", ["Self Enquiry", "Company Invited"])
    
    # Professional Info
    st.sidebar.subheader("Professional Info")
    occupation = st.sidebar.selectbox("Occupation", ["Salaried", "Small Business", "Free Lancer"])
    designation = st.sidebar.selectbox("Designation", ["Executive", "Manager", "Senior Manager", "AVP", "VP"])
    monthly_income = st.sidebar.number_input("Monthly Income", 10000, 50000, 20000)
    
    # Travel Preferences
    st.sidebar.subheader("Travel Preferences")
    num_person_visiting = st.sidebar.slider("Number of Persons Visiting", 1, 5, 2)
    num_children_visiting = st.sidebar.slider("Number of Children Visiting", 0, 3, 0)
    preferred_property_star = st.sidebar.slider("Preferred Property Star Rating", 1.0, 5.0, 3.0, 0.5)
    num_trips = st.sidebar.slider("Number of Trips per Year", 0.0, 10.0, 2.0, 0.5)
    
    # Additional Info
    st.sidebar.subheader("Additional Info")
    passport = st.sidebar.selectbox("Has Passport", ["Yes", "No"])
    own_car = st.sidebar.selectbox("Owns Car", ["Yes", "No"])
    
    # Sales Interaction
    st.sidebar.subheader("Sales Interaction")
    duration_of_pitch = st.sidebar.slider("Duration of Pitch (minutes)", 5, 60, 15)
    product_pitched = st.sidebar.selectbox("Product Pitched", ["Basic", "Standard", "Deluxe", "Super Deluxe"])
    num_followups = st.sidebar.slider("Number of Followups", 0.0, 6.0, 3.0, 0.5)
    pitch_satisfaction_score = st.sidebar.slider("Pitch Satisfaction Score", 1, 5, 3)
    
    # Main content
    col1, col2 = st.columns([2, 1])
    
    with col1:
        st.subheader("Customer Profile Summary")
        profile_data = {
            "Age": age,
            "Gender": gender,
            "Marital Status": marital_status,
            "City Tier": city_tier,
            "Occupation": occupation,
            "Monthly Income": f"‚Çπ{monthly_income:,}",
            "Number of Persons": num_person_visiting,
            "Preferred Star Rating": preferred_property_star,
            "Annual Trips": num_trips,
            "Has Passport": passport,
            "Owns Car": own_car
        }
        
        for key, value in profile_data.items():
            st.write(f"**{key}:** {value}")
    
    with col2:
        st.subheader("Prediction")
        
        if st.button("Predict Purchase Likelihood", type="primary"):
            input_data = prepare_input_data(
                age, gender, marital_status, city_tier, type_of_contact,
                occupation, designation, monthly_income, num_person_visiting,
                num_children_visiting, preferred_property_star, num_trips,
                passport, own_car, duration_of_pitch, product_pitched,
                num_followups, pitch_satisfaction_score
            )
            
            try:
                prediction = model.predict(input_data)[0]
                prediction_proba = model.predict_proba(input_data)[0]
                
                if prediction == 1:
                    st.success("High likelihood of purchase!")
                    st.write(f"**Confidence:** {prediction_proba[1]:.2%}")
                    st.balloons()
                else:
                    st.warning("Low likelihood of purchase")
                    st.write(f"**Confidence:** {prediction_proba[0]:.2%}")
                
                # Probability breakdown
                st.subheader("Probability Breakdown")
                prob_df = pd.DataFrame({
                    'Outcome': ['Will Not Purchase', 'Will Purchase'],
                    'Probability': [prediction_proba[0], prediction_proba[1]]
                })
                st.bar_chart(prob_df.set_index('Outcome'))
                
            except Exception as e:
                st.error(f"Prediction error: {e}")
    
    st.markdown("---")
    st.markdown("### About This Model")
    st.info("""
    This ML model predicts customer purchase likelihood for the Wellness Tourism Package
    based on demographics, travel preferences, and sales interaction data.
    """)

if __name__ == "__main__":
    main()


Overwriting tourism_project/deployment/app.py


## Dependency Handling

Please ensure that the dependency handling file is named `requirements.txt`.

In [16]:
%%writefile tourism_project/deployment/requirements.txt
streamlit==1.28.1
pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0
xgboost==1.7.6
joblib==1.3.2
huggingface-hub==0.16.4
datasets==2.14.4

Overwriting tourism_project/deployment/requirements.txt


# Hosting

In [18]:
%%writefile tourism_project/deployment/deploy_to_hf.py
#!/usr/bin/env python3
"""
Deployment Script for Tourism Package Prediction Project
"""

from huggingface_hub import HfApi, login
import os

#login(token=HF_TOKEN)

def deploy_to_huggingface_space():
    """Deploy application to HuggingFace Spaces"""
    print("Deploying to HuggingFace Spaces...")
    
    try:
        api = HfApi()
        space_id = "abhishek-kumar/tourism_project"
        
        files_to_upload = [
            ("app.py", "app.py"),
            ("requirements.txt", "requirements.txt"),
            ("Dockerfile", "Dockerfile")
        ]
        
        print(f"Uploading files to space: {space_id}")
        
        for local_path, repo_path in files_to_upload:
            if os.path.exists(local_path):
                print(f"Uploading {local_path}...")
                api.upload_file(
                    path_or_fileobj=local_path,
                    path_in_repo=repo_path,
                    repo_id=space_id,
                    repo_type="space",
                    token=os.getenv('HF_TOKEN')
                )
                print(f"{local_path} uploaded successfully")
        
        print(f"\nDeployment completed!")
        print(f"App URL: https://huggingface.co/spaces/{space_id}")
        return True
        
    except Exception as e:
        print(f"Deployment error: {e}")
        return False

if __name__ == "__main__":
    deploy_to_huggingface_space()


Overwriting tourism_project/deployment/deploy_to_hf.py


# MLOps Pipeline with Github Actions Workflow

**Note:**

1. Before running the file below, make sure to add the HF_TOKEN to your GitHub secrets to enable authentication between GitHub and Hugging Face.
2. The below code is for a sample YAML file that can be updated as required to meet the requirements of this project.

```
name: Tourism Project Pipeline

on:
  push:
    branches:
      - main  # Automatically triggers on push to the main branch

jobs:

  register-dataset:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: <add_code_here>
      - name: Upload Dataset to Hugging Face Hub
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: <add_code_here>

  data-prep:
    needs: register-dataset
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: <add_code_here>
      - name: Run Data Preparation
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: <add_code_here>


  model-traning:
    needs: data-prep
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: <add_code_here>
      - name: Start MLflow Server
        run: |
          nohup mlflow ui --host 0.0.0.0 --port 5000 &  # Run MLflow UI in the background
          sleep 5  # Wait for a moment to let the server starts
      - name: Model Building
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: <add_code_here>


  deploy-hosting:
    runs-on: ubuntu-latest
    needs: [model-traning,data-prep,register-dataset]
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: <add_code_here>
      - name: Push files to Frontend Hugging Face Space
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: <add_code_here>

```

**Note:** To use this YAML file for our use case, we need to

1. Go to the GitHub repository for the project
2. Create a folder named ***.github/workflows/***
3. In the above folder, create a file named ***pipeline.yml***
4. Copy and paste the above content for the YAML file into the ***pipeline.yml*** file

## Requirements file for the Github Actions Workflow

## Github Authentication and Push Files

* Before moving forward, we need to generate a secret token to push files directly from Colab to the GitHub repository.
* Please follow the below instructions to create the GitHub token:
    - Open your GitHub profile.
    - Click on ***Settings***.
    - Go to ***Developer Settings***.
    - Expand the ***Personal access tokens*** section and select ***Tokens (classic)***.
    - Click ***Generate new token***, then choose ***Generate new token (classic)***.
    - Add a note and select all required scopes.
    - Click ***Generate token***.
    - Copy the generated token and store it safely in a notepad.

In [None]:
# Install Git
!apt-get install git

# Set your Git identity (replace with your details)
!git config --global user.email "<-------GitHub Email Address------->"
!git config --global user.name "<--------GitHub UserName--------->"

# Clone your GitHub repository
!git clone https://github.com/<--------GitHub UserName--------->/<--------GitHub Reponame--------->.git

# Move your folder to the repository directory
!mv /content/tourism_project/ /content/<--------GitHub Reponame--------->

In [None]:
# Change directory to the cloned repository
%cd <--------GitHub Reponame--------->/

# Add the new folder to Git
!git add .

# Commit the changes
!git commit -m "first commit"

# Push to GitHub (you'll need your GitHub credentials; use a personal access token if 2FA enabled)
!git push https://<--------GitHub UserName--------->:<--------GitHub Token--------->@github.com/<--------GitHub UserName--------->/<--------GitHub Reponame--------->.git

# Output Evaluation

- GitHub (link to repository, screenshot of folder structure and executed workflow)

- Streamlit on Hugging Face (link to HF space, screenshot of Streamlit app)

<font size=6 color="navyblue">Power Ahead!</font>
___