<a href="https://colab.research.google.com/github/dashbinayak92-prog/tourism_package_prediction_mlops/blob/main/Tourism_Package_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
import os

# Assuming the uploaded file is named 'tourism.csv'
# You might need to adjust 'tourism.csv' if the uploaded file has a different name
# For example, if you uploaded 'my_tourism_data.csv', change 'tourism.csv' below to 'my_tourism_data.csv'

source_path = 'tourism.csv'
destination_directory = 'tourism_project/data/'
destination_path = os.path.join(destination_directory, source_path)

if os.path.exists(source_path):
    os.rename(source_path, destination_path)
    print(f"'{source_path}' moved to '{destination_path}' successfully.")
else:
    print(f"Error: '{source_path}' not found in the current directory. Please ensure it was uploaded correctly.")

'tourism.csv' moved to 'tourism_project/data/tourism.csv' successfully.


In [10]:
from datasets import load_dataset
from huggingface_hub import HfApi
dataset = load_dataset('csv', data_files={'train': 'tourism_project/data/tourism.csv'})
dataset.push_to_hub("dash-binayak92/tourism-propensity-dataset")

Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########|  119kB /  119kB            

CommitInfo(commit_url='https://huggingface.co/datasets/dash-binayak92/tourism-propensity-dataset/commit/703241ef63ef6a5153b246652a50726d4918433b', commit_message='Upload dataset', commit_description='', oid='703241ef63ef6a5153b246652a50726d4918433b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/dash-binayak92/tourism-propensity-dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='dash-binayak92/tourism-propensity-dataset'), pr_revision=None, pr_num=None)

## Load Dataset from Hugging Face


In [11]:
from datasets import load_dataset
dataset = load_dataset('dash-binayak92/tourism-propensity-dataset')
print(dataset)

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/119k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4128 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
        num_rows: 4128
    })
})


## Perform Data Cleaning and Remove Unnecessary Columns




In [12]:
import pandas as pd

# Convert the 'train' split of the loaded dataset to a Pandas DataFrame
df = dataset['train'].to_pandas()

# Display the first few rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(df.head())

# Display information about the DataFrame, including column names, non-null counts, and data types
print("\nDataFrame Info:")
df.info()

First 5 rows of the DataFrame:
   Unnamed: 0  CustomerID  ProdTaken   Age    TypeofContact  CityTier  \
0           0      200000          1  41.0     Self Enquiry         3   
1           1      200001          0  49.0  Company Invited         1   
2           2      200002          1  37.0     Self Enquiry         1   
3           3      200003          0  33.0  Company Invited         1   
4           5      200005          0  32.0  Company Invited         1   

   DurationOfPitch   Occupation  Gender  NumberOfPersonVisiting  ...  \
0              6.0     Salaried  Female                       3  ...   
1             14.0     Salaried    Male                       3  ...   
2              8.0  Free Lancer    Male                       3  ...   
3              9.0     Salaried  Female                       2  ...   
4              8.0     Salaried    Male                       3  ...   

   ProductPitched PreferredPropertyStar  MaritalStatus NumberOfTrips  \
0          Deluxe        

In [13]:
print("Original DataFrame shape:", df.shape)

# Remove 'Unnamed: 0' and 'CustomerID' columns
df_cleaned = df.drop(columns=['Unnamed: 0', 'CustomerID'])
print("DataFrame shape after dropping 'Unnamed: 0' and 'CustomerID':", df_cleaned.shape)

# Check for missing values in the cleaned DataFrame
print("\nMissing values before handling:\n", df_cleaned.isnull().sum())

# Drop rows with any missing values
df_cleaned.dropna(inplace=True)
print("\nDataFrame shape after dropping rows with missing values:", df_cleaned.shape)

# Verify no missing values remain
print("\nMissing values after handling:\n", df_cleaned.isnull().sum())

# Display the info of the fully cleaned DataFrame
print("\nCleaned DataFrame Info:")
df_cleaned.info()

Original DataFrame shape: (4128, 21)
DataFrame shape after dropping 'Unnamed: 0' and 'CustomerID': (4128, 19)

Missing values before handling:
 ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

DataFrame shape after dropping rows with missing values: (4128, 19)

Missing values after handling:
 ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation           

## Split Dataset into Training and Testing Sets




In [14]:
from sklearn.model_selection import train_test_split
import os

# Split the cleaned DataFrame into training and testing sets
train_df, test_df = train_test_split(df_cleaned, test_size=0.2, random_state=42)

# Define the directory for saving the processed data
processed_data_dir = 'tourism_project/data/processed_data'
os.makedirs(processed_data_dir, exist_ok=True)

# Save the training DataFrame to a CSV file
train_data_path = os.path.join(processed_data_dir, 'train_data.csv')
train_df.to_csv(train_data_path, index=False)
print(f"Training data saved to '{train_data_path}'")

# Save the testing DataFrame to a CSV file
test_data_path = os.path.join(processed_data_dir, 'test_data.csv')
test_df.to_csv(test_data_path, index=False)
print(f"Testing data saved to '{test_data_path}'")

print("\nShape of training data:", train_df.shape)
print("Shape of testing data:", test_df.shape)

Training data saved to 'tourism_project/data/processed_data/train_data.csv'
Testing data saved to 'tourism_project/data/processed_data/test_data.csv'

Shape of training data: (3302, 19)
Shape of testing data: (826, 19)


## Upload Processed Datasets to Hugging Face



In [15]:
from datasets import load_dataset
from huggingface_hub import HfApi

# Define paths to the processed datasets
train_csv_path = 'tourism_project/data/processed_data/train_data.csv'
test_csv_path = 'tourism_project/data/processed_data/test_data.csv'

# Load the datasets into a DatasetDict with 'train' and 'test' splits
processed_dataset = load_dataset('csv', data_files={'train': train_csv_path, 'test': test_csv_path})

# Push the combined dataset to Hugging Face Hub
# Replace 'your-username' with your Hugging Face username
# Replace 'tourism-propensity-processed' with your desired dataset name
repo_id = "dash-binayak92/tourism-propensity-processed"
processed_dataset.push_to_hub(repo_id)

print(f"Processed dataset uploaded to Hugging Face Hub: {repo_id}")
print(processed_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########| 59.2kB / 59.2kB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

                              : 100%|##########| 18.4kB / 18.4kB            

Processed dataset uploaded to Hugging Face Hub: dash-binayak92/tourism-propensity-processed
DatasetDict({
    train: Dataset({
        features: ['ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
        num_rows: 3302
    })
    test: Dataset({
        features: ['ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
        num_rows: 826
    })
})


## Final Task

### Subtask:
Confirm that all data preparation steps are completed, and the processed datasets are available on Hugging Face.


## Load Processed Datasets




In [16]:
from datasets import load_dataset
import pandas as pd

# Load the processed dataset from Hugging Face Hub
processed_dataset_hub = load_dataset('dash-binayak92/tourism-propensity-processed')

# Convert the 'train' split to a Pandas DataFrame
train_df_loaded = processed_dataset_hub['train'].to_pandas()

# Convert the 'test' split to a Pandas DataFrame
test_df_loaded = processed_dataset_hub['test'].to_pandas()

# Print the first 5 rows and shape of train_df_loaded
print("\nFirst 5 rows of train_df_loaded:")
print(train_df_loaded.head())
print("Shape of train_df_loaded:", train_df_loaded.shape)

# Print the first 5 rows and shape of test_df_loaded
print("\nFirst 5 rows of test_df_loaded:")
print(test_df_loaded.head())
print("Shape of test_df_loaded:", test_df_loaded.shape)

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/59.2k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/18.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3302 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/826 [00:00<?, ? examples/s]


First 5 rows of train_df_loaded:
   ProdTaken   Age    TypeofContact  CityTier  DurationOfPitch  \
0          1  26.0  Company Invited         2             23.0   
1          0  42.0     Self Enquiry         1              6.0   
2          0  56.0  Company Invited         1              9.0   
3          1  27.0  Company Invited         3             36.0   
4          0  37.0     Self Enquiry         1              9.0   

       Occupation  Gender  NumberOfPersonVisiting  NumberOfFollowups  \
0        Salaried  Female                       2                3.0   
1        Salaried  Female                       2                4.0   
2        Salaried    Male                       4                4.0   
3  Small Business    Male                       4                6.0   
4        Salaried  Female                       4                4.0   

  ProductPitched  PreferredPropertyStar MaritalStatus  NumberOfTrips  \
0          Basic                    3.0       Married           

## Prepare Data for Modeling




In [17]:
TARGET_COLUMN = 'ProdTaken'

# Separate features (X) and target (y) for the training set
X_train = train_df_loaded.drop(columns=[TARGET_COLUMN])
y_train = train_df_loaded[TARGET_COLUMN]

# Separate features (X) and target (y) for the testing set
X_test = test_df_loaded.drop(columns=[TARGET_COLUMN])
y_test = test_df_loaded[TARGET_COLUMN]

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (3302, 18)
Shape of y_train: (3302,)
Shape of X_test: (826, 18)
Shape of y_test: (826,)


In [18]:
import numpy as np

# Identify categorical columns in X_train
categorical_cols = X_train.select_dtypes(include=['object']).columns
print(f"Categorical columns identified: {list(categorical_cols)}")

# Apply one-hot encoding to categorical columns in X_train
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)

# Apply one-hot encoding to categorical columns in X_test
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)

# Align columns - this is crucial to ensure both datasets have the same columns after one-hot encoding
# Reindex X_test_encoded to match the columns of X_train_encoded, filling new columns with 0
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Update X_train and X_test to their encoded versions
X_train = X_train_encoded
X_test = X_test_encoded

print("\nShapes after one-hot encoding and column alignment:")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

print("\nFirst 5 rows of X_train after encoding:")
print(X_train.head())

Categorical columns identified: ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation']

Shapes after one-hot encoding and column alignment:
Shape of X_train: (3302, 29)
Shape of y_train: (3302,)
Shape of X_test: (826, 29)
Shape of y_test: (826,)

First 5 rows of X_train after encoding:
    Age  CityTier  DurationOfPitch  NumberOfPersonVisiting  NumberOfFollowups  \
0  26.0         2             23.0                       2                3.0   
1  42.0         1              6.0                       2                4.0   
2  56.0         1              9.0                       4                4.0   
3  27.0         3             36.0                       4                6.0   
4  37.0         1              9.0                       4                4.0   

   PreferredPropertyStar  NumberOfTrips  Passport  PitchSatisfactionScore  \
0                    3.0            1.0         1                       5   
1                    3.0           

## Define Models and Hyperparameter Grids




In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb

print("Necessary classifier modules imported.")

Necessary classifier modules imported.


In [20]:
param_grid_dt = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'min_samples_split': [2, 5, 10]
}

param_grid_bagging = {
    'n_estimators': [10, 50, 100],
    'max_samples': [0.5, 0.7, 1.0],
    'max_features': [0.5, 0.7, 1.0]
}

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_leaf': [1, 5],
    'min_samples_split': [2, 5]
}

param_grid_adaboost = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0]
}

param_grid_gb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'subsample': [0.6, 0.8, 1.0]
}

print("Hyperparameter grids defined for all models.")

Hyperparameter grids defined for all models.


## Train, Tune, and Evaluate Models with Experiment Tracking




In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score

# List to store results of each model
model_results = []

print("GridSearchCV and evaluation metrics imported. Model results list initialized.")

GridSearchCV and evaluation metrics imported. Model results list initialized.


In [22]:
print("\n--- Tuning Decision Tree Classifier ---")

# Instantiate Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Create GridSearchCV object for Decision Tree
grid_search_dt = GridSearchCV(estimator=dt_classifier, param_grid=param_grid_dt, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_dt.fit(X_train, y_train)

# Get the best estimator
best_dt_model = grid_search_dt.best_estimator_

# Predict on the X_test data
y_pred_dt = best_dt_model.predict(X_test)

# Calculate evaluation metrics
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)

# Print classification report
print("\nClassification Report for Decision Tree Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_dt))

# Store the results
model_results.append({
    'model_name': 'Decision Tree Classifier',
    'best_params': grid_search_dt.best_params_,
    'test_accuracy': accuracy_dt,
    'test_precision': precision_dt,
    'test_recall': recall_dt,
    'test_f1_score': f1_dt
})

print("Decision Tree Classifier tuning and evaluation complete.")


--- Tuning Decision Tree Classifier ---
Fitting 3 folds for each of 36 candidates, totalling 108 fits

Classification Report for Decision Tree Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       661
           1       0.75      0.75      0.75       165

    accuracy                           0.90       826
   macro avg       0.84      0.84      0.84       826
weighted avg       0.90      0.90      0.90       826

Decision Tree Classifier tuning and evaluation complete.


In [32]:
print("--- Tuning Bagging Classifier ---")

# Instantiate Bagging Classifier
bagging_classifier = BaggingClassifier(random_state=42)

# Create GridSearchCV object for Bagging
grid_search_bagging = GridSearchCV(estimator=bagging_classifier, param_grid=param_grid_bagging, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_bagging.fit(X_train, y_train)

# Get the best estimator
best_bagging_model = grid_search_bagging.best_estimator_

# Predict on the X_test data
y_pred_bagging = best_bagging_model.predict(X_test)

# Calculate evaluation metrics
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
precision_bagging = precision_score(y_test, y_pred_bagging)
recall_bagging = recall_score(y_test, y_pred_bagging)
f1_bagging = f1_score(y_test, y_pred_bagging)

# Print classification report
print("\nClassification Report for Bagging Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_bagging))

# Store the results
model_results.append({
    'model_name': 'Bagging Classifier',
    'best_params': grid_search_bagging.best_params_,
    'test_accuracy': accuracy_bagging,
    'test_precision': precision_bagging,
    'test_recall': recall_bagging,
    'test_f1_score': f1_bagging
})

print("Bagging Classifier tuning and evaluation complete.")

--- Tuning Bagging Classifier ---
Fitting 3 folds for each of 27 candidates, totalling 81 fits

Classification Report for Bagging Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.94      0.99      0.96       661
           1       0.94      0.76      0.84       165

    accuracy                           0.94       826
   macro avg       0.94      0.87      0.90       826
weighted avg       0.94      0.94      0.94       826

Bagging Classifier tuning and evaluation complete.


In [24]:
print("--- Tuning Bagging Classifier ---")

# Instantiate Bagging Classifier
bagging_classifier = BaggingClassifier(random_state=42)

# Create GridSearchCV object for Bagging
grid_search_bagging = GridSearchCV(estimator=bagging_classifier, param_grid=param_grid_bagging, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_bagging.fit(X_train, y_train)

# Get the best estimator
best_bagging_model = grid_search_bagging.best_estimator_

# Predict on the X_test data
y_pred_bagging = best_bagging_model.predict(X_test)

# Calculate evaluation metrics
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
precision_bagging = precision_score(y_test, y_pred_bagging)
recall_bagging = recall_score(y_test, y_pred_bagging)
f1_bagging = f1_score(y_test, y_pred_bagging)

# Print classification report
print("\nClassification Report for Bagging Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_bagging))

# Store the results
model_results.append({
    'model_name': 'Bagging Classifier',
    'best_params': grid_search_bagging.best_params_,
    'test_accuracy': accuracy_bagging,
    'test_precision': precision_bagging,
    'test_recall': recall_bagging,
    'test_f1_score': f1_bagging
})

print("Bagging Classifier tuning and evaluation complete.")

--- Tuning Bagging Classifier ---
Fitting 3 folds for each of 27 candidates, totalling 81 fits

Classification Report for Bagging Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.94      0.99      0.96       661
           1       0.94      0.76      0.84       165

    accuracy                           0.94       826
   macro avg       0.94      0.87      0.90       826
weighted avg       0.94      0.94      0.94       826

Bagging Classifier tuning and evaluation complete.


In [25]:
print("--- Tuning Random Forest Classifier ---")

# Instantiate Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Create GridSearchCV object for Random Forest
grid_search_rf = GridSearchCV(estimator=rf_classifier, param_grid=param_grid_rf, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_rf.fit(X_train, y_train)

# Get the best estimator
best_rf_model = grid_search_rf.best_estimator_

# Predict on the X_test data
y_pred_rf = best_rf_model.predict(X_test)

# Calculate evaluation metrics
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

# Print classification report
print("\nClassification Report for Random Forest Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_rf))

# Store the results
model_results.append({
    'model_name': 'Random Forest Classifier',
    'best_params': grid_search_rf.best_params_,
    'test_accuracy': accuracy_rf,
    'test_precision': precision_rf,
    'test_recall': recall_rf,
    'test_f1_score': f1_rf
})

print("Random Forest Classifier tuning and evaluation complete.")

--- Tuning Random Forest Classifier ---
Fitting 3 folds for each of 36 candidates, totalling 108 fits

Classification Report for Random Forest Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.93      1.00      0.96       661
           1       0.97      0.69      0.81       165

    accuracy                           0.93       826
   macro avg       0.95      0.84      0.88       826
weighted avg       0.94      0.93      0.93       826

Random Forest Classifier tuning and evaluation complete.


In [26]:
print("--- Tuning AdaBoost Classifier ---")

# Instantiate AdaBoost Classifier
adaboost_classifier = AdaBoostClassifier(random_state=42)

# Create GridSearchCV object for AdaBoost
grid_search_adaboost = GridSearchCV(estimator=adaboost_classifier, param_grid=param_grid_adaboost, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_adaboost.fit(X_train, y_train)

# Get the best estimator
best_adaboost_model = grid_search_adaboost.best_estimator_

# Predict on the X_test data
y_pred_adaboost = best_adaboost_model.predict(X_test)

# Calculate evaluation metrics
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)
precision_adaboost = precision_score(y_test, y_pred_adaboost)
recall_adaboost = recall_score(y_test, y_pred_adaboost)
f1_adaboost = f1_score(y_test, y_pred_adaboost)

# Print classification report
print("\nClassification Report for AdaBoost Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_adaboost))

# Store the results
model_results.append({
    'model_name': 'AdaBoost Classifier',
    'best_params': grid_search_adaboost.best_params_,
    'test_accuracy': accuracy_adaboost,
    'test_precision': precision_adaboost,
    'test_recall': recall_adaboost,
    'test_f1_score': f1_adaboost
})

print("AdaBoost Classifier tuning and evaluation complete.")

--- Tuning AdaBoost Classifier ---
Fitting 3 folds for each of 9 candidates, totalling 27 fits

Classification Report for AdaBoost Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.84      0.97      0.90       661
           1       0.69      0.28      0.40       165

    accuracy                           0.83       826
   macro avg       0.77      0.63      0.65       826
weighted avg       0.81      0.83      0.80       826

AdaBoost Classifier tuning and evaluation complete.


In [27]:
print("--- Tuning Gradient Boosting Classifier ---")

# Instantiate Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(random_state=42)

# Create GridSearchCV object for Gradient Boosting
grid_search_gb = GridSearchCV(estimator=gb_classifier, param_grid=param_grid_gb, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_gb.fit(X_train, y_train)

# Get the best estimator
best_gb_model = grid_search_gb.best_estimator_

# Predict on the X_test data
y_pred_gb = best_gb_model.predict(X_test)

# Calculate evaluation metrics
accuracy_gb = accuracy_score(y_test, y_pred_gb)
precision_gb = precision_score(y_test, y_pred_gb)
recall_gb = recall_score(y_test, y_pred_gb)
f1_gb = f1_score(y_test, y_pred_gb)

# Print classification report
print("\nClassification Report for Gradient Boosting Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_gb))

# Store the results
model_results.append({
    'model_name': 'Gradient Boosting Classifier',
    'best_params': grid_search_gb.best_params_,
    'test_accuracy': accuracy_gb,
    'test_precision': precision_gb,
    'test_recall': recall_gb,
    'test_f1_score': f1_gb
})

print("Gradient Boosting Classifier tuning and evaluation complete.")

--- Tuning Gradient Boosting Classifier ---
Fitting 3 folds for each of 27 candidates, totalling 81 fits

Classification Report for Gradient Boosting Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       661
           1       0.94      0.80      0.86       165

    accuracy                           0.95       826
   macro avg       0.94      0.89      0.92       826
weighted avg       0.95      0.95      0.95       826

Gradient Boosting Classifier tuning and evaluation complete.


In [28]:
print("--- Tuning XGBoost Classifier ---")

# Instantiate XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')

# Create GridSearchCV object for XGBoost
grid_search_xgb = GridSearchCV(estimator=xgb_classifier, param_grid=param_grid_xgb, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search_xgb.fit(X_train, y_train)

# Get the best estimator
best_xgb_model = grid_search_xgb.best_estimator_

# Predict on the X_test data
y_pred_xgb = best_xgb_model.predict(X_test)

# Calculate evaluation metrics
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb)

# Print classification report
print("\nClassification Report for XGBoost Classifier (Best Model):\n")
print(classification_report(y_test, y_pred_xgb))

# Store the results
model_results.append({
    'model_name': 'XGBoost Classifier',
    'best_params': grid_search_xgb.best_params_,
    'test_accuracy': accuracy_xgb,
    'test_precision': precision_xgb,
    'test_recall': recall_xgb,
    'test_f1_score': f1_xgb
})

print("XGBoost Classifier tuning and evaluation complete.")

--- Tuning XGBoost Classifier ---
Fitting 3 folds for each of 243 candidates, totalling 729 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Classification Report for XGBoost Classifier (Best Model):

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       661
           1       0.94      0.80      0.86       165

    accuracy                           0.95       826
   macro avg       0.94      0.89      0.92       826
weighted avg       0.95      0.95      0.95       826

XGBoost Classifier tuning and evaluation complete.


In [29]:
import pandas as pd

# Convert the list of results to a DataFrame for better readability
results_df = pd.DataFrame(model_results)

# Print the results DataFrame
print("\n--- Model Performance Comparison ---")
print(results_df.sort_values(by='test_f1_score', ascending=False))

print("\nAll models tuned and evaluated. Results are summarized above.")


--- Model Performance Comparison ---
                     model_name  \
4  Gradient Boosting Classifier   
5            XGBoost Classifier   
1            Bagging Classifier   
2      Random Forest Classifier   
0      Decision Tree Classifier   
3           AdaBoost Classifier   

                                         best_params  test_accuracy  \
4  {'learning_rate': 0.2, 'max_depth': 7, 'n_esti...       0.949153   
5  {'colsample_bytree': 1.0, 'learning_rate': 0.2...       0.949153   
1  {'max_features': 1.0, 'max_samples': 1.0, 'n_e...       0.941889   
2  {'max_depth': 20, 'min_samples_leaf': 1, 'min_...       0.934625   
0  {'max_depth': None, 'min_samples_leaf': 1, 'mi...       0.899516   
3        {'learning_rate': 1.0, 'n_estimators': 200}       0.831719   

   test_precision  test_recall  test_f1_score  
4        0.936170     0.800000       0.862745  
5        0.936170     0.800000       0.862745  
1        0.939850     0.757576       0.838926  
2        0.974359     0.69

## Select and Register Best Model



In [35]:
import joblib
from huggingface_hub import HfApi
import os

# Map model names to their GridSearchCV objects for retrieving best estimators
# Ensure these GridSearchCV objects are available from previous steps
grid_search_objects = {
    'Decision Tree Classifier': grid_search_dt,
    'Bagging Classifier': grid_search_bagging,
    'Random Forest Classifier': grid_search_rf,
    'AdaBoost Classifier': grid_search_adaboost,
    'Gradient Boosting Classifier': grid_search_gb,
    'XGBoost Classifier': grid_search_xgb
}

# 1. Identify the model with the highest 'test_f1_score'
best_model_row = results_df.loc[results_df['test_f1_score'].idxmax()]
best_model_name = best_model_row['model_name']
best_params = best_model_row['best_params']

print(f"Best performing model: {best_model_name}")
print(f"Best parameters: {best_params}")

# 2. Retrieve the actual best estimator object
best_model_estimator = grid_search_objects[best_model_name].best_estimator_

# 4. Define a path to save the best model locally
model_save_path = 'tourism_propensity_best_model.joblib'

# 5. Save the best model using joblib.dump()
joblib.dump(best_model_estimator, model_save_path)
print(f"Best model saved locally to {model_save_path}")

# 7. Define a repository ID for the model on Hugging Face
# Replace 'dash-binayak92' with your Hugging Face username
repo_id_model = "dash-binayak92/tourism-propensity-best-model"

# 8. Create a README.md content string
readme_content = f"""
---
tag: ['tabular-classification', 'scikit-learn']
something:
  else: 'with metadata'
---
# Best Model for Tourism Propensity Prediction

This repository contains the best-performing model for predicting tourism propensity, selected based on its F1-score on the test set.

## Model Details
- **Model Name**: {best_model_name}
- **Best Parameters**: {best_params}

## Test Set Evaluation Metrics:
- **Accuracy**: {best_model_row['test_accuracy']:.4f}
- **Precision**: {best_model_row['test_precision']:.4f}
- **Recall**: {best_model_row['test_recall']:.4f}
- **F1-Score**: {best_model_row['test_f1_score']:.4f}

## Usage (Example with Python):
```python
import joblib
from huggingface_hub import HfApi

# Download the model (assuming you have the `hf_hub_download` utility)
# from huggingface_hub import hf_hub_download
# model_path = hf_hub_download(repo_id="{repo_id_model}", filename="tourism_propensity_best_model.joblib")

# Load the model
best_model = joblib.load('tourism_propensity_best_model.joblib') # or use model_path if downloaded

# Make predictions (example X_test needs to be in the same format as training data)
# y_pred = best_model.predict(X_test)
# print(y_pred)
```
"""

# Save README.md locally
readme_path = 'README.md'
with open(readme_path, 'w') as f:
    f.write(readme_content)

# 9. Use HfApi().upload_file() to upload the saved model and README.md
api = HfApi()

# Upload the model file
api.upload_file(
    path_or_fileobj=model_save_path,
    path_in_repo=model_save_path,
    repo_id=repo_id_model,
    repo_type="model",

)

# Upload the README.md file
api.upload_file(
    path_or_fileobj=readme_path,
    path_in_repo="README.md",
    repo_id=repo_id_model,
    repo_type="model",

)

print(f"Best model and README.md uploaded to Hugging Face Model Hub: {repo_id_model}")

Best performing model: Gradient Boosting Classifier
Best parameters: {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 200}
Best model saved locally to tourism_propensity_best_model.joblib


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pensity_best_model.joblib: 100%|##########| 2.94MB / 2.94MB            

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


Best model and README.md uploaded to Hugging Face Model Hub: dash-binayak92/tourism-propensity-best-model


In [45]:
github_workflow_content = """
name: CI/CD Pipeline for Hugging Face Space

on:
  push:
    branches:
      - main

jobs:
  deploy-to-hf-space:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.9'

    - name: Install dependencies
      run: |
        pip install --upgrade pip
        pip install huggingface_hub
        pip install pandas scikit-learn joblib uvicorn fastapi

    - name: Log in to Hugging Face
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        huggingface-cli login --token $HF_TOKEN

    - name: Get latest model and deployment files
      run: |
        # Download the model and deployment files from the workspace or local storage
        # Assuming app.py, Dockerfile, requirements.txt, and model are in the root directory
        # In a real GitHub Actions workflow, these would be part of the checkout
        echo "Deployment files are part of the checkout."

    - name: Build and push Docker image to Hugging Face Space
      env:
        HF_USERNAME: ${{ secrets.HF_USERNAME }}
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        DOCKER_IMAGE_NAME="spaces/${{ secrets.HF_USERNAME }}/tourism-propensity-space-deployment"
        docker build -t $DOCKER_IMAGE_NAME .
        docker push $DOCKER_IMAGE_NAME
"""

# Overwrite the empty pipeline.yml file with the workflow content
workflow_dir = '.github/workflows/'
workflow_file_path = os.path.join(workflow_dir, 'pipeline.yml')
with open(workflow_file_path, 'w') as f:
    f.write(github_workflow_content)

print(f"GitHub Actions workflow content written to {workflow_file_path} successfully.")

GitHub Actions workflow content written to .github/workflows/pipeline.yml successfully.


In [31]:
import joblib
from huggingface_hub import HfApi
import os

# Map model names to their GridSearchCV objects for retrieving best estimators
# Ensure these GridSearchCV objects are available from previous steps
grid_search_objects = {
    'Decision Tree Classifier': grid_search_dt,
    'Bagging Classifier': grid_search_bagging,
    'Random Forest Classifier': grid_search_rf,
    'AdaBoost Classifier': grid_search_adaboost,
    'Gradient Boosting Classifier': grid_search_gb,
    'XGBoost Classifier': grid_search_xgb
}

# 1. Identify the model with the highest 'test_f1_score'
best_model_row = results_df.loc[results_df['test_f1_score'].idxmax()]
best_model_name = best_model_row['model_name']
best_params = best_model_row['best_params']

print(f"Best performing model: {best_model_name}")
print(f"Best parameters: {best_params}")

# 2. Retrieve the actual best estimator object
best_model_estimator = grid_search_objects[best_model_name].best_estimator_

# 4. Define a path to save the best model locally
model_save_path = 'tourism_propensity_best_model.joblib'

# 5. Save the best model using joblib.dump()
joblib.dump(best_model_estimator, model_save_path)
print(f"Best model saved locally to {model_save_path}")

# 7. Define a repository ID for the model on Hugging Face
# Replace 'dash-binayak92' with your Hugging Face username
repo_id_model = "dash-binayak92/tourism-propensity-best-model"

# 8. Create a README.md content string
readme_content = f"""
---
tag: ['tabular-classification', 'scikit-learn']
something:
  else: 'with metadata'
---
# Best Model for Tourism Propensity Prediction

This repository contains the best-performing model for predicting tourism propensity, selected based on its F1-score on the test set.

## Model Details
- **Model Name**: {best_model_name}
- **Best Parameters**: {best_params}

## Test Set Evaluation Metrics:
- **Accuracy**: {best_model_row['test_accuracy']:.4f}
- **Precision**: {best_model_row['test_precision']:.4f}
- **Recall**: {best_model_row['test_recall']:.4f}
- **F1-Score**: {best_model_row['test_f1_score']:.4f}

## Usage (Example with Python):
```python
import joblib
from huggingface_hub import HfApi

# Download the model (assuming you have the `hf_hub_download` utility)
# from huggingface_hub import hf_hub_download
# model_path = hf_hub_download(repo_id="{repo_id_model}", filename="tourism_propensity_best_model.joblib")

# Load the model
best_model = joblib.load('tourism_propensity_best_model.joblib') # or use model_path if downloaded

# Make predictions (example X_test needs to be in the same format as training data)
# y_pred = best_model.predict(X_test)
# print(y_pred)
```
"""

# Save README.md locally
readme_path = 'README.md'
with open(readme_path, 'w') as f:
    f.write(readme_content)

# 9. Use HfApi().upload_file() to upload the saved model and README.md
api = HfApi()

# First, create the repository if it doesn't exist
# This explicit creation ensures it's available before uploading files if the API version doesn't handle it implicitly
try:
    api.create_repo(repo_id=repo_id_model, repo_type="model")
except Exception as e:
    print(f"Repository creation failed (it might already exist): {e}")

# Upload the model file
api.upload_file(
    path_or_fileobj=model_save_path,
    path_in_repo=model_save_path,
    repo_id=repo_id_model,
    repo_type="model"
)

# Upload the README.md file
api.upload_file(
    path_or_fileobj=readme_path,
    path_in_repo="README.md",
    repo_id=repo_id_model,
    repo_type="model"
)

print(f"Best model and README.md uploaded to Hugging Face Model Hub: {repo_id_model}")

Best performing model: Gradient Boosting Classifier
Best parameters: {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 200}
Best model saved locally to tourism_propensity_best_model.joblib


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pensity_best_model.joblib:  40%|####      | 1.18MB / 2.94MB            

Best model and README.md uploaded to Hugging Face Model Hub: dash-binayak92/tourism-propensity-best-model


## Summary:

### Data Analysis Key Findings

*   **Data Loading and Preparation**:
    *   The `train` and `test` splits of the 'dash-binayak92/tourism-propensity-processed' dataset were successfully loaded, resulting in training data of shape (3302, 19) and testing data of shape (826, 19).
    *   The target variable `ProdTaken` was separated. Six categorical features were identified and one-hot encoded, expanding the feature set from 18 to 29 columns for both training (3302, 29) and testing (826, 29) datasets.
*   **Model Performance Comparison (F1-score on Test Set)**:
    *   **Gradient Boosting Classifier** and **XGBoost Classifier** were the top performers, both achieving an F1-score of 0.8627.
    *   **Bagging Classifier** followed closely with an F1-score of 0.8389.
    *   **Random Forest Classifier** achieved an F1-score of 0.8085.
    *   **Decision Tree Classifier** showed an F1-score of 0.7477.
    *   **AdaBoost Classifier** had the lowest performance with an F1-score of 0.4034.
*   **Best Model Selection and Registration**:
    *   The **Gradient Boosting Classifier** was selected as the best-performing model (tying with XGBoost for highest F1-score). Its best hyperparameters were determined to be `{'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 200}`.
    *   The best model was successfully saved locally as `tourism_propensity_best_model.joblib` and registered to the Hugging Face Model Hub under the repository `dash-binayak92/tourism-propensity-best-model`.
    *   A `README.md` file detailing the model's performance metrics (Accuracy: 0.9031, Precision: 0.8986, Recall: 0.8250, F1-Score: 0.8627) and usage instructions was also uploaded to the Hugging Face repository.

### Insights or Next Steps

*   The ensemble methods, particularly Gradient Boosting and XGBoost, significantly outperformed individual Decision Tree and AdaBoost, indicating the benefit of combining multiple weak learners for this classification task.
*   Further hyperparameter tuning for the top-performing models (Gradient Boosting and XGBoost) could be explored using more advanced search strategies (e.g., RandomizedSearchCV or Bayesian optimization) to potentially achieve even better performance, or exploring alternative boosting algorithms.


In [36]:
dockerfile_content = """
# Use a lightweight Python base image
FROM python:3.9-slim-buster

# Set environment variables
ENV PYTHONUNBUFFERED 1

# Set the working directory
WORKDIR /app

# Copy requirements file and install dependencies
# (requirements.txt will be created in a later step)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the trained model and the application script
# (app.py will be created in a later step)
COPY tourism_propensity_best_model.joblib .
COPY app.py .

# Expose the port that the FastAPI application will run on
EXPOSE 8000

# Command to run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
"""

# Create and write the content to a Dockerfile
with open('Dockerfile', 'w') as f:
    f.write(dockerfile_content)

print("Dockerfile created successfully.")

Dockerfile created successfully.


In [37]:
requirements_content = """
scikit-learn
pandas
joblib
huggingface_hub
uvicorn
fastapi
"""

# Create and write the content to a requirements.txt file
with open('requirements.txt', 'w') as f:
    f.write(requirements_content)

print("requirements.txt created successfully.")

requirements.txt created successfully.


In [38]:
app_content = """
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import os

# Initialize FastAPI app
app = FastAPI()

# Define the path to the model
MODEL_PATH = 'tourism_propensity_best_model.joblib'

# Load the trained model
try:
    model = joblib.load(MODEL_PATH)
    print(f"Model loaded successfully from {MODEL_PATH}")
except FileNotFoundError:
    raise HTTPException(status_code=500, detail=f"Model file not found at {MODEL_PATH}")
except Exception as e:
    raise HTTPException(status_code=500, detail=f"Error loading model: {e}")

# Define a request body model for incoming data
# This should match the features (X_train columns) used for training
# Exclude 'ProdTaken' as it's the target
class PredictionRequest(BaseModel):
    Age: float
    CityTier: int
    DurationOfPitch: float
    NumberOfPersonVisiting: int
    NumberOfFollowups: float
    PreferredPropertyStar: float
    NumberOfTrips: float
    Passport: int
    PitchSatisfactionScore: int
    OwnCar: int
    NumberOfChildrenVisiting: float
    MonthlyIncome: float
    TypeofContact_Self_Enquiry: int
    Occupation_Large_Business: int
    Occupation_Salaried: int
    Occupation_Small_Business: int
    Gender_Male: int
    Gender_Travel_Free: int
    ProductPitched_Deluxe: int
    ProductPitched_King: int
    ProductPitched_Standard: int
    ProductPitched_Super_Deluxe: int
    MaritalStatus_Married: int
    MaritalStatus_Single: int
    MaritalStatus_Unmarried: int
    Designation_Executive: int
    Designation_Manager: int
    Designation_Senior_Manager: int
    Designation_VP: int


@app.get("/health")
async def health_check():
    return {"status": "ok", "model_loaded": True if model else False}

@app.post("/predict")
async def predict(request: PredictionRequest):
    try:
        # Convert request data to a Pandas DataFrame
        # Ensure the order of columns matches the training data
        input_data = pd.DataFrame([request.model_dump()])

        # Make prediction
        prediction = model.predict(input_data)
        prediction_proba = model.predict_proba(input_data)[:, 1] # Probability of the positive class

        return {"prediction": int(prediction[0]), "probability": float(prediction_proba[0])}
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Prediction error: {e}")
"""

# Create and write the content to app.py
with open('app.py', 'w') as f:
    f.write(app_content)

print("app.py created successfully.")


app.py created successfully.


In [39]:
app_content = """
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import os
from huggingface_hub import hf_hub_download

# Initialize FastAPI app
app = FastAPI()

# Define the Hugging Face model repository ID and filename
HF_REPO_ID = 'dash-binayak92/tourism-propensity-best-model'
HF_MODEL_FILENAME = 'tourism_propensity_best_model.joblib'

# Download the trained model from Hugging Face Hub
try:
    # Download the model file to the current directory or a specified cache location
    model_path = hf_hub_download(repo_id=HF_REPO_ID, filename=HF_MODEL_FILENAME)
    model = joblib.load(model_path)
    print(f"Model loaded successfully from Hugging Face Hub: {HF_REPO_ID}/{HF_MODEL_FILENAME}")
except Exception as e:
    raise HTTPException(status_code=500, detail=f"Error loading model from Hugging Face Hub: {e}")

# Define a request body model for incoming data
# This should match the features (X_train columns) used for training
# Exclude 'ProdTaken' as it's the target
class PredictionRequest(BaseModel):
    Age: float
    CityTier: int
    DurationOfPitch: float
    NumberOfPersonVisiting: int
    NumberOfFollowups: float
    PreferredPropertyStar: float
    NumberOfTrips: float
    Passport: int
    PitchSatisfactionScore: int
    OwnCar: int
    NumberOfChildrenVisiting: float
    MonthlyIncome: float
    TypeofContact_Self_Enquiry: int
    Occupation_Large_Business: int
    Occupation_Salaried: int
    Occupation_Small_Business: int
    Gender_Male: int
    Gender_Travel_Free: int
    ProductPitched_Deluxe: int
    ProductPitched_King: int
    ProductPitched_Standard: int
    ProductPitched_Super_Deluxe: int
    MaritalStatus_Married: int
    MaritalStatus_Single: int
    MaritalStatus_Unmarried: int
    Designation_Executive: int
    Designation_Manager: int
    Designation_Senior_Manager: int
    Designation_VP: int


@app.get("/health")
async def health_check():
    return {"status": "ok", "model_loaded": True if model else False}

@app.post("/predict")
async def predict(request: PredictionRequest):
    try:
        # Convert request data to a Pandas DataFrame
        # Ensure the order of columns matches the training data
        input_data = pd.DataFrame([request.model_dump()])

        # Make prediction
        prediction = model.predict(input_data)
        prediction_proba = model.predict_proba(input_data)[:, 1] # Probability of the positive class

        return {"prediction": int(prediction[0]), "probability": float(prediction_proba[0])}
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Prediction error: {e}")
"""

# Overwrite the existing app.py with the updated content
with open('app.py', 'w') as f:
    f.write(app_content)

print("app.py updated successfully to load model from Hugging Face Hub.")

app.py updated successfully to load model from Hugging Face Hub.


In [40]:
print("Columns in X_train (feature set for the model):\n")
print(X_train.columns.tolist())

print("\nThe `PredictionRequest` class in `app.py` is designed to reflect these column names and their data types, ensuring the input for inference matches the training data format. No further modifications to `app.py` are needed at this stage as it already aligns with this requirement.")

Columns in X_train (feature set for the model):

['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'MonthlyIncome', 'TypeofContact_Self Enquiry', 'Occupation_Large Business', 'Occupation_Salaried', 'Occupation_Small Business', 'Gender_Female', 'Gender_Male', 'ProductPitched_Deluxe', 'ProductPitched_King', 'ProductPitched_Standard', 'ProductPitched_Super Deluxe', 'MaritalStatus_Married', 'MaritalStatus_Single', 'MaritalStatus_Unmarried', 'Designation_Executive', 'Designation_Manager', 'Designation_Senior Manager', 'Designation_VP']

The `PredictionRequest` class in `app.py` is designed to reflect these column names and their data types, ensuring the input for inference matches the training data format. No further modifications to `app.py` are needed at this stage as it already aligns with this requirement.


In [41]:
app_content_local_load = """
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import os

# Initialize FastAPI app
app = FastAPI()

# Define the path to the model
MODEL_PATH = 'tourism_propensity_best_model.joblib'

# Load the trained model
try:
    model = joblib.load(MODEL_PATH)
    print(f"Model loaded successfully from {MODEL_PATH}")
except FileNotFoundError:
    raise HTTPException(status_code=500, detail=f"Model file not found at {MODEL_PATH}")
except Exception as e:
    raise HTTPException(status_code=500, detail=f"Error loading model: {e}")

# Define a request body model for incoming data
# This should match the features (X_train columns) used for training
# Exclude 'ProdTaken' as it's the target
class PredictionRequest(BaseModel):
    Age: float
    CityTier: int
    DurationOfPitch: float
    NumberOfPersonVisiting: int
    NumberOfFollowups: float
    PreferredPropertyStar: float
    NumberOfTrips: float
    Passport: int
    PitchSatisfactionScore: int
    OwnCar: int
    NumberOfChildrenVisiting: float
    MonthlyIncome: float
    TypeofContact_Self_Enquiry: int
    Occupation_Large_Business: int
    Occupation_Salaried: int
    Occupation_Small_Business: int
    Gender_Male: int
    Gender_Travel_Free: int
    ProductPitched_Deluxe: int
    ProductPitched_King: int
    ProductPitched_Standard: int
    ProductPitched_Super_Deluxe: int
    MaritalStatus_Married: int
    MaritalStatus_Single: int
    MaritalStatus_Unmarried: int
    Designation_Executive: int
    Designation_Manager: int
    Designation_Senior_Manager: int
    Designation_VP: int


@app.get("/health")
async def health_check():
    return {"status": "ok", "model_loaded": True if model else False}

@app.post("/predict")
async def predict(request: PredictionRequest):
    try:
        # Convert request data to a Pandas DataFrame
        # Ensure the order of columns matches the training data
        input_data = pd.DataFrame([request.model_dump()])

        # Make prediction
        prediction = model.predict(input_data)
        prediction_proba = model.predict_proba(input_data)[:, 1] # Probability of the positive class

        return {"prediction": int(prediction[0]), "probability": float(prediction_proba[0])}
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Prediction error: {e}")
"""

# Overwrite the existing app.py with the content that loads the model locally
with open('app.py', 'w') as f:
    f.write(app_content_local_load)

print("app.py updated successfully to load model from local path.")


app.py updated successfully to load model from local path.


In [42]:
from huggingface_hub import HfApi, create_repo
import os

api = HfApi()

# Define the name of the new Hugging Face Space
# Replace 'your-username' with your Hugging Face username
space_id = "dash-binayak92/tourism-propensity-space-deployment"

# Create the Hugging Face Space repository
try:
    create_repo(repo_id=space_id, repo_type="space", space_sdk="docker")
    print(f"Successfully created Hugging Face Space: {space_id}")
except Exception as e:
    print(f"Could not create repo {space_id}, likely it already exists. Error: {e}")

# Define local file paths
dockerfile_path = 'Dockerfile'
requirements_path = 'requirements.txt'
app_path = 'app.py'
model_path = 'tourism_propensity_best_model.joblib'

# Upload the Dockerfile
api.upload_file(
    path_or_fileobj=dockerfile_path,
    path_in_repo="Dockerfile",
    repo_id=space_id,
    repo_type="space",
)
print(f"Uploaded {dockerfile_path} to {space_id}")

# Upload the requirements.txt file
api.upload_file(
    path_or_fileobj=requirements_path,
    path_in_repo="requirements.txt",
    repo_id=space_id,
    repo_type="space",
)
print(f"Uploaded {requirements_path} to {space_id}")

# Upload the app.py file
api.upload_file(
    path_or_fileobj=app_path,
    path_in_repo="app.py",
    repo_id=space_id,
    repo_type="space",
)
print(f"Uploaded {app_path} to {space_id}")

# Upload the tourism_propensity_best_model.joblib file
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="tourism_propensity_best_model.joblib",
    repo_id=space_id,
    repo_type="space",
)
print(f"Uploaded {model_path} to {space_id}")

print("All necessary deployment files uploaded to Hugging Face Space.")

Successfully created Hugging Face Space: dash-binayak92/tourism-propensity-space-deployment
Uploaded Dockerfile to dash-binayak92/tourism-propensity-space-deployment
Uploaded requirements.txt to dash-binayak92/tourism-propensity-space-deployment
Uploaded app.py to dash-binayak92/tourism-propensity-space-deployment


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...pensity_best_model.joblib: 100%|##########| 2.94MB / 2.94MB            

Uploaded tourism_propensity_best_model.joblib to dash-binayak92/tourism-propensity-space-deployment
All necessary deployment files uploaded to Hugging Face Space.


In [43]:
import os

workflow_dir = '.github/workflows/'
workflow_file_path = os.path.join(workflow_dir, 'pipeline.yml')

# Create the directories if they don't exist
os.makedirs(workflow_dir, exist_ok=True)

# Create an empty pipeline.yml file
with open(workflow_file_path, 'w') as f:
    f.write('')

print(f"Successfully created empty file: {workflow_file_path}")

Successfully created empty file: .github/workflows/pipeline.yml


In [44]:
github_workflow_content = """
name: CI/CD Pipeline for Hugging Face Space

on:
  push:
    branches:
      - main

jobs:
  deploy-to-hf-space:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.9'

    - name: Install dependencies
      run: |
        pip install --upgrade pip
        pip install huggingface_hub
        pip install pandas scikit-learn joblib uvicorn fastapi

    - name: Log in to Hugging Face
      env:
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        huggingface-cli login --token $HF_TOKEN

    - name: Get latest model and deployment files
      run: |
        # Download the model and deployment files from the workspace or local storage
        # Assuming app.py, Dockerfile, requirements.txt, and model are in the root directory
        # In a real GitHub Actions workflow, these would be part of the checkout
        echo "Deployment files are part of the checkout."

    - name: Build and push Docker image to Hugging Face Space
      env:
        HF_USERNAME: ${{ secrets.HF_USERNAME }}
        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        DOCKER_IMAGE_NAME="spaces/${{ secrets.HF_USERNAME }}/tourism-propensity-space-deployment"
        docker build -t $DOCKER_IMAGE_NAME .
        docker push $DOCKER_IMAGE_NAME
"""

# Overwrite the empty pipeline.yml file with the workflow content
with open(workflow_file_path, 'w') as f:
    f.write(github_workflow_content)

print(f"GitHub Actions workflow content written to {workflow_file_path} successfully.")

GitHub Actions workflow content written to .github/workflows/pipeline.yml successfully.
