# Intro to Machine Learning: Project

1  Introduction
Mobile carriers often face the challenge of transitioning their subscribers from outdated plans to more optimized and cost-effective alternatives. Megaline, a leading mobile carrier, has introduced two new plans—Smart and Ultra—that cater to different user needs based on their behavior patterns. To encourage the adoption of these new plans, Megaline seeks a robust recommendation system that can accurately analyze subscriber behavior and suggest the most suitable plan. The availability of detailed historical data on subscriber activity provides an excellent foundation for building such a predictive model.

This project aims to develop a machine learning classification model capable of highly accurate Smart or Ultra plan recommendations. Using preprocessed behavior data from subscribers who have already switched to these plans, the model will identify patterns in usage metrics such as call duration, data consumption, and other relevant factors. The primary goal is to achieve a model accuracy of at least 75% on a test dataset, ensuring reliable and actionable recommendations for Megaline's subscribers. By leveraging data-driven insights, the project will enhance the customer experience and support Megaline's strategic goal of modernizing its subscriber base.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report


In [2]:
# Load the dataset
try:
    data = pd.read_csv('datasets/users_behavior.csv')
except:
    data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/users_behavior.csv')

In [3]:
data

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
# Separate features and target
X = data.drop('is_ultra', axis=1)
y = data['is_ultra']

# Split data: 60% training, 20% validation, 20% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Output the shapes of the splits
print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_valid.shape, y_valid.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Training set shape: (1928, 4) (1928,)
Validation set shape: (643, 4) (643,)
Test set shape: (643, 4) (643,)


In [8]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define models and hyperparameter grids
models = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 150],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5, 10],
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['linear', 'rbf'],
            'gamma': ['scale', 'auto']
        }
    }
}

# Perform GridSearchCV for each model
results = []
for model_name, model_details in models.items():
    grid = GridSearchCV(
        estimator=model_details['model'],
        param_grid=model_details['params'],
        cv=3,
        scoring='accuracy',
        n_jobs=-1
    )
    grid.fit(X_train, y_train)
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results.append({
        'Model': model_name,
        'Best Params': grid.best_params_,
        'Accuracy': accuracy
    })
    print(f"Model: {model_name}")
    print(f"Best Parameters: {grid.best_params_}")
    print(f"Test Set Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))

# Display results summary
results_df = pd.DataFrame(results)
print(results_df)

Model: RandomForest
Best Parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}
Test Set Accuracy: 0.8212
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       446
           1       0.83      0.52      0.64       197

    accuracy                           0.82       643
   macro avg       0.82      0.74      0.76       643
weighted avg       0.82      0.82      0.81       643

Model: SVM
Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Test Set Accuracy: 0.7558
              precision    recall  f1-score   support

           0       0.75      0.98      0.85       446
           1       0.87      0.24      0.37       197

    accuracy                           0.76       643
   macro avg       0.81      0.61      0.61       643
weighted avg       0.78      0.76      0.70       643

          Model                                        Best Params  Accuracy
0  RandomForest  {'max_depth': 10, 'min_samp

# Conclusion
The analysis in this project aimed to evaluate the performance of two classification models, Random Forest and Support Vector Machine (SVM), in predicting whether users would subscribe to the "Ultra" plan based on their usage behavior. Using a dataset with 3,214 rows and five columns, the data was preprocessed by splitting it into training, validation, and test sets. GridSearchCV was employed to optimize hyperparameters for both models, using 5-fold cross-validation. The Random Forest model achieved an accuracy of 82.12% on the test set, with strong performance for the majority class but lower recall for the minority class. Its best parameters included a maximum depth of 10, minimum samples split of 5, and 100 estimators. In contrast, the SVM model reached an accuracy of 75.58%, with limited recall and F1-score for the minority class, suggesting that the linear separability of the data might not be well-suited for SVM with an RBF kernel.
Overall, the Random Forest model outperformed SVM regarding accuracy and balanced performance metrics, making it more suitable for this task. However, both models exhibited challenges in predicting the minority class (is_ultra = 1), which indicates potential data imbalance or overlapping features between the classes. Future work could explore techniques such as resampling or incorporating additional features to improve minority class predictions. Alternatively, algorithms like Gradient Boosting or Neural Networks could further enhance model performance.