# Using Decision Tree, Random Forest, and Logistic Regression for classification tasks with telecom data 

The goal of this project is to develop a machine learning model that accurately predicts which mobile carrier plan a user will choose based on their usage patterns. By analyzing features such as call minutes, number of messages, and internet data consumption, we aim to identify the key factors influencing customer decisions and automate the classification process.

To achieve this, we explore several classification algorithms, compare their performance, and select the most reliable model. The results can help telecom companies better understand customer behavior, improve targeted marketing, and optimize plan offerings. Overall, we hope to develop a model with the highest possible accuracy. 

In [9]:
import pandas as pd

# Load the dataset
df = pd.read_csv('users_behavior.csv')

In [10]:
# Display the first five rows
print("First five rows:")
print(df.head(), "\n")

First five rows:
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0 



In [11]:
# Show info about the dataset (including data types and missing values)
print("Info about the dataset:")
print(df.info(), "\n")

Info about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None 



In [12]:
# Show summary statistics for numeric columns
print("Summary statistics:")
print(df.describe(), "\n")

Summary statistics:
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.000000  21424.700000     1.000000
max     244.000000  1632.060000   224.000000  49745.730000     1.000000 



In [13]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [14]:
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv('users_behavior.csv')

# First, separate features and target
X = df.drop('is_ultra', axis=1)
y = df['is_ultra']

# Split into train+val and test (80% train+val, 20% test)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Now split train+val into train and validation (75% train, 25% val of train+val = 60% train, 20% val)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val
)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Validation set: {X_valid.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Optionally, save splits for later use
X_train.to_csv('X_train.csv', index=False)
X_valid.to_csv('X_valid.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_valid.to_csv('y_valid.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

Train set: 1928 samples
Validation set: 643 samples
Test set: 643 samples


In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load train/validation splits
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv').squeeze()  # squeeze to convert to Series
X_valid = pd.read_csv('X_valid.csv')
y_valid = pd.read_csv('y_valid.csv').squeeze()

results = []

# Decision Tree
for max_depth in [3, 5, 7, None]:
    clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    clf.fit(X_train, y_train)
    preds = clf.predict(X_valid)
    acc = accuracy_score(y_valid, preds)
    results.append({
        'model': 'DecisionTree',
        'params': f'max_depth={max_depth}',
        'accuracy': acc
    })

# Random Forest
for n_estimators in [10, 50, 100]:
    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    clf.fit(X_train, y_train)
    preds = clf.predict(X_valid)
    acc = accuracy_score(y_valid, preds)
    results.append({
        'model': 'RandomForest',
        'params': f'n_estimators={n_estimators}',
        'accuracy': acc
    })

# Logistic Regression
for C in [0.1, 1, 10]:
    clf = LogisticRegression(C=C, max_iter=1000, random_state=42)
    clf.fit(X_train, y_train)
    preds = clf.predict(X_valid)
    acc = accuracy_score(y_valid, preds)
    results.append({
        'model': 'LogisticRegression',
        'params': f'C={C}',
        'accuracy': acc
    })

# Print results
df_results = pd.DataFrame(results)
print(df_results.sort_values(by='accuracy', ascending=False))

                model            params  accuracy
1        DecisionTree       max_depth=5  0.790047
5        RandomForest   n_estimators=50  0.785381
6        RandomForest  n_estimators=100  0.782271
0        DecisionTree       max_depth=3  0.777605
2        DecisionTree       max_depth=7  0.776050
4        RandomForest   n_estimators=10  0.772939
7  LogisticRegression             C=0.1  0.744946
8  LogisticRegression               C=1  0.744946
9  LogisticRegression              C=10  0.744946
3        DecisionTree    max_depth=None  0.715397


**Findings** The **Decision Tree Model** with max_depth=5 acheieved the highest validation accuracy (0.79). It outperformed other models and hyperparameter configuartions.
The **Random Forest Models** also performed well, with the best result from n_estimators=50 (accuracy 0.785), closely followed by n_estimators=100 and n_estimators=10.
**Logistic Regression Models**, regardless of the regularization parameter C, had slightly lower accuracy (~0.74), not surpassing the 0.75 threshold.
Overall, **tree-based models** (Decision Tree and Random Forest) provided better accuracy than Logistic Regression for this classification task.
All models **except** Logisctic Regression surpassed the project's rquired accuarcy threshold of 0.75.
In conclusion, the **Decision Tree** with max_depth=5 was the **best performing** model in the study!

In [16]:
# Load train and test splits
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv').squeeze()
X_test = pd.read_csv('X_test.csv')
y_test = pd.read_csv('y_test.csv').squeeze()

# Train the best model (DecisionTree with max_depth=5)
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {test_accuracy:.3f}")

Test set accuracy: 0.801


**Sanity Check Summary**
**Accuracy Consistency:** The model's validation (0.79) and test (0.80) accuracies are close, which indicates no major overfitting or underfitting.

**Conclusion:**  
The Decision Tree model (`max_depth=5`) passes all sanity checks. It generalizes well, uses relevant features, and reliably predicts the classes. No obvious issues are detected in its performance.

**Conclusion**
In this project, we set out to build a reliable classifier for predicting user plan choices using their activity data. The dataset was carefully split into training, validation, and test sets to ensure fair model evaluation.

After comparing multiple models, Tree-based models (Decision Tree and Random Forest) consistently outperformed Logistic Regression. The best-performing model was a Decision Tree with a maximum depth of 5, achieving a validation accuracy of 0.79.

When evaluated on the test set, the Decision Tree model achieved an accuracy of 0.80, confirming its ability to generalize well to new, unseen data. Sanity check shows the model's validation (0.79) and test (0.80) accuracies are close, which indicates no major overfitting or underfitting.

Overall, the project successfully developed a robust model that meets the required accuracy threshold. The results demonstrate that tree-based models are effective for this classification task, providing reliable and interpretable predictions that can support business decision-making.