## 1 Overview

### 1.1 Project Description

Megaline, a mobile carrier, has identified that a significant number of their subscribers are still using legacy plans. The company aims to develop a model that can analyze subscriber behavior and recommend one of their newer plans: Smart or Ultra. The project leverages behavior data from subscribers who have already switched to these new plans. The task is to build a classification model that recommends the appropriate plan based on subscriber behavior.

### 1.2 Objectives

- **Develop a Model:** Create a model with the highest possible accuracy to classify subscribers into Smart or Ultra plans. The accuracy threshold for this project is set at 0.75.
- **Evaluate Model Performance:** Use a test dataset to verify the model's accuracy.

## 2 Initialization

### 2.1 Add imports

Imports in Jupyter notebooks allow users to access external libraries for extended functionality and facilitate code organization by declaring dependencies at the beginning of the notebook, ensuring clear and efficient development.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

1. **Pandas**: is a Python library used for data manipulation and analysis, offering powerful data structures and operations for working with structured data.
2. **Sklearn**: is a Python library for machine learning that provides tools for data mining, data analysis, and machine learning algorithms, built on NumPy, SciPy, and matplotlib.
3. **Matplotlib**: Matplotlib is a plotting library for Python that enables the creation of static, animated, and interactive visualizations, providing a wide range of plotting options and customization.

### 2.2 Set up CSV DataFrames

In my Jupyter notebook, I use Pandas to load CSV files, enabling me to manipulate and analyze data seamlessly within the notebook environment.

In [2]:
path = {
    'local': './datasets/users_behavior.csv',
    'sever': '/datasets/users_behavior.csv',
    'online': 'https://raw.githubusercontent.com/alexcoy06/Data-Science/main/Project%207/datasets/users_behavior.csv'
}

I'm using `path` to store paths to datasets for my `local` machine, TripleTen's `server`, and `online` use for any remote use when needed.

In [3]:
def load_csv(file_path):
    try:
        df = pd.read_csv(file_path['local'])
    except FileNotFoundError:
        try:
            df = pd.read_csv(file_path['server'])
        except FileNotFoundError:
            df = pd.read_csv(file_path['online'])
    return df

I define the `load_csv` function to load a dataset specified by the argument `file_path`. First, I attempt to read it locally from `file_path['local']`, handling a `FileNotFoundError` by trying to read from `file_path['server_path']` if necessary, and finally, from `file_path['online']` if all else fails.

In [4]:
behavior = load_csv(path)

The variable `behavior` is assigned the resulting DataFrame from the created function.

## 3 Preparing the Data

I will now proceed to examine the `behavior` data frame.

In [5]:
behavior

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


The dataset consists of monthly behavior information for individual users, with the following attributes:

- **Calls:** Number of calls made.
- **Minutes:** Total duration of calls in minutes.
- **Messages:** Number of text messages sent.
- **MB Used:** Internet traffic used in megabytes.
- **Is Ultra:** Indicator of the plan for the current month (Ultra - 1, Smart - 0).

In [6]:
behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Based on our current knowledge, all the data types are acceptable.

In [7]:
behavior_miss = behavior.isna().sum()
behavior_dupl = behavior.duplicated().sum()
print(f'There are {behavior_dupl} duplicate values. The columns with missing values are:\n{behavior_miss}')

There are 0 duplicate values. The columns with missing values are:
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


With no missing or duplicated values present, it is safe to proceed to the next step.

## 4 Splitting the Data

In this section, we will split our dataset into training and testing sets using the `train_test_split` function from `sklearn.model_selection`. This division is crucial for evaluating the performance of our machine learning model on unseen data.

In [8]:
features = behavior.drop('is_ultra', axis=1)
target = behavior['is_ultra']

First, we needed to define the feature and the target variable.

In [9]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=42)

This code splits the `features` and `target` datasets into training and validation sets using a 75/25 ratio, ensuring reproducibility with `random_state=42`.

## 5 Model Training and Evaluation

### 5.1 Decision Tree Classifier

Now i will use a decision tree classifier to predicts outcomes.

In [10]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(features_train, target_train)

This initializes a Decision Tree classifier with the same fixed random seed, then trains the model using training data, `features_train` and `target_train`.

In [11]:
target_pred_valid_dt = dt_model.predict(features_valid)  
accuracy_dt = accuracy_score(target_valid, target_pred_valid_dt) 
print(f'Decision Tree Classifier Accuracy: {accuracy_dt}')

Decision Tree Classifier Accuracy: 0.7375621890547264


The decision tree's accuracy is around 74%.

### 5.2 Random Forest Classifier

Now i will use a random forest classifier to predicts outcomes.

In [12]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(features_train, target_train)

This initializes a Random Forest classifier with 100 trees and the same fixed random seed. It then fits the model to training data, `features_train` and `target_train`.

In [13]:
target_pred_valid_rf = rf_model.predict(features_valid)  
accuracy_rf = accuracy_score(target_valid, target_pred_valid_rf) 
print(f'Random Forest Classifier Accuracy: {accuracy_rf}')

Random Forest Classifier Accuracy: 0.8208955223880597


The accuracy achieved using the Random Forest model is notably higher, reaching 82% upon calculation.

## 6 Sanity Check and Documentation

A sanity check is a simple and quick evaluation to ensure that something is working as expected and/or makes sense.

In [14]:
print(classification_report(target_valid, target_pred_valid_dt))

              precision    recall  f1-score   support

           0       0.81      0.81      0.81       565
           1       0.56      0.56      0.56       239

    accuracy                           0.74       804
   macro avg       0.69      0.69      0.69       804
weighted avg       0.74      0.74      0.74       804



This is the decision tree sanity check.

In [15]:
print(classification_report(target_valid, target_pred_valid_rf))

              precision    recall  f1-score   support

           0       0.84      0.93      0.88       565
           1       0.77      0.57      0.65       239

    accuracy                           0.82       804
   macro avg       0.80      0.75      0.77       804
weighted avg       0.82      0.82      0.81       804



This is the random forest sanity check.