In [4]:
import pandas as pd

It seems like you've got a clear task ahead of you! Let's break down the steps you need to take for this assignment:

1. **Data Preparation**:
   - Start by loading the dataset into your preferred data analysis environment (e.g., Python with pandas).
   - Inspect the dataset to understand its structure, check for missing values, and identify the features and target variable.
   - Convert the 'Connect_Date' column to a datetime format for further analysis.
   - You may also need to preprocess the data, such as encoding categorical variables and handling any outliers or anomalies.

2. **Train-Validation Split**:
   - Split the dataset into training and validation sets. Since you're supposed to create your validation set, you can use techniques like stratified sampling to ensure that both sets have a similar distribution of classes (churned vs. not churned).
   - It's essential to keep the validation set separate from the training set to evaluate the performance of your predictive model properly.

3. **Model Selection**:
   - Choose appropriate machine learning algorithms for binary classification. Common choices include logistic regression, random forests, gradient boosting, and support vector machines.
   - Experiment with different models and hyperparameters to find the best performing one for your task.

4. **Model Evaluation**:
   - Evaluate the performance of your model using the two specified metrics: profit @ top-20 and AUC.
   - For profit @ top-20, you'll need to predict churn probabilities for all customers in the validation set, rank them based on these probabilities, and calculate the accumulated profitability for the top 20 predicted churners.
   - For AUC, you can use standard evaluation functions provided by libraries like scikit-learn in Python.

5. **Addressing Class Imbalance**:
   - Since the dataset may suffer from class imbalance (i.e., fewer instances of churned customers compared to non-churned ones), consider techniques like oversampling, undersampling, or using class weights to handle this imbalance.
   - Be mindful that the class imbalance and the specific evaluation metric (profit @ top-20) will present challenges that you'll need to address during modeling and evaluation.

6. **Iteration and Optimization**:
   - Iterate on your model by fine-tuning hyperparameters, feature engineering, or trying different algorithms to improve performance.
   - Optimize your model for both the profit @ top-20 and AUC metrics, balancing between the two as needed.

By following these steps and iterating on your approach, you should be able to construct a predictive model to predict telco customer churn effectively. Remember to document your process and results thoroughly for your assignment. Good luck!

# step 1
load the dataset and 
Inspect the dataset to understand its structure, check for missing values, and identify the features and target variable.

In [5]:

traindata = pd.read_csv('/Users/camillecu/Downloads/KUL/AdvancedAnalytic/AdvancedAnalytics_Assignments/Assignment1/data/train.csv')


In [None]:
# show rows that contain missing data
missing_data_rows = traindata[traindata.isnull().any(axis=1)]
missing_data_rows


Unnamed: 0,Gender,Age,Connect_Date,L_O_S,Dropped_Calls,tariff,Handset,Peak_calls_Sum,Peak_mins_Sum,OffPeak_calls_Sum,...,Tariff_OK,average cost min,Peak ratio,OffPeak ratio,Weekend ratio,Nat-InterNat Ratio,high Dropped calls,No Usage,target,id
1736,F,48.0,26/07/98,26.966667,2.0,Play 100,BS110,0.0,0.0,0.0,...,OK,0.5,0.0,0.0,0.0,0.0,F,T,0,K244380
3237,F,34.0,22/03/97,43.333333,2.0,Play 100,BS110,0.0,0.0,0.0,...,OK,0.5,0.0,0.0,0.0,0.0,F,T,0,K244320
3836,M,21.0,03/01/96,58.133333,2.0,Play 100,CAS30,0.0,0.0,0.0,...,OK,0.5,0.0,0.0,0.0,0.0,F,T,1,K213590
4301,F,22.0,08/08/98,26.533333,5.0,Play 100,CAS30,0.0,0.0,0.0,...,OK,0.5,0.0,0.0,0.0,0.0,F,T,1,K212820


In [6]:
# drop rows with missing data from traindata because only 4 rows of missing data
traindata = traindata.dropna()


# Stratified Splitting:
 Ensure that both the training and validation sets have a similar class distribution as the original dataset. This is particularly important for imbalanced datasets, such as predicting customer churn.

In [7]:
from sklearn.model_selection import train_test_split

In [8]:

# Define 'X' and 'y' variables
X = traindata.drop('target', axis=1)
y = traindata['target']

# Perform train-validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


# encoding categorical variables and text 

In [9]:
# List all the columns in the traindata that are of type object (text string) or category
categorical_text_columns = traindata.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_text_columns)


['Gender', 'Connect_Date', 'tariff', 'Handset', 'Usage_Band', 'Tariff_OK', 'high Dropped calls', 'No Usage', 'id']


In [10]:
#encode Gender column
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder and transform the "Gender" column in both training and validation sets
X_train['Gender'] = label_encoder.fit_transform(X_train['Gender'])
X_val['Gender'] = label_encoder.transform(X_val['Gender'])


"Connect_Date" is approximately uniformly distributed over a four-year time period, it suggests that customers have been connecting to the service relatively evenly throughout that timeframe.
extract Year, Month and Day to capture any seasonal patterns or trends in customer connections.

In [45]:
# Convert 'Connect_Date' to datetime format
X_train['Connect_Date'] = pd.to_datetime(X_train['Connect_Date'])
X_val['Connect_Date'] = pd.to_datetime(X_val['Connect_Date'])

# Extract year, month, and day
X_train['Connect_Year'] = X_train['Connect_Date'].dt.year
X_train['Connect_Month'] = X_train['Connect_Date'].dt.month
X_train['Connect_Day'] = X_train['Connect_Date'].dt.day

X_val['Connect_Year'] = X_val['Connect_Date'].dt.year
X_val['Connect_Month'] = X_val['Connect_Date'].dt.month
X_val['Connect_Day'] = X_val['Connect_Date'].dt.day



In [46]:

X_train = X_train.drop(['Connect_Date'], axis=1)
X_val = X_val.drop(['Connect_Date'], axis=1)

In [None]:
#Specifies the tariff plan or subscription package of the customer.
.unique()
X_train['tariff']

array(['CAT 100', 'Play 300', 'CAT 200', 'Play 100', 'CAT 50'],
      dtype=object)

In [59]:
import category_encoders as ce
encoder = ce.WOEEncoder(cols=['tariff'])

encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)



In [56]:
#Represents the type or model of handset used by the customer.
X_train['Handset'].unique()


array(['BS210', 'ASAD170', 'S50', 'S80', 'BS110', 'WC95', 'SOP20',
       'CAS30', 'ASAD90', 'SOP10', 'CAS60'], dtype=object)

In [61]:
import category_encoders as ce
encoder = ce.WOEEncoder(cols=['Handset'])

encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)



In [25]:
#Indicates the usage band of the customer.
X_train['Usage_Band'].unique()

array(['Med', 'High', 'MedHigh', 'MedLow', 'Low'], dtype=object)

In [None]:
from category_encoders import BinaryEncoder

# create a binary encoder
binary_encoder = BinaryEncoder(cols=['Usage_Band'])
# fit and transform the 'Usage_Band' column
X_train = binary_encoder.fit_transform(X_train)


In [None]:
X_train['Tariff_OK'].unique()

array(['High CAT 100', 'OK', 'High CAT 50', 'High Play 100'], dtype=object)

In [33]:
X_train['Tariff_OK'] = X_train['Tariff_OK'].apply(lambda x: 1 if x == 'OK' else 0)


In [34]:
X_train['Tariff_OK']

4872    0
4332    1
491     0
1866    1
4097    1
       ..
3015    1
4530    0
778     1
917     1
4396    1
Name: Tariff_OK, Length: 4032, dtype: int64

In [14]:
X_train['high Dropped calls'].unique()

array([0, 1])

In [13]:
# Use the LabelEncoder that's already defined to encode the 'high Dropped calls' column in X_train
X_train['high Dropped calls'] = label_encoder.fit_transform(X_train['high Dropped calls'])
X_val['high Dropped calls'] = label_encoder.fit_transform(X_val['high Dropped calls'])


In [39]:
X_train['high Dropped calls']

4872    0
4332    0
491     0
1866    0
4097    0
       ..
3015    0
4530    0
778     0
917     0
4396    1
Name: high Dropped calls, Length: 4032, dtype: int64

In [15]:
X_train['No Usage'].unique()

array(['F'], dtype=object)

In [16]:
#If the column "No Usage" only contains one value 'F', indicating that all customers have used the service, 
#then it does not provide any useful information for predicting churn. 
#In this case, it can be safely removed from the model because it does not contribute to the variability 
#in the target variable (churn) and would not help improve the model's predictive performance.
X_train = X_train.drop('No Usage', axis=1)
X_val = X_val.drop('No Usage', axis=1)

In [17]:
#ID is not useful for prediction, so drop it
X_train = X_train.drop('id', axis=1)
X_val = X_val.drop('id', axis=1)

In [50]:
correlation = X_train[['Total_call_cost', 'Total_Cost']].corr()
print(correlation)

                 Total_call_cost  Total_Cost
Total_call_cost         1.000000    0.922025
Total_Cost              0.922025    1.000000


In [51]:
#since the correlation between 'Total_call_cost' and 'Total_Cost' is 1, we can drop one of them
X_train = X_train.drop('Total_call_cost', axis=1)

In [53]:
correlation1 = X_train[['All_calls_mins', 'National mins']].corr()
print(correlation1)

                All_calls_mins  National mins
All_calls_mins        1.000000       0.983414
National mins         0.983414       1.000000


In [54]:
#since the correlation between 'All_calls_mins' and 'National mins' is 1, we can drop one of them
X_train = X_train.drop('National mins', axis=1)

In [62]:
X_train

Unnamed: 0,Gender,Age,L_O_S,Dropped_Calls,tariff,Handset,Peak_calls_Sum,Peak_mins_Sum,OffPeak_calls_Sum,OffPeak_mins_Sum,...,Tariff_OK,average cost min,Peak ratio,OffPeak ratio,Weekend ratio,Nat-InterNat Ratio,high Dropped calls,Connect_Year,Connect_Month,Connect_Day
4872,0,18.0,21.566667,2.0,-0.065706,-0.904940,260.0,774.600000,41.0,180.300001,...,0,0.164789,0.791296,0.184186,0.024517,0.055794,0,1999,4,1
4332,0,19.0,36.500000,0.0,-0.517076,-2.326433,26.0,141.600000,290.0,627.300000,...,1,0.172304,0.176801,0.783244,0.039955,0.329335,0,1997,10,13
491,0,32.0,30.700000,2.0,-0.065706,-0.904940,286.0,839.400001,39.0,68.099999,...,0,0.175558,0.895933,0.072687,0.031380,0.106794,0,1998,5,4
1866,1,42.0,30.366667,2.0,-0.003297,-0.076801,311.0,1969.800000,18.0,122.700001,...,1,0.151454,0.897199,0.055887,0.046914,0.270087,0,1998,4,15
4097,1,27.0,13.833333,2.0,-0.517076,-0.076801,49.0,178.200000,213.0,567.299999,...,1,0.134312,0.209820,0.667962,0.122218,0.160121,0,1999,8,24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3015,1,22.0,27.200000,1.0,-0.065706,-0.106055,221.0,424.800001,6.0,264.899999,...,1,0.177464,0.559463,0.348874,0.091663,0.138337,0,1998,7,19
4530,0,16.0,25.200000,1.0,0.392410,-0.904940,152.0,356.400000,115.0,286.800001,...,0,0.110832,0.487285,0.392125,0.120591,0.008258,0,1998,9,17
778,1,64.0,36.333333,1.0,0.392410,-0.904940,23.0,140.400000,15.0,167.400000,...,1,0.230781,0.415631,0.495560,0.088810,0.315206,0,1997,10,18
917,1,17.0,20.200000,0.0,-0.003297,-2.326433,575.0,898.800000,4.0,276.600000,...,1,0.153566,0.753774,0.231969,0.014257,0.189639,0,1999,2,14


In [None]:
X_val

Unnamed: 0,Gender,Age,Connect_Date,L_O_S,Dropped_Calls,tariff,Handset,Peak_calls_Sum,Peak_mins_Sum,OffPeak_calls_Sum,...,Total_Cost,Tariff_OK,average cost min,Peak ratio,OffPeak ratio,Weekend ratio,Nat-InterNat Ratio,high Dropped calls,No Usage,id
3160,1,51.0,251,47.766667,2.0,CAT 200,BS110,392.0,1774.800000,54.0,...,275.258159,OK,0.120049,0.818257,0.146611,0.035131,0.057113,F,F,K371910
1123,0,31.0,691,35.066667,13.0,CAT 200,SOP10,837.0,1256.400000,9.0,...,304.156451,OK,0.158938,0.844071,0.118307,0.037622,0.285647,T,F,K404320
3328,1,20.0,1386,11.900000,8.0,CAT 100,S50,31.0,615.600000,134.0,...,145.046952,OK,0.155259,0.663434,0.226641,0.109926,0.006818,F,F,K350260
2807,1,19.0,464,34.766667,9.0,Play 300,BS210,244.0,417.600000,5.0,...,144.517727,OK,0.139895,0.481883,0.481191,0.036926,0.192066,F,F,K274340
2195,1,32.0,649,36.466667,12.0,Play 100,SOP20,10.0,64.199999,64.0,...,66.067733,OK,0.110830,0.110537,0.774793,0.114669,0.026376,T,F,K413900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,1,23.0,1198,18.166667,2.0,CAT 200,S50,737.0,1316.399999,174.0,...,212.980704,OK,0.110907,0.689070,0.310930,0.000000,0.005208,F,F,K307860
3405,1,35.0,1152,21.600000,0.0,Play 100,S50,42.0,81.600000,28.0,...,84.142654,OK,0.177551,0.197388,0.780842,0.021771,0.146363,F,F,K220360
4689,0,38.0,820,31.766667,1.0,CAT 200,ASAD170,410.0,885.600000,71.0,...,214.593309,OK,0.169196,0.841026,0.127635,0.031339,0.204474,F,F,K116210
2944,0,20.0,589,38.466667,0.0,CAT 100,WC95,93.0,537.000000,70.0,...,133.419976,OK,0.166922,0.694067,0.184955,0.120977,0.033084,F,F,K352160


# Model Selection:

 ## logistic regression
 sensitive to correlative features and curse of dimensionality
##  random forests
##  gradient boosting

In [63]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Sort features by their importances
sorted_indices = np.argsort(importances)[::-1]

# Select the top N features
N = 10
selected_features = X_train.columns[sorted_indices[:N]]

# Train a new model with only selected features
rf_new = RandomForestClassifier()
rf_new.fit(X_train[selected_features], y_train)

NameError: name 'np' is not defined

In [None]:

#logistic regression
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
logistic_model = LogisticRegression()

# Train the model on the training data
logistic_model.fit(X_train, y_train)