In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [50]:
df = pd.read_csv('Churn.csv')

In [51]:
df.dtypes

Customer ID           object
Gender                object
Senior Citizen         int64
Partner               object
Dependents            object
tenure                 int64
Phone Service         object
Multiple Lines        object
Internet Service      object
Online Security       object
Online Backup         object
Device Protection     object
Tech Support          object
Streaming TV          object
Streaming Movies      object
Contract              object
Paperless Billing     object
Payment Method        object
Monthly Charges      float64
Total Charges         object
Churn                 object
dtype: object

In [52]:
X = pd.get_dummies(df.drop(['Churn', 'Customer ID'], axis=1))
y = df['Churn'].apply(lambda x: 1 if x=='Yes' else 0)

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [54]:
y_train.head()

2501    0
2212    0
1594    0
3388    0
4780    0
Name: Churn, dtype: int64

# 1. Import Dependencies

In [55]:
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score

# 2. Build and Compile Model

## RELU (hidden layer)

Purpose:
- ReLU is the most commonly used activation function in hidden layers of neural networks.
- It outputs the input directly if it is positive; otherwise, it outputs zero. This introduces non-linearity, which helps the model learn complex patterns.
- It’s computationally efficient because it doesn’t activate all neurons simultaneously (neurons with negative input are set to zero), which reduces the likelihood of overfitting and speeds up training.
        
Why ReLU?:
- Simplicity: ReLU is simple and easy to implement.
- Performance: It helps mitigate the vanishing gradient problem, which can occur with activation functions like the sigmoid or tanh. ReLU allows gradients to propagate more effectively - during backpropagation, which is crucial for training deep networks.

## Sigmoid (output layer)

Purpose:
- The sigmoid activation function squashes input values into a range between 0 and 1, making it suitable for binary classification problems (where you want an output representing a probability).
- It transforms the output of the network into a probability value, which is ideal when your output layer is expected to represent the probability of a particular class (0 or 1).
- This is why it's often used in the final layer of binary classification models.

Why Sigmoid in the Output Layer?:
- Binary Output: The sigmoid function is particularly useful in the output layer when you need a binary output (e.g., 0 or 1) for classification tasks.
- Interpretability: The output can be interpreted as the probability of the input belonging to the positive class (1).

## Tanh (Hyperbolic Tangent)
- Range: [-1, 1]
- Purpose: Similar to Sigmoid but outputs values between -1 and 1. This can be advantageous because it centers the data, which often leads to faster convergence in training compared to Sigmoid.
- Use Case: Often used in hidden layers where the output needs to be normalized.

## Leaky ReLU
- Range: [−∞,∞]
- Purpose: A variation of ReLU that allows a small, non-zero gradient (controlled by 𝛼) when the input is negative, addressing the "dying ReLU" problem where neurons could become inactive permanently.
- Use Case: Often used in hidden layers when ReLU might result in dead neurons.

## ELU (Exponential Linear Unit)
- Range:[−𝛼,∞]
- Purpose: Similar to Leaky ReLU but smoother. ELU can provide better performance because it has negative values, which push the mean of the activations closer to zero and speeds up learning.
- Use Case: Applied in hidden layers, especially when a smooth gradient is preferred.

## Softmax
- Range: [0, 1] (but the sum of all outputs is 1)
- Purpose: Converts logits (raw prediction scores) into probabilities. Unlike Sigmoid, which is used for binary classification, Softmax is used for multi-class classification where multiple classes exist, and one class must be chosen.
- Use Case: Typically used in the output layer of neural networks when dealing with multi-class classification problems.

## Swish
- Range:[−0.278,∞]
- Purpose: Combines properties of ReLU and Sigmoid. Swish tends to perform better than ReLU in some deep neural networks, particularly for very deep models.
- Use Case: Can be used in hidden layers, especially in deep networks.

## GELU (Gaussian Error Linear Unit)
- Purpose: Provides a smoother version of ReLU. It introduces non-linearity in a probabilistic way and is used in some state-of-the-art models like BERT.
- Use Case: Commonly used in deep learning architectures, particularly in natural language processing (NLP).

## Softplus
- Range:[0,∞)
- Purpose: A smooth approximation of ReLU. Unlike ReLU, it is always differentiable, which can be advantageous during optimization.


## Maxout
- Purpose: Generalizes ReLU and Leaky ReLU. It allows the model to learn the best activation function for the task by selecting the maximum of multiple linear functions.
- Use Case: Used in hidden layers, particularly in models that benefit from learning different activation functions.

### Summary
- ReLU, Leaky ReLU, ELU are commonly used in *hidden layers* for their simplicity and efficiency.
- Sigmoid, Softmax are often used in *output layers* for binary and multi-class classification tasks, respectively.
- Tanh, Swish, GELU are alternative activation functions that can **outperform** ReLU in certain scenarios.
- Maxout, Softplus offer flexibility and smoothness in activations, suitable for specific use cases.

Selecting the right activation function depends on the nature of the problem, the architecture of the network, and empirical testing.

In [56]:
# relu (Rectified Linear Unit) = most commonly used for hidden layers. easy and simple to implement

model = Sequential()
model.add(Dense(units=32, activation='relu', input_dim=len(X_train.columns)))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

# Summary:
# ReLU in hidden layers helps the model learn non-linear relationships in the data, making the network capable of solving more complex problems.
# Sigmoid in the output layer is used for binary classification, giving a probabilistic interpretation of the output.

### Learning Rate

1. Learning Rate Too High
Overshooting: If the learning rate is too high, the optimizer might take steps that are too large, causing the model to overshoot the minimum of the loss function. This can result in the model bouncing around the minimum or diverging altogether, failing to converge.
Instability: High learning rates can lead to highly fluctuating loss values, making the training process unstable.
Poor Convergence: The model might not find the optimal weights, leading to poor accuracy or performance.
2. Learning Rate Too Low
Slow Convergence: If the learning rate is too low, the optimizer takes tiny steps towards the minimum. This can lead to very slow convergence, meaning the model might take a long time to reach the optimal solution.
Getting Stuck in Local Minima: A very low learning rate may cause the optimizer to get stuck in local minima or plateaus, preventing the model from finding the global minimum.
Underfitting: A low learning rate might prevent the model from learning the underlying patterns in the data, leading to underfitting, where the model performs poorly on both the training and validation sets.
3. Learning Rate Just Right
Efficient Convergence: A well-chosen learning rate will allow the model to converge efficiently to a good minimum of the loss function. It balances the step size, ensuring the optimizer moves steadily toward the minimum without overshooting or taking too long.
Good Generalization: With the right learning rate, the model is more likely to generalize well to unseen data, finding a balance between fitting the training data and performing well on the validation/test set.
4. Learning Rate Scheduling
Learning Rate Decay: Reducing the learning rate during training can help the model converge more smoothly. Start with a higher learning rate to make quick progress, and then reduce it to fine-tune the weights as you approach the minimum.
Learning Rate Schedulers: Techniques like exponential decay, step decay, or adaptive learning rate methods (like those in Adam, RMSprop) automatically adjust the learning rate during training.
5. Practical Considerations
Grid Search/Cross-Validation: Often, finding the optimal learning rate involves experimentation. Techniques like grid search or cross-validation can help identify a good learning rate.
Learning Rate Finder: A learning rate finder can be used to plot the loss against various learning rates, helping you select an optimal starting point.

In [57]:
# model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics='accuracy')

In [58]:
from tensorflow.keras.optimizers import SGD

# Define the optimizer with momentum
optimizer = SGD(learning_rate=0.01, momentum=0.9)

# Compile the model with the new optimizer
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])


- can either use categorical_cross entropy or binary_cross_entropy for classification task
- use MSE, MAE, or Huber Loss for regression cases

### Liist of optimizers that can be used
1. SGD (stochastic gradient descent)
2. SGD with momentum

**Adaptive learning Rate Optimizers**
1. Adagrad
2. Adadelta
3. Adam
4. RMSProp
5. AdaMax
6. Nadam


##### Summary
- The choice of optimizer depends on the nature of the problem, the architecture of the neural network, and specific training requirements.
- SGD, Adam, and RMSprop are some of the most commonly used optimizers, but experimenting with different optimizers and their parameters can often lead to improved performance.

In [59]:
X_train = X_train.astype('float32')
y_train = y_train.astype('float32')  # or 'int64' if y_train contains integer labels

X_test = X_test.astype('float32')
y_test = y_test.astype('float32')  # or 'int64' if y_train contains integer labels


# 3. Fit, Predict and Evaluate

In [60]:
model.fit(X_train, y_train, epochs=100, batch_size=32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x21102b2f370>

In [61]:
y_hat = model.predict(X_test)
y_hat = [0 if val < 0.5 else 1 for val in y_hat]

In [62]:
accuracy_score(y_test, y_hat)

0.26827537260468415

#### SGD Binary cross entropy
- epochs 200 - 0.8055
- epochs 100 - 0.7821


# 4. Saving and Reloading

In [63]:
model.save('tfmodel')

INFO:tensorflow:Assets written to: tfmodel\assets


In [64]:
del model 

In [65]:
model = load_model('tfmodel')