#  Machine-Learning Pipeline Implementation

## Introduction

For telecommunication companies, it is key to attract new customers and at the same time avoid contract terminations (i.e., churn) to grow their revenue generating base. Looking at churn, different reasons trigger customers to terminate their contracts, such as better price offers, more interesting packages, bad service experiences or change of customers’ personal situations. Churn analytics provides valuable capabilities to predict customer churn and also define the underlying reasons that drive it. With this in mind, telecommunication companies might decide to apply machine learning models to predict churn on an individual customer basis and take counter measures, such as discounts, special offers or other gratifications to keep their customers. Therefore, in this homework, we will imagine that a company asks you to predict whether its customers are about to churn by means of a machine-learning approach. 

Specifically,  we will ask you to:

Load and explore the data appropriately, based on the task and application (what is important?). 

Implement and evaluate a machine-learning approach appropriately, based on the task and data set (e.g., size). 



## The Data Set

This sample data tracks a fictional telco company's customer churn based on a variety of possible factors. Specifically, the `data` folder includes CSV files pertaining to customer's demographic attributes, contract information, and monthly service-related data. Each file is characterized by the following attributes. 

**customer.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| CustomerID | A unique ID that identifies each customer.  | 
| Gender |  The customer’s gender: Male, Female  | 
| SeniorCitizen | Indicates if the customer is 65 or older: 0, 1.  | 
| Partner |  Indicates if the customer is married: Yes, No | 
| Dependents | Indicates if the customer lives with any dependents: Yes, No.  |  
| PaperlessBilling | Indicates if the customer has chosen paperless billing: Yes, No.  | 
| PaymentMethod |  Indicates how the customer pays their bill: Mailed check, Electronic check, Credit card, Bank transfer.  | 

**contract.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| ContractID |  A unique ID that identifies each contract.  | 
| CustomerID |  A unique ID that identifies each customer.  | 
| Contract |  Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.  | 
| StartDate |  Start date of the contract. | 

**churn.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| CustomerID | A unique ID that identifies each customer.   | 
| Churn | 1 = the customer left the company. 0 = the customer remained with the company.  | 

**phone_usage.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| ContractID | A unique ID that identifies each contract.  | 
| Date | The reference period for the monthly usage indicated by this record for this ContractID.  | 
| MonthlyUsage | Indicates the customer’s monthly usage for the phone.  | 

**services.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| ContractID | A unique ID that identifies each contract.  | 
| ServiceValue | The specific service value of type Service the contract has. No = the customer does not have that Service.   | 
| Service | A string label identifying a type of service offered by the company: PhoneService, InternetService, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies.   | 

**charges.csv**

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| ContractID |  A unique ID that identifies each contract. | 
| Date | The billing date for the monthly usage indicated by this record for this ContractID.   | 
| Charge | Indicates the contract’s monthly charge.  | 

In [None]:
#### PACKAGE IMPORTS ####

# Run this cell first to import all required packages. Do not make any imports elsewhere in the notebook. 

import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import pandas as pd
import numpy as np
import math

%matplotlib inline

### your packages here

<a id="section1"></a>
## 1  Load and explore the data appropriately. 
----

In this section, you should:
1. Combine and look at the features in customer.csv and contract.csv, and connect them to their potential influence on customer churn. 
2. Extend the feature set with behavioral features you will create from phone_usage.csv, services.csv, and charges.csv.  

In [None]:
# Files with one record per customer / contract
customer = pd.read_csv('./data/customer.csv')
contract = pd.read_csv('./data/contract.csv')
churn = pd.read_csv('./data/churn.csv')

# Files with one record per customer / contract over months
phone_usage = pd.read_csv('./data/phone_usage.csv')
services = pd.read_csv('./data/services.csv')
charges = pd.read_csv('./data/charges.csv')

<a id="section1.1"></a>
### Task 1.1 

From the above descriptive tables, you have seen that customers' information is distributed in different files. Three of these files, `customer.csv`, `contract.csv`, and `churn.csv`, include one record per customer / contract. Therefore, we first ask you to:
- Combine the first two dataframes (customer and contract) in a single dataframe entitled `X`.
- Copy the third dataframe in a dataframe entitled `y`.
- Conduct an exploratory analysis on `X`, supported by appropriate visualizations, to relate the values of these features to their potential influence on the customer churn `y` (e.g., how the dependents are related with the churn target).
- Think of and write down your hypotheses on the extent to which each feature in `X` can be related to the fact that the customer drops the company; the churn target is indicated in `y` (e.g., customers with a high number of dependants drop less the company).    

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

<a id="section1.2"></a>
### Task 1.2

While demographic information is important, it does not reveal how customers actually used the services. Therefore, we now ask you extend the current set of features for each customer with novel behavioral features derived from the services the customers subscribed to, the extent to which they are charged monthly, and the actual phone usage. To this end, you will need to play with the files `phone_usage.csv`, `services.csv`, and `charges.csv`, and their corresponding dataframes. Specifically, we ask you to:
- Create, in total, at least **5 novel behavioral features**. They can come from the phone usage, the services, and/or the charges.
- Include these novel features to `X` you prepared in the previous task. 
- Explore how these novel features in `X` relate to the churn in `y` (e.g., how the number of services the customer subscribed to is related with the churn). 
- Think of and write down your hypotheses on the extent to which each **novel** feature in `X` can be related to the fact that the customer drops the company; the churn target is indicated in `y` (e.g., customers with a lower number of services drop more the company).   

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

<a id="section2"></a>
## 2  Implement and evaluate a machine-learning approach appropriately. 
----

In this section, you should:
1. Choose and initialize a machine-learning model based on the task and the data set (and pre-process the features in X, **if necessary/appropriate**). 
2. Choose and run an evaluation method with appropriate performance metrics, based on the task and the data set. 
3. Report, interpret, and discuss the results achieved by your model appropriately, comparing against random and majority class predictions. 

<a id="section2.1"></a>
### Task 2.1

A range of machine-learning algorithms available in the field have been explored in the lectures and the lab sessions, and choosing the one you think that is most likely to perform best depends on your problem type and data. As you have seen in this course, certain algorithms are better suitable for regression tasks, while others for classification tasks, as an example. The data also plays a key role in the process of choosing the right algorithm for the right problem. Some algorithms can work with smaller sample sets, while others require lots of samples. Similarly, certain algorithms can work better with categorical data, while others work better with numerical input. Once you select the algorithm you will use for the next steps:
- you **could** pre-process here the features in `X` (**if necessary/appropriate**), in such a way that they are ready to be fed into the selected algorithm; 
- and then initialize the corresponding scikit-learn or statsmodel model to be ready to run the next steps on it.

*REMARK: please, note that pre-processing your data is NOT mandatory in the sense that you should pre-process the features in X only if necessary or appropriate, based on the model you chose and how it is able to deal with the data at hand. It is just important that you justify your decisions. Just in case you might need to encode categorical variables, as mentioned in lab session 3, scikit-learn has nice classes for this purpose, namely [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) or [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Another nice alternative is to use the [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function of Pandas. We copy here again the reading mentioned in lab session 3 on these three encoding strategies: [this supporting tutorial](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd).*

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

<a id="section2.2"></a>
### Task 2.2

So, you should have successfully initialized your machine-learning model. What should you do now? You will need to evaluate the goodness of the model in predicting the target, which is an essential part of the entire pipeline. When to use which evaluation methods and performance metrics depends primarily on the nature of your problem and the characteristics of the data. Getting back to your homework, question yourself what is the main purpose you are trying to solve, select the right performance metrics and evaluation method, and run the evaluation of your model accordingly. In this task, we ask you to run an appropriate evaluation method and compute an appropriate performance metric / performance metrics, among those described in the lectures and in the lab sessions.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

<a id="section2.3"></a>
### Task 2.3

Selecting the right performance metrics and evaluation method and running the evaluation of your model accordingly is not enough. To assess the model goodness, you need to report and communicate the results of your model appropriately (e.g., deciding which performance metrics you present and how you to present them). The way you perform this task should be coherent with the problem type you are dealing with, the characteristics of the data set, and what is important in the target context. In this task we ask you to 1) report the obtained results visually, and 2) interpret and discuss them with respect to the performance achievable by a random model and by a model which always predicts the majority class.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE