<a href="https://colab.research.google.com/github/carmen-chan/A2/blob/master/A2%20Report%20Machine%20Learning%20Spring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Introduction

The world of telecommunications is changing - mobile phones and the accessibility of mobile data has changed what companies can offer and what customers value. As the world becomes more interconnected customers have higher demands for internet service and lowered use of traditional phones. Due to this shift, new companies are emerging that offer low prices on data only plans which threaten the traditional telecommunication companies (telco). As competition grows retaining current customers is increasingly important - understanding what different customers are using in a telco service and what are important to offer can help telco companies personalise a customers experience. 

This report examines a telco dataset of 7043 customers with 19 attributes that include customer information and service subscriptions with the target variable 'churn' which is the monthly retention rates. Using machine learning algorithms we will classify which customers will churn or be retained which demonstrates the type of customers that are satisfied with their service and those who are not so we may learn from them to increase retention. 




# Exploration

The telco dataset used in this report comes from the website Kaggle - it is uploaded by BlastChur and includes 21 attributes and 7043 rows. 

The original dataset includes customerID as the first attribute - for the purposes of this report it will be ignored as it does not add value to the results. 

Below is a brief summary of each attribute and how they are distributed:

*   *'Gender'* - string(object) - Slightly more males than females 
*   *'Senior Citizen'* - Binary - Significantly less senior citizens in dataset
* *'Partner'* - String(object) -  48.3% have a partner, 51.7% do not have a partner
* *'Dependents'* - String(object) - only 29.96% have dependents
* *'Tenure'* - Float(int64) - fairly normally distrubted, slightly right skewed

![Graph of Tenure with right skew](https://i.imgur.com/QD57TDE.jpg)
* *'Phone Service'* - String(object) - Only 10% do NOT have a phone service
* *'MutlipleLines'* - String(object) - 48% do not have multiple lines
* *'InternetService'* - String(object) - 43.9% Fibre Optic, 34.4% DSL and 21.7% have none
* *'Online Security'* - String(object) - 28.7% Do, 49.67% do not, remaining have no internet
* *'OnlineBackup'* - String(object) -  34.5% do, 43.8 Do not
* *'DeviceProtection'* - String(object) - 34.4% do, 43.9 do not
* *'TechSupport'* - String(object) - 29% do, 49.3 don't
* *'StreamingTV'* - String(object) - 38.4% do, 39.9% don't
* *'StreamingMovies'* - String(object) - 38.8% do, 39.5% don't
* *'Contract'* - String(object) - 55% Month to Month, 20.9% One year, 24% Two year
* *'PaperlessBilling'* - String(object) - 59.2% yes
* *'Payment Method'* - String(object) - 21.9% AutoBank, 21.6% CreditCardAuto, 33.6% Electronic check, 22.9% Mailed Check
* *'MonthlyCharges'* - Float(int64) - Fairly normally distributed iwth a slight left skew

![Graph of Monthly Charges with left skew](https://i.imgur.com/VprAPYf.jpg)
* *'Totalcharges'* - Float(int64) - Strong right skew

![Graph of Total Charges with left skew](https://i.imgur.com/8o5nkd9.jpg)
* *'Churn'* - Skewed target data so accuracy statistic won't be really useful - need F score instead. 73.46% is No (as in retained) and 26.54% is yes (as in lost)


The attributes were evaluted in KNIME for relationships and produced this correlation matrix. 

![Correlation Matrix](https://i.imgur.com/ynrVAQJ.jpg)

The interesting relationships are outlined below: 
* *'StreamingTV'* positively correlates strongly with Device *'Protection'* (0.76), *'InternetService'* (0.71) and *'OnlineSecurity'* (0.7) 
* StreamingMovie positively correlates strongly with *'DeviceProtection'* (0.77), *'InternetService'* (0.71) and *'OnlineSecurity'* (0.7)
* *'StreamingMovies'* and *'StreamingTV'* positively  correlate strongly with each other (0.83)
* *'Totalcharges'* positively correlates strongly with *'Tenure'* (0.81) 
* *'Onlinesecurity'* positively correlates with *'TechSupport'* (0.79) 
* *'Onlinebackup'* negatively correlates with *'MonthlyCharges'* (-0.71)
* The attributes that correlated most with the target *'Churn'* are:
  * Negatively with Contract, TenureMonths, OnlineSecurity, TechSupport, DeviceProtection
  * Positively MonthlyCharges, SEnior Citizen and Partner
  * It correlates least with Gender, PhoneService and InternetService
  
  
The dataset is reasonably clean with little to no missing values and no outliers. Therefore the data did not require high levels of cleaning. 

The data did require augmentation as majority of the attributes were strings. These attributes were encode into categories that are numerical so models can input this data. This was critical to begin experimenting with the dataset in any model. 

Additionally *'TenureMonths', 'MonthlyCharges' and 'TotalCharges'* required normalisation to ensure that they were easier to interpret by models as they currently have a much larger range - with *'TotalCharges'* reaching nearly 9000. A standard scalar was used from the sklearn preprocessing package was used to normalise these attributes for the models to compute these datasets better and it did increase the model accuracies. 



  



In [0]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
telco = telco.apply(LabelEncoder().fit_transform)


scaler = preprocessing.StandardScaler()
scaled_telco = scaler.fit_transform(ColumnstoScale)
scaled_telco = pd.DataFrame(scaled_telco, columns=['TotalCharges_Normalised', 'MonthlyCharges_Normalised', 'TenureMonths_Normalised'])

The models we have decided to use are Decision Tree, Random Forest, SVM Linear & Kernel. 

**Decision Tree**

We have used a decision tree as we dealing with a discrete target variable and the data is fairly logical in nature. The decision tree is a quality model as it is robust to errors and can work around data that may still be messy in nature. The concern with using a decision tree is how prone it is to overfitting to the dataset.


**Random Forest** 

Random Forest classifiers are many decision trees and therefore has the same advantages, it also improves resistance to overfitting and often has increased accuracy as a result of completing multiple decision tree models. 


**SVM**

Support Vector Machine classifiers can capture complicated data and transform it to compute complex relationships. It is useful in both regression and classification environments. Whilst it might be more than enough for the simple dataset used in this report it is still useful for comparison to the other two very simple algorithms mentioned above. 
This report uses both linear - where the data is split in the new space by a straight linear line - and kernal - where the boundaries of clasification are created through clusters and are non linear. 

https://www.dezyre.com/article/top-10-machine-learning-algorithms/202 


# Methodology

The models used are all imported from the sklearn package for python coding and are limited to the parameters available for adjustment within these. 

The first step after cleaning the data so it would work with the models which is mentioned in the exploration section of the report. 
Once this is complete the data is split into the training and test datasets. First by allocating which attributes will be used for prediction. The following attributes were used for prediction: 

* 'Gender', 'SeniorCitizen', 'Partner','Dependents', 'TenureMonths_Normalised',  'PhoneService',  'MultipleLines',  'InternetService',  'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'ContractTerm', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges_Normalised', 'TotalCharges_Normalised'

The target is then set to be 'MonthlyChurn'. Once these have been defined the training and test sets are split with an 80/20 split. 

After this the classifiers were imported from sklearn their parameters adjusted as below: 
* **Decision Tree**
  * Max depth of 5 
  * Criterion = entropy 
* **Random Forest**
  * N estimators = 500
  * Random_state = 1
  * Max depth of 8
* **Linear & Kernel SVM**
  * Unchanged

These are the final parameters that are used in the models after trial and error testing. 

Attributes with low correlation were also removed to test to see if that simplification improved the classifiers. It seemed to improve the decision tree by 0.013679 (rounding) and SVM linear by 0.003085 (rounding) classifiers however, worsen the random forest by 0.007144 (rounding) and kernel SVM by 0.00018 (rounding)

In [0]:
from sklearn.tree import DecisionTreeClassifier 
dtC = DecisionTreeClassifier(criterion="entropy", max_depth=6)

dtC = dtC.fit(X_train,y_train)

y_pred_dtC = dtC.predict(X_test)

print('Weighted F1 Score', f1_score(y_test, y_pred_dtC, average='weighted') )
print("Accuracy of decision tree classifier:",metrics.accuracy_score(y_test, y_pred_dtC))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_dtC))
print(classification_report(y_test,y_pred_dtC))
print(confusion_matrix(y_test,y_pred_dtC))

# Evaluation
*Report execution on data?*

*Perform and report testing*

*Perform efficiency analysis*

*Do possible comparative study*



*   Report the accuracy stats
*   How do we do an efficiency analysis? GOogle to get output on computational cost and time taken to create
* Comparison fairly straight forward I assume
* Execution - what do you see as an outcome

**Execution - what happened**
* Initially a lot of the cells had errors and however now they all run since the data has been cleaned up a lot more. 
* Each printed the classification report 


**Accuracy stats**

* Need to use F1 score as the target isn't balanced
* Decision Tree - 0.7954053519036592
* *Random Forest - 0.8202317922734982*
* Linear SVM - 0.8002769583555663
* Kernel SVM - 0.7475030683876639


**Efficiency - time for each to execute**
* Decision Tree - 0.14952842057000454
* Random Forest - 0.6895580111799973
* Linear SVM - 1.2340110431299944
* Kernel SVM - 1.787700364480006

In [0]:
#Testing 

# Conclusion

*Reflections and improvements*

# Ethics

Which ethical approach, what are misuses of this technique. 

Perhaps sitting somewhere between Deonotological Duty Based Ethics and Utilitarianism ? Which kinda contradict each other lmao :) 

The idea that yes some things are definitely wrong and there is a principle to the 'thing' in the sense that there is a duty to the right thing even if it doesn't help everyone (aka ruins a business). However consequences shouldn't be ignored completely as that's still important as doing the right thing in a business setting might not be right in a human setting. Considering the consequence of the 'right' should still be considered but if it is the right thing we should do so. 


*From website - http://www.bbc.co.uk/ethics/introduction/duty_1.shtml* 

> Duty-based or Deontological ethics
Deontological (duty-based) ethics are concerned with what people do, not with the consequences of their actions.

> Do the right thing.
Do it because it's the right thing to do.
Don't do wrong things.
Avoid them because they are wrong.
Under this form of ethics you can't justify an action by showing that it produced good consequences, which is why it's sometimes called 'non-Consequentialist'.


> Duty-based ethics are usually what people are talking about when they refer to 'the principle of the thing'. Duty-based ethics teaches that some acts are right or wrong because of the sorts of things they are, and people have a duty to act accordingly, regardless of the good or bad consequences that may be produced. Someone who follows Duty-based ethics should do the right thing, even if that produces more harm (or less good) than doing the wrong thing:

> People have a duty to do the right thing, even if it produces a bad result. So, for example, the philosopher Kant thought that it would be wrong to tell a lie in order to save a friend from a murderer. 

> If we compare Deontologists with Consequentialists we can see that Consequentialists begin by considering what things are good, and identify 'right' actions as the ones that produce the maximum of those good things.
Deontologists appear to do it the other way around; they first consider what actions are 'right' and proceed from there. (Actually this is what they do in practice, but it isn't really the starting point of deontological thinking.)

> So a person is doing something good if they are doing a morally right action.


**Misuse of method/thing**
We are predicting if a customer will leave a company. 
* Poach other customers 
 

**Video - 3-5 mins- Can just be ppt with voice over :)**