## Road Map

Specific Tools
- preprocessing.StandardScaler()
- sklearn.linear_model import LogisticRegression

Clean 
> - verify data types of dataset
> - remove any values from the dataset which are not needed
> - convert datatypes to integers if necessary
> - tools: astypes, dtypes, LabelEncoding

Prepare 
> - convert dataset into an np.array
> - normalize the data if the values vary signicantly
> - split dataset into training and testing sets

Model 
> - select features
> - set outcome
> - fit training sets into chosen alogrithm
> - transform data onto unseen datasets
> - predict outcome using testing set

Evaluate
> - compare testing and predicted outcomes
> - use Mean Squred Error, F1_score, accuracy_score


## Preparing:

1. select features

2. convert into values into an array

3. normalize the values of the features
> From the docs

> - sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

> - Standardize features by removing the mean and scaling to unit variance

Methods:
> - fit(X, y) : Compute the mean and std to be used for later scaling.

> - fit_transform(X, y) : Fit to data, then transform it.

> - transform(X, copy) : Perform standardization by centering and scaling

4. split the data into training and testing sets

---

---

## Modeling

1. fit data into logistic regression Modeling

> - using predict and predict_proba

> - from the docs

> - sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

Parameters

> - C : float, default=1.0
> Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization

> - solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’

> For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

> For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

> - ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
> - ‘liblinear’ and ‘saga’ also handle L1 penalty
> - ‘saga’ also supports ‘elasticnet’ penalty
> - ‘liblinear’ does not support setting penalty='none'

---

In [54]:
import pandas as pd 
from sklearn import preprocessing, linear_model, metrics, model_selection

df = pd.read_csv('datasets/ChurnData.csv')
# clean
df.astype(int).dtypes

# prepare
columns = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon', 'longten', 'tollten', 'cardten', 'voice', 'pager', 'internet', 'callwait', 'confer', 'ebill', 'loglong', 'logtoll','lninc', 'custcat', 'churn']

features = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']


output = ['churn']

x = df[features].values
y = df[output].values

x = preprocessing.StandardScaler().fit(x).transform(x)

x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=4)

# model
LogReg = linear_model.LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)

y_hat = LogReg.predict(x_test)
y_prob = LogReg.predict_proba(x_test)

# evaluate
log_loss = metrics.log_loss(y_test, y_prob)
print('Log Loss: %.3f' % log_loss, '\n')

f1_score = metrics.f1_score(y_test, y_hat)
print('F1 score: %.3f' % f1_score, '\n')

classifcation_report = metrics.classification_report(y_test, y_hat)
print(classifcation_report)

confusion_matrix = metrics.confusion_matrix(y_test, y_hat)
print('confusion matrix \n', confusion_matrix)



Log Loss: 0.602 

F1 score: 0.545 

              precision    recall  f1-score   support

         0.0       0.73      0.96      0.83        25
         1.0       0.86      0.40      0.55        15

   micro avg       0.75      0.75      0.75        40
   macro avg       0.79      0.68      0.69        40
weighted avg       0.78      0.75      0.72        40

confusion matrix 
 [[24  1]
 [ 9  6]]


---
## Evaluation

- log_loss

> log loss(Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1

- f1_score

> The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. 

- classification report

- confusion matrix

> Build a text report showing the main classification metrics

---

<h2>Disclaimer</h2>

This script was orginally from Coursera's [IBM AI Engineering course](https://www.coursera.org/professional-certificates/ai-engineer), authored by <a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a> and was modifed to fit my needs. 