# NLP Example, Twitter Tweets <img style="float: right; width: 310px;" src="./Data/Twitter_Logo.jpg"/>  
  
---  

### By: Heather M. Steich, M.S.
### Date: October 29$^{th}$, 2017
### Written in: Python 3.4.5

In [1]:
import sys
print(sys.version)

3.4.5 |Anaconda custom (64-bit)| (default, Jul  5 2016, 14:53:07) [MSC v.1600 64 bit (AMD64)]


---  
  
## Dataset Credit  
  
  
The data used for this project is used with permission (if cited) from the following source:  

    Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. 
    In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, Oct 2010. (Bibtex)

<https://archive.org/details/twitter_cikm_2010><img style="float: center;" src="./Data/paper_logo.gif">

---  

## Overview

The goal of the exercise is to extract information about concert appearances of musicians, performers or bands.  For each such tweet, we are looking to extract:  

 - Who was the performer  
 - When was the show  
 - Where was the show  
 - The Tweeter user who attended it  
 - The sentiment of the tweet  
   
Not all of these fields are available in all tweets, and that’s ok.  

Each row in the dataset includes the user id who sent the tweet and the timestamp for the tweet. For the ‘when’ field, we are interested in the date of the show (not just the tweet). We are not interested in any other tweets, including tweets about performers which don’t mention concerts.

---  
  
### Part 1: Classify if the tweets are relevant

In [2]:
## LOAD LIBRARIES

# Data wrangling & processing: 
import numpy as np
import pandas as pd

# Machine learning:
#from sklearn.preprocessing import StandardScaler
#from sklearn.model_selection import train_test_split
#from sklearn.ensemble import RandomForestClassifier as RF
#from sklearn.metrics import confusion_matrix
#from pandas_ml import ConfusionMatrix
#from sklearn.metrics import roc_curve
#from sklearn.ensemble import RandomForestRegressor as RFR

# Plotting:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#from IPython.display import display, HTML

# Remove warning messages:
#import warnings
#warnings.filterwarnings('ignore')

In [3]:
## ESTABLISH PLOT FORMATTING

#mpl.rcdefaults()  # Resets plot defaults

def plt_format():
    %matplotlib inline
    plt.rcParams['figure.figsize'] = (16, 10)
    plt.rcParams['font.size'] = 16
    plt.rcParams['font.family'] = 'Times New Roman'
    plt.rcParams['axes.labelcolor'] = 'black'
    plt.rcParams['axes.labelsize'] = 20
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.titlesize'] = 32
    plt.rcParams['axes.titleweight'] = 'bold'
    plt.rcParams['legend.fontsize'] = 16
    plt.rcParams['legend.markerscale'] = 4
    plt.rcParams['text.color'] = 'black'
    plt.rcParams['xtick.labelsize'] = 20
    plt.rcParams['ytick.labelsize'] = 20
    plt.rcParams['legend.fontsize'] = 16
    plt.rcParams['legend.frameon'] = False
    plt.rcParams['axes.linewidth'] = 1

#plt.rcParams.keys()  # Available rcParams
plt_format()

 - Step 2: Load, view & prepare the provided data

In [18]:
## LOAD DATA:

# Read in the files:
train = pd.read_csv("./Data/corrected_training_set_tweets.csv")
test = pd.read_csv("./Data/corrected_test_set_tweets.csv")

# Translate the timestamps to DateTime objects:
train.tCreatedAt = pd.to_datetime(train.tCreatedAt)
test.tCreatedAt = pd.to_datetime(test.tCreatedAt)

# Print shapes:
print('Train Shape:', train.shape)
print('Train Column Names:', train.columns)
print('\nTest Shape:', test.shape)
print('Test Column Names:', test.columns)

Train Shape: (3741881, 4)
Train Column Names: Index(['UserID', 'tTweetID', 'tTweet', 'tCreatedAt'], dtype='object')

Test Shape: (5125748, 4)
Test Column Names: Index(['UserID', 'tTweetID', 'tTweet', 'tCreatedAt'], dtype='object')


In [15]:
## PRINT A PREVIEW OF THE DATAFRAMES:

train.head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
0,60730027,6320951896,@thediscovietnam coo. thanks. just dropped yo...,2009-12-03 18:41:07
1,60730027,6320673258,@thediscovietnam shit it ain't lettin me DM yo...,2009-12-03 18:31:01
2,60730027,6319871652,"@thediscovietnam hey cody, quick question...ca...",2009-12-03 18:01:51
3,60730027,6318151501,@smokinvinyl dang. you need anything? I got ...,2009-12-03 17:00:16
4,60730027,6317932721,"maybe i'm late in the game on this one, but th...",2009-12-03 16:52:36


In [16]:
test.head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
0,22077441,10538487904,Ok today I have to find something to wear for ...,2010-03-15 17:35:58
1,22077441,10536835844,I am glad I'm having this show but I can't wai...,2010-03-15 16:53:44
2,22077441,10536809086,Honestly I don't even know what's going on any...,2010-03-15 16:52:59
3,22077441,10534149786,@LovelyJ_Janelle hey sorry I'm sitting infront...,2010-03-15 15:42:07
4,22077441,10530203659,Sitting infront of this sewing machine ... I d...,2010-03-15 13:55:22


In [19]:
## CHECK DATA TYPES:

print('Training:\n', train.dtypes)
print('\nTesting:\n', test.dtypes)

Training:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object

Testing:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object


# The following blocks are from a different Notebook:

 - Step 3: Data exploration

In [None]:
## BASIC STATISTICS FOR THE CLAIMS DATA:

claim_df.describe()

In [None]:
## THERE WAS ONE NEGATIVE CLAIM:

claim_df[claim_df.ClaimedAmount < 0]

In [None]:
## IT SEEMS LIKE THIS WAS A REVERSAL OF A CHARGE ON THE SAME DAY:

claim_df[claim_df.PolicyId == 777949]

In [None]:
## BASIC STATISTICS FOR THE POLICY DATA:

policy_df.describe()

In [None]:
## VIEW POLICIES THAT DO NOT HAVE AN ASSIGNED MONTHLY PREMIUM:

policy_df[policy_df.MonthlyPremium.isnull()]

In [None]:
## CHECK THE DATE RANGES TO MAKE SURE THIS IS ALL 2016 DATA:

print("Earliest Date of Claims", claim_df.ClaimDate.min())
print("Latest Date of Claims", claim_df.ClaimDate.max())
print("\nEarliest Date of Policy Enrollment", policy_df.EnrollDate.min())
print("Latest Date of Policy Enrollment", policy_df.EnrollDate.max())
print("\nEarliest Date of Policy Cancel", policy_df.CancelDate.min())
print("Latest Date of Policy Cancel", policy_df.CancelDate.max())

In [None]:
## MANY CUSTOMERS ENROLLED LONG BEFORE 2016:  

print("Number of Policy Holders Enrolled Prior to 2016:", 
      policy_df[policy_df.EnrollDate < '2016'].shape[0])
policy_df[policy_df.EnrollDate < '2016'].head()

In [None]:
## COUNT OF UNIQUE POLICY ID'S:

print("Unique Policy ID's:", policy_df.PolicyId.unique().shape[0])
print("Number of Duplicated Rows:", sum(policy_df.duplicated()))

In [None]:
## COUNT OF UNIQUE POLICY ID'S THAT FILED CLAIMS:

print("Unique Policy ID's That Filed Claims:", 
      claim_df.PolicyId.unique().shape[0])
print("Number of Duplicated Rows:", sum(claim_df.duplicated()))

**Key finding:**  
  
**There are 3,744 claims that have duplicates; resulting in 6,616 rows (4.255%) are duplicates of another.**  
  
**This would be a pivotal time to reach out to check that these are in-fact valid claims.  Due to this being a homework assignment, I will make the executive decision that these *are* valid claims.  For example, perhaps the animal went to the veterinarian for two vaccines, which resulted in two identical claim amounts on the same day.**  

**Please note that this is an assumption, and in the real world I would ask for further clarification prior to proceeding with analyses.**  

In [None]:
## MANY CLAIMS ARE DUPLICATED:

print("Duplicated Claims Account for", 
      claim_df[claim_df.duplicated(keep=False)].shape[0],
      "Rows of the Claims Data.  This is",
      round(claim_df[claim_df.duplicated(keep=False)].shape[0] 
      / claim_df.shape[0] * 100, 3), "% of the Claims Data.")
claim_df[claim_df.duplicated(keep=False)].head(10)

 - Step 4: Transform & merge the data to a monthly-per-policy basis

In [None]:
## ADD A LEVEL TO COLUMN INDEX TO ALIGN WITH THE MONTHLY CLAIMS DATA:

# Re-assign DataFrame to another name:
reindex_policy = pd.DataFrame(policy_df)

# Reset the index level to prevent an error from occurring if 
# this cell is run more than once:
try:
    reindex_policy = reindex_policy[list(policy_df.columns.levels[0].values)]
except: 
    reindex_policy = policy_df[list(policy_df.columns.values)]

# Add extra level to column index:
reindex_policy.columns = pd.MultiIndex.from_arrays([reindex_policy.columns, 
                    [' '] * len(reindex_policy.columns)])

# Print preview of re-indexed policy DataFrame:
reindex_policy.head()

In [None]:
## PIVOT & MERGE THE TWO DATAFRAMES TO GET A MORE COMPLETE PICTURE OF 
## CLAIMS BEHAVIOUR, ADDING ADDITION METRICS (FEATURES) FOR MODELLING:

# Pivot the claim DataFrame to have one row per policy, & a column
# for every month's ClaimedAmount & PaidAmount; if not claim data, 
# fill as '$0' claim:
pivot_df = pd.pivot_table(claim_df,index=claim_df['PolicyId'],
               columns=claim_df['ClaimDate'].dt.month,
               aggfunc=np.sum, fill_value=0).reset_index()

# Merge the Claims and Policy DataFrames together in an 'left' join on PolicyId:
wide_df = reindex_policy.join(pivot_df.set_index('PolicyId'))

# Fill in missing amounts with zeros:
wide_df['ClaimedAmount'] = wide_df.ClaimedAmount.fillna(0)
wide_df['PaidAmount'] = wide_df.PaidAmount.fillna(0)

# Calculate basic metrics for each policy:
wide_df['MeanClaims'] = np.mean(wide_df.ClaimedAmount, axis=1)
wide_df['MeanPaid'] = np.mean(wide_df.PaidAmount, axis=1)
wide_df['MedianClaims'] = np.median(wide_df.ClaimedAmount, axis=1)
wide_df['MedianPaid'] = np.median(wide_df.PaidAmount, axis=1)
wide_df['TotalClaims'] = np.sum(wide_df.ClaimedAmount, axis=1)
wide_df['TotalPaid'] = np.sum(wide_df.PaidAmount, axis=1)
wide_df['TotalDifference'] =  (np.sum(wide_df.ClaimedAmount, axis=1) - 
                               np.sum(wide_df.PaidAmount, axis=1))
wide_df['ProportionCovered'] =  (np.sum(wide_df.PaidAmount, axis=1) / 
                                    np.sum(wide_df.ClaimedAmount, axis=1))

# Since we only have claims for 2016, calculate the number of months &  
# paid premiums in 2016 alone:
wide_df['Premiums2016'] = (np.where(pd.to_datetime(pd.Series(wide_df['CancelDate'])).dt.year == 2016,
         pd.to_datetime(pd.Series(wide_df['CancelDate'])).dt.month, 12) 
         - np.where(pd.to_datetime(pd.Series(wide_df['EnrollDate'])).dt.year == 2016, 
         pd.to_datetime(pd.Series(wide_df['EnrollDate'])).dt.month, 0))
wide_df['PremiumsPaid2016'] = np.multiply(wide_df.Premiums2016.values, 
                                       wide_df.MonthlyPremium.iloc[0:, 0].values)
wide_df['PremiumVPaid'] = (wide_df.PremiumsPaid2016.values - wide_df.TotalPaid.values)

# Use a binary key to mark if the customer is current or canceled at the end of 2016:
wide_df['Churned'] = np.where(wide_df.CancelDate.fillna(0) 
                                       > pd.datetime(2016, 1, 1), 1, 0)

# Calculate the total number of months the customer has held a policy:
end = pd.to_datetime('2016-12-31')
wide_df['PolicyLength'] = (wide_df.CancelDate.fillna(end) - 
                           wide_df.EnrollDate).astype('timedelta64[M]')

# Print a preview of the prepared DataFrame:
wide_df.head()

In [None]:
## WRITE OUT THE PREPARED DATA FRAME FOR FUTURE REFERENCE:

# Condense the columns' MultiIndex for clarification prior to writing out:
write_out = wide_df.copy(deep=True)
write_out.columns =  [''.join(tuple(map(str, t))) for t in write_out.columns.values]
write_out.to_csv('./Data/PreparedData.csv', index=False)

**Key finding:**  
  
**Based on the provided data, using 2016 premiums collected (assuming *no* pro-rating for partial months) & paid claims only, Trupanion made a gross profit of $24,185,900.12!**  

In [None]:
## DATA FACT == IN THE BLACK FOR 2016:

# Calculation of the 2016 total sum of premiums paid to Trupanion minus paid claims:
print('$', round(np.sum(wide_df.PremiumVPaid), 2))

**Key finding:**  
  
**There are 5 policies missing data on monthly premiums.  I typically would ask for clarification here, but since there are so few I'll make the executive decision.  I believe that these rows should be dropped for a model on paid claims, because none of them had any claims submitted or paid.  However, since 4 out of the 5 rows were canceled policies, I would choose to keep them for a model on cancel predictions.**

In [None]:
## LOCATE POLICIES MISSING MONTHLY PREMIUM DATA & CREATE NEW DATAFRAMES FOR USE 
## IN PAID CLAIMS & CANCELATION MODELING:

# Identify the rows with missing data:
missing = np.argwhere(np.isnan(wide_df.xs('PremiumVPaid', axis=1, drop_level=True))).ravel()
no_premium = wide_df[wide_df.index.isin(missing)] 

# DataFrame without the 5 rows:
paid_df = wide_df[~wide_df.index.isin(no_premium.index)]

# DataFrame with 5 missing premiums filled with zeros:
no_premium[['MonthlyPremium', 'PremiumsPaid2016', 'PremiumVPaid']] = (
    no_premium[['MonthlyPremium', 'PremiumsPaid2016', 'PremiumVPaid']].fillna(0))
cancel_df = pd.concat([paid_df, no_premium])

# Print the policies with missing premiums:
no_premium = wide_df[wide_df.index.isin(missing)] 
no_premium

In [None]:
## THE MOST EXPENSIVE CUSTOMERS:

a = paid_df.PremiumVPaid.sort_values()[0:5].index
paid_df[paid_df.index.isin(a)]

 - Step 5: Visualize the data to help determine appropriate modeling methods

In [None]:
plt_format()

plt.plot(np.sum(wide_df, axis=0).ClaimedAmount, label='Claimed Amount')
plt.plot(np.sum(wide_df, axis=0).PaidAmount, label='Paid Amount')
plt.title('Total Claimed & Paid Amounts in 2016, by Month')
plt.xlim(0.9, 12.1)
plt.xlabel('Month')
plt.xticks([1,2,3,4,5,6,7,8,9,10,11,12], ['January', 'February', 
            'March', 'April', 'May', 'June', 'July', 'August', 
            'September', 'October', 'November', 'December'], rotation=45)
plt.ylabel('Total Dollars')
plt.legend(loc=5);

In [None]:
plt_format()

sns.distplot(paid_df.PremiumVPaid.astype(int))
plt.title('Distribution of Premiums Collected Minus Paid Claims')
plt.xlabel('Premium Collected Minus Paid Claims');

In [None]:
## DATA DISTRIBUTION COMPARISONS:

plt_format()

sns.violinplot(x='Churned', y='MonthlyPremium ', data=write_out, 
               split=True, inner='quartile', saturation=0.6, 
               palette={1: "deepskyblue", 0: "mediumpurple"})
plt.title('Monthly Premium Distributions, Separated by Churn Status')
plt.xlabel('Churned')
plt.ylabel('Monthly Premium, USD')

plt.ylim(0, 180);

In [None]:
## DATA DISTRIBUTION COMPARISONS:

plt_format()

sns.violinplot(x='Churned', y='PolicyLength', 
               data=write_out, split=True, inner='quartile', saturation=0.6, 
               palette={1: "deepskyblue", 0: "mediumpurple"})
plt.title('Policy Length Distributions, Separated by Churn Status')
plt.xlabel('Churned')
plt.ylabel('Policy Length, in Months')

plt.ylim(-10, 210);

In [None]:
## DATA DISTRIBUTION COMPARISONS:

plt_format()

sns.violinplot(x='Churned', y='TotalPaid', 
               data=write_out, split=True, inner='quartile', saturation=0.6, 
               palette={1: "deepskyblue", 0: "mediumpurple"})
plt.title('Total Paid Distributions, Separated by Churn Status')
plt.xlabel('Churned')
plt.ylabel('Total Paid, USD');

In [None]:
plt_format()

plt.plot(paid_df.PolicyLength, paid_df.MonthlyPremium, '.')
plt.title('Policy Length vs. Monthly Premium')
plt.xlabel('Policy Length, in Months')
plt.ylabel('Monthly Premium, in Dollars');

In [None]:
plt_format()

plt.plot(paid_df.EnrollDate, paid_df.MeanPaid, '.')
plt.title('Enrollment Date vs. Average Paid Claim Amount')
plt.xlabel('Enrollment Date')
plt.ylabel('Mean Paid Claim Amount, in Dollars');

In [None]:
plt_format()

sns.distplot(paid_df.PolicyLength.astype(int))
plt.title('Policy Length Distribution')
plt.xlabel('Policy Length, in Months');

---  
  
---  
  
---  
  
  
## Model: Predicting Cancellation Probabilities 
  
 - First, run through a trial model to get an idea of how model will perform

In [None]:
## DEFINE MODEL INPUT FEATURES & OUTPUT:

df = cancel_df.copy(deep=True)

input_features = ['ClaimedAmount', 'MonthlyPremium', 'PaidAmount', 
                  'MeanClaims', 'MeanPaid', 'MedianClaims', 
                  'MedianPaid', 'TotalClaims', 'TotalPaid', 
                  'TotalDifference', 'Premiums2016', 'PremiumsPaid2016', 
                  'PremiumVPaid', 'PolicyLength']
output_feature = 'Churned'

X = df[input_features]
y = df[output_feature]

In [None]:
plt_format()

sns.heatmap(pd.concat([pd.DataFrame(X), y], axis=1).corr())
plt.title('Heatmap of Data Feature Correlations');

In [None]:
## RANDOM FOREST MODELING:

# Split into training & test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
# Scale the data to a mean of '0' and standard deviation of '1'
# Scaling the test data on the training set:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
      
# Initialize a classifier:
clf = RF(n_estimators=10, criterion='entropy')
clf.fit(X_train, y_train)
# Predict classes:
y_pred = clf.predict(X_test)
# Predict probabilities:
y_prob = clf.predict_proba(X_test)
    
# Print the accuracy:
accuracy = clf.score(X_test, y_test)
print("  Accuracy: ", accuracy*100, '%\n')

print('  Model Statistics:')
confusion_matrices = ConfusionMatrix(y_test, y_pred)
confusion_matrices.print_stats()

In [None]:
plt_format()

confusion_matrices.plot(backend='seaborn', annot=True, fmt=".0f")
plt.title('Confusion Matrix of Cancellation Predictions');

In [None]:
## EVALUATE PROBABILITY PREDICTIONS:

# Number of times a predicted probability is assigned to an observation:
counts = pd.value_counts(y_prob[:, 1])
is_churn = y_test == 1

# Calculate true probabilities:
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[y_prob[:, 1] == prob])
    true_prob = pd.Series(true_prob)

# Reshape & rename:
counts = pd.concat([counts, true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']

In [None]:
plt_format()

baseline = np.mean(is_churn)

plt.plot(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 
         c="#95a5a6", linewidth=3, alpha=0.8, label='Ideal Predection Line')
plt.plot(np.linspace(0, 1, 10), np.linspace(baseline, baseline, 10), 
         c="#3498db", linewidth=3, alpha=0.8, label='Mean Probability of Churn')
plt.scatter(data=counts, x='pred_prob', y='true_prob', s='count', 
            marker='o', label=None, alpha=0.7, c="#9b59b6")
plt.title("Random Forest Model Outcomes")
plt.xlabel("Predicted Probability")
plt.ylabel("Relative Frequency of Outcome")
plt.xlim(-0.05,  1.05)
plt.ylim(-0.05, 1.05)
plt.legend(loc=2);

In [None]:
plt_format()

conf_mat = pd.DataFrame(
    confusion_matrix(y_test.values, y_pred), 
    columns=["Predicted False", "Predicted True"], 
    index=["Actual False", "Actual True"]
)
display(conf_mat)

# Calculate the false positives & true positives for all thresholds of the classification
fpr, tpr, threshold = roc_curve(y_test, y_prob[:, 1])
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, color='deepskyblue', linewidth=3)
plt.plot([0, 1], [0, 1], '--', color='mediumpurple', linewidth=3)
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
fig = plt.figure(figsize=(20, 18))
ax = fig.add_subplot(111)

df_f = pd.DataFrame(clf.feature_importances_, columns=["Importance"])
df_f["Labels"] = df[input_features].columns
df_f.sort_values("Importance", inplace=True, ascending=False)
display(df_f.head(5))

index = np.arange(len(clf.feature_importances_))
bar_width = 0.7
rects = plt.barh(index , df_f["Importance"], bar_width, alpha=0.4, color='b', label='Main')
plt.yticks(index, df_f["Labels"])
plt.title('Proportion of Importance for each Model Input Feature')
plt.xlabel('Importance, Proportion')
plt.ylabel('Model Input Feature')
plt.show()

In [None]:
## PREVIEW CHURN PROBABILITIES:

pd.DataFrame(y_prob).iloc[:, 1].head(10)

---  
  
- Final cancellation prediction model

In [None]:
## RANDOM FOREST MODELING; TRAIN FINAL MODEL WITH OUTPUT:

# Scale the data to a mean of '0' and standard deviation of '1':
scaler = StandardScaler().fit(X)
X = scaler.transform(X)
      
# Initialize a classifier:
clf = RF(n_estimators=4, criterion='entropy', random_state=None)
clf.fit(X, y)
# Predict classes:
y_pred = clf.predict(X)
# Predict probabilities:
y_prob = clf.predict_proba(X)
    
# Print the accuracy:
accuracy = clf.score(X, y)
print("  Accuracy: ", accuracy*100, '%\n')

print('  Model Statistics:')
confusion_matrices = ConfusionMatrix(y, y_pred)
confusion_matrices.print_stats()

In [None]:
## EVALUATE PROBABILITY PREDICTIONS:

# Number of times a predicted probability is assigned to an observation:
counts = pd.value_counts(y_prob[:, 1])
is_churn = y == 1

# Calculate true probabilities:
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[y_prob[:, 1] == prob])
    true_prob = pd.Series(true_prob)

# Reshape & rename:
counts = pd.concat([counts, true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']

In [None]:
## PLOT PREDICTION COUNTS:

plt_format()

baseline = np.mean(is_churn)

plt.plot(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 
         c="#95a5a6", linewidth=3, alpha=0.8, label='Ideal Predection Line')
plt.plot(np.linspace(0, 1, 10), np.linspace(baseline, baseline, 10), 
         c="#3498db", linewidth=3, alpha=0.8, label='Mean Probability of Churn')
plt.scatter(data=counts, x='pred_prob', y='true_prob', s='count', 
            marker='o', label=None, alpha=0.7, c="#9b59b6")
plt.title("Random Forest Model Outcomes")
plt.xlabel("Predicted Probability")
plt.ylabel("Relative Frequency of Outcome")
plt.xlim(-0.05,  1.05)
plt.ylim(-0.05, 1.05)
plt.legend(loc=2);

In [None]:
## WRITE OUT CANCELLATION PROBABILITIES:

cancel_probs = pd.DataFrame(y_prob).iloc[:, 1]
cancel_probs = pd.concat([df.PolicyId, cancel_probs], axis=1)
cancel_probs.columns = ['PolicyId', 'CancelProb']
cancel_probs.to_csv('./Data/CancellationProbabilities.csv', index=False)