## Work Flow

1. Perform any cleaning, exploratory analysis, and/or visualizations to use the
provided data for this analysis.
   
2. Build a predictive model to help determine the probability that a rider will
be retained.

3. Evaluate the model.  Focus on metrics that are important for your *statistical
model*.
 
4. Identify / interpret features that are the most influential in affecting
your predictions.

5. Discuss the validity of your model. Issues such as
leakage.  For more on leakage, see [this essay on
Kaggle](https://www.kaggle.com/dansbecker/data-leakage), and this paper: [Leakage in Data
Mining: Formulation, Detection, and Avoidance](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.7769&rep=rep1&type=pdf).

6. Repeat 2 - 5 until you have a satisfactory model.

7. Consider business decisions that your model may indicate are appropriate.
Evaluate possible decisions with metrics that are appropriate for *decision
rules*.

## Deliverables

- Code you used to build the model.  The more repeatable, self explanatory, the
  better.

- A presentation including the following points:
  - How did you compute the target?
  - What model did you use in the end? Why?
  - Alternative models you considered? Why are they not good enough?
  - What performance metric did you use to evaluate the *model*? Why?
  - **Based on insights from the model, what plans do you propose to
    reduce churn?**
  - What are the potential impacts of implementing these plans or decisions?
    What performance metrics did you use to evaluate these *decisions*, why?

### Numerical Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

### Learning Libraries

### Load in Data

In [2]:
df = pd.read_csv('data/churn_train.csv')

#### Convert times (currently in string) to datetimes

In [3]:
df['last_trip_date'] = pd.to_datetime(df['last_trip_date'])
df['signup_date'] = pd.to_datetime(df['signup_date'])

In [4]:
df.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,luxury_car_user,weekday_pct
0,6.94,5.0,5.0,1.0,Astapor,2014-05-03,Android,2014-01-12,0.0,0,False,100.0
1,8.06,5.0,5.0,1.0,Astapor,2014-01-26,Android,2014-01-25,0.0,2,True,0.0
2,21.5,4.0,,1.0,Winterfell,2014-05-21,iPhone,2014-01-02,0.0,1,True,100.0
3,9.46,5.0,,2.75,Winterfell,2014-01-10,Android,2014-01-09,100.0,1,False,100.0
4,13.77,5.0,,1.0,Winterfell,2014-05-13,iPhone,2014-01-31,0.0,0,False,100.0


## Problem Description

A ride-sharing company (Company X) is interested in predicting rider retention.
To help explore this question, we have provided a sample dataset of a cohort of
users who signed up for an account in January 2014. The data was pulled on July
1, 2014; we consider a user retained if they were “active” (i.e. took a trip)
in the preceding 30 days (from the day the data was pulled). In other words, a
user is "active" if they have taken a trip since June 1, 2014. The data,
`churn.csv`, is in the [data](data) folder.  The data are split into train and
test sets.  You are encouraged to tune and estimate your model's performance on
the train set, then see how it does on the unseen data in the test set at the
end.

- The 'Feature Importance' discussion is in Random Forests lecture
- Also discussed on page 262 of hands-on

#### A user is "active" if they have taken a trip since June 1, 2014.

In [5]:
cutoff_date = '2014-06-01'
cutoff_date = pd.to_datetime(cutoff_date)

In [6]:
df['active'] = (df['last_trip_date'] >= cutoff_date).astype(int)

In [7]:
df.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,luxury_car_user,weekday_pct,active
0,6.94,5.0,5.0,1.0,Astapor,2014-05-03,Android,2014-01-12,0.0,0,False,100.0,0
1,8.06,5.0,5.0,1.0,Astapor,2014-01-26,Android,2014-01-25,0.0,2,True,0.0,0
2,21.5,4.0,,1.0,Winterfell,2014-05-21,iPhone,2014-01-02,0.0,1,True,100.0,0
3,9.46,5.0,,2.75,Winterfell,2014-01-10,Android,2014-01-09,100.0,1,False,100.0,0
4,13.77,5.0,,1.0,Winterfell,2014-05-13,iPhone,2014-01-31,0.0,0,False,100.0,0


In [8]:
df['active'].mean()

0.3758

In [9]:
np.sort(df['avg_rating_of_driver'].unique())

array([1. , 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
       2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9,
       4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , nan])

#### Let's look at avg_rating_of_driver = NaN values

In [10]:
df[df['avg_rating_of_driver'].isnull()]

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,luxury_car_user,weekday_pct,active
2,21.50,4.0,,1.00,Winterfell,2014-05-21,iPhone,2014-01-02,0.0,1,True,100.0,0
3,9.46,5.0,,2.75,Winterfell,2014-01-10,Android,2014-01-09,100.0,1,False,100.0,0
4,13.77,5.0,,1.00,Winterfell,2014-05-13,iPhone,2014-01-31,0.0,0,False,100.0,0
5,14.51,5.0,,1.00,Astapor,2014-04-22,iPhone,2014-01-29,0.0,0,True,100.0,0
10,3.96,5.0,,2.00,Winterfell,2014-01-19,iPhone,2014-01-18,100.0,1,False,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39966,8.60,5.0,,1.00,Winterfell,2014-05-26,iPhone,2014-01-31,0.0,0,False,100.0,0
39971,13.45,5.0,,1.00,King's Landing,2014-02-26,Android,2014-01-08,0.0,0,True,100.0,0
39975,1.16,5.0,,1.00,Astapor,2014-01-18,iPhone,2014-01-18,0.0,1,True,0.0,0
39980,4.48,4.0,,1.00,Astapor,2014-05-11,iPhone,2014-01-21,0.0,1,True,50.0,0


In [20]:
df[df['avg_rating_of_driver'].isnull()].describe()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,trips_in_first_30_days,weekday_pct,active
count,6528.0,6471.0,0.0,6528.0,6528.0,6528.0,6528.0,6528.0
mean,7.458338,4.783959,,1.079312,8.693474,0.594516,58.917831,0.203891
std,7.778382,0.62555,,0.312519,26.766206,0.60937,46.79959,0.40292
min,0.0,1.0,,1.0,0.0,0.0,0.0,0.0
25%,2.41,5.0,,1.0,0.0,0.0,0.0,0.0
50%,4.67,5.0,,1.0,0.0,1.0,100.0,0.0
75%,10.6,5.0,,1.0,0.0,1.0,100.0,0.0
max,160.96,5.0,,5.0,100.0,5.0,100.0,1.0


### Make a column, avg_rating_of_driver_nan, where if the rider hasn't given out a rating, then its value is 1

In [23]:
df['avg_rating_of_driver_nan'] = df['avg_rating_of_driver'].isnull().astype(int)

#### Let's look at avg_rating_by_driver = NaN values

In [18]:
df[df['avg_rating_by_driver'].isnull()].describe()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,trips_in_first_30_days,weekday_pct,active
count,162.0,0.0,105.0,162.0,162.0,162.0,162.0,162.0
mean,6.337037,,4.509524,1.181049,17.078395,0.567901,53.08642,0.197531
std,13.425184,,0.95066,0.492734,37.56552,0.544613,50.059393,0.399371
min,0.0,,1.0,1.0,0.0,0.0,0.0,0.0
25%,1.9725,,4.0,1.0,0.0,0.0,0.0,0.0
50%,3.175,,5.0,1.0,0.0,1.0,100.0,0.0
75%,6.43,,5.0,1.0,0.0,1.0,100.0,0.0
max,160.96,,5.0,4.0,100.0,2.0,100.0,1.0


In [24]:
### Make a column, avg_rating_by_driver_nan, where if the rider hasn't been given a rating, then its value is 1

In [25]:
df['avg_rating_by_driver_nan'] = df['avg_rating_by_driver'].isnull().astype(int)

In [26]:
df.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,luxury_car_user,weekday_pct,active,avg_rating_of_driver_nan,avg_rating_by_driver_nan
0,6.94,5.0,5.0,1.0,Astapor,2014-05-03,Android,2014-01-12,0.0,0,False,100.0,0,0,0
1,8.06,5.0,5.0,1.0,Astapor,2014-01-26,Android,2014-01-25,0.0,2,True,0.0,0,0,0
2,21.5,4.0,,1.0,Winterfell,2014-05-21,iPhone,2014-01-02,0.0,1,True,100.0,0,1,0
3,9.46,5.0,,2.75,Winterfell,2014-01-10,Android,2014-01-09,100.0,1,False,100.0,0,1,0
4,13.77,5.0,,1.0,Winterfell,2014-05-13,iPhone,2014-01-31,0.0,0,False,100.0,0,1,0


In [27]:
df.describe()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,trips_in_first_30_days,weekday_pct,active,avg_rating_of_driver_nan,avg_rating_by_driver_nan
count,40000.0,39838.0,33472.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0
mean,5.791302,4.777434,4.601697,1.074956,8.857342,2.2807,60.874382,0.3758,0.1632,0.00405
std,5.708056,0.448088,0.61481,0.222427,20.014008,3.811289,37.089619,0.484335,0.369553,0.063511
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.42,4.7,4.3,1.0,0.0,0.0,33.3,0.0,0.0,0.0
50%,3.88,5.0,4.9,1.0,0.0,1.0,66.7,0.0,0.0,0.0
75%,6.93,5.0,5.0,1.05,8.3,3.0,100.0,1.0,0.0,0.0
max,160.96,5.0,5.0,8.0,100.0,125.0,100.0,1.0,1.0,1.0


### What to do with the NaN values? Should we convert those to 0? or Average value in column?

- It is noted that most get 5 stars, with the mean being above 4.5

### Look at 'phone' column
- Do android vs iphone users have a higher churn?
- What about the nan values?

In [28]:
df['phone'].unique()

array(['Android', 'iPhone', nan], dtype=object)

In [34]:
df['phone'].isnull().sum()

319

### Look at 'city' column

In [35]:
df['city'].unique()

array(['Astapor', 'Winterfell', "King's Landing"], dtype=object)