<center> <h1> Park Slope Parents Membership Project</h1></center>
<center> <h2> Descriptive Analysis and Predictive Model for Membership Longevity </h2></center>
<hr>
<center> <h2> Summary and Technical Analysis</h2></center>
<center> <h3> Winter 2016-2017 </n> A. Joshua Bentley, DSI-2 Capstone</h3></center>

In [1]:
# Load libraries

from datetime import datetime, date, timedelta
import csv
import pandas as pd 
import plotly.graph_objs as go
import plotly.plotly as py
from plotly.graph_objs import graph_objs
# plot.ly credentials withheld for security. Images of results supplied.
import matplotlib.pyplot as plt
%matplotlib inline

## Project Overview
Park Slope Parents, a networking and peer support group for parents living in and around the Park Slope Neighborhood of Brooklyn has seen a flattening of membership and was interested in better understanding when families joined and left the group. Specifically they had these questions:

* When do people typically join PPP relative to their children’s birth dates?
* When do people typically let membership lapse?
* What is the average length of membership?
* Is there seasonality associated with membership?
* How have membership levels changed over the years?
* How has the group’s reach changed over the years?
* How can we continue to grow?

## Methodology
My goal is to address these questions in five phases. The first two are covered in this paper while the remaining phases will be addressed in the future and are beyond the client's initial scope.

Phase I: Descriptive Analysis
Phase II: Modelling Long Term versus Short Term membership
Phase III: NLP
Phase IV: Demographic shifts in Brooklyn for marketing efforts
Phase V: Membership price analysis

## Phase I: Descriptive Analysis
After removing open-response fields (address and reasons for joining the group), which I felt would be more useful in future phases, I dummied out several groups that had string responses where int would be more useful in analysis, then deleted the original column of data. Where the answers were binary (e.g., Yes/No, male/female) I deleted the negative value of the pair.

I chose to take this route with the binary response columns rather than converting the responses to 0 and 1 because it seemed clearer code, though perhaps less "Pythonic."

I added columns for the month and year a member joined, the member's first child was born, and the member's second child was born (if only one child repeated first child's month and year. I created a new column for the year the membership lapsed. While doing this I discovered several outliers. Some were the result of input error without the system throwing up an exception. This resulted in children reported as having been born in the year 2 (002) ranther than 2002 (2002 or 02). 

More difficult to identify and work around were situations where grandparents would join the organization and when they were asked for their childen's birthdays would answer accuratly, meaning they would give their childen's birthdays and not their grandchildren's birthdays. In order to account for this I limited the study to members who had children whose birthdays were between 1995 - 2018.

My first round of EDA was histograms taken directly from the data on month joined, number of children, birth patterns by month, how the group was discovered, and membership type (options were available for 1, 2, 5, and lifetime memberships). Most visualizations are plot.ly initiated in python and then cleaned in plot.ly workshop. Stacked 100% cylander charts (or, as the client calls them, Smartees charts) were made in Excel.

#### Descriptive Analysis Observations and Visualizations
Code here was fairly straighforward and is provided following the write-up. All plot.ly code is commented out as it requires sign-in to run and my credentials have not been provided.

** 1. When do families join? **
Typically membership is fairly even through the year with November and December lagging as the lowest months for new members joining. Perhaps because of these softer numbers January tends to be stronger than most for new families.

New families also seem to join in the summer months. According to PPP this is largely for nanny recommendations.

This year the summer swell was unusually strong as nearly half of the year's new families joined between May and July.

![](https://ajbentley.github.io/assets/images/psp/join_month_year.png?raw=true)



** 2. How many children do PPP families have? **
Most families have had only one child, but the number of two children families has been very strong as well. I had been toying with the idea of just using 1-child families since that would be easier for some comparisons, but that is clearly not an option here. The good news is that there are very few 3+ families so the fact that we don't have birth dates past #2 is less of an issue.

Please note that the original distribution for this feature included a "more than four" category. There were few enough families of that category that I merged them with the 4 (to make 4+)

Additionally the 0.5 column is for families who had a child on the way, through pregnancy or adoption, but were not yet parents.

Mean number of children for Parkside Parenting Partnership families has been 1.37

![](https://ajbentley.github.io/assets/images/psp/psp_kidcount.png?raw=true)


Looking at how the composition of the membership has changed over time in terms of number of children, the Parkside Parenting Partnership is increasingly comprised of families with one child.

![](https://ajbentley.github.io/assets/images/psp/kid_count_yr.png?raw=true)


** 3. When are members' children born? **
Birth patterns are very similar for first and second children, never more than 50 children between number born in each month.

Please note, this does not mean that individual families have children in the same month, just that overall patterns are similar.

This section is laregly here for the benefit of the client and has a considerable bias as when there was no second child the first child's birthdate is used. Regardless, there are enough second children that a more apparent discrepency should be seen if it was somehow different.


![](https://ajbentley.github.io/assets/images/psp/psp_birthmonth.png?raw=true)


** 4. How did members find out about the Parkside Parenting Partnership? **
Far and away the top means of finding PPP was through a friend/neighbor who is a member of the group.

*Data Dictionary*

|Label	 |  Definition|
|--------|------------|
|0  |	 A PPP member I don\'t know told me about it|
|1	 |  A PPPP member who is a friend/neighbor|
|2	 |  Found it through Yahoo|
|3	 |  Found it through a Google search|
|4	 |  Heard about it on another online parenting group (Urban Baby, etc.)|
|5	 |  Heard about it through a magazine, newspaper, blog|
|6	 |  I don't remember / Other|
|7	 |  NA|

![](https://ajbentley.github.io/assets/images/psp/psp_discovered.png?raw=true)


Finding PPP through a Google search was the #3 means of finding PPP and has been fairly consistent in most years, following a similar pattern to joining the group in most years.

In 2016 the likelihood of a family having found PPP this way has been much less until May, dropping off again after July, mirroring the pattern seen for new families joining this year.  

![](https://ajbentley.github.io/assets/images/psp/google_found_year.png?raw=true)

Interestingly, the trend of searches for the group as reported by Google are very different than the responses from families joining PPP. Searches in general have been on the decline, though this may merely suggest that people say they found PPP through Google when they just meant that they found it online somehow.

These data collected from Google Trends

![](https://ajbentley.github.io/assets/images/psp/psp_google_search.png?raw=true)

** 5. How popular is each type of membership? **
In terms of percent of all families, 1-year memberships have been on a slow decline over the years with 2-year memberships on the rise.

When she found out about this the client let me know that she is interested in looking into whether the price point for 2 year memberships is costing the organization more money than it would lose by having only 1 year memberships available. This is something that I will address in Phase V of the study.

![](https://ajbentley.github.io/assets/images/psp/mem_type_yr.png?raw=true)

** 6. How long do people keep their memberships?**
Most members have only stayed for a year.

Still, enough have been members through several years that the average is 2.83 years

![](https://ajbentley.github.io/assets/images/psp/psp_memduration.png?raw=true)


** 7. When do members join the Parkside Parenting Partnership, relative to the arrival of their child?**
Looking just at members who show a join date at most 12 years before the child's arrival or less than 2 years prior, we can see that the average (median) member joins about 5 months after their first child is born. This average includes people who join before their children are due.

About 2/3 of members join after their children arrived.

Among these members the average join date is 22 months old. That means that half of members who join after their children have arrived do so after age 22 months.

The 1st quartile in this measurement, the 25% earliest joins post arrival, was at 4.5 months old. That is a large drop-off and presents a potential target for membership.

Among members who join before their children arrive, they do so 3.6 months before the arrival. The 1st quartile here is 5.3 months so there isn't as much opportunity to reach out here.


![](https://ajbentley.github.io/assets/images/psp/psp_joinvbirth.png?raw=true)

Over time we can see that it has become more and more popular for people to join PPP while they are still waiting for their child to be born or adopted, representing over half of the members joining in 2016.

![](https://ajbentley.github.io/assets/images/psp/jvb_yrly.png?raw=true)

** 8. When do memberships lapse, relative to the arrival of members' latest children? **
Looking only at expired memberships, they tend to last about 22 months after their second child was born (if applicable--if not then 1st child birth date was used).

This is a _very_ interesting number as it is also the median age at which new members join (join date versus 1st child's birth). This suggests there may be a "ships in the night" pattern and makes it doubly important that Parenting Network concentrates on either increasing or emphasizing value for parents of children under 2 years old in order to both retain membership past this age and to bring parents in a little sooner.

![](https://ajbentley.github.io/assets/images/psp/psp_expvbirth.png?raw=true)


It appears that families are staying a little longer relative to their children's ages with a slight uptick in expirations after age 5 and slightly fewer in terms of 1-2 year olds.

![](https://ajbentley.github.io/assets/images/psp/evb_yrly.png?raw=true)

**Technical Challenges**
I hadn't really had to work with datetime information before. Previously when I had to get a piece of information it would just be a year, which I would get by referencing a slice of a string. Here I needed to use the entire date.

What was a major problem was that it was difficult to make calculations off of the timedeltas that were the result of addition / subtraction on datetimes, but division / multiplication of them resulted in integers, though it was not clear what those integers represented. I found that if I divided it by 1440 I would get the number of days it represented.

One thing that I needed to do in order to understand the organization's composition over time, as well as to answer the question of how membership levels changed year to year, was create breakouts for membership in each year. The data I had gave a start date and an end date which generally spanned a number of years. I tried a number of different ways to do this and was generally successful except for 2016, which I could not isolate. 

My first attempt only identified the year the member joined, which wasn't useful as I had that data already.

In [None]:
### first attempt at isolating membership years
### stopped indicating membership after it found a first year

for year in year_list:

    mem_year_list = []
    
    for x in dfy.join_year:
        while x == year:
            if x >= year | x < (year+1):
                mem_year_list.append(1)
            else:
                mem_year_list.append(0)

    myl = pd.Series(mem_year_list)
    dfy[year] = myl
    er = str(year)
    er = ('mem_'+ er[2:])
    dfy.rename (columns={year:er}, inplace=True)

My next attempt at it showed a steep decline in 2016. When I asked the client about it she said that this wasn't true, according to data she had from a report from her site host. I went back and saw that the method I'd used to do this excluded members whose expiration was further out than 2017. When I tried to fix it it showed membership doubling.

In [None]:
annual_mem_cols = ['mem_no','joined', 'exp_date', 'exp_year']
dfy = pd.DataFrame(dfn[annual_mem_cols])

dfy['joined'] = pd.to_datetime(dfy['joined'])
dfy['exp_date'] = pd.to_datetime(dfy['exp_date'])

year_list = [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,\
             2013, 2014, 2015, 2016]

for year in year_list:
    
    dfy['mem_' + str(year)] = dfy.apply(lambda x: x['joined'].year <= year and x['exp_date'].\
                                        year>= year, axis=1).astype('int')

I decided that since that method had worked for all years except for 2016 I took 2016 out of that code and made a new code to bring in 2016 data. These figures still didn't jibe with what the client had. After sorting through the code in Excel and picking things out one by one I found that my numbers were correct based on the data provided. The client and I agreed to move forward with these data and to discuss the disconnect with her provider at a later date. 

Below is the code used to isolate members in 2016.

In [None]:
memlist_2016 = []

for n in dfn.exp_year:
    if n >= 2016:
        memlist_2016.append(1)
    else:
        memlist_2016.append(0)

        
m16 = pd.Series(memlist_2016)
        
dfy = pd.concat([dfy, m16], axis = 1)
dfy.rename (columns={0:'mem_2016'}, inplace=True)
dfy.head()

# plot the year to year changes
mem_by_year_list = ['mem_2010','mem_2011','mem_2012','mem_2013','mem_2014','mem_2015','mem_2016']

dfmc = pd.DataFrame(dfn[mem_by_year_list])
dfmc.columns = ['2010', '2011', '2012', '2013', '2014', '2015', '2016']
memcount =[]
memcount = pd.Series(dfmc.sum(axis=0))
print memcount

memcount.plot(kind='line', figsize=(8,4), title="Annual PPP Membership", fontsize=13)

**Nontechnical Challenges**
During EDA I noticed trend changes that needed to be addressed. I contacted the client and was informed that in 2007 they got a website and in 2009 began charging for membership. After exploring year to year data I concluded that these changes were disruptive / significant and chose to cut data trends prior to 2010.

After publishing my Phase I results on my website my client became upset that information about her group were being released to the public. I had not discussed this with her previously and had not thought it would be an issue. As a compromise I changed the name of the group in published work to "Parkside Parenthood Partnership." My hope is that in the future she will allow me to change the name back so that people looking at my portfolio will know that it was real work and not an exercise on dummy data.

As part of my data collection for Phase III I needed to be made an admin for a Slack Group for new parents. The members of the group were upset by someone being "in their midst" when talking about private issues. This was a good experience for this level in my work to learn that if I have to do something similar in the future the members should be advised of the reason for my presence and that I am collecting metadata, not reading individual conversations. It was a very active example of the [Hawthorne effect](https://en.wikipedia.org/wiki/Hawthorne_effect) (an observer effect similar to Heisenberg's uncertainty principle). 

### Phase I Code

In [3]:
# read in data

# dfn = pd.read_csv("../../projects/psp/raw_data/PSP_data_4capstone.csv")
# dfn = pd.read_csv("../../projects/psp/refined_data/psp_numerical.csv")

In [None]:
# convert datetime columns

dfn.joined = pd.to_datetime(dfn.joined, format = '%Y/%m/%d')
dfn.exp_date = pd.to_datetime(dfn.exp_date, format = '%Y/%m/%d')
dfn.last_renewal_date = pd.to_datetime(dfn.last_renewal_date, format = '%Y/%m/%d')
dfn.kid1_bday = pd.to_datetime(dfn.kid1_bday, format = '%m/%d/%y')
dfn.kid2_bday = pd.to_datetime(dfn.kid2_bday, format = '%m/%d/%y')

In [None]:
# setting member number as index
dfn.set_index('mem_no')

# getting rid of a few columns
dfn.drop(["address",'join_reason'], axis=1, inplace=True)

In [None]:
# make a few dummies

status_dummy = pd.get_dummies(dfn['status'], prefix = 'status')
memtype_dummy = pd.get_dummies(dfn['mem_type'], prefix = 'mem_type')
email_dummy = pd.get_dummies(dfn['club_email'], prefix = 'club_email')
dup_dummy = pd.get_dummies(dfn['dup'], prefix = 'dup')
parent_dummy = pd.get_dummies(dfn['parent_status'], prefix = 'parent_status')

advice_dummy = pd.get_dummies(dfn['advice_grp'], prefix = 'advice_grp')
classifieds_dummy = pd.get_dummies(dfn['classifieds'], prefix = 'classifieds')
class_sp_dummy = pd.get_dummies(dfn['classifieds_spouse'], prefix = 'classifieds_spouse')
tony_dummy = pd.get_dummies(dfn['tony_kids'], prefix = 'tony_dids')
disc_dummy = pd.get_dummies(dfn['discovered'], prefix = 'discovered')


dfn = dfn.join(status_dummy)
dfn = dfn.join(memtype_dummy)
dfn = dfn.join(email_dummy)
dfn = dfn.join(dup_dummy)
dfn = dfn.join(parent_dummy)
dfn = dfn.join(advice_dummy)
dfn = dfn.join(classifieds_dummy)
dfn = dfn.join(class_sp_dummy)
dfn = dfn.join(tony_dummy)
dfn = dfn.join(disc_dummy)

In [None]:
# columns for joined month and year

dfn['join_year'] = dfn['joined'].dt.year
dfn['join_month'] = dfn['joined'].dt.month

# columns for exp_date year

dfn['exp_year'] = dfn['exp_date'].dt.year

# # columns for 1st kid's birth month and year
dfn['k1bday_year'] = dfn['kid1_bday'].dt.year
dfn['k1bday_month'] = dfn['kid1_bday'].dt.month

# # columns for 2nd kid's birth month and year
dfn['k2bday_year'] = dfn['kid2_bday'].dt.year
dfn['k2bday_month'] = dfn['kid2_bday'].dt.month

In [None]:
# check dates for out of consideration range. basically for grandparents who are using their children's birth dates,
# not their grandchildren's. Org started in 2002 so will assume anything prior to 1990 will be out of range

dfn = pd.DataFrame(dfn.loc[dfn['k1bday_year'] >= 1995])
dfn = pd.DataFrame(dfn.loc[dfn['k2bday_year'] >= 1995])
dfn = pd.DataFrame(dfn.loc[dfn['k1bday_year'] < 2018])
dfn = pd.DataFrame(dfn.loc[dfn['k2bday_year'] < 2018])

In [None]:
# enumerate some columns that had been dummied so that they come up on histograms

dfn['mem_type'].replace('1 year membership ($40)', 0, inplace=True)
dfn['mem_type'].replace('2 Year Membership ($75)', 1, inplace=True)
dfn['mem_type'].replace('3 year membership ($110)', 2, inplace=True)
dfn['mem_type'].replace('5 year membership ($175)', 3, inplace=True)
dfn['mem_type'].replace('Complimentary', 4, inplace=True)
dfn['mem_type'].replace('Lifetime Member', 5, inplace=True)
dfn['mem_type'].replace('Trial Membership', 6, inplace=True)

dfn['parent_status'].replace('No', 0, inplace=True)
dfn['parent_status'].replace('No, but we are pregnant/adopting', 1, inplace=True)
dfn['parent_status'].replace('Yes', 2, inplace=True)

dfn['discovered'].replace('A PSP member I don\'t know told me about it', 0, inplace=True)
dfn['discovered'].replace('A PSP member who is a friend/neighbor', 1, inplace=True)
dfn['discovered'].replace('Found it through Yahoo', 2, inplace=True)
dfn['discovered'].replace('Found it through a Google search', 3, inplace=True)
dfn['discovered'].replace('Heard about it on another online parenting group (Urban Baby, etc.)', 4, inplace=True)
dfn['discovered'].replace('Heard about it through a magazine, newspaper, blog', 5, inplace=True)
dfn['discovered'].replace('I don\'t remember', 6, inplace=True)
dfn['discovered'].replace('NA', 5, inplace=True)
dfn['discovered'].replace('Other', 6, inplace=True)

** 1. When do members join? **

In [None]:
# historam of month joined
# data = [go.Histogram(x=dfn.join_month)]
# py.iplot(data, thickness=5)

In [None]:
# create pivot table to show count of joins per month per year.
dfn_p = pd.pivot_table(dfn, values='mem_no', index='join_year', columns='join_month',\
               aggfunc=len, fill_value=0)

# exported to Excel to create stacked bar graph

** 2. How many children to PPP families have?**

In [None]:
# histogram of number of children
# data = [go.Histogram(x=dfn.kid_count)]
# py.iplot(data)

In [None]:
# create graph for number of children broken out by annual membership

num_child_cols = ['mem_no','kid_count', 'mem_2010','mem_2011','mem_2012','mem_2013',\
                  'mem_2014','mem_2015','mem_2016']

dfn_num_child = pd.DataFrame(dfn[num_child_cols])

# count of members with child due by year
df_due = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 0.5])
due_count = df_due.sum(axis=0)
due = pd.Series(due_count)
print due_count

# count of members with 1 child per year
df_one = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 1.0])
one_count = df_one.sum(axis=0)
one = pd.Series(one_count)
print one_count

# count of members with 2 children by year
df_two = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 2.0])
two_count = df_two.sum(axis=0)
two = pd.Series(two_count)
print two_count

# count of members with 3 children by year
df_three = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 3.0])
three_count = df_three.sum(axis=0)
three = pd.Series(three_count)
print three_count

# count of members with 4 children by year
df_four = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 4.0])
kfour_count = df_four.sum(axis=0)
four_or_more = pd.Series(kfour_count)
print kfour_count

# concat kid count dfs
df_kids = pd.concat([due, one, two, three, four_or_more], axis=1)
df_kids.drop(['kid_count'], axis=0, inplace=True)

# Export to Excel for 100% stacked graph

** 3. When are members' children born?**

In [None]:
# Compare birth patterns by month for first and second child

# x0 = dfn.k1bday_month
# x1 = dfn.k2bday_month

# trace1 = go.Histogram(
#     x=x0,
#     opacity=0.75
# )
# trace2 = go.Histogram(
#     x=x1,
#     opacity=0.75
# )
# data = [trace1, trace2]
# layout = go.Layout(
#     barmode='overlay'
# )
# fig = go.Figure(data=data, layout=layout)
# py.iplot(fig)

** 4. How did members find out about the Parkside Parenting Partnership?**

In [None]:
# histogram of how discovered
# data = [go.Histogram(x=dfn.discovered)]
# py.iplot(data)

In [None]:
# create graph showing when Google was the way the group was found by annual membership

# google_cols = ['mem_no','joined','discovered_Found it through a Google search']

google_cols = ['mem_no','joined','join_year','join_month','discovered_Found it through a Google search']

df_ggl = pd.DataFrame(dfn[google_cols])

# break into annual columns
df10 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2010])
df10.rename(columns={'join_month':'joined_2010'}, inplace=True)
df10g = df10.groupby(['joined_2010'])['join_year'].count()

df11 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2011])
df11.rename(columns={'join_month':'joined_2011'}, inplace=True)
df11g = df11.groupby(['joined_2011'])['join_year'].count()

df12 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2012])
df12.rename(columns={'join_month':'joined_2012'}, inplace=True)
df12g = df12.groupby(['joined_2012'])['join_year'].count()

df13 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2013])
df13.rename(columns={'join_month':'joined_2013'}, inplace=True)
df13g = df13.groupby(['joined_2013'])['join_year'].count()

df14 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2014])
df14.rename(columns={'join_month':'joined_2014'}, inplace=True)
df14g = df14.groupby(['joined_2014'])['join_year'].count()

df15 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2015])
df15.rename(columns={'join_month':'joined_2015'}, inplace=True)
df15g = df15.groupby(['joined_2015'])['join_year'].count()

df16 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2016])
df16.rename(columns={'join_month':'joined_2016'}, inplace=True)
df16g = df16.groupby(['joined_2016'])['join_year'].count()


# combine annual counts into single df
df_ggl = pd.concat([df10g, df11g, df12g, df13g, df14g, df15g, df16g], axis=1)
df_ggl.columns = ['joined_2010', 'joined_2011', 'joined_2012', 'joined_2013',\
                  'joined_2014', 'joined_2015', 'joined_2016']

# change column names
df_ggl['month']=('Jan','Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec')
df_ggl

# Export to Excel for graphing

In [None]:
# bring in Google Trends data
goog = pd.read_csv("../../projects/psp/raw_data/multiTimeline.csv")

# create graph in plotly

trace = go.Scatter(
    x = goog['Week'],
    y = goog['Park Slope Parents: (New York)']
)

data = [trace]


** 5. How popular is each type of membership? **

In [None]:
# histogram of membership type
# data = [go.Histogram(x=dfn.mem_type)]
# py.iplot(data)

In [None]:
# create graph for membership type broken out by annual membership

mem_type_cols = ['mem_no','mem_type', 'mem_2010','mem_2011','mem_2012','mem_2013',\
                  'mem_2014','mem_2015','mem_2016']

dfn_mem_type = pd.DataFrame(dfn[mem_type_cols])

# count of members with 1 year membership
df_mem1 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 0])
mem1_count = df_mem1.sum(axis=0)
smem1 = pd.Series(mem1_count)
print mem1_count

# # count of members with 2 year membership
df_mem2 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 1])
mem2_count = df_mem2.sum(axis=0)
smem2 = pd.Series(mem2_count)
# print due_count

# # count of members with 3 year membership
df_mem3 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 2])
mem3_count = df_mem3.sum(axis=0)
smem3 = pd.Series(mem3_count)
# print due_count

# # count of members with 5 year membership
df_mem5 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 3])
mem5_count = df_mem5.sum(axis=0)
smem5 = pd.Series(mem5_count)
# print due_count

# # count of members with lifetime membership
df_memlife = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 5])
memlife_count = df_memlife.sum(axis=0)
smemlife = pd.Series(memlife_count)
# print due_count

# concat annual dfs into single df
df_memt = pd.concat([smem1, smem2, smem3, smem5, smemlife], axis=1)
df_memt.drop(['mem_no','mem_type'], axis=0, inplace=True)

# change column names
df_memt.columns=('1yr_mem', '2yr_mem', '3yr_mem','5yr_mem','lifetime')

# Export to Excel for graphing

** 6. How long do people keep their memberships?

In [None]:
# calculate the average membership duration
dfn['mem_duration'] = dfn['exp_year'] - dfn['join_year']

In [None]:
# histogram of membership length
# data = [go.Histogram(x=dfn.mem_duration)]
# py.iplot(data)

** 7. When do members join the PPP, relative to the arrival of their children?**

In [None]:
# creating a column that shows the difference between when people join and when their kids
# are born
dfn['join_v_birth'] = (dfn['joined']-dfn['kid1_bday']).astype('timedelta64[m]')

In [None]:
# calculating average difference between date joined and when child was born
# once calculation is done, divide by 1440 to get number of days, then by 30 for months
k = ((sum(dfn.join_v_birth) / len(dfn.join_v_birth)))
print k/1440
print ((k/1440)/30)

In [None]:
# exclude outliers

dfk = pd.DataFrame(dfn.loc[dfn['join_v_birth'] < 6220800])
dfk = pd.DataFrame(dfk.loc[dfk['join_v_birth'] > -1036800])
dfk.head()

In [None]:
# histogram of difference between PSP join and 1st child's birth
# data = [go.Histogram(x=(dfk.join_v_birth/43200))]
# py.iplot(data)


In [None]:
# separating out positives and negatives in join v birth (those who joined pre and post birth)
jvb_p = []
jvb_n = []
jvb_z = []

for n in dfk.join_v_birth:
    if n > 0:
        jvb_p.append(n)
    elif n < 0:
        jvb_n.append(n)
    else:
        jvb_z.append(n)
        
# convert pre, post, and zero to Series

jvb_pos = pd.Series(jvb_p)
jvb_neg = pd.Series(jvb_n)
jvb_zed = pd.Series(jvb_z)

In [None]:
# get descriptive data for members who join post-birth
jvb_pos.describe()

In [None]:
# get descriptive data for members who join pre-birth
jvb_pos.describe()

In [None]:
# create graph showing difference between when a member joined and the age of child by year

join_v_birth_cols = ['mem_no','join_year','join_v_birth']

dfn_join_v_birth = pd.DataFrame(dfn[join_v_birth_cols])

dfn_join_v_birth.describe()

# calculate months from datetime (mult by 43200)
# 1 year = 518400
# 2 years = 1036800
# 3 years = 1555200
# 5 years = 2592000

# joined while child due
df_due = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] <= 0])
df_due['jvb_range'] = 'joined while due'


# joined while child was under 1 year old
df_one = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 0])
df_one = pd.DataFrame(df_one.loc[df_one['join_v_birth'] <= 51840.0 ])
df_one['jvb_range'] = 'joined while child under 1'


# joined while child was 1 - 2 years old
df_two = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 518400])
df_two = pd.DataFrame(df_two.loc[df_two['join_v_birth'] <= 1036800.0 ])
df_two['jvb_range'] = 'joined while child 1-2'


# joined while child was 2 - 3 years old
df_three = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 1036800.0])
df_three = pd.DataFrame(df_three.loc[df_three['join_v_birth'] <= 1555200.0 ])
df_three['jvb_range'] = 'joined while child 2-3'


# joined while child was 3 - 5 years old
df_five = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 1555200.0])
df_five = pd.DataFrame(df_five.loc[df_five['join_v_birth'] <= 2592000.0 ])
df_five['jvb_range'] = 'joined while child 3-5'


# joined while child was 5+ years old
df_older = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 2592000])
df_older['jvb_range'] = 'joined while child 5 or older'


# combine annual dfs
df_jvb = pd.concat([df_due, df_one, df_two, df_five, df_older], axis=0)
# df_kids.drop(['kid_count'], axis=0, inplace=True)
df_jvb.head(20)

# create pivot table 
dfn_jvbpiv = pd.pivot_table(df_jvb, values='mem_no', index='join_year', columns='jvb_range',\
               aggfunc=len, fill_value=0)

# export to Excel for graphing



** 8. When do memberships lapse, relative to the birth of members' last child?**

In [None]:
# creating a new column which shows the difference between the member's
# expiration date and the 2nd child's birthday (defaults to 1st child if no second)

dfn['exp_v_birth'] = (dfn['exp_date']-dfn['kid2_bday']).astype('timedelta64[m]')

In [None]:
# largest membership length before "lifetime" is 5 years so I will exclude any instances of a membership longer than
# 5 years. Also excluding records with negative numbers here.

dfk2 = pd.DataFrame(dfn.loc[dfn['exp_v_birth'] <= 2628000])
dfk2 = pd.DataFrame(dfk2.loc[dfk2['exp_v_birth'] >= 0])

# breaking out members who have lapsed from members who are still active

df_exp = pd.DataFrame(dfk2.loc[dfk2['status'] == 'Expired'])

In [None]:
# histogram of difference between PSP expiration date and 2nd child's arrival for expired members
# data = [go.Histogram(x=(df_exp.exp_v_birth/43200))]
# py.iplot(data)

In [None]:
# create graph showing difference between when a memberbership lapsed and the age of youngest child (of 2) by year

exp_v_birth_cols = ['mem_no','exp_year','exp_v_birth']

dfn_exp_v_birth = pd.DataFrame(dfn[exp_v_birth_cols])

dfn_exp_v_birth.describe()

# calculate months from datetime (mult by 43200)
# 1 year = 518400
# 2 years = 1036800
# 3 years = 1555200
# 5 years = 2592000


# exped while child was under 1 year old
df_one_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] <= 51840.0])
df_one_exp['exp_range'] = 'exped while child under 1'


# exped while child was 1 - 2 years old
df_two_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 518400])
df_two_exp = pd.DataFrame(df_two_exp.loc[df_two_exp['exp_v_birth'] <= 1036800.0 ])
df_two_exp['exp_range'] = 'exped while child 1-2'


# exped while child was 2 - 3 years old
df_three_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 1036800.0])
df_three_exp = pd.DataFrame(df_three_exp.loc[df_three_exp['exp_v_birth'] <= 1555200.0 ])
df_three_exp['exp_range'] = 'exped while child 2-3'


# exped while child was 3 - 5 years old
df_five_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 1555200.0])
df_five_exp = pd.DataFrame(df_five_exp.loc[df_five_exp['exp_v_birth'] <= 2592000.0 ])
df_five_exp['exp_range'] = 'exped while child 3-5'


# exped while child was 5+ years old
df_older_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 2592000])
df_older_exp['exp_range'] = 'exped while child 5 or older'


# combine annual dfs
df_evb = pd.concat([df_one_exp, df_two_exp, df_five_exp, df_older_exp], axis=0)

# create pivot table
dfn_evbpiv = pd.pivot_table(df_evb, values='mem_no', index='exp_year', columns='exp_range',\
               aggfunc=len, fill_value=0)

# export to excel for graphing



## Phase II: Descriptions and Predictions of Long versus Short Term Members
One of the client's concerns was that membership was flattening (or declining, according to the data she provided me). I thought it would be useful for her to know what features have distinguished long term versus short term membership.



#### Descriptive Analysis Observations and Visualizations
Code here was fairly straighforward and is provided following the write-up. All plot.ly code is commented out as it requires sign-in to run and my credentials have not been provided.

** 1. When do families join? **
Typically membership is fairly even through the year with November and December lagging as the lowest months for new members joining. Perhaps because of these softer numbers January tends to be stronger than most for new families.

New families also seem to join in the summer months. According to PPP this is largely for nanny recommendations.

This year the summer swell was unusually strong as nearly half of