# Bentley Capstone Progress Report + Preliminary Findings

Thus far most of my work has been in the EDA process, though that is somewhat due to the project having a largely descriptive nature. This step in itself has been somewhat fraught with frustrations, many of which are ongoing.

There are a few prongs to this project:
1. Descriptive Analysis and Visualization
2. Classification modeling on what makes a long-term versus short-term member
3. Text analysis on PSP's Slack, Yahoo Groups, and website discussions
4. Expansion study looking at demographic shifts in the areas the group has grown to see 

##### Descriptive Analysis and Visualization
This part is done at this point and will be what you mostly see below. It is also all up on my website. This was actually an issue with the client as she did not want her results to be publicized. For that reason you'll see me calling it "Parkside Parenting Partnership" instead of "Park Slope Parents."

I've had a couple of barriers here. The most substantial is that the client's dashboard from the website host gives her different membership numbers than her database. This resulted in a lot of repeated work as the data show that she lost about a thousand members in the last year while her report showed that she was fairly flat. Her report also shows her as having about a thousand fewer members than the data does.

This is still unresolved but I've told her that we can address the disparity after the project is over.

There was another round of revisiting after I asked about big changes in the data seen in 2007 and 2009. She let me know that they got a "real" website in 2007 and started charging for membership in 2009. This basically meant that any data before 2010 was worthless. Unfortunately everyone who had been a member before they started charging was grandfathered in as a lifetime member so I couldn't really cut them out. 

##### Classification modelling
This is going to be my next step. I had actually planned on doing the text analysis first but that is proving to have some technical difficulties that I would rather push off so that if I'm unable to get to them as fully as I hope to it won't interfere with completing the project.

I have nothing to show for this yet.


##### Text analysis
This has become a locus of "excitement if I had the time for it."

Of the three text sources I will be using the smallest, Slack (about 100M), is too big for me to work on with my computer so I need to hit the cloud computing. Alex also told me about Dask, which I'll check out, but I'm afraid it might not be enough when I get to the larger sets.

Slack had an interesting twist, actually. In order to download the information I had to be given admin rights and to join the group. It was a group of new mothers who were very upset that this guy suddenly showed up and wasn't interacting at all. After Susan, the client, told me about it I went in and wrote to the group about what I was doing and letting them know that I wasn't actually reading anything they wrote (apparently there are a lot of bodily fluids involved in most of their conversations). It was just an interesting insight on informed participation.

I decided not to tell them that if they thought anything they did online was actually private and secure they were foolish. 

So the next barrier with this is that the network's website has now told me that they will not be able to supply me with the discussion data which means I'm going to have to scrape it all down. I'm hoping that they will be able to get me some top-line usage so that I can at least use frequency of participation on the site as part of my classifier. My hopes are not high.


##### Expansion

The final piece of this project I am now certain will not be attainable during the time allotted by the course. This would be looking at demographics in the areas that my client's group has expanded into and seeing what neighborhoods are becoming increasingly similar.

## Data cleaning
I assume you assume data cleaning is an arduous process and that you would be as happy as not to just not have to hear me complain about how many ways I had to correct the spelling of Brooklyn (38).

I also assume that if you really REALLY want to find out about it you can look at my website.

One thing that I will say about my data cleaning step is that I decided to break my notebooks up so that I didn't have to run everything over and over again and have to search so much through everything (I just found out about the clickable TOC and will be using the **** out of that in the future). So I did all of the cleanup and then exported to csv so that I could then just open it in other notebooks. 

I'd originally tried to export as a dbf but that was crazy to figure out.

## EDA

There have been two big barriers with my EDA. 

One is that my client didn't tell me until today that they had major changes in the organization in 2007 and 2009 that made any kind of trending including those years disinformative. Unfortunately most of the members from before 2009 were grandfathered into lifetime membership so I couldn't really remove them from any overall observations on the data.

I have been able to exclude them from yearly trends, though, which has been a blessing...to a point.

Because people are members over a number of years it is necessary for me to isolate who was a member in any given year. I had thought of a very clever way to do this (I thought) which ended up only capturing the first year of membership. I was then able to write (with help from Stack Overflow (and man do some folks get their hackles up when you tell them what they suggested doesn't work!)) code that isolated membership in every year except for 2016. That one is still eluding me and is basically the barrier to my completing this portion.

This has been especially frustrating because I have twice now thought that the issue was resolved, moved forward with analysis and visualizing, and then realized I needed to go back to the drawing board. 

I also decided a few days ago to stop banging my head against walls of visualizations that I can't get right in Python and just do them in Excel. I'll get back around to learning how to do them in Python later, but I've got to keep an eye on the deadline too much now to be a purist about such things.



In [2]:
# Load libraries
from datetime import datetime, date, timedelta
import csv
import pandas as pd 
import plotly.graph_objs as go
import plotly.plotly as py
from plotly.graph_objs import graph_objs
from IPython.display import Image
from IPython.core.display import HTML 


# read in data

dfn = pd.read_csv("../../projects/psp/refined_data/psp_numerical.csv")


Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.



### EDA Step 1: Full Data Set Averages

#### Q 1: What are membership levels like from year to year?

This core question has not yet been answerable because of my membership year isolation issue. The first time I ran it it looked like membership had fallen by 30% year to year. The next time it looked like it had grown 4% (which I assumed was correct until I saw it wasn't). The most recent incarnation has membership doubling this year.

Below is the code I have been working with:

In [None]:
# this doesn't work for 2016 because it doesn't recognize that anything with an exp\
# date past 2016 should be assigned to 2016

dfy['joined'] = pd.to_datetime(dfy['joined'])
dfy['exp_date'] = pd.to_datetime(dfy['exp_date'])

year_list = [2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]

for year in year_list:
    
    dfy['mem_' + str(year)] = dfy.apply(lambda x: x['joined'].year <= year and x['exp_date'].\
                                        year>= year, axis=1).astype('int')
    
# code to isolate 2016

memlist_2016 = []

for n in dfn.exp_year:
    if n == 2016:
        memlist_2016.append(1)
    else:
        memlist_2016.append(0)
        
m16 = pd.Series(memlist_2016)

# join 2016 to other years
dfy = pd.concat([dfy, m16], axis = 1)
dfy.rename (columns={0:'mem_2016'}, inplace=True)
dfy.head()

### When do members join?
Typically membership is fairly even through the year with November and December lagging as the lowest months for new members joining. Perhaps because of these softer numbers January tends to be stronger than most for new members.

New families also seem to join in the summer months, largely for nanny recommendations.

This year the summer swell was unusually strong as nearly half of the year's new families joined between May and July.

In [None]:
dfn_p = pd.pivot_table(dfn, values='mem_no', index='join_year', columns='join_month',\
               aggfunc=len, fill_value=0)

![join_month_year](https://ajbentley.github.io/assets/images/psp/join_month_year.png)

### How many children do PPP families have?
Most families have had only one child, but the number of two children families has been very strong as well. I had been toying with the idea of just using 1-child families since that would be easier for some comparisons, but that is clearly not an option here. The good news is that there are very few 3+ families so the fact that we don't have birth dates past #2 is less of an issue.

Please note that the original distribution for this feature included a "more than four" category. There were few enough families of that category that I merged them with the 4 (to make 4+)

Additionally the 0.5 column is for families who had a child on the way, through pregnancy or adoption, but were not yet parents.

Mean number of children for Parkside Parenting Partnership families has been 1.37

In [None]:
# histogram of number of children
data = [go.Histogram(x=dfn.kid_count)]
py.iplot(data)

![](https://ajbentley.github.io/assets/images/psp/psp_kidcount.png?raw=true)

Looking at how the composition of the membership has changed over time in terms of number of children, the Parkside Parenting Partnership is increasingly comprised of families with one child.

In [None]:
# creating dataframe to show year to year changes in kid count composition

num_child_cols = ['mem_no','kid_count', 'mem_2010','mem_2011','mem_2012','mem_2013',\
                  'mem_2014','mem_2015','mem_2016']

dfn_num_child = pd.DataFrame(dfn[num_child_cols])

# count of members with child due by year
df_due = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 0.5])
due_count = df_due.sum(axis=0)
due = pd.Series(due_count)
print due_count

# # count of members with 1 child per year
df_one = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 1.0])
one_count = df_one.sum(axis=0)
one = pd.Series(one_count)
print one_count

# # count of members with 2 children by year
df_two = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 2.0])
two_count = df_two.sum(axis=0)
two = pd.Series(two_count)
print two_count

# # count of members with 3 children by year
df_three = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 3.0])
three_count = df_three.sum(axis=0)
three = pd.Series(three_count)
print three_count

# # count of members with 4 children by year
df_four = pd.DataFrame(dfn_num_child.loc[dfn_num_child['kid_count'] == 4.0])
kfour_count = df_four.sum(axis=0)
four_or_more = pd.Series(kfour_count)
print kfour_count

In [None]:
# bringing separate dfs together for final df

df_kids = pd.concat([due, one, two, three, four_or_more], axis=1)
df_kids.drop(['kid_count'], axis=0, inplace=True)
df_kids.columns=('child_due','1 child','2 children','3 children','4 + children')

### exporting to excel to make 100% stacked chart

![](https://ajbentley.github.io/assets/images/psp/kid_count_yr.png?raw=true)

### When are members' children born?
Birth patterns are very similar for first and second children, never more than 50 children between number born in each month.

Please note, this does not mean that individual families have children in the same month, just that overall patterns are similar.


In [None]:
# Compare birth patterns by month for first and second child

x0 = dfn.k1bday_month
x1 = dfn.k2bday_month

trace1 = go.Histogram(
    x=x0,
    opacity=0.75
)
trace2 = go.Histogram(
    x=x1,
    opacity=0.75
)
data = [trace1, trace2]
layout = go.Layout(
    barmode='overlay'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


![](https://ajbentley.github.io/assets/images/psp/psp_birthmonth.png?raw=true)

### How did members find out about the Parkside Parenting Partnership?
Far and away the top means of finding PPP was through a friend/neighbor who is a member of the group.

Unfortunately it looks like "I don't remember / Other" is the second most likely response but #3, Found it through a Google search, is informative.

*Data Dictionary*

|Label	 |  Definition|
|--------|------------|
|0  |	 A PPP member I don\'t know told me about it|
|1	 |  A PPPP member who is a friend/neighbor|
|2	 |  Found it through Yahoo|
|3	 |  Found it through a Google search|
|4	 |  Heard about it on another online parenting group (Urban Baby, etc.)|
|5	 |  Heard about it through a magazine, newspaper, blog|
|6	 |  I don't remember / Other|
|7	 |  NA|


In [None]:
# histogram of how discovered
data = [go.Histogram(x=dfn.discovered)]
py.iplot(data)

![](https://ajbentley.github.io/assets/images/psp/psp_discovered.png?raw=true)

Finding PPP through a Google search has been fairly consistent in most years, following a similar pattern to joining the group in most years.

In 2016 the likelihood of a family having found PPP this way has been much less until May, dropping off again after July, mirroring the pattern seen for new families joining this year.  

In [None]:
# create graph showing when Google was the way the group was found

# google_cols = ['mem_no','joined','discovered_Found it through a Google search']

google_cols = ['mem_no','joined','join_year','join_month','discovered_Found it through a Google search']

df_ggl = pd.DataFrame(dfn[google_cols])

# break into annual columns
df10 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2010])
df10.rename(columns={'join_month':'joined_2010'}, inplace=True)
df10g = df10.groupby(['joined_2010'])['join_year'].count()

df11 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2011])
df11.rename(columns={'join_month':'joined_2011'}, inplace=True)
df11g = df11.groupby(['joined_2011'])['join_year'].count()

df12 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2012])
df12.rename(columns={'join_month':'joined_2012'}, inplace=True)
df12g = df12.groupby(['joined_2012'])['join_year'].count()

df13 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2013])
df13.rename(columns={'join_month':'joined_2013'}, inplace=True)
df13g = df13.groupby(['joined_2013'])['join_year'].count()

df14 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2014])
df14.rename(columns={'join_month':'joined_2014'}, inplace=True)
df14g = df14.groupby(['joined_2014'])['join_year'].count()

df15 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2015])
df15.rename(columns={'join_month':'joined_2015'}, inplace=True)
df15g = df15.groupby(['joined_2015'])['join_year'].count()

df16 = pd.DataFrame(df_ggl.loc[df_ggl['join_year'] == 2016])
df16.rename(columns={'join_month':'joined_2016'}, inplace=True)
df16g = df16.groupby(['joined_2016'])['join_year'].count()

df_ggl = pd.concat([df10g, df11g, df12g, df13g, df14g, df15g, df16g], axis=1)
df_ggl.columns = ['joined_2010', 'joined_2011', 'joined_2012', 'joined_2013',\
                  'joined_2014', 'joined_2015', 'joined_2016']
df_ggl['month']=('Jan','Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec')
df_ggl

![](https://ajbentley.github.io/assets/images/psp/google_found_year.png?raw=true)

Guess who was paying attention in class today when you were talking about Google Trends?

Interestingly, the trend of searches for the group as reported by Google are very different than the responses from families joining PPP. Searches in general have been on the decline, though this may merely suggest that people say they found PPP through Google when they just meant that they found it online somehow.

![](https://ajbentley.github.io/assets/images/psp/psp_google_search.png?raw=true)

### How popular is each type of membership?
Far and away the top membership option is annual

|Label	 |  Definition|
|--------|------------|
|0  |	 1 year membership ($40)|
|1	 |  2 Year Membership ($75)|
|2	 |  3 year membership ($110)|
|3	 |  5 year membership ($175)|
|4	 |  Complimentary|
|5	 |  Lifetime Member|
|6	 |  Trial Membership|


In [None]:
# histogram of membership type
data = [go.Histogram(x=dfn.mem_type)]
py.iplot(data)

![](https://ajbentley.github.io/assets/images/psp/psp_memtype.png?raw=true)

In terms of percent of all families, 1-year memberships have been on a slow decline over the years with 2-year memberships on the rise.


In [None]:
# make dataframes for all membership types trends

mem_type_cols = ['mem_no','mem_type', 'mem_2010','mem_2011','mem_2012','mem_2013',\
                  'mem_2014','mem_2015','mem_2016']

dfn_mem_type = pd.DataFrame(dfn[mem_type_cols])

# count of members with 1 year membership
df_mem1 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 0])
mem1_count = df_mem1.sum(axis=0)
smem1 = pd.Series(mem1_count)
print mem1_count

# # count of members with 2 year membership
df_mem2 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 1])
mem2_count = df_mem2.sum(axis=0)
smem2 = pd.Series(mem2_count)
# print due_count

# # count of members with 3 year membership
df_mem3 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 2])
mem3_count = df_mem3.sum(axis=0)
smem3 = pd.Series(mem3_count)
# print due_count

# # count of members with 5 year membership
df_mem5 = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 3])
mem5_count = df_mem5.sum(axis=0)
smem5 = pd.Series(mem5_count)
# print due_count

# # count of members with lifetime membership
df_memlife = pd.DataFrame(dfn_mem_type.loc[dfn_mem_type['mem_type'] == 5])
memlife_count = df_memlife.sum(axis=0)
smemlife = pd.Series(memlife_count)
# print due_count


In [None]:
# merge dfs

df_memt = pd.concat([smem1, smem2, smem3, smem5, smemlife], axis=1)
df_memt.drop(['mem_no','mem_type'], axis=0, inplace=True)
df_memt.columns=('1yr_mem', '2yr_mem', '3yr_mem','5yr_mem','lifetime')

### exporting to excel to make 100% stacked chart

![](https://ajbentley.github.io/assets/images/psp/mem_type_yr.png?raw=true)

### How long do people keep their memberships?
Most members have only stayed for a year.

Still, enough have multi-year memberships that the average is 2.83 years

In [None]:
# histogram of membership length
data = [go.Histogram(x=dfn.mem_duration)]
py.iplot(data)

![](https://ajbentley.github.io/assets/images/psp/psp_memduration.png?raw=true)

### When do members join the Parkside Parenting Partnership, relative to the arrival of their child?
Looking just at members who show a join date at most 12 years before the child's arrival or less than 2 years prior, we can see that the average (median) member joins about 5 months after their first child is born. This average includes people who join before their children are due.

About 2/3 of members join after their children arrived.

Among these members the average join date is 22 months old. That means that half of members who join after their children have arrived do so after age 22 months.

The 1st quartile in this measurement, the 25% earliest joins post arrival, was at 4.5 months old. That is a large drop-off and presents a potential target for membership.

Among members who join before their children arrive, they do so 3.6 months before the arrival. The 1st quartile here is 5.3 months so there isn't as much opportunity to reach out here.

In [None]:
# creating a column that shows the difference between when people join and when their kids are born

dfn['join_v_birth'] = (dfn['joined']-dfn['kid1_bday']).astype('timedelta64[m]')

In [None]:
# for whatever reason, the numbers that come out of a timedelta on months needs to be divided by 1440 in order to get
# to the number of days, which can then be converted to months.

k = ((sum(dfn.join_v_birth) / len(dfn.join_v_birth)))
print k
print k/43200
print k/1440
print ((k/1440)/30)

In [None]:
# for some reason there are still some strange numbers in here, including people joining the organization when their
# children are in their 20s. In order to exclude outliers I'm going to create a more focused dataframe that will be 
# more useful for the analysis

dfk = pd.DataFrame(dfn.loc[dfn['join_v_birth'] < 6220800])
dfk = pd.DataFrame(dfk.loc[dfk['join_v_birth'] > -1036800])
dfk.head()

In [None]:
 histogram of difference between PSP join and 1st child's birth
data = [go.Histogram(x=(dfk.join_v_birth/43200))]
py.iplot(data)

# Reminder: this chart excludes any record that showed a join date over 12 years after the first child's birth or
# over 2 years before the first child's birth.

# On average (median) parents join PSP about 5 months after their first child is born (includes those who join pre).

![](https://ajbentley.github.io/assets/images/psp/psp_joinvbirth.png?raw=true)

Over time we can see that it has become more and more popular for people to join PPP while they are still waiting for their child to be born or adopted, representing over half of the members joining in 2016.

In [None]:
# separating out positives and negatives in join v birth (those who joined pre and post birth)
jvb_p = []
jvb_n = []
jvb_z = []

for n in dfk.join_v_birth:
    if n > 0:
        jvb_p.append(n)
    elif n < 0:
        jvb_n.append(n)
    else:
        jvb_z.append(n)

In [None]:
# about 2/3 of members join after their child is born.

pl = len(jvb_p)
nl = len(jvb_n)
zl = len(jvb_z)

lensum = (pl+nl+zl)
print pl, nl, zl 
print lensum

print pl/14114.
print nl/14114.

df_jvb_len = pd.DataFrame([pl, nl, zl])

In [None]:
jvb_pos = pd.Series(jvb_p)
jvb_neg = pd.Series(jvb_n)
jvb_zed = pd.Series(jvb_z)

In [None]:
jvb_pos.describe()

# among members who join after their first child is born, on average (median) their child is 22 months old. 
# the 1st quartile (lowest 25% of join v birth) joined when their child was 4.5 months old. 

In [None]:
jvb_neg.describe()

# among members who join before their first child is born, on average (median) their child is due in 3.6 months. 
# the 1st quartile (lowest 25% of join v birth) joined when their child was due in 5.3 months. 
# This is a much tighter cluster than for those who join after they have their child.

In [None]:
# create graph showing difference between when a member joined and the age of child by year

join_v_birth_cols = ['mem_no','join_year','join_v_birth']

dfn_join_v_birth = pd.DataFrame(dfn[join_v_birth_cols])

dfn_join_v_birth.describe()

# calculate months from datetime (mult by 43200)
# 1 year = 518400
# 2 years = 1036800
# 3 years = 1555200
# 5 years = 2592000

# joined while child due
df_due = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] <= 0])
df_due['jvb_range'] = 'joined while due'


# joined while child was under 1 year old
df_one = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 0])
df_one = pd.DataFrame(df_one.loc[df_one['join_v_birth'] <= 51840.0 ])
df_one['jvb_range'] = 'joined while child under 1'


# joined while child was 1 - 2 years old
df_two = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 518400])
df_two = pd.DataFrame(df_two.loc[df_two['join_v_birth'] <= 1036800.0 ])
df_two['jvb_range'] = 'joined while child 1-2'


# joined while child was 2 - 3 years old
df_three = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 1036800.0])
df_three = pd.DataFrame(df_three.loc[df_three['join_v_birth'] <= 1555200.0 ])
df_three['jvb_range'] = 'joined while child 2-3'


# joined while child was 3 - 5 years old
df_five = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 1555200.0])
df_five = pd.DataFrame(df_five.loc[df_five['join_v_birth'] <= 2592000.0 ])
df_five['jvb_range'] = 'joined while child 3-5'


# joined while child was 5+ years old
df_older = pd.DataFrame(dfn_join_v_birth.loc[dfn_join_v_birth['join_v_birth'] > 2592000])
df_older['jvb_range'] = 'joined while child 5 or older'


In [None]:
# bring individual dfs together
df_jvb = pd.concat([df_due, df_one, df_two, df_five, df_older], axis=0)

# create pivot table
dfn_jvbpiv = pd.pivot_table(df_jvb, values='mem_no', index='join_year', columns='jvb_range',\
               aggfunc=len, fill_value=0)

### exporting to excel to make 100% stacked chart

![](https://ajbentley.github.io/assets/images/psp/jvb_yrly.png?raw=true)

### When do memberships lapse, relative to the arrival of members' latest children?
Looking only at expired memberships, they tend to last about 22 months after their second child was born (if applicable--if not then 1st child birth date was used).

This is a _very_ interesting number as it is also the median age at which new members join (join date versus 1st child's birth). This suggests there may be a "ships in the night" pattern and makes it doubly important that Parenting Network concentrates on either increasing or emphasizing value for parents of children under 2 years old in order to both retain membership past this age and to bring parents in a little sooner.


In [None]:
# creating a new column which shows the difference between the member's
# expiration date and the 2nd child's birthday (defaults to 1st child if no second)

dfn['exp_v_birth'] = (dfn['exp_date']-dfn['kid2_bday']).astype('timedelta64[m]')

In [None]:
# largest membership length before "lifetime" is 5 years so I will exclude any instances of a membership longer than
# 5 years. Also excluding records with negative numbers here.

dfk2 = pd.DataFrame(dfn.loc[dfn['exp_v_birth'] <= 2628000])
dfk2 = pd.DataFrame(dfk2.loc[dfk2['exp_v_birth'] >= 0])

In [None]:
# breaking out members who have lapsed from members who are still active

df_exp = pd.DataFrame(dfk2.loc[dfk2['status'] == 'Expired'])

In [None]:
# histogram of difference between PSP expiration date and 2nd child's arrival for expired members
data = [go.Histogram(x=(df_exp.exp_v_birth/43200))]
py.iplot(data)

# this is a much more evenly distributed dataset, though still right-skewed.

# On average (median) PSP memberships have lapsed about 22 months after their second child is born (first child if
# only).

# This is a VERY interesting number as it is also the median age at which new members join (join date versus 1st
# child's birth. This suggests there may be a "ships in the night" pattern.

![](https://ajbentley.github.io/assets/images/psp/psp_expvbirth.png?raw=true)

It appears that families are staying a little longer relative to their children's ages with a slight uptick in expirations after age 5 and slightly fewer in terms of 1-2 year olds.

In [None]:
# create graph showing difference between when a memberbership lapsed and the age of youngest child (of 2) by year

exp_v_birth_cols = ['mem_no','exp_year','exp_v_birth']

dfn_exp_v_birth = pd.DataFrame(dfn[exp_v_birth_cols])

dfn_exp_v_birth.describe()

# calculate months from datetime (mult by 43200)
# 1 year = 518400
# 2 years = 1036800
# 3 years = 1555200
# 5 years = 2592000


# exped while child was under 1 year old
df_one_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] <= 51840.0])
df_one_exp['exp_range'] = 'exped while child under 1'


# exped while child was 1 - 2 years old
df_two_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 518400])
df_two_exp = pd.DataFrame(df_two_exp.loc[df_two_exp['exp_v_birth'] <= 1036800.0 ])
df_two_exp['exp_range'] = 'exped while child 1-2'


# exped while child was 2 - 3 years old
df_three_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 1036800.0])
df_three_exp = pd.DataFrame(df_three_exp.loc[df_three_exp['exp_v_birth'] <= 1555200.0 ])
df_three_exp['exp_range'] = 'exped while child 2-3'


# exped while child was 3 - 5 years old
df_five_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 1555200.0])
df_five_exp = pd.DataFrame(df_five_exp.loc[df_five_exp['exp_v_birth'] <= 2592000.0 ])
df_five_exp['exp_range'] = 'exped while child 3-5'


# exped while child was 5+ years old
df_older_exp = pd.DataFrame(dfn_exp_v_birth.loc[dfn_exp_v_birth['exp_v_birth'] > 2592000])
df_older_exp['exp_range'] = 'exped while child 5 or older'


In [None]:
df_evb = pd.concat([df_one_exp, df_two_exp, df_five_exp, df_older_exp], axis=0)
df_evb.head(20)

In [None]:
dfn_evbpiv = pd.pivot_table(df_evb, values='mem_no', index='exp_year', columns='exp_range',\
               aggfunc=len, fill_value=0)

In [None]:
### exporting to excel to make 100% stacked chart

![](https://ajbentley.github.io/assets/images/psp/evb_yrly.png?raw=true)