## Exercise 1: Produce a "people" file

Produce a “people” file based on the schema provided. Save it to the working directory as a CSV with a header line.

**Initial data inspection**

Before we begin, let's get an overall sense of the dataset. We'll load each of the 3 CSV files as a pandas DataFrame and conduct a quick inspection.

In [1]:
# First, we need to import pandas
import pandas as pd

# We can set some options to determine how data is displayed in the notebook
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.colheader_justify', 'left')

### Constinuent Information: df_cons_info

In [2]:
# Load the Constinuent Information ('cons.csv') file as a DataFrame
df_cons_info = pd.read_csv('data/cons.csv')

In [3]:
# Now, let's examine the header line and first 3 rows of data
df_cons_info.head(3)

Unnamed: 0,cons_id,prefix,firstname,middlename,lastname,suffix,salutation,gender,birth_dt,title,employer,occupation,income,source,subsource,userid,password,is_validated,is_banned,change_password_next_login,consent_type_id,create_dt,create_app,create_user,modified_dt,modified_app,modified_user,status,note
0,1,,,Lee,,MD,,E,,vSkSIzEQJdXnqeTTTXSG,,,29716060000.0,google,,3663,_kXcXaoK7i,1,0,0,5958,"Fri, 1983-08-26 06:02:03",1484,6162,"Sun, 2015-12-27 09:28:02",4022,6349,1,
1,2,,,,,II,boFqBKgLlSgEZsFrgCZd,E,"Mon, 2004-11-15",,,,671.7468,facebook,pRzBAZSGNScwCyreCEYr,7125,Ll3ZUxnh*9,0,0,1,4236,"Mon, 1979-03-05 21:08:54",4176,5476,"Tue, 1989-06-20 13:28:57",9010,5698,1,
2,3,,,David,King,,,D,"Fri, 1994-04-08",bxGxufoNzpKvjwNIxgRj,iPUtgXtqIBEaxQxaMMsr,,,,UAWXnALxxBXmwbPibFdw,5202,&@sU8IaE+L,1,0,1,1263,"Fri, 2008-08-22 19:20:28",4702,8239,"Fri, 2020-06-05 18:13:57",8837,1175,1,


The above DataFrame is a little hard to read. Let's fix that by creating a styler object we can use to format the text.

In [4]:
#  Create a styler object and left align the text 
style = df_cons_info.head(3).style
style.set_properties(**{'text-align': 'left'})
style.set_table_styles([{
    'selector': 'th:not(.index_name)',
    'props': 'text-align: left;'
}])

Unnamed: 0,cons_id,prefix,firstname,middlename,lastname,suffix,salutation,gender,birth_dt,title,employer,occupation,income,source,subsource,userid,password,is_validated,is_banned,change_password_next_login,consent_type_id,create_dt,create_app,create_user,modified_dt,modified_app,modified_user,status,note
0,1,,,Lee,,MD,,E,,vSkSIzEQJdXnqeTTTXSG,,,29716063420.773495,google,,3663,_kXcXaoK7i,1,0,0,5958,"Fri, 1983-08-26 06:02:03",1484,6162,"Sun, 2015-12-27 09:28:02",4022,6349,1,
1,2,,,,,II,boFqBKgLlSgEZsFrgCZd,E,"Mon, 2004-11-15",,,,671.7468,facebook,pRzBAZSGNScwCyreCEYr,7125,Ll3ZUxnh*9,0,0,1,4236,"Mon, 1979-03-05 21:08:54",4176,5476,"Tue, 1989-06-20 13:28:57",9010,5698,1,
2,3,,,David,King,,,D,"Fri, 1994-04-08",bxGxufoNzpKvjwNIxgRj,iPUtgXtqIBEaxQxaMMsr,,,,UAWXnALxxBXmwbPibFdw,5202,&@sU8IaE+L,1,0,1,1263,"Fri, 2008-08-22 19:20:28",4702,8239,"Fri, 2020-06-05 18:13:57",8837,1175,1,


In [5]:
# Let see find out the size of this data file and the data types of each of its columns
df_cons_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700000 entries, 0 to 699999
Data columns (total 29 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   cons_id                     700000 non-null  int64  
 1   prefix                      350304 non-null  object 
 2   firstname                   350244 non-null  object 
 3   middlename                  560213 non-null  object 
 4   lastname                    349314 non-null  object 
 5   suffix                      349541 non-null  object 
 6   salutation                  350021 non-null  object 
 7   gender                      349891 non-null  object 
 8   birth_dt                    349954 non-null  object 
 9   title                       350082 non-null  object 
 10  employer                    349228 non-null  object 
 11  occupation                  350239 non-null  object 
 12  income                      350637 non-null  float64
 13  source        

### Constinuent Email Addresses: df_cons_email

***Data Notes***

* Boolean columns (including 'is_primary') in all of these datasets are 1/0 numeric values. 1 means True, 0 means False.

In [6]:
# Load the Constituent Email Addresses file, 'cons_email.csv' as a DataFrame
df_cons_email = pd.read_csv('data/cons_email.csv')

In [7]:
# Again, we'll create a styler object to make the data easier to read
style = df_cons_email.head(3).style
style.set_properties(**{'text-align': 'left'})
style.set_table_styles([{
    'selector': 'th:not(.index_name)',
    'props': 'text-align: left;'
}])

Unnamed: 0,cons_email_id,cons_id,cons_email_type_id,is_primary,email,canonical_local_part,domain,double_validation,create_dt,create_app,create_user,modified_dt,modified_app,modified_user,status,note
0,1,548198,3361,1,xmartinez@vincent.com,,gmail.com,,"Wed, 1994-01-26 23:49:16",4072,9954,"Sat, 2014-04-19 19:10:39",1990,7595,1,
1,2,491137,2474,1,hmiller@haynes.biz,jqCyozTDojYuylQPTHfm,hotmail.com,,"Thu, 1999-12-09 06:18:27",1600,5716,"Sat, 1984-07-14 05:55:27",4686,3248,1,
2,3,413429,5175,1,aaron64@yahoo.com,FCBeBiVoqnnKDWjnllhN,yahoo.com,kRLZexQEwYIMbwFNvQxg,"Wed, 1992-11-18 16:46:27",7358,2685,"Sun, 1995-12-24 13:13:01",3857,7405,1,


In [8]:
# Check the size of the data file and the data types of each of its columns
df_cons_email.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400000 entries, 0 to 1399999
Data columns (total 16 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   cons_email_id         1400000 non-null  int64 
 1   cons_id               1400000 non-null  int64 
 2   cons_email_type_id    1400000 non-null  int64 
 3   is_primary            1400000 non-null  int64 
 4   email                 1400000 non-null  object
 5   canonical_local_part  700029 non-null   object
 6   domain                1400000 non-null  object
 7   double_validation     699825 non-null   object
 8   create_dt             1400000 non-null  object
 9   create_app            1400000 non-null  int64 
 10  create_user           1400000 non-null  int64 
 11  modified_dt           1400000 non-null  object
 12  modified_app          1400000 non-null  int64 
 13  modified_user         1400000 non-null  int64 
 14  status                1400000 non-null  int64 
 15

### Constinuent Subscription Status: df_cons_subs

***Data Notes***

* We only care about subscription statuses where chapter_id is 1.
* If an email is not present in this table, it is assumed to still be subscribed where chapter_id is 1.

In [9]:
# Now, load the Constituent Subscription Status file, 'cons_email_chapter_subscription.csv', as a DataFrame and repeat the inspection steps.
df_cons_subs = pd.read_csv('data/cons_email_chapter_subscription.csv')

# Creating the styler object to make the data easier to read.
style = df_cons_subs.head(3).style
style.set_properties(**{'text-align': 'left'})
style.set_table_styles([{
    'selector': 'th:not(.index_name)',
    'props': 'text-align: left;'
}])

Unnamed: 0,cons_email_chapter_subscription_id,cons_email_id,chapter_id,isunsub,unsub_dt,modified_dt
0,1,332188,1,1,"Sat, 1971-06-12 15:38:44","Thu, 1990-06-28 10:54:20"
1,2,536526,1,1,"Wed, 2006-07-12 01:50:45","Thu, 1979-09-20 06:02:35"
2,3,134711,1,1,"Tue, 1987-01-06 13:05:15","Sun, 1974-03-03 15:11:50"


In [10]:
# Checking size of the data set and data types of each column
df_cons_subs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350000 entries, 0 to 349999
Data columns (total 6 columns):
 #   Column                              Non-Null Count   Dtype 
---  ------                              --------------   ----- 
 0   cons_email_chapter_subscription_id  350000 non-null  int64 
 1   cons_email_id                       350000 non-null  int64 
 2   chapter_id                          350000 non-null  int64 
 3   isunsub                             350000 non-null  int64 
 4   unsub_dt                            350000 non-null  object
 5   modified_dt                         350000 non-null  object
dtypes: int64(4), object(2)
memory usage: 16.0+ MB


In [11]:
# Creating the styler object to make the data easier to read.
style = df_cons_subs.tail(3).style
style.set_properties(**{'text-align': 'left'})
style.set_table_styles([{
    'selector': 'th:not(.index_name)',
    'props': 'text-align: left;'
}])

Unnamed: 0,cons_email_chapter_subscription_id,cons_email_id,chapter_id,isunsub,unsub_dt,modified_dt
349997,349998,258910,4,1,"Sun, 2008-07-13 04:00:13","Sun, 1972-12-03 00:52:17"
349998,349999,570,1,1,"Sun, 1982-11-28 17:43:05","Fri, 2000-10-20 23:39:51"
349999,350000,270074,2,1,"Sun, 2019-08-11 14:01:34","Mon, 1997-04-28 18:41:48"


Now that we have a general understanding of our dataset, let's make a list of the columns we need to create and where we can find the necessary data. (Note - not all of the column names are an exact match, so let's confirm with the client that we've mapped them correctly.)

New column name / Data location / Expected data type / Description
* code / df_cons_info.source / string / Source code
* created_dt / df_cons_info.create_dt / datetime / Person creation datetime
* updated_dt / df_cons_info.modified_dt / datetime / Person updated datetime

* email / df_cons_email.email / string / Primary email address

* is_unsub / df_cons_subs.isunsub / boolean / Is the primary email address unsubscribed?

## Observations & Questions: 

* df_cons_info and df_cons_email share column "cons_id" - so we can join on this column
* df_cons_info: there are 700,000 constituents
* df_cons_email: there are 1,400,000 email addresses - we know that a constituent can have more than one email address, but are there some email addresses with no matching constituents?
* df_cons_email and df_cons_subs share column "cons_email_id" - so we can join on this column
* once we've joined all 3 tables into 1 big dataframe we want to filter to include only rows where:  chapter_id is 1 AND is_primary is 1 (True)
* if an email is not present, but chapter_id value is 1, we should assume it is still subscribed and make sure that it's isunsub values is 0 (False)
* we must limit to just the columns we want - need to decide how and when to do this...


In [12]:
df_cons_info_email = pd.merge(df_cons_email, df_cons_info, how='left')
df_cons_info_email.head()

Unnamed: 0,cons_email_id,cons_id,cons_email_type_id,is_primary,email,canonical_local_part,domain,double_validation,create_dt,create_app,create_user,modified_dt,modified_app,modified_user,status,note,prefix,firstname,middlename,lastname,suffix,salutation,gender,birth_dt,title,employer,occupation,income,source,subsource,userid,password,is_validated,is_banned,change_password_next_login,consent_type_id
0,1,548198,3361,1,xmartinez@vincent.com,,gmail.com,,"Wed, 1994-01-26 23:49:16",4072,9954,"Sat, 2014-04-19 19:10:39",1990,7595,1,,,,,,,,,,,,,,,,,,,,,
1,2,491137,2474,1,hmiller@haynes.biz,jqCyozTDojYuylQPTHfm,hotmail.com,,"Thu, 1999-12-09 06:18:27",1600,5716,"Sat, 1984-07-14 05:55:27",4686,3248,1,,,,,,,,,,,,,,,,,,,,,
2,3,413429,5175,1,aaron64@yahoo.com,FCBeBiVoqnnKDWjnllhN,yahoo.com,kRLZexQEwYIMbwFNvQxg,"Wed, 1992-11-18 16:46:27",7358,2685,"Sun, 1995-12-24 13:13:01",3857,7405,1,,,,,,,,,,,,,,,,,,,,,
3,4,347346,4117,1,wyattvincent@hotmail.com,,gmail.com,zSbfmlqXimGyWVBUGdQg,"Sat, 1983-11-26 16:49:14",881,3444,"Sun, 1975-01-19 14:32:56",8713,7713,1,,,,,,,,,,,,,,,,,,,,,
4,5,443000,6781,1,tspencer@hotmail.com,VaQIYlKcUkIywkKKEptD,gmail.com,,"Wed, 2000-11-15 13:28:34",5380,5456,"Sun, 1994-03-13 16:38:37",765,8618,1,,,,,,,,,,,,,,,,,,,,,


In [13]:
df_cons_info_email.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1400000 entries, 0 to 1399999
Data columns (total 36 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   cons_email_id               1400000 non-null  int64  
 1   cons_id                     1400000 non-null  int64  
 2   cons_email_type_id          1400000 non-null  int64  
 3   is_primary                  1400000 non-null  int64  
 4   email                       1400000 non-null  object 
 5   canonical_local_part        700029 non-null   object 
 6   domain                      1400000 non-null  object 
 7   double_validation           699825 non-null   object 
 8   create_dt                   1400000 non-null  object 
 9   create_app                  1400000 non-null  int64  
 10  create_user                 1400000 non-null  int64  
 11  modified_dt                 1400000 non-null  object 
 12  modified_app                1400000 non-null  int64  
 1

In [14]:
# now merge with df_cons_info_email with df_cons_subs- matching on column cons_email_id
df_cons_info_email_subs = pd.merge(df_cons_info_email, df_cons_subs, how='left')
df_cons_info_email_subs.head()
# now filter so we only keep rows containg primary email address (so, where value of is_primary is 1)
# then filter to include only rows where chapter_id is 1
# for rows where email is null AND chapter_id is 1, we should assume user is still subsciredbed, so we should make is_unsub False (0)

Unnamed: 0,cons_email_id,cons_id,cons_email_type_id,is_primary,email,canonical_local_part,domain,double_validation,create_dt,create_app,create_user,modified_dt,modified_app,modified_user,status,note,prefix,firstname,middlename,lastname,suffix,salutation,gender,birth_dt,title,employer,occupation,income,source,subsource,userid,password,is_validated,is_banned,change_password_next_login,consent_type_id,cons_email_chapter_subscription_id,chapter_id,isunsub,unsub_dt
0,1,548198,3361,1,xmartinez@vincent.com,,gmail.com,,"Wed, 1994-01-26 23:49:16",4072,9954,"Sat, 2014-04-19 19:10:39",1990,7595,1,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,491137,2474,1,hmiller@haynes.biz,jqCyozTDojYuylQPTHfm,hotmail.com,,"Thu, 1999-12-09 06:18:27",1600,5716,"Sat, 1984-07-14 05:55:27",4686,3248,1,,,,,,,,,,,,,,,,,,,,,,,,,
2,3,413429,5175,1,aaron64@yahoo.com,FCBeBiVoqnnKDWjnllhN,yahoo.com,kRLZexQEwYIMbwFNvQxg,"Wed, 1992-11-18 16:46:27",7358,2685,"Sun, 1995-12-24 13:13:01",3857,7405,1,,,,,,,,,,,,,,,,,,,,,,,,,
3,4,347346,4117,1,wyattvincent@hotmail.com,,gmail.com,zSbfmlqXimGyWVBUGdQg,"Sat, 1983-11-26 16:49:14",881,3444,"Sun, 1975-01-19 14:32:56",8713,7713,1,,,,,,,,,,,,,,,,,,,,,,,,,
4,5,443000,6781,1,tspencer@hotmail.com,VaQIYlKcUkIywkKKEptD,gmail.com,,"Wed, 2000-11-15 13:28:34",5380,5456,"Sun, 1994-03-13 16:38:37",765,8618,1,,,,,,,,,,,,,,,,,,,,,,,,,


In [15]:
df_cons_info_email_subs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1400000 entries, 0 to 1399999
Data columns (total 40 columns):
 #   Column                              Non-Null Count    Dtype  
---  ------                              --------------    -----  
 0   cons_email_id                       1400000 non-null  int64  
 1   cons_id                             1400000 non-null  int64  
 2   cons_email_type_id                  1400000 non-null  int64  
 3   is_primary                          1400000 non-null  int64  
 4   email                               1400000 non-null  object 
 5   canonical_local_part                700029 non-null   object 
 6   domain                              1400000 non-null  object 
 7   double_validation                   699825 non-null   object 
 8   create_dt                           1400000 non-null  object 
 9   create_app                          1400000 non-null  int64  
 10  create_user                         1400000 non-null  int64  
 11  modified_dt