### Objectives
The purpose of this notebook is to ensure sufficient data quality for a subsequent explortory data analysis exercise. Additionally, it serves to create helpful functions should this or a similar exercise need to be done in the future

### Package Imports & Path Setup

In [34]:
import sys
import os
from ydata_profiling import ProfileReport

In [35]:
#enabling importing of modules from utils directory

# construct path to utils directory
path_to_utils = os.path.abspath(os.path.join(os.path.dirname('__file__'), 'utils')) 

# avoid adding multiple times
if path_to_utils not in sys.path:
    sys.path.append(path_to_utils)

In [36]:
import pandas as pd
from utils.kscleaning import cleaning_pipeline

### Data Import

In [37]:
data = pd.read_excel("./data/Kickstarter.xlsx")

### Data Profiling

In [38]:
# checking size of the dataset, match of columns with data dictionary
print("Original shape:", data.shape)
print("Columns:", data.columns)

Original shape: (15474, 45)
Columns: Index(['id', 'name', 'goal', 'pledged', 'state', 'disable_communication',
       'country', 'currency', 'deadline', 'state_changed_at', 'created_at',
       'launched_at', 'staff_pick', 'backers_count', 'static_usd_rate',
       'usd_pledged', 'category', 'spotlight', 'name_len', 'name_len_clean',
       'blurb_len', 'blurb_len_clean', 'deadline_weekday',
       'state_changed_at_weekday', 'created_at_weekday', 'launched_at_weekday',
       'deadline_month', 'deadline_day', 'deadline_yr', 'deadline_hr',
       'state_changed_at_month', 'state_changed_at_day', 'state_changed_at_yr',
       'state_changed_at_hr', 'created_at_month', 'created_at_day',
       'created_at_yr', 'created_at_hr', 'launched_at_month',
       'launched_at_day', 'launched_at_yr', 'launched_at_hr',
       'create_to_launch_days', 'launch_to_deadline_days',
       'launch_to_state_change_days'],
      dtype='object')


In [39]:
# validating expected data types for data profiling
data.dtypes

id                                      int64
name                                   object
goal                                  float64
pledged                               float64
state                                  object
disable_communication                    bool
country                                object
currency                               object
deadline                       datetime64[ns]
state_changed_at               datetime64[ns]
created_at                     datetime64[ns]
launched_at                    datetime64[ns]
staff_pick                               bool
backers_count                           int64
static_usd_rate                       float64
usd_pledged                           float64
category                               object
spotlight                                bool
name_len                              float64
name_len_clean                        float64
blurb_len                             float64
blurb_len_clean                   

In [40]:
# viewing subset of records
data.head(10).style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
0,1538064060,MAGIC PIXEL - Bluetooth full color LED display,15000.0,5933.0,failed,False,GB,GBP,2016-03-19 09:31:29,2016-03-19 09:31:32,2015-12-18 03:17:13,2016-02-18 09:31:29,False,66,1.429989,8484.125686,Gadgets,False,8.0,8.0,18.0,14.0,Saturday,Saturday,Friday,Thursday,3,19,2016,9,3,19,2016,9,12,18,2015,3,2,18,2016,9,62,30,30
1,556771080,SmartPi - Turn your Raspberry Pi into a SmartMeter,9000.0,16552.0,successful,False,DE,EUR,2016-04-03 08:05:09,2016-04-03 08:05:10,2016-02-08 09:27:33,2016-02-18 08:05:09,False,131,1.114939,18454.471487,,True,9.0,6.0,23.0,15.0,Sunday,Sunday,Monday,Thursday,4,3,2016,8,4,3,2016,8,2,8,2016,9,2,18,2016,8,9,45,45
2,839314928,PlantSitter - The World's Smartest Plant Monitoring System,60000.0,43234.0,canceled,False,US,USD,2016-03-29 08:01:08,2016-03-28 09:46:41,2016-01-31 05:21:52,2016-02-18 08:01:08,False,632,1.0,43234.0,Gadgets,False,8.0,8.0,22.0,12.0,Tuesday,Monday,Sunday,Thursday,3,29,2016,8,3,28,2016,9,1,31,2016,5,2,18,2016,8,18,40,39
3,681077916,Digital Video LUT Box for Colorblindness Correction,125000.0,1262.0,canceled,False,US,USD,2016-03-19 07:48:02,2016-02-23 09:30:28,2016-01-28 11:21:14,2016-02-18 07:48:02,False,4,1.0,1262.0,Hardware,False,7.0,6.0,24.0,16.0,Saturday,Tuesday,Thursday,Thursday,3,19,2016,7,2,23,2016,9,1,28,2016,11,2,18,2016,7,20,30,5
4,1315415013,help send Object Collection to Norway!,2000.0,2300.0,successful,False,US,USD,2016-03-03 17:00:00,2016-03-03 17:00:00,2016-02-16 10:00:06,2016-02-18 07:00:44,False,29,1.0,2300.0,Experimental,True,6.0,5.0,19.0,13.0,Thursday,Thursday,Tuesday,Thursday,3,3,2016,17,3,3,2016,17,2,16,2016,10,2,18,2016,7,1,14,14
5,836821539,The Tragedy of Mario and Juliet,3000.0,3255.0,successful,False,US,USD,2016-04-18 04:13:25,2016-04-18 04:13:25,2016-01-27 17:18:58,2016-02-18 04:13:25,False,24,1.0,3255.0,Plays,True,6.0,4.0,19.0,14.0,Monday,Monday,Wednesday,Thursday,4,18,2016,4,4,18,2016,4,1,27,2016,17,2,18,2016,4,21,60,60
6,2077265745,Timepiece Pulu,35000.0,823.0,failed,False,US,USD,2016-03-25 21:46:51,2016-03-25 21:46:51,2016-01-24 23:54:17,2016-02-17 21:46:51,False,9,1.0,823.0,Gadgets,False,2.0,2.0,20.0,15.0,Friday,Friday,Sunday,Wednesday,3,25,2016,21,3,25,2016,21,1,24,2016,23,2,17,2016,21,23,37,37
7,2119284588,Stikk- The Gel Pad That Will Stick Anything to Everything!,2000.0,62831.0,successful,False,AU,AUD,2016-03-18 21:25:42,2016-03-18 21:25:42,2016-02-16 08:46:01,2016-02-17 21:25:42,False,2012,0.710177,44621.120406,Gadgets,True,10.0,9.0,15.0,12.0,Friday,Friday,Tuesday,Wednesday,3,18,2016,21,3,18,2016,21,2,16,2016,8,2,17,2016,21,1,30,30
8,1463630983,AHS Theater - Help us light up our stage!,6000.0,6530.0,successful,False,US,USD,2016-04-17 18:44:54,2016-04-17 18:44:55,2016-02-16 14:50:22,2016-02-17 18:44:54,False,69,1.0,6530.0,Spaces,True,9.0,7.0,26.0,18.0,Sunday,Sunday,Tuesday,Wednesday,4,17,2016,18,4,17,2016,18,2,16,2016,14,2,17,2016,18,1,60,60
9,694708905,The Lawn Project,80000.0,0.0,failed,False,CA,CAD,2016-03-18 18:10:49,2016-03-18 18:10:49,2016-02-14 23:26:48,2016-02-17 18:10:49,False,0,0.725834,0.0,Web,False,3.0,3.0,21.0,13.0,Friday,Friday,Sunday,Wednesday,3,18,2016,18,3,18,2016,18,2,14,2016,23,2,17,2016,18,2,30,30


In [7]:
# dropping ID column (no intrinsic meaning) to run data through ydata profiling
data2 = data.drop("id", axis= 1)

In [8]:
profile = ProfileReport(data2, title = "Raw Profile Report")

In [37]:
# profile.to_file(output_file="Kickstarter Raw Profile.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

**Observations**
* Generally low missingness of data. Column with most missing data is `category` at 9%. This will require imputation
* Other columns with missing values ( < 1%) are: `name`, `name_len`, `name_len_clean`, `blurb_len`, and  `blurb_len_clean`
* As `name` cannot be meaningfully imputed, it will only be tagged as missing where applicable. This tagging can help with checks to ensure there is no pattern in the missingness for future data pulls
* Where `name` is present, `name_len` can be derived and imputed
* Without access to projects blurbs, `blurb_len`, and  `blurb_len_clean` cannot be meaningfully imputed and will be left null. If these are identified as features for modeling, observations missing these attributes will be excluded. 
* For comparability, will create `usd_goal` variable similar to `usd_pledged`. This can be done using `static_usd_rate`
* Relatively high cardinality in `category` with 23 distinct in dataset. 
* The sample does not include any projects which are in progress as per the `state` column

### Data Preprocessing Exploration
Exploring cases of required action in order to inform the creation of replicable functions. 

In [41]:
missing_nl = data[data['name_len'].isna()]

In [42]:
missing_nl.head().style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
977,272079457,N/A (Canceled),1500000.0,0.0,canceled,False,US,USD,2016-01-06 22:57:52,2015-12-08 04:43:04,2015-12-07 19:17:09,2015-12-07 22:57:52,False,0,1.0,0.0,Spaces,False,,,,,Wednesday,Tuesday,Monday,Monday,1,6,2016,22,12,8,2015,4,12,7,2015,19,12,7,2015,22,0,30,0
2189,626888806,Star Wars Bluetooth Speakers (Canceled),60000.0,36058.0,canceled,False,GB,GBP,2015-10-30 07:51:51,2015-10-02 17:30:21,2015-07-02 04:51:04,2015-09-30 07:51:51,False,242,1.516337,54676.079907,Sound,False,,,,,Friday,Friday,Thursday,Wednesday,10,30,2015,7,10,2,2015,17,7,2,2015,4,9,30,2015,7,90,30,2
10047,470839570,TEST (Canceled),1000001.0,31.0,canceled,False,US,USD,2014-09-26 23:48:49,2014-09-25 20:26:10,2014-09-06 09:18:22,2014-09-12 23:48:49,False,2,1.0,31.0,Hardware,False,,,,,Friday,Thursday,Saturday,Friday,9,26,2014,23,9,25,2014,20,9,6,2014,9,9,12,2014,23,6,14,12
11544,1773256696,OF Press - A WordPress Theme and Site Builder (Canceled),5000.0,71.0,canceled,False,US,USD,2014-08-10 16:45:23,2014-07-30 17:58:09,2014-07-10 17:53:09,2014-07-11 16:45:23,False,4,1.0,71.0,Software,False,,,,,Sunday,Wednesday,Thursday,Friday,8,10,2014,16,7,30,2014,17,7,10,2014,17,7,11,2014,16,0,30,19


**Observation**
* Name length can be computed for these as a name is present
* Provided name length variables (which show number of words in the name) may be including the term (Canceled)

In [43]:
canceled = data[data['state'] == 'canceled']

In [44]:
canceled.head().style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
2,839314928,PlantSitter - The World's Smartest Plant Monitoring System,60000.0,43234.0,canceled,False,US,USD,2016-03-29 08:01:08,2016-03-28 09:46:41,2016-01-31 05:21:52,2016-02-18 08:01:08,False,632,1.0,43234.0,Gadgets,False,8.0,8.0,22.0,12.0,Tuesday,Monday,Sunday,Thursday,3,29,2016,8,3,28,2016,9,1,31,2016,5,2,18,2016,8,18,40,39
3,681077916,Digital Video LUT Box for Colorblindness Correction,125000.0,1262.0,canceled,False,US,USD,2016-03-19 07:48:02,2016-02-23 09:30:28,2016-01-28 11:21:14,2016-02-18 07:48:02,False,4,1.0,1262.0,Hardware,False,7.0,6.0,24.0,16.0,Saturday,Tuesday,Thursday,Thursday,3,19,2016,7,2,23,2016,9,1,28,2016,11,2,18,2016,7,20,30,5
10,1203278901,QwiQcom: A new way to communicate (Canceled),100000.0,6005.0,canceled,False,US,USD,2016-03-18 18:02:08,2016-03-15 09:43:10,2015-11-02 17:38:33,2016-02-17 18:02:08,False,60,1.0,6005.0,Gadgets,False,7.0,6.0,22.0,16.0,Friday,Tuesday,Monday,Wednesday,3,18,2016,18,3,15,2016,9,11,2,2015,17,2,17,2016,18,107,30,26
13,1975151755,Shorter Biography: Changing Autobiography Narrative,25000.0,100.0,canceled,False,US,USD,2016-03-18 15:29:56,2016-02-26 10:04:29,2016-02-15 17:44:50,2016-02-17 15:29:56,False,2,1.0,100.0,Software,False,5.0,5.0,17.0,12.0,Friday,Friday,Monday,Wednesday,3,18,2016,15,2,26,2016,10,2,15,2016,17,2,17,2016,15,1,30,8
27,69927850,Unlost (Canceled),20000.0,183.0,canceled,False,FR,EUR,2016-04-17 06:03:37,2016-04-14 10:32:19,2015-06-27 05:38:18,2016-02-17 06:03:37,False,7,1.116094,204.245175,Gadgets,False,2.0,2.0,15.0,11.0,Sunday,Thursday,Saturday,Wednesday,4,17,2016,6,4,14,2016,10,6,27,2015,5,2,17,2016,6,235,60,57


In [45]:
# checking if similar for suspended
suspended = data[data['state'] == 'suspended']
suspended.head(10).style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
229,1527373655,Phanes (Home Human Egg Fertilization Device) (Suspended),15000.0,25.0,suspended,True,US,USD,2016-03-15 09:25:21,2016-02-08 10:33:42,2015-12-03 10:46:21,2016-02-04 09:25:21,False,1,1.0,25.0,,False,7.0,7.0,10.0,7.0,Tuesday,Monday,Thursday,Thursday,3,15,2016,9,2,8,2016,10,12,3,2015,10,2,4,2016,9,62,40,4
233,675376262,E-Sign (Suspended),75000.0,50.0,suspended,True,GB,GBP,2016-03-05 07:37:56,2016-02-10 15:22:32,2014-04-24 04:37:56,2016-02-04 07:37:56,False,1,1.440414,72.020685,Software,False,2.0,2.0,10.0,7.0,Saturday,Wednesday,Thursday,Thursday,3,5,2016,7,2,10,2016,15,4,24,2014,4,2,4,2016,7,651,30,6
344,1505313406,Rare Meteorite Pedigree (Suspended),2200.0,1.0,suspended,True,US,USD,2016-02-27 17:55:18,2016-02-05 12:46:59,2016-01-23 18:13:36,2016-01-28 17:55:18,False,1,1.0,1.0,,False,4.0,4.0,22.0,13.0,Saturday,Friday,Saturday,Thursday,2,27,2016,17,2,5,2016,12,1,23,2016,18,1,28,2016,17,4,30,7
345,1459318004,Glasscam (Suspended),10000.0,104.0,suspended,True,IT,EUR,2016-03-25 05:57:00,2016-02-04 14:06:53,2016-01-22 04:57:18,2016-01-28 16:08:31,False,4,1.086562,113.002453,Hardware,False,2.0,2.0,24.0,14.0,Friday,Thursday,Friday,Thursday,3,25,2016,5,2,4,2016,14,1,22,2016,4,1,28,2016,16,6,56,6
371,249917655,LumiÃ¨re: THE ULTIMATE EYE PROTECTION LAMP YOU TAKE ANYWHERE!,11000.0,55998.0,suspended,True,AU,AUD,2016-02-29 06:30:47,2016-02-08 13:00:19,2016-01-08 07:39:47,2016-01-27 06:30:47,False,315,0.695718,38958.789125,Hardware,False,9.0,9.0,20.0,17.0,Monday,Monday,Friday,Wednesday,2,29,2016,6,2,8,2016,13,1,8,2016,7,1,27,2016,6,18,33,12
405,1788122327,TITAN : The PC that fits into the palm of your hand,19700.0,3025.0,suspended,True,US,USD,2016-02-29 19:00:00,2016-02-03 08:24:58,2015-12-09 13:12:23,2016-01-25 13:30:29,False,17,1.0,3025.0,Hardware,False,12.0,7.0,24.0,13.0,Monday,Wednesday,Wednesday,Monday,2,29,2016,19,2,3,2016,8,12,9,2015,13,1,25,2016,13,47,35,8
408,882967864,Shot Drive (Suspended),1000.0,0.0,suspended,True,US,USD,2016-02-24 10:33:46,2016-02-22 16:21:34,2016-01-19 21:07:30,2016-01-25 10:33:46,False,0,1.0,0.0,Hardware,False,3.0,3.0,12.0,8.0,Wednesday,Monday,Tuesday,Monday,2,24,2016,10,2,22,2016,16,1,19,2016,21,1,25,2016,10,5,30,28
497,2121781245,REX Super Brain/ sleep/ meditation/ concentration/ learning,10000.0,42172.0,suspended,True,CA,CAD,2016-02-16 12:40:49,2016-02-03 09:13:57,2015-09-14 19:23:35,2016-01-18 12:40:49,False,297,0.687799,29005.861958,Wearables,False,7.0,7.0,11.0,9.0,Tuesday,Wednesday,Monday,Monday,2,16,2016,12,2,3,2016,9,9,14,2015,19,1,18,2016,12,125,29,15
535,1793062138,flying cars (Suspended),1.0,0.0,suspended,True,CA,CAD,2016-02-13 10:38:47,2016-01-15 16:43:53,2016-01-08 01:22:19,2016-01-14 10:38:47,False,0,0.702277,0.0,Flight,False,3.0,3.0,14.0,5.0,Saturday,Friday,Friday,Thursday,2,13,2016,10,1,15,2016,16,1,8,2016,1,1,14,2016,10,6,30,1
547,1703712469,Religion scientifique du gros bon sens/Scientific Religion,50000.0,100.0,suspended,True,CA,CAD,2016-03-13 10:14:30,2016-01-25 13:56:55,2016-01-12 09:40:08,2016-01-13 10:14:30,False,1,0.703358,70.335804,Web,False,7.0,7.0,21.0,21.0,Sunday,Monday,Tuesday,Wednesday,3,13,2016,10,1,25,2016,13,1,12,2016,9,1,13,2016,10,1,60,12


In [46]:
failed = data[data['state'] == 'failed']
failed.head(10).style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
0,1538064060,MAGIC PIXEL - Bluetooth full color LED display,15000.0,5933.0,failed,False,GB,GBP,2016-03-19 09:31:29,2016-03-19 09:31:32,2015-12-18 03:17:13,2016-02-18 09:31:29,False,66,1.429989,8484.125686,Gadgets,False,8.0,8.0,18.0,14.0,Saturday,Saturday,Friday,Thursday,3,19,2016,9,3,19,2016,9,12,18,2015,3,2,18,2016,9,62,30,30
6,2077265745,Timepiece Pulu,35000.0,823.0,failed,False,US,USD,2016-03-25 21:46:51,2016-03-25 21:46:51,2016-01-24 23:54:17,2016-02-17 21:46:51,False,9,1.0,823.0,Gadgets,False,2.0,2.0,20.0,15.0,Friday,Friday,Sunday,Wednesday,3,25,2016,21,3,25,2016,21,1,24,2016,23,2,17,2016,21,23,37,37
9,694708905,The Lawn Project,80000.0,0.0,failed,False,CA,CAD,2016-03-18 18:10:49,2016-03-18 18:10:49,2016-02-14 23:26:48,2016-02-17 18:10:49,False,0,0.725834,0.0,Web,False,3.0,3.0,21.0,13.0,Friday,Friday,Sunday,Wednesday,3,18,2016,18,3,18,2016,18,2,14,2016,23,2,17,2016,18,2,30,30
11,781088620,"POI. Person of interest. A social app, but better.",105000.0,10.0,failed,False,US,USD,2016-04-02 17:43:31,2016-04-02 17:43:34,2016-01-28 22:37:00,2016-02-17 17:43:31,False,1,1.0,10.0,Apps,False,9.0,7.0,24.0,17.0,Saturday,Saturday,Thursday,Wednesday,4,2,2016,17,4,2,2016,17,1,28,2016,22,2,17,2016,17,19,45,45
12,1346050884,Volda Komiklubb,80000.0,0.0,failed,False,NO,NOK,2016-03-03 15:30:43,2016-03-03 15:30:43,2016-02-17 14:49:34,2016-02-17 15:30:43,False,0,0.116231,0.0,Experimental,False,2.0,2.0,22.0,13.0,Thursday,Thursday,Wednesday,Wednesday,3,3,2016,15,3,3,2016,15,2,17,2016,14,2,17,2016,15,0,15,15
14,1208442418,Kuruko - The AMV Sharing Site,600.0,10.0,failed,False,US,USD,2016-04-07 14:32:18,2016-04-07 14:32:18,2016-01-31 22:58:09,2016-02-17 14:32:18,False,2,1.0,10.0,Web,False,6.0,6.0,22.0,17.0,Thursday,Thursday,Sunday,Wednesday,4,7,2016,14,4,7,2016,14,1,31,2016,22,2,17,2016,14,16,50,50
17,1523306227,Chill Cocktail Ice Theater Project,35000.0,0.0,failed,False,US,USD,2016-04-17 13:52:00,2016-04-17 13:52:02,2015-12-29 22:20:54,2016-02-17 13:52:00,False,0,1.0,0.0,Experimental,False,5.0,5.0,13.0,13.0,Sunday,Sunday,Tuesday,Wednesday,4,17,2016,13,4,17,2016,13,12,29,2015,22,2,17,2016,13,49,60,60
19,1609059265,The Vegan Booklet,500.0,26.0,failed,False,FR,EUR,2016-03-18 12:12:28,2016-03-18 12:12:28,2016-02-15 09:02:21,2016-02-17 12:12:28,False,4,1.116094,29.01844,Web,False,3.0,3.0,18.0,11.0,Friday,Friday,Monday,Wednesday,3,18,2016,12,3,18,2016,12,2,15,2016,9,2,17,2016,12,2,30,30
20,246127349,CafeSync,5000.0,231.0,failed,False,US,USD,2016-03-18 12:05:45,2016-03-18 12:05:45,2016-02-10 12:26:36,2016-02-17 12:05:45,False,8,1.0,231.0,Software,False,1.0,1.0,20.0,16.0,Friday,Friday,Wednesday,Wednesday,3,18,2016,12,3,18,2016,12,2,10,2016,12,2,17,2016,12,6,30,30
22,1614791826,RasPiBox Zero,6000.0,912.0,failed,False,DE,EUR,2016-03-18 11:13:13,2016-03-18 11:13:14,2016-01-20 13:51:23,2016-02-17 11:13:13,False,25,1.116094,1017.877591,,False,2.0,2.0,17.0,14.0,Friday,Friday,Wednesday,Wednesday,3,18,2016,11,3,18,2016,11,1,20,2016,13,2,17,2016,11,27,30,30


In [47]:
successful = data[data['state'] == 'successful']
successful.head(10).style

Unnamed: 0,id,name,goal,pledged,state,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,category,spotlight,name_len,name_len_clean,blurb_len,blurb_len_clean,deadline_weekday,state_changed_at_weekday,created_at_weekday,launched_at_weekday,deadline_month,deadline_day,deadline_yr,deadline_hr,state_changed_at_month,state_changed_at_day,state_changed_at_yr,state_changed_at_hr,created_at_month,created_at_day,created_at_yr,created_at_hr,launched_at_month,launched_at_day,launched_at_yr,launched_at_hr,create_to_launch_days,launch_to_deadline_days,launch_to_state_change_days
1,556771080,SmartPi - Turn your Raspberry Pi into a SmartMeter,9000.0,16552.0,successful,False,DE,EUR,2016-04-03 08:05:09,2016-04-03 08:05:10,2016-02-08 09:27:33,2016-02-18 08:05:09,False,131,1.114939,18454.471487,,True,9.0,6.0,23.0,15.0,Sunday,Sunday,Monday,Thursday,4,3,2016,8,4,3,2016,8,2,8,2016,9,2,18,2016,8,9,45,45
4,1315415013,help send Object Collection to Norway!,2000.0,2300.0,successful,False,US,USD,2016-03-03 17:00:00,2016-03-03 17:00:00,2016-02-16 10:00:06,2016-02-18 07:00:44,False,29,1.0,2300.0,Experimental,True,6.0,5.0,19.0,13.0,Thursday,Thursday,Tuesday,Thursday,3,3,2016,17,3,3,2016,17,2,16,2016,10,2,18,2016,7,1,14,14
5,836821539,The Tragedy of Mario and Juliet,3000.0,3255.0,successful,False,US,USD,2016-04-18 04:13:25,2016-04-18 04:13:25,2016-01-27 17:18:58,2016-02-18 04:13:25,False,24,1.0,3255.0,Plays,True,6.0,4.0,19.0,14.0,Monday,Monday,Wednesday,Thursday,4,18,2016,4,4,18,2016,4,1,27,2016,17,2,18,2016,4,21,60,60
7,2119284588,Stikk- The Gel Pad That Will Stick Anything to Everything!,2000.0,62831.0,successful,False,AU,AUD,2016-03-18 21:25:42,2016-03-18 21:25:42,2016-02-16 08:46:01,2016-02-17 21:25:42,False,2012,0.710177,44621.120406,Gadgets,True,10.0,9.0,15.0,12.0,Friday,Friday,Tuesday,Wednesday,3,18,2016,21,3,18,2016,21,2,16,2016,8,2,17,2016,21,1,30,30
8,1463630983,AHS Theater - Help us light up our stage!,6000.0,6530.0,successful,False,US,USD,2016-04-17 18:44:54,2016-04-17 18:44:55,2016-02-16 14:50:22,2016-02-17 18:44:54,False,69,1.0,6530.0,Spaces,True,9.0,7.0,26.0,18.0,Sunday,Sunday,Tuesday,Wednesday,4,17,2016,18,4,17,2016,18,2,16,2016,14,2,17,2016,18,1,60,60
15,1491190846,BAT-SAFE,25000.0,33912.0,successful,False,US,USD,2016-03-18 14:29:38,2016-03-18 14:29:52,2016-01-28 18:18:01,2016-02-17 14:29:38,False,473,1.0,33912.0,Gadgets,True,1.0,1.0,21.0,13.0,Friday,Friday,Thursday,Wednesday,3,18,2016,14,3,18,2016,14,1,28,2016,18,2,17,2016,14,19,30,30
16,185703695,The Disaster Prediction App,35000.0,119910.44,successful,False,US,USD,2016-03-09 13:52:44,2016-03-09 13:52:45,2016-02-07 12:43:43,2016-02-17 13:52:44,False,2208,1.0,119910.44,Apps,True,4.0,4.0,16.0,12.0,Wednesday,Wednesday,Sunday,Wednesday,3,9,2016,13,3,9,2016,13,2,7,2016,12,2,17,2016,13,10,21,21
18,2101427126,Motion Control Camera Camcorder HD Bluetooth Smart Glasses,5000.0,10678.0,successful,False,US,USD,2016-04-17 13:18:39,2016-04-17 13:18:39,2016-02-02 16:39:30,2016-02-17 13:18:39,False,87,1.0,10678.0,Wearables,True,8.0,8.0,18.0,13.0,Sunday,Sunday,Tuesday,Wednesday,4,17,2016,13,4,17,2016,13,2,2,2016,16,2,17,2016,13,14,60,60
21,1624887146,I/O Cape for the BeagleBone Black (BBB-GVS-3),500.0,706.0,successful,False,US,USD,2016-03-20 19:00:00,2016-03-20 19:00:00,2016-02-12 17:56:05,2016-02-17 11:40:34,False,16,1.0,706.0,,True,7.0,5.0,16.0,11.0,Sunday,Sunday,Friday,Wednesday,3,20,2016,19,3,20,2016,19,2,12,2016,17,2,17,2016,11,4,32,32
26,766209806,Akvavit Theatre presents NOTHING OF ME by Arne Lygre,2500.0,2565.0,successful,False,US,USD,2016-03-02 23:59:00,2016-03-02 23:59:00,2016-02-16 10:02:53,2016-02-17 08:03:10,False,33,1.0,2565.0,Plays,True,9.0,8.0,19.0,13.0,Wednesday,Wednesday,Tuesday,Wednesday,3,2,2016,23,3,2,2016,23,2,16,2016,10,2,17,2016,8,0,14,14


**Observation**
* Noting the presence of the status 'canceled' within the name of some cases with missing name lengths, checked to see if this might be captured in other canceled projects for which `name_len` is populated. In the example of Unlost(Canceled), both `name_len`and `name_len_clean` capture length in words as 2. While the true length of the project name is better captured as 1 word. 
* This points to a miss in the provided dataset which should be addressed for meaningful EDA and model building. 
* Observing this prompted a check on other status where a similar pattern for suspended projects was observed. Preprocessing will include checking for statuses being captured within project names, and computing updated name length figures

### Data Preprocessing with Pipeline

In [48]:
cleaning_pipeline(data)

'Processed dataset saved as cleaned_data.xlsx'