# Project 1 - Data Science Blog 

I want to try and answer three questions:

1. Are there any categories that are more successful than others?
2. Is there a better time-scale to raise the appropriate funds, or start time?
3. Can we predict whether a Kickstarter campaign will be successful or not given this data/Are there any key attributes that make a big impact on the outcome? 

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import the data

Extract the data in to a directory called "`data`". This should contain 2 csv files, one of which is used as the dataset for this project: "`ks-projects-201801.csv`"

In [79]:
data_filename = 'data\\ks-projects-201801.csv'

In [80]:
df = pd.read_csv(data_filename)    

In [81]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ID                378661 non-null  int64  
 1   name              378657 non-null  object 
 2   category          378661 non-null  object 
 3   main_category     378661 non-null  object 
 4   currency          378661 non-null  object 
 5   deadline          378661 non-null  object 
 6   goal              378661 non-null  float64
 7   launched          378661 non-null  object 
 8   pledged           378661 non-null  float64
 9   state             378661 non-null  object 
 10  backers           378661 non-null  int64  
 11  country           378661 non-null  object 
 12  usd pledged       374864 non-null  float64
 13  usd_pledged_real  378661 non-null  float64
 14  usd_goal_real     378661 non-null  float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB


In [83]:
df.describe()

Unnamed: 0,ID,goal,pledged,backers,usd pledged,usd_pledged_real,usd_goal_real
count,378661.0,378661.0,378661.0,378661.0,374864.0,378661.0,378661.0
mean,1074731000.0,49080.79,9682.979,105.617476,7036.729,9058.924,45454.4
std,619086200.0,1183391.0,95636.01,907.185035,78639.75,90973.34,1152950.0
min,5971.0,0.01,0.0,0.0,0.0,0.0,0.01
25%,538263500.0,2000.0,30.0,2.0,16.98,31.0,2000.0
50%,1075276000.0,5200.0,620.0,12.0,394.72,624.33,5500.0
75%,1610149000.0,16000.0,4076.0,56.0,3034.09,4050.0,15500.0
max,2147476000.0,100000000.0,20338990.0,219382.0,20338990.0,20338990.0,166361400.0


## Clean the data

### Check for nulls

In [84]:
df.isna().sum()

ID                     0
name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

Drop rows with any nulls

In [85]:
df = df.dropna()
df = df.reset_index(drop=True)

### Remove "Canceled" Campaigns"

In [143]:
df = df[df.state != "canceled"]
df = df.reset_index(drop=True)

### Split out Launched Date and Time

Split the `launched` column into `launched_date` and `launched_time`:

In [87]:
launched = df.launched

launched_dates = []
launched_times = []
for launch_date_time in launched:
    launched_date, launched_time = launch_date_time.split()
    launched_dates.append(launched_date)
    launched_times.append(launched_time)

In [88]:
df['launched_date'] = launched_dates
df['launched_time'] = launched_times
df = df.drop('launched', axis=1)

### Split the Launch Time into Categories

In [136]:
def time_categorize(time_list):
    """
    Sorts the times into morning, afternoon, evening and night. If there's any
    values that are not converted for some reason they are put into error_list.
    
    error_list should ideally be empty. Also creates a new total list of the 
    categorized times of day. 
    """
    i = 0
    time_of_day = []   # a list of all the times in order
    
    while i < len(time_list):
        time = time_list[i]
        
        if time < '12:00:00' and time >= '05:00:00':
            time_of_day.append('Morning')
            
        elif time < '17:00:00' and time >= '12:00:00':
            time_of_day.append('Afternoon')
            
        elif time < '21:00:00' and time >= '17:00:00':
            time_of_day.append('Evening')
            
        elif time < '24:00:00' and time >= '21:00:00':
            time_of_day.append('Night')
            
        elif time < '05:00:00' and time >= '00:00:00':
            time_of_day.append('Night')
            
        else:
            time_of_day.append('ERROR')
            
        i += 1
    return time_of_day

In [139]:
launch_time_category = time_categorize(list(df.launched_time))
df['launch_time_category'] = launch_time_category

In [140]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,launched_date,launched_time,launch_time_category
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,0.0,failed,0,GB,0.0,0.0,1533.95,2015-08-11,12:12:28,Afternoon
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2421.0,failed,15,US,100.0,2421.0,30000.0,2017-09-02,04:43:57,Night
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,220.0,failed,3,US,220.0,220.0,45000.0,2013-01-12,00:20:50,Night
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,1.0,failed,1,US,1.0,1.0,5000.0,2012-03-17,03:24:11,Night
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,1283.0,canceled,14,US,1283.0,1283.0,19500.0,2015-07-04,08:35:03,Morning


# 1. Are there any types of projects that are more successful than others?

### First, lets get an idea of the proportion of each main_category and category for all of the campaigns in the data

In [92]:
main_category_df = df[['main_category', 'state']]
category_df = df[['category', 'state']]

#### Main Categories

In [104]:
def get_count_of_states_per_category(df, category_col_name):
    category_df = df[[category_col_name, 'state']]
    for category in category_df[category_col_name].unique():
        print(category)

In [106]:
get_count_of_states_per_category(df, 'category')

Poetry
Narrative Film
Music
Film & Video
Restaurants
Food
Drinks
Product Design
Documentary
Nonfiction
Indie Rock
Crafts
Games
Tabletop Games
Design
Comic Books
Art Books
Fashion
Childrenswear
Theater
Comics
DIY
Webseries
Animation
Food Trucks
Public Art
Illustration
Photography
Pop
People
Art
Family
Fiction
Accessories
Rock
Hardware
Software
Weaving
Gadgets
Web
Jazz
Ready-to-wear
Festivals
Video Games
Anthologies
Publishing
Shorts
Electronic Music
Radio & Podcasts
Apps
Cookbooks
Apparel
Metal
Comedy
Hip-Hop
Periodicals
Dance
Technology
Painting
World Music
Photobooks
Drama
Architecture
Young Adult
Latin
Mobile Games
Flight
Fine Art
Action
Playing Cards
Makerspaces
Punk
Thrillers
Children's Books
Audio
Performance Art
Ceramics
Vegan
Graphic Novels
Fabrication Tools
Performances
Sculpture
Sound
Stationery
Print
Farmer's Markets
Events
Classical Music
Graphic Design
Spaces
Country & Folk
Wearables
Mixed Media
Journalism
Movie Theaters
Animals
Digital Art
Horror
Knitting
Small Batch
Insta

# 2. Is there an optimal time-scale to raise the appropriate funds, or start time?

# 3. Can we predict whether a Kickstarter campaign will be successful or not given this data/Are there any key attributes that make a big impact on the outcome?

Note: maybe use title workd count after removing stop words and punctuation? 