# KickStarter Project

## Table of Contents
1. Import Libraries ad Data
2. Explorative Data Analysis
3. Feature analysis
4. Data visualization
5. Partition dataset into train / test sets
6. Modelling and Hyperparameter Optimization
7. Conclusion

# 1. Import Libraries and Data

# 1.1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from matplotlib import rcParams, cycler
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns
import datetime as dt
sns.set_style('whitegrid')
%matplotlib inline

## 1.2. Load Data

The present project analyse the dataset of Kickstarter projects, one of the most renown platforms to for promoting(mostly) creative, smart and visionary ideas and concepts.

In [2]:
#importing data
df = pd.read_csv('/Users/anamatias/Documents/github/Kickstarter git copy/DSI_kickstarterscrape_dataset.csv',encoding='latin1', index_col = 0)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/anamatias/Documents/github/Kickstarter git copy/DSI_kickstarterscrape_dataset.csv'

# 2. Explorative Data Analysis (EDA)

In [None]:
# show first 5 observations of dataframe for exploration purposes
df.head()

In [None]:
#check how many rows and columns
print('Dimension of the dataset:', df.shape)

In [None]:
# name of the columns
df.columns

In [None]:
#get some info
df.info()

In [None]:
#number of missing values per column
df.isnull().sum().sort_values(ascending = False)

In [None]:
#remove the rows with the missing values
df.dropna(inplace = True)

In [None]:
#look for duplicate values
df.duplicated().sum()

In [None]:
#drop the duplicated rows
df = df.drop_duplicates(subset=None, keep='first')

In [None]:
#number of unique valuesper column, sorted by descending order
df.T.apply(lambda x: x.nunique(), axis = 1).sort_values(ascending = False)

#### Status of the project
Status show how successfull was your project

In [None]:
#Calculate frequency 
def categorical_count(df, feature):
    #Calculate frequency on% and value
    freq = pd.concat([df[feature].value_counts(normalize=True) * 100, df[feature].value_counts(normalize=False)], axis=1)
    #rename columns
    freq.columns = [feature + '_%', feature + '_count']
    return freq

In [None]:
categorical_count(df, 'status')

In [None]:
#Histogram to check the frequency of the different status
plt.hist(df['status'])
plt.xlabel('Status', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.title('Project Status',fontsize = 20)
plt.show

Based on the table and histogram above we can confirm that "Status" don't follow a normal distribution. "Failed" and "Successful" contain 91% of all projects. 

The main goal of exercise is to identify the attributes that are important for the sucess or failure of the projects. The projects with status 'live', 'canceled' or 'suspended' can be dropped once can't be confirm for sure they would be successful or fail. For the further analysis we will keep just these two classes. 

In [None]:
#return only sucessful and failed projects
df = df.loc[df['status'].isin(['successful', 'failed'])]
df.head()

'Status' was transformed in a binary classification problem.

In [None]:
categorical_count(df, 'status')

In [None]:
#replace the categorical data by 1 and 0 values, where 1 means 'successfull' and 0 'failed'
status = pd.get_dummies(df['status'],drop_first=True)
df.drop(['status'],axis=1,inplace=True)
df = pd.concat([df,status],axis=1)
df.head()

### Features which will be used for further analysis
Features that are leaking our label so that the classifier will be able to predict sucess of projects right after it's launch.
    .name
    .category
    .subcategory
    .location
    .goal
    .pledge
    .backers
    .reward level
    .updates
    .comments
    .duration
    .successful



In [None]:
#Drop columns not relevant for this analysis
df = df.drop(['url', 'subcategory','funded percentage','levels', 'reward levels', 'updates', 'comments'], axis = 1, inplace = False)
df.head()

In [None]:
# Rename `funded date` to `funded_date`
df.rename(columns={'funded date':'funded_date'}, inplace=True)

 # 3. Feature analysis

## What is the mean (total) pledge that projects get? (not per backer)

In [None]:
# Mean of pledge
pledge_mean = df['pledged'].mean()
print(f'The mean total pledge is a $ {round(pledge_mean,2)}')

## Create a histogram that shows the distribution for number of backers. What is the skew of the distribution?

In [None]:
#Plotting the Histogram
plt.hist(df['backers'])
plt.xlabel('Backers', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.title('Number of Backers', fontsize = 20)
plt.show

#Skew of the distribution
The skew shows the level of the distortion from the symmetry bell curve or the normal distribution. 

## What type of projects would be most successful at getting funded?¶

In [None]:
# Print categories
df['category'].unique().tolist()

In [None]:
order = (df.loc[df.successful==1]['category'].value_counts())/(df['category'].value_counts())
order_1 = order.index

In [None]:
# Plot showing the percentage of Success
sns.catplot(y='successful', x= 'category', data=df, kind='bar', ci = None, aspect = 2)
plt.xlabel('Category', fontsize = 15)
plt.ylabel('Percentage of Success', fontsize = 15)
plt.title('Success by Category', fontsize = 20)
plt.show()

In [None]:
# Print locations
df['location'].unique().tolist()

### b. Skew of the distribution
The skew shows the level of the distortion from the symmetry bell curve or the normal distribution. 

In [None]:
#Checking Data Skew
backers_skew = df['backers'].skew()
print(f'The skew is {round(backers_skew,2)}')

The skew indicates that the data is highly skewed (superior than 1). 

## Is the 'duration' variable normal distributed?

### a. Histogram

There are a couple of ways to test if our data has a normal distribution. We can start with the simple and easiest one, histogram. Through the histogram we can check if our data is symetrical, has a bell shape, and if the mean and median are coincident.

In [None]:
sns.displot(df['duration'], kde = True)

The shape of the histogram is not specifically defined, but we can note that is trimodal, havind three separated classes or intervals, representing the maximum frequency of the distribution.

### b. Boxplot

Boxplot is another way to visualize data and have an idea about it's distribution. 

In [None]:
#plt.boxplot(df.duration, meanline = True, vert = False)
sns.boxplot(x=df['duration'], palette='flare')

The above image shows that the median is not centred in the boxplot and is not coincident with the mean, is located near the bottom 25% range. The boxplot is left-skeweed what indicates that the data doesn't follow a normal distribution. Outliers are also easly indentified representing a bigger number on the right side of the boxplot, comparing with the left side. 

# Part 2: Qualitative Analysis

## What's the best length of time to run a campaign?

In [None]:
#Define bar plot
colors = ('darkgreen','darkred')
dur = df.groupby('successful')['duration'].mean().sort_values()
dur.plot(kind = 'barh', figsize = (8, 3), color = colors,  zorder=2, width=0.3)

# Set axis label
plt.xlabel('Duration', fontsize = 15)
plt.ylabel('Status', fontsize = 15)
plt.title('Optimal duration of the project', fontsize = 20)


Projects on Kickstarter can last on average from 1 - 50 days. Campaigns with shorter durations show higher success rate.

## What's the ideal pledge goal?

In [None]:
#Define bar plot
colors = ('darkgreen','darkred')
dur = df.groupby('successful')['goal'].mean().sort_values()
dur.plot(kind = 'barh', figsize = (8, 3), color = colors,  zorder=2, width=0.3)

# Set axis label
plt.xlabel('Pledge Goal', fontsize = 12)
plt.ylabel('Status', fontsize = 12)
plt.title('Pledge Goal', fontsize = 20)

In [None]:
#Goal mean of the successful projects
mean_suc = round(df.groupby('successful')['goal'].mean().loc[1],2)
print(f'The ideal pledge goal is $ {mean_suc}')

The graphic shows that the projects with lower pledge have more chances to preform better.
Let's have a look in the relationship between project's goal and actual amount pledge, the graphic below visualizes both the goal and amount pledge for each project and the individual state of the project.

In [None]:
#define colors (darkgreen for successful projects and darkred for failed ones
colors = ('darkgreen','darkred')
#create a plot using seaborn, adjust data to millions
ax = sns.scatterplot(x = df.pledged/1e6, y = df.goal/1e6, hue=df.successful, palette=colors)
#add blue line to better visualize the border between failed and successful projects
sns.lineplot(x=(0,1.5), y=(0,1.5), color='darkblue')
#set the axes from -1 to their maximum (-1 looks better than 0 actually)
ax.set(ylim=(-0.1,1.5), xlim=(-0.1,1.5))
#set labels and title
ax.set(xlabel='Amount Pledged in Millions', ylabel='Goal in Millions', title= 'Goal vs. Pledged')

The graph suggests that unsuccessful projects usually fail without getting close to their goal, meaning that they “do not move horizontally towards the blue line but stay at x~0”. This leads to the conclusion that Kickstarter is an “all or nothing” platform, what we can confirm on the platform website. If you don’t make it, you probably didn’t even come close. On the other hand, many successful projects exceed their stated goal by far and pledge a multiple of their initial goal.

## What type of projects would be most successful at getting funded?

In [None]:
# Print categories
df['category'].unique().tolist()

In [None]:
# Plot showing the percentage of Success
sns.catplot(y='successful', x= 'category', data=df, kind='bar', ci = None, aspect = 2)
plt.xlabel('Percentage of Success', fontsize = 15)
plt.ylabel('Category', fontsize = 15)
plt.title('Success by Category', fontsize = 20)
plt.show()

The category with more success is Dance.

## Is there any ideal month/day/time to launch a campaign?

In [None]:
# Create 'funded_month' column based on datetime conversion from 'funded_date' column
df['funded_month'] = df['funded_date'].apply(lambda x: dt.datetime.strptime(x, '%a, %d %b %Y %X -%f').strftime('%B'))
# Create 'funded_weekday' column based on datetime conversion from 'funded_date' column
df['funded_weekday'] = df['funded_date'].apply(lambda x: dt.datetime.strptime(x, '%a, %d %b %Y %X -%f').strftime('%a'))
# Create 'funded_hour column based on datetime conversion from 'funded_date' column
df['funded_hour'] = df['funded_date'].apply(lambda x: dt.datetime.strptime(x, '%a, %d %b %Y %H:%M:%S -%f').strftime('%H'))

In [None]:
# Plot showing rates of success by Month
sns.catplot(y='successful', x='funded_month', data = df, kind='bar', ci = None, aspect = 1.8)
plt.xlabel('Month of Funding', fontsize = 15)
plt.ylabel('Percentage of Success', fontsize = 15)
plt.title('Success by Funding Month', fontsize = 20)
plt.show()

In [None]:
# Plot showing rates of success by Weekday
sns.catplot(y='successful', x='funded_weekday', data = df, kind='bar', ci = None, aspect = 1.8)
plt.xlabel('Day of Funding', fontsize = 15)
plt.ylabel('Percentage of Success', fontsize = 15)
plt.title('Success by Funding Weekday', fontsize = 20)
plt.show()

In [None]:
# Plot showing rates of success by Hour
sns.catplot(y='successful', x='funded_hour', data = df, kind='bar', ci = None, aspect = 1.8)
plt.xlabel('Day of Funding', fontsize = 15)
plt.ylabel('Percentage of Success', fontsize = 15)
plt.title('Success by Funding Hour', fontsize = 20)
plt.show()

In [None]:
# Create 'funded_month' column based on datetime conversion from 'funded_date' column
df['funded_year'] = df['funded_date'].apply(lambda x: dt.datetime.strptime(x, '%a, %d %b %Y %X -%f').strftime('%Y'))

In [None]:
# Plot showing rates of success by Year
sns.catplot(y='successful', x='funded_year', data = df, kind='bar', ci = None, aspect = 1.8)
plt.xlabel('Year of Funding', fontsize = 15)
plt.ylabel('Percentage of Success', fontsize = 15)
plt.title('Success by Funding Year', fontsize = 20)
plt.show()