<a href="https://colab.research.google.com/github/avmiloserdov/cofound-predict-success-model/blob/main/CoFound_le_kickstarter_lgbm703.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement
To predict if a kickstarter project will be successful or will fail before its actual deadline. Also identify the factors that determine the success rate of a project.


# Solution Notebook
This notebook basically has 4 steps/ modules:
    1. Data Understanding (EDA) and Preprocessing
    2. Feature Engineering and heuristic feature selection
    3. Model Building
        3A. XGBoost
        3B. Random Forest
        3C. LGBM (2 versions: with one-hot encoded features and with categorical features at integer-category columns)
        3D. Ensemble Models- ormal Averaging and AdaBoosting
    4. Feature importance
    
The best accuracy obtained was 70.3% accuracy on Test Data from LGBM (version 2)

## Setting up the requires libraries and packages

In [154]:
# Libraries
import numpy as np
import pandas as pd
import os
from datetime import datetime
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn import preprocessing
import string
#import itertools
#from itertools import product

## Importing a dataset

In [155]:
# read in data
kickstarters_2017 = pd.read_csv('ks-projects-201801.csv')
#kickstarters_2017.head()

In [156]:
categoryArr = kickstarters_2017.category.unique()
print('Количество категорий в датасете:', len(categoryArr))

 #categoryArr2 = ['Audio','Print','Journalism','Video','Web','Photo','Games','Tabletop Games','Video Games','Mobile Games','Playing Cards','Puzzles','Live Games','Gaming Hardware',
 #               'Fashion','Childrenswear','Accessories','Ready-to-wear','Apparel','Jewelry','Footwear','Couture','Pet Fashion','Product Design','Design','Architecture','Graphic Design',
 #               'Typography','Interactive Design','Civic Design','Crafts','DIY','Weaving','Stationery','Woodworking','Letterpress','Pottery','Glass','Printing','Public Art','Illustration',
 #               'Art','Painting','Performance Art','Ceramics','Sculpture','Mixed Media','Digital Art','Installations','Conceptual Art','Textiles','Video Art','Hardware','Software','Gadgets',
 #               'Web','Apps','Technology','Flight','Makerspaces','Fabrication Tools','Sound','Wearables','DIY Electronics','Camera Equipment','3D Printing','Space Exploration','Robots'] 

Количество категорий в датасете: 159


## Basic Tests and EDA on input data

In [157]:
#printing all summary of the kickstarter data
#this will give the dimensions of data set : (rows, columns)
#print(kickstarters_2017.shape)
#columns and data types
#print(kickstarters_2017.info())
#basic stats of columns
#print(kickstarters_2017.describe())
#number of unique values in all columns
#print(kickstarters_2017.nunique())

The above stats help us reaching the following conclusions:
1. the data is at ID level (unique of ID=number of rows)
2. The numerical data fields are: goal, pledged, backers, usd_pledged, usd_pledged_real,usd_goal_real

#### Understanding Variables in the Dataset

The dataset has 15 variablesincluding ID. SInce ID is the level of the dataset, we can set it as the index of the ata later. Variables like name, currency, deadline, launched date and country as self explanatory. Explanations of some key variables are as follows:

Main_Category: There are 15 main categories for the project. These main categories broadly classify projects based on topic and genre they belong to.

Category: Main Categories are further sub divided in categories to give more general idea of the project. For example, Main Category “Technology” has 15 categories like Gadgets, Web, Apps, Software etc. There are 159 total categories.

Goal: This is the goal amount which the company need to raise to start its project. The goal amount is important variable for company as if it is too high, the project may fail to raise that amount of money and be unsuccessful. If it is too low, then it may reach its goal soon and backers may not be interested to pledge more.

Pledged: This is amount raised by the company through its backers. On Kickstarter, if total amount pledged is lower than goal, then the project is unsuccessful and the start-up company doesn’t receive any fund. If pledged amount is more than the goal, the company is considered successful. The variable “usd pledged” is amount of money raised in US dollars.

Number of Backers: These are number of people who have supported the project by pledging some amount.

In [158]:
#Распределение проектов по статусам
percent_success = round(kickstarters_2017["state"].value_counts() 
/ len(kickstarters_2017["state"]) * 100,2)

print("Статус, % ")
print(percent_success)

Статус, % 
failed        52.22
successful    35.38
canceled      10.24
undefined      0.94
live           0.74
suspended      0.49
Name: state, dtype: float64


In [159]:
#renaming column usd_pledged as there is no '_' in the actual dataset variable name
col_names_prev=list(kickstarters_2017)
col_names_new= ['ID',
 'name',
 'category',
 'main_category',
 'currency',
 'deadline',
 'goal',
 'launched',
 'pledged',
 'state',
 'backers',
 'country',
 'usd_pledged',
 'usd_pledged_real',
 'usd_goal_real']
kickstarters_2017.columns= col_names_new

In [160]:
7#segregating the variables as categorical and constinuous
cat_vars=[ 'category', 'main_category', 'currency','country']
cont_vars=['goal', 'pledged', 'backers','usd_pledged','usd_pledged_real','usd_goal_real']

In [161]:
#Корреляция количественных атрибутов
#kickstarters_2017[cont_vars].corr()

In [162]:
#setting unique ID as index of the table
#this is because the ID column will not be used in the algorithm. yet it is needed to identify the project
df_kick= kickstarters_2017.set_index('ID')

In [163]:
# Filtering only for successful and failed projects
kick_projects = df_kick[(df_kick['state'] == 'failed') | (df_kick['state'] == 'successful')]
#converting 'successful' state to 1 and failed to 0
kick_projects['state'] = (kick_projects['state'] =='successful').astype(int)
print(kick_projects.shape)

(331675, 14)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [164]:
#checking distribution of projects across various main categories
#kick_projects.groupby(['main_category','state']).size()
#kick_projects.groupby(['category','state']).size()

In [165]:
#correlation of continuous variables with the dependent variable
#kick_projects[['goal', 'pledged', 'backers','usd_pledged','usd_pledged_real','usd_goal_real','state']].corr()

## Feature Engineering

In [166]:
#creating derived metrics/ features

#converting the date columns from string to date format
#will use it to derive the duration of the project
kick_projects['launched_date'] = pd.to_datetime(kick_projects['launched'], format='%Y-%m-%d %H:%M:%S')
kick_projects['deadline_date'] = pd.to_datetime(kick_projects['deadline'], format='%Y-%m-%d %H:%M:%S')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [167]:
kick_projects= kick_projects.sort_values('launched_date',ascending=True)

In [168]:
#kick_projects.head()

In [169]:
#creating features from the project name

#length of name
kick_projects['name_len'] = kick_projects.name.str.len()

# presence of !
kick_projects['name_exclaim'] = (kick_projects.name.str[-1] == '!').astype(int)

# presence of !
kick_projects['name_question'] = (kick_projects.name.str[-1] == '?').astype(int)

# number of words in the name
kick_projects['name_words'] = kick_projects.name.apply(lambda x: len(str(x).split(' ')))

# if name is uppercase
kick_projects['name_is_upper'] = kick_projects.name.str.isupper().astype(float)

In [170]:
# normalizing goal by applying log
kick_projects['goal_log'] = np.log1p(kick_projects.goal)
#creating goal features to check what range goal lies in
kick_projects['Goal_10'] = kick_projects.goal.apply(lambda x: x // 10)
kick_projects['Goal_1000'] = kick_projects.goal.apply(lambda x: x // 1000)
kick_projects['Goal_100'] = kick_projects.goal.apply(lambda x: x // 100)
kick_projects['Goal_500'] = kick_projects.goal.apply(lambda x: x // 500)

In [171]:
#features from date column
kick_projects['duration']=(kick_projects['deadline_date']-kick_projects['launched_date']).dt.days
#the idea for deriving launched quarter month year is that perhaps projects launched in a particular year/ quarter/ month might have a low success rate
kick_projects['launched_quarter']= kick_projects['launched_date'].dt.quarter
kick_projects['launched_month']= kick_projects['launched_date'].dt.month
kick_projects['launched_year']= kick_projects['launched_date'].dt.year
kick_projects['launched_week']= kick_projects['launched_date'].dt.week

  import sys


In [172]:
#kick_projects.head()
#df = pd.DataFrame(kick_projects)
#df2 = df.drop(df.columns[[0,3,4,5,6,7,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]], axis=1)
#df3 = df2[df2['main_category'].isin(['Technology','Design','Crafts'])]
#df4 = df3[34:134]
#kick_projects_new.head()
#df3.head() 

In [173]:
#events = pd.read_csv('!events.csv')
#events2 = events.drop(events.columns[[0,4,5,6,8]], axis=1)
#events2.head()

In [174]:
#from pathlib import Path  
#filepath = Path('/content/out.csv')  
#df4.to_csv(filepath)  

In [175]:
#additional features from goal, pledge and backers columns
kick_projects.loc[:,'goal_reached'] = kick_projects['pledged'] / kick_projects['goal'] # Pledged amount as a percentage of goal.
#The above field will be used to compute another metric
# In backers column, impute 0 with 1 to prevent undefined division.
kick_projects.loc[kick_projects['backers'] == 0, 'backers'] = 1 
kick_projects.loc[:,'pledge_per_backer'] = kick_projects['pledged'] / kick_projects['backers'] # Pledged amount per backer.

In [176]:
#will create percentile buckets for the goal amount in a category
kick_projects['goal_cat_perc'] =  kick_projects.groupby(['category'])['goal'].transform(
                     lambda x: pd.qcut(x, [0, .35, .70, 1], labels =[1,2,3], duplicates='drop'))

#will create percentile buckets for the duration in a category
kick_projects['duration_cat_perc'] =  kick_projects.groupby(['category'])['duration'].transform(
                     lambda x: pd.qcut(x, [0, .35, .70, 1], labels =False, duplicates='drop'))

In [177]:
#creating a metric to see number of competitors for a given project in a given quarter
#number of participants in a given category, that launched in the same year and quarter and in the same goal bucket
ks_particpants_qtr=kick_projects.groupby(['category','launched_year','launched_quarter','goal_cat_perc']).count()
ks_particpants_qtr=ks_particpants_qtr[['name']]
#since the above table has all group by columns created as index, converting them into columns
ks_particpants_qtr.reset_index(inplace=True)

#creating a metric to see number of competitors for a given project in a given month
#number of participants in a given category, that launched in the same year and month and in the same goal bucket
ks_particpants_mth=kick_projects.groupby(['category','launched_year','launched_month','goal_cat_perc']).count()
ks_particpants_mth=ks_particpants_mth[['name']]
#since the above table has all group by columns created as index, converting them into columns
ks_particpants_mth.reset_index(inplace=True)

#creating a metric to see number of competitors for a given project in a given week
#number of participants in a given category, that launched in the same year and week and in the same goal bucket
ks_particpants_wk=kick_projects.groupby(['category','launched_year','launched_week','goal_cat_perc']).count()
ks_particpants_wk=ks_particpants_wk[['name']]
#since the above table has all group by columns created as index, converting them into columns
ks_particpants_wk.reset_index(inplace=True)

In [178]:
#renaming columns of the derived table
colmns_qtr=['category', 'launched_year', 'launched_quarter', 'goal_cat_perc', 'participants_qtr']
ks_particpants_qtr.columns=colmns_qtr

colmns_mth=['category', 'launched_year', 'launched_month', 'goal_cat_perc', 'participants_mth']
ks_particpants_mth.columns=colmns_mth

colmns_wk=['category', 'launched_year', 'launched_week', 'goal_cat_perc', 'participants_wk']
ks_particpants_wk.columns=colmns_wk

In [179]:
#merging the particpants column into the base table
kick_projects = pd.merge(kick_projects, ks_particpants_qtr, on = ['category', 'launched_year', 'launched_quarter','goal_cat_perc'], how = 'left')
kick_projects = pd.merge(kick_projects, ks_particpants_mth, on = ['category', 'launched_year', 'launched_month','goal_cat_perc'], how = 'left')
kick_projects = pd.merge(kick_projects, ks_particpants_wk, on = ['category', 'launched_year', 'launched_week','goal_cat_perc'], how = 'left')

In [180]:
#creating 2 metrics to get average pledge per backer for a category in a year according to the goal bucket it lies in and the success rate ie average pledged to goal ratio for the category and goal bucket in this year
#using pledge_per_backer (computed earlier) and averaging it by category in a launch year
ks_ppb_goal=pd.DataFrame(kick_projects.groupby(['category','launched_year','goal_cat_perc'])['pledge_per_backer','goal_reached'].mean())
#since the above table has all group by columns created as index, converting them into columns
ks_ppb_goal.reset_index(inplace=True)
#renaming column
ks_ppb_goal.columns= ['category','launched_year','goal_cat_perc','avg_ppb_goal','avg_success_rate_goal']

#creating a metric: the success rate ie average pledged to goal ratio for the category in this year
ks_ppb_duration=pd.DataFrame(kick_projects.groupby(['category','launched_year','duration_cat_perc'])['goal_reached'].mean())
#since the above table has all group by columns created as index, converting them into columns
ks_ppb_duration.reset_index(inplace=True)
#renaming column
ks_ppb_duration.columns= ['category','launched_year','duration_cat_perc','avg_success_rate_duration']

  This is separate from the ipykernel package so we can avoid doing imports until


In [181]:
#merging the particpants column into the base table
kick_projects = pd.merge(kick_projects, ks_ppb_goal, on = ['category', 'launched_year','goal_cat_perc'], how = 'left')
kick_projects = pd.merge(kick_projects, ks_ppb_duration, on = ['category', 'launched_year','duration_cat_perc'], how = 'left')

In [182]:
#creating 2 metrics: mean and median goal amount
median_goal_cat=pd.DataFrame(kick_projects.groupby(['category','launched_year','duration_cat_perc'])['goal'].median())
#since the above table has all group by columns created as index, converting them into columns
median_goal_cat.reset_index(inplace=True)
#renaming column
median_goal_cat.columns= ['category','launched_year','duration_cat_perc','median_goal_year']

mean_goal_cat=pd.DataFrame(kick_projects.groupby(['category','launched_year','duration_cat_perc'])['goal'].mean())
#since the above table has all group by columns created as index, converting them into columns
mean_goal_cat.reset_index(inplace=True)
#renaming column
mean_goal_cat.columns= ['category','launched_year','duration_cat_perc','mean_goal_year']

In [183]:
#merging the particpants column into the base table
kick_projects = pd.merge(kick_projects, median_goal_cat, on = ['category', 'launched_year','duration_cat_perc'], how = 'left')
kick_projects = pd.merge(kick_projects, mean_goal_cat, on = ['category', 'launched_year','duration_cat_perc'], how = 'left')

In [184]:
#print(kick_projects.shape)
#kick_projects[:3]

In [185]:
# replacing all 'N,0"' values in the country column with 'NZERO' to avoid discrepancies while one hot encoding
kick_projects = kick_projects.replace({'country': 'N,0"'}, {'country': 'NZERO'}, regex=True)

In [186]:
#kick_projects[:3]

In [187]:
#selecting the needed fields only
#this will lead to the final features list

#creating a list of columns to be dropped
drop_columns= ['name','launched','deadline','launched_date','deadline_date','pledged','backers','usd_pledged','usd_pledged_real','pledge_per_backer','goal_reached']
#dropping columns above
kick_projects.drop(drop_columns, axis=1, inplace=True)

In [188]:
#these functions will be used on the textual column entries to remove '&','-' or white spaces
def replace_ampersand(val):
    if isinstance(val, str):
        return(val.replace('&', 'and'))
    else:
        return(val)

def replace_hyphen(val):
    if isinstance(val, str):
        return(val.replace('-', '_'))
    else:
        return(val)    
    
def remove_extraspace(val):
        if isinstance(val, str):
            return(val.strip())
        else:
            return(val) 

def replace_space(val):
        if isinstance(val, str):
            return(val.replace(' ', '_'))
        else:
            return(val)         

In [189]:
#apply those functions to all cat columns
#this will remove special characters from the character columns.
#Since these fields will be one-hot encoded, the column names so derived should 
#be compatible with the requied format
kick_projects['category'] = kick_projects['category'].apply(remove_extraspace)
kick_projects['category'] = kick_projects['category'].apply(replace_ampersand)
kick_projects['category'] = kick_projects['category'].apply(replace_hyphen)
kick_projects['category'] = kick_projects['category'].apply(replace_space)

kick_projects['main_category'] = kick_projects['main_category'].apply(remove_extraspace)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_ampersand)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_hyphen)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_space)

In [190]:
#missing value treatment
# Check for nulls.
#kick_projects.isnull().sum()

There are only 3 rows with nulls, and the rows with nulls have no names. These rows can be removed.

In [191]:
#dropping all rows that have any nulls
kick_projects=kick_projects.dropna() 

In [192]:
# Check for nulls again.
#kick_projects.isnull().sum()

No nulls, we are good to go

In [193]:
#creating a backup copy of the dataset
kick_projects_copy= kick_projects.copy()

#kick_projects_copy[:5]

In [194]:
#from pathlib import Path  
#filepath = Path('/content/kick_projects_copy.csv')  
#kick_projects_copy.to_csv(filepath)

In [195]:
#kick_projects = pd.read_csv('kick_projects_copy.csv')

In [196]:
for c in kick_projects.columns:
    #this gives us the list of columns and the respective data types
    col_type = kick_projects[c].dtype
    #looking through all categorical columns in the list above
    if col_type == 'object' :
        a=kick_projects[c].unique()
        keys= range(a.shape[0])
        #initiating a dictionary
        diction={}
        for idx,val in enumerate(a):
        #looping through to create the dictionary with mappings
            diction[idx] = a[idx]
        #the above step maps integers to the values in the column
        # hence inverting the key-value pairs
        diction = {v: k for k, v in diction.items()}
        print(diction)
        # creating a dictionary for mapping the values to integers
        kick_projects_copy[c] = [diction[item] for item in kick_projects_copy[c]] 
        # converting data type to 'category'
        kick_projects_copy[c] = kick_projects_copy[c].astype('category')

{'Fashion': 0, 'Shorts': 1, 'Illustration': 2, 'Software': 3, 'Journalism': 4, 'Fiction': 5, 'Rock': 6, 'Photography': 7, 'Puzzles': 8, 'Graphic_Design': 9, 'Film_and_Video': 10, 'Publishing': 11, 'Documentary': 12, 'Theater': 13, 'Sculpture': 14, 'Electronic_Music': 15, 'Nonfiction': 16, 'Painting': 17, 'Indie_Rock': 18, 'Public_Art': 19, 'Art': 20, 'Crafts': 21, 'Jazz': 22, 'Music': 23, 'Comics': 24, "Children's_Books": 25, 'Narrative_Film': 26, 'Tabletop_Games': 27, 'Video_Games': 28, 'Digital_Art': 29, 'Food': 30, 'Animation': 31, 'Conceptual_Art': 32, 'Pop': 33, 'Hip_Hop': 34, 'Country_and_Folk': 35, 'Periodicals': 36, 'Webseries': 37, 'Product_Design': 38, 'Performance_Art': 39, 'Art_Books': 40, 'World_Music': 41, 'Knitting': 42, 'Technology': 43, 'Classical_Music': 44, 'Graphic_Novels': 45, 'Poetry': 46, 'Radio_and_Podcasts': 47, 'Design': 48, 'Hardware': 49, 'Webcomics': 50, 'Dance': 51, 'Translations': 52, 'Crochet': 53, 'Games': 54, 'Photo': 55, 'Mixed_Media': 56, 'Space_Expl

In [197]:
le = preprocessing.LabelEncoder()

In [198]:
#category_dic = pd.DataFrame([le.fit_transform(kick_projects['category']),le.inverse_transform(le.fit_transform(kick_projects['category']))])
#category_dic.to_csv('category_dictionary.csv')

In [199]:
#test = le.fit_transform(kick_projects['category']

In [200]:
#main_category_dic = pd.DataFrame([le.fit_transform(kick_projects['main_category']),le.inverse_transform(le.fit_transform(kick_projects['main_category']))])
#main_category_dic.to_csv('main_category_dictionary.csv')

In [201]:
#currency_dic = pd.DataFrame([le.fit_transform(kick_projects['currency']),le.inverse_transform(le.fit_transform(kick_projects['currency']))])
#currency_dic.to_csv('currency_dictionary.csv')

In [202]:
dic_test_df = pd.DataFrame(le.fit_transform(kick_projects['category']))

In [203]:
dic_test_arr = le.fit_transform(kick_projects['category'])
dic_test_arr

array([ 52, 129,  70, ...,   7, 136,  68])

In [204]:
drop_columns2= ['currency','country', 'name_exclaim', 'name_question' ,'name_words' ,'name_is_upper' ,'goal_log' ,'Goal_10' ,'Goal_1000' ,'Goal_100' ,'Goal_500' ,'goal' ,'launched_quarter' ,'launched_month' ,'launched_week' ,'goal_cat_perc' ,'duration_cat_perc','participants_mth', 'participants_wk', 'median_goal_year']
#dropping columns above
kick_projects.drop(drop_columns2, axis=1, inplace=True)

In [205]:
kick_projects.loc[435]

category                     Photography
main_category                Photography
state                                  0
usd_goal_real                     1500.0
name_len                            38.0
duration                              87
launched_year                       2009
participants_qtr                      13
avg_ppb_goal                   39.690597
avg_success_rate_goal           1.207638
avg_success_rate_duration       0.716354
mean_goal_year               2986.559556
Name: 435, dtype: object

In [206]:
#country_dic = pd.DataFrame([le.fit_transform(kick_projects['country']),le.inverse_transform(kick_projects['country']))
#country_dic.to_csv('country_dictionary.csv')

In [207]:
kick_projects_ip = pd.DataFrame()

kick_projects_ip['category'] = le.fit_transform(kick_projects['category'])
kick_projects_ip['main_category'] = le.fit_transform(kick_projects['main_category'])

In [208]:
dropCol = ['category','main_category']
pd.concat([kick_projects_ip,kick_projects.drop(columns = dropCol)], axis=1)

Unnamed: 0,category,main_category,state,usd_goal_real,name_len,duration,launched_year,participants_qtr,avg_ppb_goal,avg_success_rate_goal,avg_success_rate_duration,mean_goal_year
0,52.0,5.0,0.0,1000.00,59.0,39.0,2009.0,3.0,40.982361,0.325542,0.314024,5011.470588
1,129.0,6.0,0.0,80000.00,30.0,87.0,2009.0,1.0,65.203511,0.274317,0.433184,9675.250000
2,70.0,0.0,1.0,20.00,19.0,8.0,2009.0,3.0,13.095238,0.552500,0.508929,437.500000
3,131.0,13.0,1.0,99.00,28.0,79.0,2009.0,7.0,36.765524,0.572958,0.611654,4358.076923
4,52.0,5.0,0.0,1900.00,10.0,28.0,2009.0,3.0,40.982361,0.325542,0.256781,1600.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
331670,136.0,8.0,1.0,35.98,20.0,2.0,2017.0,23.0,36.799397,2.953141,1.540368,29724.629371
331671,68.0,10.0,1.0,271.03,48.0,4.0,2017.0,245.0,44.882036,18.711116,10.934493,11423.117560
331672,,,1.0,200.00,20.0,3.0,2017.0,90.0,35.351686,1.420548,0.984520,19187.261388
331673,,,1.0,250.00,14.0,1.0,2017.0,245.0,44.882036,18.711116,10.934493,11423.117560


In [209]:
#creating a backup copy of the input dataset
kick_projects_ip = pd.concat([kick_projects_ip,kick_projects.drop(columns = dropCol)], axis=1).fillna(0)

In [210]:
kick_projects_ip_copy = kick_projects_ip.copy()

In [211]:
features=kick_projects_ip_copy.drop(columns='state').columns
#response has the target variable
response = ['state']

## Model Building

In [212]:
#creating test and train dependent and independent variables
#Split the data into test and train (30-70: random sampling)
#will be using the scaled dataset to split 
train_ind, test_ind, train_dep, test_dep = train_test_split(kick_projects_ip[features], kick_projects_ip[response], test_size=0.3, random_state=0)

### LGBM

LGBM or LightGBM is yet another gradient boosting framework that uses tree-based learning algorithm. What sets it apart from conventional tree-based algorithms like XGBoost is that it grows trees vertically instead of horizontally splitting them. In other words, it means that LightGBM grows trees leaf-wise while other algorithms grows level-wise or depth-wise. 

The LGBM model chooses the leaf with maximum delta loss to grow. Thus, in the process of growing the same leaf, a leaf-wise algorithm can reduce more loss than any other level-wise algorithm. This results in better accuracy than most other tree-based learning algorithms. Additionally, as the name suggests, LightGBM is computationally less taxing and has faster execution speeds. 

In [213]:
import lightgbm as lgb

In [214]:
#создание модели классификации LGBM
gbm_model = lgb.LGBMClassifier(
        boosting_type= "dart",
        n_estimators=1300,
        learning_rate=0.08,
        num_leaves=35,
        colsample_bytree=.8,
        subsample=.9,
        max_depth=9,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01
)

#Подготовка модели к обучающим данным
gbm_model=gbm_model.fit(train_ind[features], 
            train_dep[response], 
              verbose=0)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


#### Predictions using LGBM

In [215]:
# Predict the on the train_data
test_ind["Pred_state_LGB"] = gbm_model.predict(test_ind[features])

# Predict the on the train_data
train_ind["Pred_state_LGB"] = gbm_model.predict(train_ind[features])

# Predict the on the train_data
#kick_projects_ip["Pred_state_LGB"] = gbm_model.predict(kick_projects_ip_scaled_ftrs)

In [216]:
predict_status = gbm_model.predict_proba(train_ind[features].iloc[155].to_numpy().reshape(1, -1))



In [217]:
train_ind[features].iloc[157].to_numpy().reshape(1, -1)

array([[9.60000000e+01, 0.00000000e+00, 3.72000000e+03, 4.90000000e+01,
        4.00000000e+01, 2.01000000e+03, 1.30000000e+01, 5.92821876e+01,
        5.55345694e-01, 6.00120191e-01, 5.67500000e+03]])

In [218]:
#первое число: вер-ть неудачи
#второе: вер-ть успеха
predict_status

array([[0.24718272, 0.75281728]])

In [219]:
gbm_model.booster_.save_model('lgbm_model_le.txt')

<lightgbm.basic.Booster at 0x7fc2286494d0>

#### Model Evaluation

In [220]:
# Точность на разных выборках
print ("Точность на обучающей выборке: ", accuracy_score(train_dep[response], 
                                                         gbm_model.predict(train_ind[features])))
print ("Точность на тестовой выборке: ", accuracy_score(test_dep[response], 
                                                        gbm_model.predict(test_ind[features])))
#print ("Итоговая точность: ", accuracy_score(kick_projects_ip[response], 
                                           #  gbm_model.predict(kick_projects_ip_scaled_ftrs)))

Точность на обучающей выборке:  0.7138931481832435
Точность на тестовой выборке:  0.6963207139483232


In [221]:
# classification matrix
print('\nClassification metrics')
print(classification_report(y_true=test_dep[response], y_pred=test_ind["Pred_state_LGB"]))


Classification metrics
              precision    recall  f1-score   support

         0.0       0.72      0.80      0.76     59250
         1.0       0.65      0.55      0.59     40253

    accuracy                           0.70     99503
   macro avg       0.68      0.67      0.68     99503
weighted avg       0.69      0.70      0.69     99503



#### Feature Importances

In [222]:
## Feature importances
ftr_imp_lgb=zip(features,gbm_model.feature_importances_)

In [223]:
feature_imp_lgb=pd.DataFrame(list(zip(features,gbm_model.feature_importances_)))
column_names_lgb= ['features','LGB_imp']
feature_imp_lgb.columns= column_names_lgb

feature_imp_lgb= feature_imp_lgb.sort_values('LGB_imp',ascending=False)
feature_imp_lgb

Unnamed: 0,features,LGB_imp
2,usd_goal_real,6327
4,duration,6078
8,avg_success_rate_goal,5402
9,avg_success_rate_duration,5380
7,avg_ppb_goal,4497
6,participants_qtr,4323
10,mean_goal_year,3857
3,name_len,3356
5,launched_year,2236
0,category,1722


1. avg_success_rate_goal - cоотношение фактически собранной суммы к среднему значению собранной суммы в данной категории в год запуска проекта

2. avg_success_rate_duration	- соотношение средней суммы пожертвования на одного споснсора и среднего показателя цели в данной категории в год запуска проекта

3. duration - срок сбора средств

4. usd_goal_real - финансовая цель

5. avg_ppb_goal - средняя сумма пожертвования на одного спонсора

6. mean_goal_year - средняя финансовая цель в год запуска проекта

7. name_len - длина названия в символах

8. participants_qtr - количество участников в проекте 

9. category - категория

10. main_category - основная категория

11. launched_year - год запуска
