<header>
    <h1>CA4010 - Data warehousing and Data mining</h1>
    <h2>Continuous assessment project</h2>
</header>
<p>
    For this project, we want to predict if a project submitted to 
    <a href="https://www.indiegogo.com">indiegogo.com</a> will or will not be funded.
    For this purpose, we'll use a
    <a href="https://www.kaggle.com/kingburrito666/indiegogo-project-statistics/data">
    dataset from kaggle containing one year of indiegogo projects.</a>
    The version used here is the version cleaned in part 1, 2 and 5.
</p>
<p>
    This notebook will decribe and show how we'll predict if a project will be funded or not.
</p>

<h2>Classification</h2>
<p>Here we'll experiment around the classification and see if our results are correct or not. We're not expecting to have very high results as our dataset is poorly balanced with some projects having very high collected percentages, pledges count or target amount and other very low ones.</p>
<p>As we want to predict if a project will be funded or not, we'll use a <b>classifier</b> and because our dataset contains <b>mixed attributes</b>, we'll use a <b>Decision Tree</b>.</p>

In [63]:
import pandas as pd
from sklearn import preprocessing
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

In [64]:
indiegogo = pd.read_csv("indiegogo_cleaned_dataset.tsv", sep='\t')

<p>First, we'll have to transform our categorical datas using <b>one hot encoding</b> to have only continuous attributes in our decision tree. This encoding will be applied to:
        <ul>
            <li>category_slug</li>
            <li>currency_code</li>
        </ul>
As <i>has_partner</i> is a boolean we don't need to apply a one hot encoding on it.
</p>

In [65]:
indiegogo = pd.get_dummies(indiegogo)

<p>As we want to predict if a project will or will not be funded, we'll extract 2 categories from our dataset using the collected_percentage values:
        <ul>
            <li><b>Funded:</b> The project is successfully funded with at least a 100% score of collected percentage</li>
            <li><b>Not funded:</b> The project is not funded (less than 100% score of collected percentage)</li>
        </ul>
</p>

In [66]:
y = indiegogo.collected_percentage.map(lambda x: 'Funded' if x >= 100 else 'Not funded')

In [67]:
X = indiegogo.drop(['collected_percentage'], axis=1)

<p>Now, we'll normalize our data to avoid domination of high scores in computation. As we know that our dataset as a very high standard deviation score for almost all its attributes, we don't want to use <b>min-max</b> normalisation as the extreme values are not representatives of our dataset. That's why we'll use <b>z-norm</b> here which is more robust for this kind of data.</p>

In [74]:
cols = X.columns.values
for col in cols:
    X[col] = (X[col] - X[col].mean())/X[col].std(ddof=0)

<p>Now, we'll split the dataset into train and test samples, the test sample will be 30% of the whole entries.</p>

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

<p>Now we'll build the decision tree using gini as splitting criterion. We don't expect to see a big difference with using entropy as the results are often very similar as explained <a href="https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md">here.</a> As our data are spread, we're not expecting a very high score for this prediction</p>

In [70]:
clf = DecisionTreeClassifier(random_state=0, max_depth=7, criterion="gini")
clf = clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(metrics.classification_report(y_test, clf.predict(X_test)))
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))

0.86659794803
             precision    recall  f1-score   support

     Funded       0.77      0.73      0.75     11233
 Not funded       0.90      0.92      0.91     30483

avg / total       0.86      0.87      0.87     41716

[[ 8151  3082]
 [ 2483 28000]]


<p>After few tests on max_depth, we've obtained a decision tree with a precision of <b>86.7%</b> which is way higher than expected. We can see that the precision is better on not funded projects, probably due to the fact that the number of funded projects is substentially smaller than the non funded ones. It'll be interesting to re run the experiment with a dataset containing half of both kind of projects. We can also guess that projects with extreme values for pledges count (very high pledge count and not funded or very low pledge count and funded) could be responsible of some false positives.</p>

<p>Now we'll test the decision tree with entropy as splitting criterion to see if the results are equivalents.</p>

In [72]:
clf = DecisionTreeClassifier(random_state=0, max_depth=7, criterion="entropy")
clf = clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(metrics.classification_report(y_test, clf.predict(X_test)))
print(metrics.confusion_matrix(y_test, clf.predict(X_test)))

0.865974685972
             precision    recall  f1-score   support

     Funded       0.78      0.71      0.74     11233
 Not funded       0.90      0.92      0.91     30483

avg / total       0.86      0.87      0.86     41716

[[ 7947  3286]
 [ 2305 28178]]


<p>As expected, we can see that we reach almost the same score <b>86.6%</b> instead of <b>86.7%</b>.</p>

<h2>Conclusion</h2>
<p>Our dataset was very complicated to clean as we didn't have any information about the meaning of each attribute. We had to deal with several tricky points such as:
<ul>
    <li>Remove duplicate projects</li>
    <li>Find out the meaning of the balance attribute</li>
    <li>Convert the balance to an uniform currency</li>
    <li>Aggregate the 70 categories of projects we had using an efficient strategy</li>
    <li>Filter out the attributes with only one value or with unique values</li>
</ul>
After this first cleaning, we had see that our dataset contained about 700 projects having extreme values for at least one attribute. Most of them where interesting and thus, we have chosen to keep them. Our dataset also contained mixed attributes and using one hot encoding with the 14 categories we first extracted wasn't usable. So we've had to refine the aggregation a second time more efficiently.
We also have had to be very careful in our algorithm choice for normalisation as our dataset contained a very high standard deviation for almost all its numerical attributes and as we had mixed attributes.
Having considered all the challenges we encontered we won't expect to get a good prediction score and were thinking to try a clustering approach to classify the projects. We were very surprised to find out that our decision tree reached a very good precision and recall.
This project as shown us the impact of a correctly cleaned dataset and the challenges it may pose.
It also shown us knowing our dataset is very important as we can choose the algorithms which best fit our data and that experimenting is also a key as many times the results we've found were not those we were expecting.
</p>