<header>
    <h1>CA4010 - Data warehousing and Data mining</h1>
    <h2>Continuous assessment project</h2>
</header>
<p>
    For this project, we want to predict if a project submitted to 
    <a href="https://www.indiegogo.com">indiegogo.com</a> will or will not be funded.
    For this purpose, we'll use a
    <a href="https://www.kaggle.com/kingburrito666/indiegogo-project-statistics/data">
    dataset from kaggle containing one year of indiegogo projects.</a>
    The version used here is the concatenation of all csv files using the given 'combiner.py'.<br/>
    However, some modifications has been made from combiner.py:
    <ul>
        <li>
            One row was normalized : <i>in_forever_funding</i> was passed to 'True' instead of 'null' to avoid
            type warning at loading time and use boolean type for this attribute.
        </li>
        <li>
            The file was saved as tsv (tab separated values) instead of csv for a better readability.
        </li>
    </ul>
</p>
<p>
    This notebook will decribe and show how we'll reduce the categories we have to a usable number
</p>

<h2>Make category consistent</h2>
<p>We can check category_slug consistency</p>

In [1]:
import pandas as pd

In [2]:
indiegogo = pd.read_csv('indiegogo_cleaned_dataset.tsv', sep='\t')

In [3]:
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'education', 'productivity', 'music', 'dance-theater', 'tabletop_games', 'video_web', 'photography', 'gaming', 'food-beverages', 'wellness', 'community_projects', 'religion', 'health-fitness', 'human_rights', 'community', 'food_beverages', 'video_games', 'audio', 'animal-rights', 'animal_rights', 'web-series-tv-shows', 'fashion', 'energy_green_tech', 'tech_innovation', 'travel_outdoors', 'fashion-wearables', 'video-games', 'theatre', 'energy-green-tech', 'technology', 'design', 'small_business', 'culture', 'writing_publishing', 'spirituality', 'health_fitness', 'podcasts_blogs_vlogs', 'health', 'food', 'politics', 'phones_accessories', 'sports', 'camera_gear', 'fashion_wearables', 'camera-gear', 'tech-innovation', 'comic', 'environment', 'travel-outdoors', 'writing-publishing', 'creative_works', 'local-businesses', 'community-projects', 'local_businesses', 'art', 'transportation', 'home', 'dance', 'web_series_tv_shows', 'dance_theater', 'podcasts-blogs-vlogs', 'phones-accessories', 't

<p>
    <b>70</b> categories is too much, it'll be unefficient to have a so high number of class for one attribute during the classification process ! Moreover we can see that many <b>category are duplicated</b> using '-' instead of '_'.
Some categories <b>can also be aggregated</b>:
    <ul>
        <li>animals and animals right</li>
        <li>comic and comics</li>
        <li>community and community_projects</li>
        <li>dance and dance_theater</li>
        <li>fashion and fashion_wearable</li>
        <li>food and food_beverage</li>
        <li>tabletop_games, video_games and gaming with games</li>
        <li>health, wellness and health_fitness</li>
        <li>writing and writing_publishing</li>
        <li>tech_innovation and energy_green_tech with technology</li>
    </ul>
</p>
<p>Let's start by replacing all '-' by '_' to make our categories more consistent</p>

In [4]:
indiegogo.category_slug = indiegogo.category_slug.apply(lambda x: x.replace('-', '_'))
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'education', 'productivity', 'music', 'tabletop_games', 'video_web', 'photography', 'gaming', 'wellness', 'community_projects', 'religion', 'human_rights', 'community', 'food_beverages', 'video_games', 'audio', 'animal_rights', 'fashion', 'energy_green_tech', 'tech_innovation', 'travel_outdoors', 'theatre', 'technology', 'design', 'small_business', 'writing_publishing', 'phones_accessories', 'culture', 'spirituality', 'podcasts_blogs_vlogs', 'health', 'food', 'health_fitness', 'politics', 'sports', 'camera_gear', 'fashion_wearables', 'comic', 'environment', 'creative_works', 'local_businesses', 'art', 'transportation', 'dance', 'web_series_tv_shows', 'dance_theater', 'transmedia', 'film', 'comics', 'home', 'writing', 'animals'}, 51


<p> We had remove <b>19 categories</b> with normalization only. It's better but <b>51 categories</b> is still a high number. Let's aggregate the related categories.</p>

In [5]:
def aggregate_categories(category):
    to_aggregate = {
        "animal_": "animals",
        "community_": "community",
        "dance_": "dance",
        "fashion_": "fashion",
        "food_": "food",
        "comic": "comics",
        "video_games": "games",
        "gaming": "games",
        "tabletop_games": "games",
        "health_": "health",
        "writing_": "writing",
        "tech_": "technology",
        "energy_green_tech": "technology",
        "web_series_": "series",
        "podcasts_": "podcast",
        "wellness": "health"
        
    }
    for key in to_aggregate.keys():
        if category.startswith(key):
            return to_aggregate[key]
    return category

In [6]:
indiegogo.category_slug = indiegogo.category_slug.apply(aggregate_categories)
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'education', 'productivity', 'music', 'video_web', 'photography', 'religion', 'human_rights', 'community', 'audio', 'fashion', 'travel_outdoors', 'theatre', 'technology', 'design', 'small_business', 'phones_accessories', 'culture', 'spirituality', 'health', 'food', 'podcast', 'politics', 'series', 'games', 'camera_gear', 'sports', 'environment', 'creative_works', 'local_businesses', 'art', 'transportation', 'dance', 'transmedia', 'film', 'comics', 'home', 'writing', 'animals'}, 38


<p>It's better, we've decreased our categories number by <b>13</b> with aggregation. This we've still <b>38 categories left</b> which is still too high. Let's analyze the categories which are still in the list</p>

<h2>Remove categories with the fewer projects</h2>
<p>During the previous data cleaning, we've noticed that some categories contained <b>only a few projects</b>. Let's see if we can remove them and decrease the number of categories.</p>

<p>Here is the number of projects with all categories. Ideally, we we'll want to have around <b>10 categories</b> to process without loosing too much data.</p>

In [7]:
print("{} projects".format(indiegogo.shape[0]))

139052 projects


<p>Here is the detail of the different categories. We can see that we can do a big cutoff by only taking categories with more than 200 or 300 projects. As we don't know at what point this cutoff will be wrong for our data, we'll try several cutoffs (some larger than the others) and see if it's better for us to keep more projects, less data or find another strategy to get less categories.</p>

In [8]:
indiegogo['category_slug'].value_counts()

film                  43239
music                 17116
local_businesses      16900
education             15052
dance                 11331
art                    7058
health                 6622
writing                4895
fashion                2879
animals                2857
environment            2411
phones_accessories     1892
games                  1755
comics                 1002
travel_outdoors         820
home                    684
photography             638
productivity            346
human_rights            340
transportation          294
camera_gear             185
audio                   170
technology              166
food                    122
community                75
series                   61
creative_works           53
culture                  34
design                   13
spirituality             12
small_business           11
podcast                   6
video_web                 5
politics                  3
religion                  2
theatre             

<p>Now we'll process several cutoffs, starting by categories with less than 100 projects and increasing the number of projects by 100 until we reach a cutoff of the categories with less than 900 projects. Each cutoff will be saved in a tsv file for further analyses.</p>

In [9]:
categories = list(zip(indiegogo['category_slug'].value_counts().keys(), indiegogo['category_slug'].value_counts()))

for threshold in range(100, 1000, 100):
    categories_to_remove = [k for k, v in categories if v < threshold]
    cleaned_data = indiegogo[~indiegogo.category_slug.isin(categories_to_remove)]
    print("Remove categories with less than {} projects...".format(threshold))
    print("{} projects and {} categories left\n".format(cleaned_data.shape[0], 
                                                      cleaned_data.category_slug.value_counts().count()))
    outfile = "indiegogo_cleaned_dataset_{}.tsv".format(threshold)
    cleaned_data.to_csv(outfile, index=False, sep='\t')

Remove categories with less than 100 projects...
138774 projects and 24 categories left

Remove categories with less than 200 projects...
138131 projects and 20 categories left

Remove categories with less than 300 projects...
137837 projects and 19 categories left

Remove categories with less than 400 projects...
137151 projects and 17 categories left

Remove categories with less than 500 projects...
137151 projects and 17 categories left

Remove categories with less than 600 projects...
137151 projects and 17 categories left

Remove categories with less than 700 projects...
135829 projects and 15 categories left

Remove categories with less than 800 projects...
135829 projects and 15 categories left

Remove categories with less than 900 projects...
135009 projects and 14 categories left

