<header>
    <h1>CA4010 - Data warehousing and Data mining</h1>
    <h2>Continuous assessment project</h2>
</header>
<p>
    For this project, we want to predict if a project submitted to 
    <a href="https://www.indiegogo.com">indiegogo.com</a> will or will not be funded.
    For this purpose, we'll use a
    <a href="https://www.kaggle.com/kingburrito666/indiegogo-project-statistics/data">
    dataset from kaggle containing one year of indiegogo projects.</a>
    The version used here is the concatenation of all csv files using the given 'combiner.py'.<br/>
    However, some modifications has been made from combiner.py:
    <ul>
        <li>
            One row was normalized : <i>in_forever_funding</i> was passed to 'True' instead of 'null' to avoid
            type warning at loading time and use boolean type for this attribute.
        </li>
        <li>
            The file was saved as tsv (tab separated values) instead of csv for a better readability.
        </li>
    </ul>
</p>
<p>
    This notebook will decribe and show how we'll reduce the categories we have to a usable number
</p>

<h2>Make category consistent</h2>
<p>We can check category_slug consistency</p>

In [25]:
import pandas as pd

In [26]:
indiegogo = pd.read_csv('indiegogo_cleaned_dataset.tsv', sep='\t')

In [27]:
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'dance-theater', 'podcasts_blogs_vlogs', 'comics', 'energy-green-tech', 'creative-works', 'food', 'environment', 'human_rights', 'sports', 'film', 'home', 'technology', 'animal_rights', 'religion', 'phones-accessories', 'energy_green_tech', 'wellness', 'travel-outdoors', 'theatre', 'transmedia', 'community', 'fashion_wearables', 'politics', 'food_beverages', 'audio', 'photography', 'travel_outdoors', 'community-projects', 'education', 'fashion-wearables', 'writing', 'food-beverages', 'productivity', 'health-fitness', 'design', 'video-games', 'animal-rights', 'culture', 'dance_theater', 'human-rights', 'writing_publishing', 'video_web', 'transportation', 'community_projects', 'web-series-tv-shows', 'tabletop_games', 'art', 'comic', 'tabletop-games', 'camera-gear', 'animals', 'tech-innovation', 'music', 'fashion', 'tech_innovation', 'small_business', 'local_businesses', 'video_games', 'web_series_tv_shows', 'writing-publishing', 'podcasts-blogs-vlogs', 'health', 'local-businesses', 'cam

<p>
    <b>70</b> categories is too much, it'll be unefficient to have a so high number of class for one attribute during the classification process ! Moreover we can see that many <b>category are duplicated</b> using '-' instead of '_'.
Some categories <b>can also be aggregated</b>:
    <ul>
        <li>animals and animals right</li>
        <li>comic and comics</li>
        <li>community and community_projects</li>
        <li>dance and dance_theater</li>
        <li>fashion and fashion_wearable</li>
        <li>food and food_beverage</li>
        <li>tabletop_games, video_games and gaming with games</li>
        <li>health, wellness and health_fitness</li>
        <li>writing and writing_publishing</li>
        <li>tech_innovation and energy_green_tech with technology</li>
    </ul>
</p>
<p>Let's start by replacing all '-' by '_' to make our categories more consistent</p>

In [28]:
indiegogo.category_slug = indiegogo.category_slug.apply(lambda x: x.replace('-', '_'))
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'podcasts_blogs_vlogs', 'comics', 'food', 'environment', 'human_rights', 'sports', 'film', 'animal_rights', 'technology', 'home', 'religion', 'energy_green_tech', 'wellness', 'theatre', 'transmedia', 'community', 'fashion_wearables', 'politics', 'food_beverages', 'audio', 'photography', 'travel_outdoors', 'education', 'writing', 'productivity', 'dance_theater', 'design', 'culture', 'writing_publishing', 'video_web', 'transportation', 'community_projects', 'tabletop_games', 'art', 'comic', 'animals', 'music', 'fashion', 'tech_innovation', 'small_business', 'local_businesses', 'video_games', 'web_series_tv_shows', 'health', 'camera_gear', 'gaming', 'creative_works', 'health_fitness', 'phones_accessories', 'dance', 'spirituality'}, 51


<p> We had remove <b>19 categories</b> with normalization only. It's better but <b>51 categories</b> is still a high number. Let's aggregate the related categories.</p>

In [29]:
def aggregate_categories(category):
    to_aggregate = {
        "animal_": "animals",
        "community_": "community",
        "dance_": "dance",
        "fashion_": "fashion",
        "food_": "food",
        "comic": "comics",
        "video_games": "games",
        "gaming": "games",
        "tabletop_games": "games",
        "health_": "health",
        "writing_": "writing",
        "tech_": "technology",
        "energy_green_tech": "technology",
        "web_series_": "series",
        "podcasts_": "podcast",
        "wellness": "health"
        
    }
    for key in to_aggregate.keys():
        if category.startswith(key):
            return to_aggregate[key]
    return category

In [30]:
indiegogo.category_slug = indiegogo.category_slug.apply(aggregate_categories)
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'podcast', 'comics', 'food', 'environment', 'human_rights', 'sports', 'film', 'home', 'technology', 'religion', 'theatre', 'series', 'transmedia', 'community', 'games', 'politics', 'audio', 'photography', 'travel_outdoors', 'education', 'writing', 'productivity', 'design', 'culture', 'video_web', 'transportation', 'art', 'animals', 'music', 'fashion', 'small_business', 'local_businesses', 'health', 'camera_gear', 'creative_works', 'phones_accessories', 'dance', 'spirituality'}, 38


<p>It's better, we've decreased our categories number by <b>13</b> with aggregation. This we've still <b>38 categories left</b> which is still too high. Let's analyze the categories which are still in the list</p>

<h2>Remove categories with the fewer projects</h2>
<p>During the previous data cleaning, we've noticed that some categories contained <b>only a few projects</b>. Let's see if we can remove them and decrease the number of categories.</p>

<p>Here is the number of projects with all categories. Ideally, we we'll want to have around <b>10 categories</b> to process without loosing too much data.</p>

In [47]:
print("{} projects".format(indiegogo.shape[0]))

146318 projects


<p>Here is the detail of the different categories. We can see that we can do a big cutoff by only taking categories with more than 200 or 300 projects. As we don't know at what point this cutoff will be wrong for our data, we'll try several cutoffs (some larger than the others) and see if it's better for us to keep more projects, less data or find another strategy to get less categories.</p>

In [48]:
indiegogo['category_slug'].value_counts()

film                  44787
local_businesses      18418
music                 17429
education             15821
dance                 11393
art                    7273
health                 7227
writing                5073
fashion                3245
animals                2988
environment            2630
phones_accessories     2257
games                  2059
comics                 1029
travel_outdoors         968
home                    820
photography             656
productivity            425
human_rights            366
transportation          357
technology              226
audio                   209
camera_gear             205
food                    142
community                97
series                   63
creative_works           60
culture                  36
spirituality             15
design                   13
small_business           12
podcast                   6
video_web                 5
politics                  3
religion                  2
sports              

<p>Now we'll process several cutoffs, starting by categories with less than 100 projects and increasing the number of projects by 100 until we reach a cutoff of the categories with less than 900 projects. Each cutoff will be saved in a tsv file for further analyses.</p>

In [49]:
categories = list(zip(indiegogo['category_slug'].value_counts().keys(), indiegogo['category_slug'].value_counts()))

for threshold in range(100, 1000, 100):
    categories_to_remove = [k for k, v in categories if v < threshold]
    cleaned_data = indiegogo[~indiegogo.category_slug.isin(categories_to_remove)]
    print("Remove categories with less than {} projects...".format(threshold))
    print("{} projects and {} categories left\n".format(cleaned_data.shape[0], 
                                                      cleaned_data.category_slug.value_counts().count()))
    outfile = "indiegogo_cleaned_categories_{}.tsv".format(threshold)
    cleaned_data.to_csv(outfile, index=False, sep='\t')

Remove categories with less than 100 projects...
146003 projects and 24 categories left

Remove categories with less than 200 projects...
145861 projects and 23 categories left

Remove categories with less than 300 projects...
145221 projects and 20 categories left

Remove categories with less than 400 projects...
144498 projects and 18 categories left

Remove categories with less than 500 projects...
144073 projects and 17 categories left

Remove categories with less than 600 projects...
144073 projects and 17 categories left

Remove categories with less than 700 projects...
143417 projects and 16 categories left

Remove categories with less than 800 projects...
143417 projects and 16 categories left

Remove categories with less than 900 projects...
142597 projects and 15 categories left

