<header>
    <h1>CA4010 - Data warehousing and Data mining</h1>
    <h2>Continuous assessment project</h2>
</header>
<p>
    For this project, we want to predict if a project submitted to 
    <a href="https://www.indiegogo.com">indiegogo.com</a> will or will not be funded.
    For this purpose, we'll use a
    <a href="https://www.kaggle.com/kingburrito666/indiegogo-project-statistics/data">
    dataset from kaggle containing one year of indiegogo projects.</a>
    The version used here is the concatenation of all csv files using the given 'combiner.py'.<br/>
    However, some modifications has been made from combiner.py:
    <ul>
        <li>
            One row was normalized : <i>in_forever_funding</i> was passed to 'True' instead of 'null' to avoid
            type warning at loading time and use boolean type for this attribute.
        </li>
        <li>
            The file was saved as tsv (tab separated values) instead of csv for a better readability.
        </li>
    </ul>
</p>
<p>
    This notebook will decribe and show how we'll reduce the categories we have to a usable number
</p>

<h2>Make category consistent</h2>
<p>We can check category_slug consistency</p>

In [50]:
import pandas as pd

In [51]:
indiegogo = pd.read_csv('indiegogo_cleaned_dataset.tsv', sep='\t')

In [52]:
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'animals', 'animal_rights', 'food_beverages', 'small_business', 'web-series-tv-shows', 'design', 'tabletop-games', 'phones_accessories', 'tech-innovation', 'audio', 'podcasts_blogs_vlogs', 'fashion', 'technology', 'travel-outdoors', 'fashion_wearables', 'comics', 'travel_outdoors', 'religion', 'human_rights', 'art', 'education', 'video-games', 'community-projects', 'productivity', 'theatre', 'video_games', 'community_projects', 'tech_innovation', 'camera-gear', 'transmedia', 'dance', 'local_businesses', 'writing_publishing', 'video_web', 'dance_theater', 'spirituality', 'web_series_tv_shows', 'health_fitness', 'energy-green-tech', 'gaming', 'local-businesses', 'environment', 'transportation', 'dance-theater', 'food-beverages', 'phones-accessories', 'animal-rights', 'human-rights', 'film', 'politics', 'tabletop_games', 'energy_green_tech', 'wellness', 'podcasts-blogs-vlogs', 'fashion-wearables', 'culture', 'camera_gear', 'creative_works', 'health', 'creative-works', 'home', 'sports', '

<p>
    <b>70</b> categories is too much, it'll be unefficient to have a so high number of class for one attribute during the classification process ! Moreover we can see that many <b>category are duplicated</b> using '-' instead of '_'.
Some categories <b>can also be aggregated</b>:
    <ul>
        <li>animals and animals right</li>
        <li>comic and comics</li>
        <li>community and community_projects</li>
        <li>dance and dance_theater</li>
        <li>fashion and fashion_wearable</li>
        <li>food and food_beverage</li>
        <li>tabletop_games, video_games and gaming with games</li>
        <li>health, wellness and health_fitness</li>
        <li>writing and writing_publishing</li>
        <li>tech_innovation and energy_green_tech with technology</li>
    </ul>
</p>
<p>Let's start by replacing all '-' by '_' to make our categories more consistent</p>

In [53]:
indiegogo.category_slug = indiegogo.category_slug.apply(lambda x: x.replace('-', '_'))
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'animals', 'animal_rights', 'food_beverages', 'small_business', 'phones_accessories', 'design', 'audio', 'podcasts_blogs_vlogs', 'fashion', 'technology', 'fashion_wearables', 'travel_outdoors', 'comics', 'religion', 'human_rights', 'art', 'education', 'productivity', 'theatre', 'video_games', 'community_projects', 'tech_innovation', 'local_businesses', 'dance_theater', 'writing_publishing', 'transmedia', 'dance', 'video_web', 'spirituality', 'web_series_tv_shows', 'health_fitness', 'gaming', 'environment', 'transportation', 'politics', 'film', 'energy_green_tech', 'tabletop_games', 'wellness', 'culture', 'camera_gear', 'creative_works', 'health', 'home', 'sports', 'comic', 'food', 'music', 'photography', 'community', 'writing'}, 51


<p> We had remove <b>19 categories</b> with normalization only. It's better but <b>51 categories</b> is still a high number. Let's aggregate the related categories.</p>

In [54]:
to_aggregate = {
        "animal_": "animals",
        "community_": "community",
        "dance_": "dance",
        "fashion_": "fashion",
        "food_": "food",
        "comic": "comics",
        "video_games": "games",
        "gaming": "games",
        "tabletop_games": "games",
        "health_": "health",
        "writing_": "writing",
        "tech_": "technology",
        "energy_green_tech": "technology",
        "web_series_": "series",
        "podcasts_": "podcast"
    }

In [55]:
def aggregate_categories(category):
    for key in to_aggregate.keys():
        if category.startswith(key):
            return to_aggregate[key]
    return category

In [56]:
indiegogo.category_slug = indiegogo.category_slug.apply(aggregate_categories)
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'animals', 'small_business', 'phones_accessories', 'design', 'audio', 'fashion', 'technology', 'travel_outdoors', 'comics', 'religion', 'human_rights', 'art', 'education', 'productivity', 'series', 'theatre', 'local_businesses', 'transmedia', 'dance', 'video_web', 'spirituality', 'podcast', 'environment', 'transportation', 'politics', 'film', 'wellness', 'culture', 'camera_gear', 'creative_works', 'health', 'home', 'sports', 'games', 'food', 'music', 'photography', 'community', 'writing'}, 39


<p>It's better, we've decreased our categories number by <b>13</b> with aggregation. This we've still <b>38 categories left</b> which is still too high. Let's generalize the categories which are still in the list using the super-categories used by indiegogo:
    <ul>
        <li>community_projects</li>
        <li>tech_innovations</li>
        <li>creative_works</li>
    </ul>
</p>

In [57]:
to_aggregate = {
        "dance": "creative_works",
        "writing": "creative_works",
        "theatre": "creative_works",
        "photography": "creative_works",
        "transmedia": "creative_works",
        "design": "creative_works",
        "comics": "creative_works",
        "film": "creative_works",
        "art": "creative_works",
        "podcast": "creative_works",
        "film": "creative_works",
        "series": "creative_works",
        "games": "creative_works",
        "music": "creative_works",
        "video_web": "creative_works",
        "camera_gear": "tech_innovation",
        "phones_accessories": "tech_innovation",
        "audio": "tech_innovation",
        "technology": "tech_innovation",
        "sports": "tech_innovation",
        "fashion": "tech_innovation",
        "food": "tech_innovation",
        "health": "tech_innovation",
        "home": "tech_innovation",
        "productivity": "tech_innovation",
        "transportation": "tech_innovation",
        "travel_outdoors": "tech_innovation",
        "small_business": "community_projects",
        "culture": "community_projects",
        "animals": "community_projects",
        "education": "community_projects",
        "environment": "community_projects",
        "human_rights": "community_projects",
        "local_businesses": "community_projects",
        "spirituality": "community_projects",
        "wellness": "community_projects",
        "religion": "community_projects",
        "community": "community_projects",
        "politics": "community_projects"
    }
indiegogo.category_slug = indiegogo.category_slug.apply(aggregate_categories)
print("{}, {}".format(set(indiegogo.category_slug), len(set(indiegogo.category_slug))))

{'creative_works', 'tech_innovation', 'community_projects'}, 3


<p>Now we can save the dataset containing our cleaned categories.</p>

In [58]:
indiegogo.to_csv("indiegogo_cleaned_dataset.tsv", index=False, sep='\t')