<header>
    <h1>CA4010 - Data warehousing and Data mining</h1>
    <h2>Continuous assessment project</h2>
</header>
<p>
    For this project, we want to predict if a project submitted to 
    <a href="https://www.indiegogo.com">indiegogo.com</a> will or will not be funded.
    For this purpose, we'll use a
    <a href="https://www.kaggle.com/kingburrito666/indiegogo-project-statistics/data">
    dataset from kaggle containing one year of indiegogo projects.</a>
    The version used here is the concatenation of all csv files using the given 'combiner.py'.<br/>
    However, some modifications has been made from combiner.py:
    <ul>
        <li>
            One row was normalized : <i>in_forever_funding</i> was passed to 'True' instead of 'null' to avoid
            type warning at loading time and use boolean type for this attribute.
        </li>
        <li>
            The file was saved as tsv (tab separated values) instead of csv for a better readability.
        </li>
    </ul>
</p>
<p>
    This notebook will decribe and show how we'll clean this dataset and what will be the attributes
    of the dataset we will use for all our further analysis.
</p>

<h2>Overview of the dataset</h2>
<p>
    We'll first detail the dataset as it is and analyze it.
</p>

In [1]:
import pandas as pd

In [2]:
infile = 'indiegogo_raw_data.tsv'
dataset = pd.read_csv(infile, sep='\t')
dataset.shape

(1720022, 21)

<p>
    We can see that the dataset contains <b>21 attributes</b>, which is way too much.
    We'll have to remove the less interesting ones and keep between <b>6 and 8 attributes</b>.
    It contains also more that <b>1 millions projects</b> which seems to be a very high number of 
    projects for only one year.
</p>

In [3]:
dataset[:5]

Unnamed: 0,id,title,nearest_five_percent,tagline,cached_collected_pledges_count,igg_image_url,compressed_image_url,balance,currency_code,amt_time_left,...,category_url,category_name,category_slug,card_type,collected_percentage,partner_name,in_forever_funding,friend_contributors,friend_team_members,source_url
0,773971,"Lasers, Lasers everywhere! Then I can make som...",100,Down The Rabbit Hole is wanting to expand into...,43,https://c1.iggcdn.com/indiegogo-media-prod-cld...,https://c1.iggcdn.com/indiegogo-media-prod-cld...,?520,GBP,No time left,...,/explore/small_business,Small Business,small_business,project,104%,,False,[],[],https://www.indiegogo.com/explore#/browse/landing
1,1820996,Support Kanekta!,100,"Support the creation of Kanekta, every $1 you ...",11,https://c1.iggcdn.com/indiegogo-media-prod-cld...,https://c1.iggcdn.com/indiegogo-media-prod-cld...,$511,CAD,No time left,...,/explore/small_business,Small Business,small_business,project,102%,,False,[],[],https://www.indiegogo.com/explore#/browse/landing
2,335364,Maise Designs at PCC!,100,"I've found myself, due to family sicknesses, n...",6,https://c1.iggcdn.com/indiegogo-media-prod-cld...,https://c1.iggcdn.com/indiegogo-media-prod-cld...,$510,USD,No time left,...,/explore/small_business,Small Business,small_business,project,102%,,False,[],[],https://www.indiegogo.com/explore#/browse/landing
3,84385,Moved by Design,100,Moved by Design offers a holistic service that...,100,https://c1.iggcdn.com/indiegogo-media-prod-cld...,https://c1.iggcdn.com/indiegogo-media-prod-cld...,$507,USD,No time left,...,/explore/small_business,Small Business,small_business,project,101%,,False,[],[],https://www.indiegogo.com/explore#/browse/landing
4,613091,Help Local Artisans Become Official Non-Profit...,100,Help our group of artisans obtain legal status...,15,https://c1.iggcdn.com/indiegogo-media-prod-cld...,https://c1.iggcdn.com/indiegogo-media-prod-cld...,$505,USD,No time left,...,/explore/small_business,Small Business,small_business,project,101%,,False,[],[],https://www.indiegogo.com/explore#/browse/landing


<h2>Removing uneeded projects</h2>
<p>
As we want to know if a project will be funded or not, some projects contained in this dataset are not relevant for us. We will remove all projects which are:
    <ul>
        <li>Funded for ever</li>
        <li>Still running</li>
        <li>Duplicated</li>
    </ul>
</p>
<p>We first remove <b>forever funding projects</b> which are not relevant for our needs</p>

In [4]:
dataset = dataset[dataset.in_forever_funding != True]
dataset.shape

(1553639, 21)

<p>We're not interested by funding percentage of projects which are <b>still running</b></p>

In [5]:
dataset = dataset[dataset.amt_time_left == 'No time left']
dataset.shape

(1407807, 21)

<p>Now that we have removed irrelevant projects, let's check if this dataset contains <b>duplicated projects</b></p>

In [6]:
dataset.shape[0] - len(set(dataset['id']))

1261489

<p>
    This dataset contains more than <b>1 million of duplicated projects</b>,
    that is to say the <b>majority of this dataset</b> is project duplication !<br/>
    Let's remove duplicated projects.
</p>

In [7]:
dataset = dataset.drop_duplicates(subset=['id'], keep='last')
dataset.shape

(146318, 21)

<h2>Cleaning data</h2>
<p>
    Some values have a numerical and a string part.
    As they are not usable as it is, we want transform them into <b>numerical values</b> and get rid of the string part.
    This transformation concern:
    <ul>
        <li>balance</li>
        <li>collected percentage</li>
    </ul>
</p>

In [8]:
dataset.balance = dataset.balance.apply(lambda x: ''.join(c for c in x if c.isdigit()))
dataset.balance = dataset.balance.apply(pd.to_numeric)

<p>
    We can see that the currency of balance can vary.
    For consistency, we will convert all currencies to <b>USD</b>.
</p>

In [9]:
set(dataset.currency_code)

{'AUD', 'CAD', 'EUR', 'GBP', 'USD'}

In [10]:
change_currencies = {'AUD': 0.78, 'CAD': 0.80, 'EUR': 1.18, 'GBP': 1.33, 'USD': 1}
dataset.balance = dataset.apply(lambda row: change_currencies[row['currency_code']] * row['balance'], axis=1)

<p>Then we convert percentage values removing the '%' sign.</p>

In [11]:
dataset.collected_percentage = dataset.collected_percentage.apply(lambda x: ''.join(c for c in x if c.isdigit()))
dataset.collected_percentage = dataset.collected_percentage.apply(lambda x: pd.to_numeric(x, downcast='float'))

<p>
    Some other values are not usable as they are because they are to complex, such as:
    <ul>
        <li>title</li>
        <li>tagline</li>
        <li>partner_name</li>
    </ul>
</p>
<p>
    <b>Title</b> and <b>tagline</b> are still very interesting values which may 
    influence the success of a campaign.
    We can use the <b>number of character</b> of each one to see if a short or long title (or tagline)
    has an impact on the success of a campaign.
</p>

In [12]:
dataset.title = dataset.title.apply(lambda x: len(str(x)))

In [13]:
dataset.tagline = dataset.tagline.apply(lambda x: len(str(x)))

<p>
    <b>Partner name</b> is not usable as it is because searching for similar partner_name won't
    be of a great help for our analysis. However knowing if having a partner help a campaign to succeed is
    much more usable. Let's convert the partner_name values to <b>boolean values</b>:
    <ul>
        <li><b>True</b> if the campaign owner has one or more partners</li>
        <li><b>False</b> otherwise</li>
    </ul>
</p>

In [14]:
dataset.partner_name = dataset.partner_name.apply(lambda x: False if x == 'null' else True)

<h2>Cleaning attributes</h2>
<p>
    After the changes we've made, we'll change the name of some attributes for consistency:
    <li>
        Because it now holds boolean values, <b>partner_name</b> will become <b>has_partner</b>
    </li>
    <li>
        <b>Title</b> and <b>tagline</b> now holds their length instead of a text, 
        and will be renames into <b>title_len</b> and <b>tagline_len</b>
    </li>
    <li>
        <b>cached_collected_pledges_count</b> name is just way too long and will be 
        simplified to <b>pledges_count</b>
    </li>
</p>

In [15]:
dataset = dataset.rename(columns={'cached_collected_pledges_count': 'pledges_count',
                        'partner_name': 'has_partner', 'title': 'title_len',
                        'tagline': 'tagline_len'})

<p>
    We also have many useless attributes which we'll drop.
    We want to keep the <b>7 following attributes</b>:
    <li><i>title_len:</i> the number of character contained in the title</li>
    <li><i>tagline_len:</i> the number of character contained in the tagline</li>
    <li><i>pledges_count:</i> the number of payment promesses for the campaign</li>
    <li><i>balance:</i> the amount required</li>
    <li><i>collected percentage:</i> the percentage of raised funds</li>
    <li><i>has partner:</i> check if a campaign has a partnership</li>
</p>

In [16]:
cols_to_drop = ['nearest_five_percent', 'igg_image_url', 'compressed_image_url', 'url', 
                'category_url', 'category_name', 'card_type', 'amt_time_left', 'card_type', 
                'in_forever_funding', 'friend_contributors', 'friend_team_members', 'source_url', 'id',
                'currency_code']
dataset.drop(cols_to_drop, inplace=True, axis=1)

<p>Now our dataset look like this:</p>

In [17]:
dataset[:5]

Unnamed: 0,title_len,tagline_len,pledges_count,balance,category_slug,collected_percentage,has_partner
2505,20,65,311,9045.0,video_web,181.0,False
3945,47,99,62,30214.0,technology,1178.0,False
46157,43,93,6704,571527.0,technology,2858.0,True
52031,15,19,1,1000.0,video_web,40.0,False
52599,6,97,29,11779.0,technology,39.0,False


<h2>Make category consistent</h2>
<p>We can check category_slug consistency</p>

In [18]:
print("{}, {}".format(set(dataset.category_slug), len(set(dataset.category_slug))))

{'theatre', 'photography', 'tabletop_games', 'energy_green_tech', 'dance-theater', 'phones-accessories', 'health-fitness', 'community_projects', 'religion', 'animal_rights', 'health_fitness', 'video-games', 'web_series_tv_shows', 'creative_works', 'tech_innovation', 'audio', 'animal-rights', 'human-rights', 'video_web', 'tabletop-games', 'health', 'camera-gear', 'video_games', 'travel_outdoors', 'home', 'human_rights', 'food_beverages', 'food', 'culture', 'politics', 'transportation', 'travel-outdoors', 'writing', 'podcasts-blogs-vlogs', 'community', 'music', 'sports', 'dance_theater', 'productivity', 'environment', 'phones_accessories', 'wellness', 'tech-innovation', 'fashion', 'food-beverages', 'local_businesses', 'fashion_wearables', 'animals', 'energy-green-tech', 'dance', 'gaming', 'writing-publishing', 'fashion-wearables', 'art', 'web-series-tv-shows', 'technology', 'film', 'education', 'local-businesses', 'comic', 'creative-works', 'transmedia', 'podcasts_blogs_vlogs', 'spiritua

<p>
    There is way too much categories.Moreover we can see that many <b>category are duplicated</b> using '-' instead of '_' and some categories <b>can be aggregated</b>:
    <ul>
        <li>animals and animals right</li>
        <li>comic and comics</li>
        <li>community and community_projects</li>
        <li>dance and dance_theater</li>
        <li>fashion and fashion_wearable</li>
        <li>food and food_beverage</li>
        <li>tabletop_games, video_games and gaming with games</li>
        <li>health and health_fitness</li>
        <li>writing and writing_publishing</li>
        <li>tech_innovation and energy_green_tech with technology</li>
    </ul>
</p>
<p>Let's start by replacing all '-' by '_' to make our categories more consistent</p>

In [19]:
dataset.category_slug = dataset.category_slug.apply(lambda x: x.replace('-', '_'))

In [20]:
print("{}, {}".format(set(dataset.category_slug), len(set(dataset.category_slug))))

{'theatre', 'photography', 'tabletop_games', 'energy_green_tech', 'community_projects', 'religion', 'animal_rights', 'web_series_tv_shows', 'health_fitness', 'creative_works', 'tech_innovation', 'audio', 'video_web', 'health', 'video_games', 'travel_outdoors', 'home', 'human_rights', 'food_beverages', 'food', 'culture', 'politics', 'dance_theater', 'transportation', 'writing', 'community', 'music', 'sports', 'productivity', 'environment', 'phones_accessories', 'wellness', 'fashion', 'local_businesses', 'fashion_wearables', 'animals', 'dance', 'gaming', 'art', 'technology', 'film', 'education', 'podcasts_blogs_vlogs', 'comic', 'transmedia', 'spirituality', 'camera_gear', 'writing_publishing', 'comics', 'small_business', 'design'}, 51


<p>Let's aggregate the related categories</p>

In [26]:
def aggregate_categories(category):
    to_aggregate = {
        "animal_": "animals",
        "community_": "community",
        "dance_": "dance",
        "fashion_": "fashion",
        "food_": "food",
        "comic": "comics",
        "video_games": "games",
        "gaming": "games",
        "tabletop_games": "games",
        "health_": "health",
        "writing_": "writing",
        "tech_": "technology",
        "energy_green_tech": "technology",
        "web_series_": "series",
        "podcasts_": "podcast"
        
    }
    for key in to_aggregate.keys():
        if category.startswith(key):
            return to_aggregate[key]
    return category

In [27]:
dataset.category_slug = dataset.category_slug.apply(aggregate_categories)

In [28]:
print("{}, {}".format(set(dataset.category_slug), len(set(dataset.category_slug))))

{'theatre', 'photography', 'series', 'religion', 'creative_works', 'games', 'audio', 'video_web', 'health', 'travel_outdoors', 'home', 'human_rights', 'food', 'culture', 'politics', 'podcast', 'transportation', 'writing', 'community', 'music', 'sports', 'productivity', 'environment', 'phones_accessories', 'wellness', 'fashion', 'local_businesses', 'animals', 'dance', 'art', 'technology', 'film', 'education', 'transmedia', 'spirituality', 'camera_gear', 'comics', 'small_business', 'design'}, 39


<p>Now our dataset look like this:</p>

In [29]:
dataset[:5]

Unnamed: 0,title_len,tagline_len,pledges_count,balance,category_slug,collected_percentage,has_partner
2505,20,65,311,9045.0,video_web,181.0,False
3945,47,99,62,30214.0,technology,1178.0,False
46157,43,93,6704,571527.0,technology,2858.0,True
52031,15,19,1,1000.0,video_web,40.0,False
52599,6,97,29,11779.0,technology,39.0,False


<p>We save it as a tsv file which we will use for all our analysis</p>

In [30]:
dataset.to_csv('indiegogo_cleaned_dataset.tsv', index=False, sep='\t')