# Analyzing Startup Fundraising Deals from Crunchbase

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this project, we'll practice working with different memory constraints. For the initial step, we will assume that we only have 10 MB of available memory.

In [17]:
import pandas as pd
import numpy as np

Below let's look at missing values:

In [18]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

missing = []

for chunk in chunk_iter:
    missing.append(chunk.isnull().sum())
    
combined = pd.concat(missing)
grouped = combined.groupby(combined.index).sum().sort_values()

print(grouped)

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64


Now let's look at memory footprints for each column:

In [19]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
test_chunk = next(chunk_iter)
cols = test_chunk.columns

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
memory = pd.Series(0, index=cols)

for chunk in chunk_iter:
    memory += chunk.memory_usage(index=False, deep=True)
    
print(memory)

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64


In [20]:
memory.sum() / (1024 ** 2) # Total memory

56.9876070022583

Now we will look at columns to drop because they are not useful for analysis. In particular, when looking at the data we should drop those that contain too many missing values like investor_category_code and those that contain urls like investor_permalink and company_permalink.

In [21]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)

## Selecting Data Types

Now let's get familiar with the column types.

In [22]:
columns = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in columns:
            columns[col] = set()
            columns[col].add(str(chunk.dtypes[col]))
        else:
            columns[col].add(str(chunk.dtypes[col]))
            
columns

{'company_permalink': {'object'},
 'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_permalink': {'object'},
 'investor_name': {'object'},
 'investor_category_code': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

Some of the columns will need to be cleaned as seen from above. We should also identify which columns to represent using more space efficient types. Let's take a quick look at some sample data:

In [23]:
chunk.head()

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,/company/nuorder,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,/person/mortimer-singer,Mortimer Singer,,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,/company/chacha,ChaCha,advertising,USA,IN,Indianapolis,Carmel,/person/morton-meyerson,Morton Meyerson,,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,/company/unified-color,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,/person/mr-andrew-oung,Mr. Andrew Oung,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,


In [24]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
uniques = {} # For each column

for chunk in chunk_iter:
    strings = chunk.select_dtypes(include=['object']) # Get strings cols of chunk
    cols = strings.columns
    
    # For each column, append the value_counts
    for c in cols:
        if c in drop_cols:
            continue
            
        val_counts = strings[c].value_counts()
        
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]
            
combined = {} # For putting the columns together

for col in uniques:
    c_concat = pd.concat(uniques[col]) # Combine lists
    c_group = c_concat.groupby(c_concat.index).sum() # Group by column
    combined[col] = c_group # Put into the new dictionary

for col in combined:
    print(col + ': ' + str(len(combined[col])))

company_name: 11573
company_category_code: 43
company_country_code: 2
company_state_code: 50
company_region: 546
company_city: 1229
investor_name: 10465
investor_country_code: 72
investor_state_code: 50
investor_region: 585
investor_city: 990
funding_round_type: 9
funded_at: 2808
funded_month: 192
funded_quarter: 72


From the above, we can put the following into categories:

company_category_code, company_country_code, company_state_code, investor_country_code, investor_state_code, funding_round_type

In [25]:
categories = ['company_category_code', 'company_country_code', 'company_state_code', 'investor_country_code', 'investor_state_code', 'funding_round_type']
cat_dic = {i:'category' for i in categories}

Looking back at the sample data, we can also clean funded_month and convert to numerical. We can also clean funded_quarter and convert to categorical. funded_at will be converted to datetime as well.

In [29]:
def clean_last_2(val):
    if not pd.isna(val):
        try:
            return val[-2:]
        except TypeError:
            return val

## Loading Chunks into SQLite

Now that we have a strategy for how to ensure best types, we will be loading each cleaned chunk into a table in a SQLite database so that we can query the full data set.

In [30]:
import sqlite3

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, parse_dates=['funded_at'], dtype=cat_dic)

conn = sqlite3.connect('test.db')
cur = conn.cursor()

for chunk in chunk_iter:
    chunk['funded_month'] = pd.to_numeric(chunk['funded_month'].apply(clean_last_2))
    chunk['funded_quarter'] = chunk['funded_quarter'].apply(clean_last_2).astype('category')
    chunk.to_sql('investments', conn)
    
cur.execute('SELECT * FROM investments LIMIT 1')


ValueError: Table 'investments' already exists.