# Analyzing Startup Fundraising Deals from Crunchbase

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this project, we'll practice working with different memory constraints. For the initial step, we will assume that we only have 10 MB of available memory.

In [26]:
import pandas as pd
import numpy as np

Below let's look at missing values:

In [27]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

missing = []

for chunk in chunk_iter:
    missing.append(chunk.isnull().sum())
    
combined = pd.concat(missing)
grouped = combined.groupby(combined.index).sum().sort_values()

print(grouped)

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64


Now let's look at memory footprints for each column:

In [28]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
test_chunk = next(chunk_iter)
cols = test_chunk.columns

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
memory = pd.Series(0, index=cols)

for chunk in chunk_iter:
    memory += chunk.memory_usage(index=False, deep=True)
    
print(memory)

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64


In [29]:
memory.sum() / (1024 ** 2) # Total memory

56.9876070022583

Now we will look at columns to drop because they are not useful for analysis. In particular, when looking at the data we should drop those that contain too many missing values like investor_category_code and those that contain urls like investor_permalink and company_permalink.

In [30]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']