# Guided Project:
### Analyzing Startup Fundraising Deals from Crunchbase

## Introduction

In this course, we explored a few different ways to work with larger datasets in pandas. In this guided project, we'll practice using some of the techniques we learned to analyze startup investments from [Crunchbase.com](www.crunchbase.com).<br>

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.<br>

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.<br>

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv). Here's a preview:

In [4]:
import pandas as pd
pd.read_csv('crunchbase-investments.csv', nrows=3)

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000


Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes **10.3 megabytes of disk space**, we know from earlier missions that 

### pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

* Because the data set contains over 50,000 rows, you'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.
* Across all of the chunks, become familiar with:
  * Each column's missing value counts
  * Each column's memory footprint
  * The total memory footprint of all of the chunks combined
Which column(s) we can drop because they aren't useful for analysis

In [3]:
import numpy as np

In [17]:
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                        encoding='latin-1')

tot_mem_usage = 0
tot_cols = pd.read_csv('crunchbase-investments.csv', nrows=1).columns
tot_cols_numnulls = [[col, 0] for col in tot_cols]


for i, chunk in enumerate(chunk_iter):

    print('#'*30)
    print('chunk #'+str(i+1))
    print(chunk.info())
    print('#'*30)
    
    for i, col in enumerate(list(chunk.columns)):
        print(col,'/ leng of unique values -',
              len(chunk[col].unique()),
              '/',
              '{0:.3f}MB used'.format(chunk[col].memory_usage(deep=True)/2**20))
        
        tot_cols_numnulls[i][1] += chunk[col].isnull().sum()
    
    tot_mem_usage += chunk.memory_usage(deep=True).sum()/2**20

##############################
chunk #1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 20 columns):
company_permalink         5000 non-null object
company_name              5000 non-null object
company_category_code     4948 non-null object
company_country_code      5000 non-null object
company_state_code        4947 non-null object
company_region            5000 non-null object
company_city              4936 non-null object
investor_permalink        5000 non-null object
investor_name             5000 non-null object
investor_category_code    2443 non-null object
investor_country_code     4222 non-null object
investor_state_code       3629 non-null object
investor_region           5000 non-null object
investor_city             4100 non-null object
funding_round_type        5000 non-null object
funded_at                 5000 non-null object
funded_month              5000 non-null object
funded_quarter            5000 non-null object
funded_

In [12]:
tot_mem_usage

56.988484382629395

In [18]:
# total number of null values of each column
tot_cols_numnulls

[['company_permalink', 1],
 ['company_name', 1],
 ['company_category_code', 643],
 ['company_country_code', 1],
 ['company_state_code', 492],
 ['company_region', 1],
 ['company_city', 533],
 ['investor_permalink', 2],
 ['investor_name', 2],
 ['investor_category_code', 50427],
 ['investor_country_code', 12001],
 ['investor_state_code', 16809],
 ['investor_region', 2],
 ['investor_city', 12480],
 ['funding_round_type', 3],
 ['funded_at', 3],
 ['funded_month', 3],
 ['funded_quarter', 3],
 ['funded_year', 3],
 ['raised_amount_usd', 3599]]

### Note : which column(s) we can drop?
* **Too many null values** (over 10 percent of entire observations)
  * 'investor_category_code', 'investor_country_code', 'investor_state_code', 'investor_city'
* **Identical information**
  * '\_permalink' and '\_name' seem to have identical information
  * Since '\_permalink' has more memory usage than '\_name', we use '\_name' column.
* **Duplicated information**
  * 'funded_at' column has all the information the columns below have:
    * 'funded_month', 'funded_year'

## Selecting Data Types

Now that we have a good sense of the missing values, let's get familiar with the column types before adding the data into SQLite.

* Identify the types for each column.
* Identify the numeric columns we can represent using more space efficient types.
* For text columns:
  * Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.
  * See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying.
* Make your changes to the code from the last step so that the overall memory the data consumes stays under 10 megabytes.

In [30]:
selected_cols = ['company_name', 'company_category_code', 'company_country_code',
                'company_region', 'company_city', 'investor_name', 'investor_region',
                'funding_round_type', 'funded_at', 'funded_quarter', 'raised_amount_usd']

chunk_iter = pd.read_csv('crunchbase-investments.csv',
                        chunksize=5000, 
                         encoding='latin-1',
                        usecols=selected_cols)

for chunk in chunk_iter:
    print(chunk.info())
    break

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
company_name             5000 non-null object
company_category_code    4948 non-null object
company_country_code     5000 non-null object
company_region           5000 non-null object
company_city             4936 non-null object
investor_name            5000 non-null object
investor_region          5000 non-null object
funding_round_type       5000 non-null object
funded_at                5000 non-null object
funded_quarter           5000 non-null object
raised_amount_usd        4347 non-null float64
dtypes: float64(1), object(10)
memory usage: 429.8+ KB
None


In [38]:
# identify the numeric columns we can represent using more space efficient types
def check_convertible_to_integer(df, col):
    for unq in df[col].unique():
        if str(10*unq)[-1] != '0':
            return False
    return True

def check_convertible_to_category(df, col):
    if len(df[col])*.5 > len(df[col].unique()):
        return True
    return False
 
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                        chunksize=5000, 
                         encoding='latin-1',
                        usecols=selected_cols)
    
chunks = []

# dictionary to store the unique values for each object type column
obj_col_uniques = {
    'company_name':[],
    'company_category_code':[],
    'company_country_code':[],
    'company_region':[],
    'company_city':[],
    'investor_name':[],
    'investor_region':[],
    'funding_round_type':[]
}
        
for chunk in chunk_iter:
    
    for col in chunk.columns:
        
        if chunk[col].dtype == 'object':
            # for column 'funded_at'
            # divide by 4 columns; funded_day, funded_month, funded_year
            # and make 'funded_quarter' column numerical (integer)
            
            if col == 'funded_at':
                
                # drop 3 rows with null values in 'funded_at'
                if chunk[col].isnull().sum():
                    drop_idx = chunk[col].isnull()[chunk[col].isnull()==True].index
                    chunk = chunk.drop(drop_idx)
                
                chunk['funded_day'] = pd.to_numeric(chunk[col].apply(lambda x: x[-2:]),
                                                    downcast='integer')
                chunk['funded_month'] = pd.to_numeric(chunk[col].apply(lambda x: x[5:7]),
                                                      downcast='integer')
                chunk['funded_year'] = pd.to_numeric(chunk[col].apply(lambda x: x[:4]),
                                                      downcast='integer')
                chunk['funded_quarter'] = pd.to_numeric(chunk['funded_quarter'].apply(lambda x: x[-1:]),
                                                      downcast='integer')
                
            # for column except 'funded_at'
            # check the ratio of unique values to the entire length
            else:
                
                if col in obj_col_uniques:
                    obj_col_uniques[col].extend(list(chunk[col].unique()))
                
                if check_convertible_to_category(chunk, col):
                    print(col, 'convertible to category type')
                else:
                    print(col, 'not convertible to category type')
            
        if chunk[col].dtype != 'object':
            
            if check_convertible_to_integer(chunk, col):
                print(col, '- to_numeric available')
            
            else:
                print(col, '- to_numeric not available')
        
    print('#'*30)
    
    chunks.append(chunk)

company_name not convertible to category type
company_category_code convertible to category type
company_country_code convertible to category type
company_region convertible to category type
company_city convertible to category type
investor_name convertible to category type
investor_region convertible to category type
funding_round_type convertible to category type
funded_quarter - to_numeric available
raised_amount_usd - to_numeric not available
##############################
company_name not convertible to category type
company_category_code convertible to category type
company_country_code convertible to category type
company_region convertible to category type
company_city convertible to category type
investor_name convertible to category type
investor_region convertible to category type
funding_round_type convertible to category type
funded_quarter - to_numeric available
raised_amount_usd - to_numeric not available
##############################
company_name not convertible to ca

#### Object columns convertible to category
* 'company_category_code', 'company_country_code', 'company_region', 'company_city', 'investor_name', 'investor_region', 'funding_round_type'

#### `raised_amount_usd` --- not convertible to integer type

In [42]:
obj_col_uniques_nodup = {key:list(set(val)) for key, val in obj_col_uniques.items()}

for key, val in obj_col_uniques_nodup.items():
    val.remove(np.nan)

In [None]:
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                        chunksize=5000, 
                         encoding='latin-1',
                        usecols=selected_cols)

chunks = []
        
for chunk in chunk_iter:
    
    for col in chunk.columns:
        
        if chunk[col].dtype == 'object':
            # for column 'funded_at'
            # divide by 4 columns; funded_day, funded_month, funded_year
            # and make 'funded_quarter' column numerical (integer)
            
            if col == 'funded_at':
                
                # drop 3 rows with null values in 'funded_at'
                if chunk[col].isnull().sum():
                    drop_idx = chunk[col].isnull()[chunk[col].isnull()==True].index
                    chunk = chunk.drop(drop_idx)
                
                chunk['funded_day'] = pd.to_numeric(chunk[col].apply(lambda x: x[-2:]),
                                                    downcast='integer')
                chunk['funded_month'] = pd.to_numeric(chunk[col].apply(lambda x: x[5:7]),
                                                      downcast='integer')
                chunk['funded_year'] = pd.to_numeric(chunk[col].apply(lambda x: x[:4]),
                                                      downcast='integer')
                chunk['funded_quarter'] = pd.to_numeric(chunk['funded_quarter'].apply(lambda x: x[-1:]),
                                                      downcast='integer')
                
            # for column except 'funded_at'
            # check the ratio of unique values to the entire length
            else:
                # except for company_name : convert to category dtype
                if check_convertible_to_category(chunk, col):
                    #print(col, 'convertible to category type')
                    chunk[col] = chunk[col].astype('category',
                                                   categories=obj_col_uniques_nodup[col])
                    
                # company_name --- pass.
                
    print('#'*30)
    #print(chunk.columns)
    chunks.append(chunk.drop(['funded_at'], axis=1))

In [50]:
crunch = pd.concat(chunks)
crunch.head()

Unnamed: 0,company_name,company_category_code,company_country_code,company_region,company_city,investor_name,investor_region,funding_round_type,funded_quarter,raised_amount_usd,funded_day,funded_month,funded_year
0,AdverCar,advertising,USA,SF Bay,San Francisco,1-800-FLOWERS.COM,New York,series-a,4,2000000.0,30,10,2012
1,LaunchGram,news,USA,SF Bay,Mountain View,10Xelerator,Columbus,other,1,20000.0,23,1,2012
2,uTaP,messaging,USA,United States - Other,,10Xelerator,Columbus,other,1,20000.0,1,1,2012
3,ZoopShop,software,USA,Columbus,columbus,10Xelerator,Columbus,angel,1,20000.0,15,2,2012
4,eFuneral,web,USA,Cleveland,Cleveland,10Xelerator,Columbus,other,3,20000.0,8,9,2011


In [52]:
# processed dataset --- memory usage : 6.3 MB
crunch.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52867 entries, 0 to 52869
Data columns (total 13 columns):
company_name             52867 non-null object
company_category_code    52225 non-null category
company_country_code     52867 non-null category
company_region           52867 non-null category
company_city             52335 non-null category
investor_name            52867 non-null category
investor_region          52867 non-null category
funding_round_type       52867 non-null category
funded_quarter           52867 non-null int8
raised_amount_usd        49271 non-null float64
funded_day               52867 non-null int8
funded_month             52867 non-null int8
funded_year              52867 non-null int16
dtypes: category(7), float64(1), int16(1), int8(3), object(1)
memory usage: 6.3 MB


## Loading Chunks into SQLite

Now we're in good shape to start exploring and analyzing the data. The next step is to load each chunk into a table in a SQLite database so we can query the full data set.

* Create and connect to a new SQLite database file.
* Expand on the existing chunk processing code to export each chunk to a new table in the SQLite database.
* Query the table and make sure the data types match up to what you had in mind for each column.
* Use the `!wc` IPython command to return the file size of the database.

In [53]:
import sqlite3

conn = sqlite3.connect('crunchbase.db')

In [54]:
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                        chunksize=5000, 
                         encoding='latin-1',
                        usecols=selected_cols)

chunks = []
        
for chunk in chunk_iter:
    
    for col in chunk.columns:
        
        if chunk[col].dtype == 'object':
            # for column 'funded_at'
            # divide by 4 columns; funded_day, funded_month, funded_year
            # and make 'funded_quarter' column numerical (integer)
            
            if col == 'funded_at':
                
                # drop 3 rows with null values in 'funded_at'
                if chunk[col].isnull().sum():
                    drop_idx = chunk[col].isnull()[chunk[col].isnull()==True].index
                    chunk = chunk.drop(drop_idx)
                
                chunk['funded_day'] = pd.to_numeric(chunk[col].apply(lambda x: x[-2:]),
                                                    downcast='integer')
                chunk['funded_month'] = pd.to_numeric(chunk[col].apply(lambda x: x[5:7]),
                                                      downcast='integer')
                chunk['funded_year'] = pd.to_numeric(chunk[col].apply(lambda x: x[:4]),
                                                      downcast='integer')
                chunk['funded_quarter'] = pd.to_numeric(chunk['funded_quarter'].apply(lambda x: x[-1:]),
                                                      downcast='integer')
                
            # for column except 'funded_at'
            # check the ratio of unique values to the entire length
            else:
                # except for company_name : convert to category dtype
                if check_convertible_to_category(chunk, col):
                    #print(col, 'convertible to category type')
                    chunk[col] = chunk[col].astype('category',
                                                   categories=obj_col_uniques_nodup[col])
                    
                # company_name --- pass.
    
    chunk = chunk.drop(['funded_at'], axis=1)
    chunk.to_sql('investments', conn, if_exists='append', index=False)



In [55]:
pd.read_sql('PRAGMA TABLE_INFO(investments);', conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,company_name,TEXT,0,,0
1,1,company_category_code,TEXT,0,,0
2,2,company_country_code,TEXT,0,,0
3,3,company_region,TEXT,0,,0
4,4,company_city,TEXT,0,,0
5,5,investor_name,TEXT,0,,0
6,6,investor_region,TEXT,0,,0
7,7,funding_round_type,TEXT,0,,0
8,8,funded_quarter,INTEGER,0,,0
9,9,raised_amount_usd,REAL,0,,0


In [56]:
# db filesize - about 5.3MB
!wc crunchbase.db

   10197  268782 5279744 crunchbase.db


## Data Exploration and Analysis

Now that the data is in SQLite, we can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.

#### Use the pandas SQLite workflow to answer the following questions:
* What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
* Which category of company attracted the most investments?
* Which investor contributed the most money (across all startups)?
* Which investors contributed the most money per startup?
* Which funding round was the most popular? Which was the least popular?

In [99]:
# What proportion of the total amount of funds did the top 10% raise?

q = '''
    SELECT raised_amount_usd
    FROM investments
    ORDER BY 1 DESC
'''

raised_amount_usd_df = pd.read_sql(q, conn)
raised_amount_usd_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52867 entries, 0 to 52866
Data columns (total 1 columns):
raised_amount_usd    49271 non-null float64
dtypes: float64(1)
memory usage: 413.1 KB


In [86]:
raised_amount_usd_df.notnull().sum()[0]

49271

In [89]:
top10per_index = int(raised_amount_usd_df.notnull().sum()[0]*.1)

raised_top10per_funds_usd = raised_amount_usd_df.iloc[:top10per_index].sum()[0]
raised_top10per_funds_usd

339885132662.0

In [91]:
# What about the top 1%? 
top1per_index = int(raised_amount_usd_df.notnull().sum()[0]*.01)

raised_top1per_funds_usd = raised_amount_usd_df.iloc[:top1per_index].sum()[0]
raised_top1per_funds_usd

131467802677.0

In [93]:
# Compare these values to the proportions the bottom 10% and bottom 1% raised.

bottom1per_index = int(raised_amount_usd_df.notnull().sum()[0]*.99)
bottom10per_index = int(raised_amount_usd_df.notnull().sum()[0]*.90)

raised_bottom1per_funds_usd = raised_amount_usd_df.iloc[bottom1per_index:].sum()[0]
raised_bottom10per_funds_usd = raised_amount_usd_df.iloc[bottom10per_index:].sum()[0]

print(raised_top10per_funds_usd/raised_top1per_funds_usd)
print(raised_bottom10per_funds_usd/raised_bottom1per_funds_usd)

2.58531081939
198.260439149


In [95]:
# Which category of company attracted the most investments?

q = '''
        SELECT company_category_code,
                SUM(raised_amount_usd) total_fund
        FROM investments
        GROUP BY company_category_code
        ORDER BY 2 DESC
'''

fund_amount_by_category = pd.read_sql(q, conn)
fund_amount_by_category.head()

Unnamed: 0,company_category_code,total_fund
0,biotech,110396400000.0
1,software,73084520000.0
2,mobile,64777380000.0
3,cleantech,52705230000.0
4,enterprise,45860930000.0


In [98]:
fund_amount_by_category.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 2 columns):
company_category_code    42 non-null object
total_fund               43 non-null float64
dtypes: float64(1), object(1)
memory usage: 3.1 KB


In [101]:
# Which investor contributed the most money (across all startups)?

q = '''
        SELECT investor_name,
                SUM(raised_amount_usd) total_fund
        FROM investments
        GROUP BY 1
        ORDER BY 2 DESC
'''

fund_amount_by_investor = pd.read_sql(q, conn)
fund_amount_by_investor.head()

Unnamed: 0,investor_name,total_fund
0,Kleiner Perkins Caufield & Byers,11217830000.0
1,New Enterprise Associates,9692542000.0
2,Accel Partners,6472126000.0
3,Goldman Sachs,6375459000.0
4,Sequoia Capital,6039402000.0


In [112]:
# Which investors contributed the most money per startup?

q = '''
        SELECT investor_name,
               AVG(raised_amount_usd) avg_fund_per_company,
               COUNT(company_name) num_of_companies_to_fund
        FROM investments
        GROUP BY 1
        ORDER BY 2 DESC
'''

fund_amount_by_investor_per_startup = pd.read_sql(q, conn)
fund_amount_by_investor_per_startup.head()

Unnamed: 0,investor_name,avg_fund_per_company,num_of_companies_to_fund
0,Marlin Equity Partners,2600000000.0,1
1,BrightHouse,2350000000.0,2
2,GI Partners,1050000000.0,1
3,Sprint Nextel,833333300.0,3
4,Siemens PLM Software,750000000.0,1


In [114]:
# Which funding round was the most popular? 

q = '''
        SELECT DISTINCT(funding_round_type),
               COUNT(*) amount_of_funding_round
        FROM investments
        GROUP BY 1
        ORDER BY 2 DESC
'''

most_famous_funding_round = pd.read_sql(q, conn)
most_famous_funding_round.head()

Unnamed: 0,funding_round_type,amount_of_funding_round
0,series-a,13938
1,series-c+,10870
2,angel,8989
3,venture,8917
4,series-b,8794


In [116]:
# Which was the least popular?

q = '''
        SELECT DISTINCT(funding_round_type),
               COUNT(*) amount_of_funding_round
        FROM investments
        GROUP BY 1
        ORDER BY 2
'''

least_famous_funding_round = pd.read_sql(q, conn)
least_famous_funding_round.head()

Unnamed: 0,funding_round_type,amount_of_funding_round
0,crowdfunding,5
1,post-ipo,33
2,private-equity,357
3,other,964
4,series-b,8794


## Next Steps

That's it for the guided steps. Here are some ideas for further exploration:

* Repeat the tasks in this guided project using stricter memory constraints (under 1 megabyte).
* Clean and analyze the other Crunchbase data sets from the same [GitHub repo](https://github.com/datahoarder/crunchbase-october-2013).
  * Understand which columns the data sets share, and how the data sets are linked.
  * Create a relational database design that links the data sets together and reduces the overall disk space the database file consumes.
  * Use pandas to populate each table in the database, create the appropriate indexes, and so on.