# Guided Project: Analyzing Startup Fundraising Deals from Crunchbase

In this course, we explored a few different ways to work with larger datasets in pandas. In this guided project, we'll practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date. The data set of investments we'll be exploring is current as of October 2013.

Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While <code>crunchbase-investments.csv</code> consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

In [1]:
import pandas as pd
import sqlite3

Because the data set contains over 50,000 rows, you'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.

## Data Exploration

Across all chunks, become familiar with:
- Each columns missing value counts
- Each columns memory footprint
- Total memory footprint of all of the chunks combined
- Which columns can be dropped because they aren't useful for analysis

In [2]:
missing_values = []
memory_chunk_MB = []

ci_iter = pd.read_csv('crunchbase-investments.csv',chunksize=5000 ,encoding='latin-1')
for chunk in ci_iter:
    
    # Missing Value counts
    missing_values.append(chunk.isnull().sum())
    
    # Each chunk Memory Footprint Over All Chunks
    memory_chunk_MB.append(round(chunk.memory_usage(deep=True).sum() / 1048576,2)) # Memory in MB
    
# Missing Values Information
missing_values = pd.concat(missing_values)
print('Number of Missing Values In Each Category')
print(missing_values.groupby(missing_values.index).sum())

# Cols With No Missing Values
if len(pd.read_csv('crunchbase-investments.csv',encoding='latin-1').head(0).columns.to_list()) == len(missing_values.groupby(missing_values.index).sum().to_list()):
    print('\nAll Columns Have Missing Values')

# Memory Chunks MB
print('\nMB Memory in Each Chunk')
print(memory_chunk_MB)

# Memory Total MB
print('\nMB Memory Total')
print(round(sum(memory_chunk_MB),2))

Number of Missing Values In Each Category
company_category_code       643
company_city                533
company_country_code          1
company_name                  1
company_permalink             1
company_region                1
company_state_code          492
funded_at                     3
funded_month                  3
funded_quarter                3
funded_year                   3
funding_round_type            3
investor_category_code    50427
investor_city             12480
investor_country_code     12001
investor_name                 2
investor_permalink            2
investor_region               2
investor_state_code       16809
raised_amount_usd          3599
dtype: int64

All Columns Have Missing Values

MB Memory in Each Chunk
[5.58, 5.53, 5.54, 5.53, 5.52, 5.55, 5.53, 5.51, 5.4, 4.64, 2.66]

MB Memory Total
56.99


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Data Types of Read In Values Unoptimized
head = pd.read_csv('crunchbase-investments.csv',encoding='latin-1').head(5)
print(head.dtypes)
head

company_permalink          object
company_name               object
company_category_code      object
company_country_code       object
company_state_code         object
company_region             object
company_city               object
investor_permalink         object
investor_name              object
investor_category_code     object
investor_country_code      object
investor_state_code        object
investor_region            object
investor_city              object
funding_round_type         object
funded_at                  object
funded_month               object
funded_quarter             object
funded_year               float64
raised_amount_usd         float64
dtype: object


Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012.0,2000000.0
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012.0,20000.0
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012.0,20000.0
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012.0,20000.0
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011.0,20000.0


Looking at the data we can see that there is a big variation in mising values from all of the columns. Come have just a few missing values while others have > 10,000. Most of the colummns are read in as object types.

## Optimizing Data Types in the DataFrame

In [4]:
# Setup inital dict with all columns names and default None
col_dtype_dict = {}
for x in pd.read_csv('crunchbase-investments.csv',encoding='latin-1', low_memory=False).head(0).columns.tolist():
    col_dtype_dict[x] = None

# Remove Columns That We Do Not Want
remove_col = ['company_permalink','company_city','investor_permalink','investor_category_code','investor_city',
              'investor_country_code','investor_state_code','funded_month','funded_quarter']
for key in remove_col:
    col_dtype_dict.pop(key, None)

# Remove datetime values since they will be set with parse_dates
for key in ['funded_at','funded_year']:
    col_dtype_dict.pop(key, None)

# Set dtypes
col_dtype_dict['company_name'] = 'object'
col_dtype_dict['company_category_code'] = 'category'
col_dtype_dict['company_country_code'] = 'category'
col_dtype_dict['company_state_code'] = 'category'
col_dtype_dict['company_region'] = 'category'
col_dtype_dict['investor_name'] = 'object'
col_dtype_dict['investor_region'] = 'category'
col_dtype_dict['funding_round_type'] = 'category'
col_dtype_dict['raised_amount_usd'] = 'float'

# Final Col Types Dict
col_dtype_dict

{'company_name': 'object',
 'company_category_code': 'category',
 'company_country_code': 'category',
 'company_state_code': 'category',
 'company_region': 'category',
 'investor_name': 'object',
 'investor_region': 'category',
 'funding_round_type': 'category',
 'raised_amount_usd': 'float'}

In [5]:
datetime_col = ['funded_at','funded_year']
cols = list(col_dtype_dict.keys()) + datetime_col
chunksize = 5000

memory_footprints = [['type_df','chunksize','column_types','memory_chunk_max_MB','total_memory_MB']]

# Optimized Chunks
# Create Chunk Iter
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         encoding='latin-1',
                         chunksize=chunksize,
                         dtype=col_dtype_dict,
                         parse_dates=datetime_col,
                         usecols=cols)
memory_footprint_total_MB = 0
memory_footprint_chunk = []
# Loop through chunks
for chunk in chunk_iter:
    memory_chunk = chunk.memory_usage(deep=True).sum() / 1048576 # Memory in MB
    memory_footprint_total_MB += memory_chunk
    memory_footprint_chunk.append(memory_chunk)
    chunk_types = chunk.dtypes.tolist()
# Add Info
memory_footprints.append(['Optimized',chunksize,chunk_types,max(memory_footprint_chunk),memory_footprint_total_MB])

# Create Stats Data Frame
stats = pd.DataFrame(memory_footprints[1:],columns=memory_footprints[0])
stats

Unnamed: 0,type_df,chunksize,column_types,memory_chunk_max_MB,total_memory_MB
0,Optimized,5000,"[object, category, category, category, categor...",0.891746,9.199158


## SQLite Database Creation and Exploration

We are going to add data into a sqlite database from the chunks.

In [6]:
# Code To Reset Table
# conn = sqlite3.connect('crunchbase-investments.db')
# conn.execute('DROP TABLE investments;')
# conn.close()

In [7]:
# Code For Inserting Chunks Into Database
# conn = sqlite3.connect('crunchbase-investments.db')

# chunk_iter = chunk_iter = pd.read_csv('crunchbase-investments.csv',
#                           encoding='latin-1',
#                           chunksize=chunksize,
#                           dtype=col_dtype_dict,
#                           parse_dates=datetime_col,
#                           usecols=cols)
# for chunk in chunk_iter:
#     chunk.to_sql('investments', conn, if_exists='append', index=False)

In [8]:
# See SQLite Table
conn = sqlite3.connect('crunchbase-investments.db')
cur = conn.cursor()
cur.execute('PRAGMA TABLE_INFO(investments);')
table_info = cur.fetchall()
conn.close()

for row in table_info:
    print(row)

(0, 'company_name', 'TEXT', 0, None, 0)
(1, 'company_category_code', 'TEXT', 0, None, 0)
(2, 'company_country_code', 'TEXT', 0, None, 0)
(3, 'company_state_code', 'TEXT', 0, None, 0)
(4, 'company_region', 'TEXT', 0, None, 0)
(5, 'investor_name', 'TEXT', 0, None, 0)
(6, 'investor_region', 'TEXT', 0, None, 0)
(7, 'funding_round_type', 'TEXT', 0, None, 0)
(8, 'funded_at', 'TIMESTAMP', 0, None, 0)
(9, 'funded_year', 'TIMESTAMP', 0, None, 0)
(10, 'raised_amount_usd', 'REAL', 0, None, 0)


## Answering Some Questions About the Data
Use the pandas SQLite workflow to answer the following questions:

- What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
- Which category of company attracted the most investments?
- Which investor contributed the most money (across all startups)?
- Which investors contributed the most money per startup?
- Which funding round was the most popular? Which was the least popular?

### Question 1

In [9]:
conn = sqlite3.connect('crunchbase-investments.db')
query = """
        SELECT company_name AS company,
               raised_amount_usd AS usd
        FROM investments AS i
        GROUP BY company_name
        ORDER BY usd DESC;
        """
company_raised_funds = pd.read_sql(query,conn).dropna()

# Subset of company_raised_funds
company_raised_funds_top_ten_per = company_raised_funds.loc[0:int((.1*(10077+1)))]
company_raised_funds_top_one_per = company_raised_funds.loc[0:int((.01*(10077+1)))]
company_raised_funds_bottom_ten_per = company_raised_funds.loc[int((.9*(10077+1))):]
company_raised_funds_bottom_one_per = company_raised_funds.loc[int((.99*(10077+1))):]

# Output of Questions
questions = ['Proportion of Total Funds Raised By Top 10% of Companies:',
            'Proportion of Total Funds Raised By Top 1% of Companies:',
            'Proportion of Total Funds Raised By Bottom 10% of Companies:',
            'Proportion of Total Funds Raised by Bottom 1% of Companies:']

results = [company_raised_funds_top_ten_per['usd'].sum() / company_raised_funds['usd'].sum(),
           company_raised_funds_top_one_per['usd'].sum() / company_raised_funds['usd'].sum(),
           company_raised_funds_bottom_ten_per['usd'].sum() / company_raised_funds['usd'].sum(),
           company_raised_funds_bottom_one_per['usd'].sum() / company_raised_funds['usd'].sum()]


print("What proportion of the total amount of funds did the top 10% raise? What about the top 1%? \
Compare these values to the proportions the bottom 10% and bottom 1% raised.\n")
print("Total Funds Raised All {} Companies: ${} Billion USD".format(len(company_raised_funds),
                                                               round(company_raised_funds['usd'].sum() / 1E9, 2)))
for i, question in enumerate(questions):
    print("{} {}%".format(question,round(results[i] * 100,2)))

What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.

Total Funds Raised All 10078 Companies: $115.19 Billion USD
Proportion of Total Funds Raised By Top 10% of Companies: 54.15%
Proportion of Total Funds Raised By Top 1% of Companies: 20.69%
Proportion of Total Funds Raised By Bottom 10% of Companies: 0.07%
Proportion of Total Funds Raised by Bottom 1% of Companies: 0.0%


### Question 2

In [10]:
conn = sqlite3.connect('crunchbase-investments.db')
query = """
        SELECT company_category_code as category,
               SUM(raised_amount_usd) AS usd
        FROM investments AS i
        GROUP BY category
        ORDER BY usd DESC;
        """
category_funds = pd.read_sql(query,conn).dropna()
category_funds['percent_total_funds'] = round(((category_funds['usd'] / category_funds['usd'].sum())*100),2)

#Output of Question
print("Which category of company attracted the most investments?")
category_funds.head(20)

Which category of company attracted the most investments?


Unnamed: 0,category,usd,percent_total_funds
0,biotech,110396400000.0,16.33
1,software,73084520000.0,10.81
2,mobile,64777380000.0,9.58
3,cleantech,52705230000.0,7.8
4,enterprise,45860930000.0,6.78
5,web,40143260000.0,5.94
6,medical,25367110000.0,3.75
7,advertising,25076660000.0,3.71
8,ecommerce,22567220000.0,3.34
9,network_hosting,22419680000.0,3.32


### Question 3

In [11]:
conn = sqlite3.connect('crunchbase-investments.db')
query = """
        SELECT investor_name as investor,
               raised_amount_usd AS usd
        FROM investments AS i
        GROUP BY investor
        ORDER BY usd DESC;
        """
investor = pd.read_sql(query,conn).dropna()
investor['percent'] = round((investor['usd'] / investor['usd'].sum()*100),2)

print("Which investor contributed the most money (across all startups)?")
investor.head(10)

Which investor contributed the most money (across all startups)?


Unnamed: 0,investor,usd,percent
0,Marlin Equity Partners,2600000000.0,2.51
1,BrightHouse,1500000000.0,1.45
2,GI Partners,1050000000.0,1.01
3,Sprint Nextel,920000000.0,0.89
4,Siemens PLM Software,750000000.0,0.72
5,U.S. Department of Energy,465000000.0,0.45
6,Laurel Crown Partners,450000000.0,0.43
7,Iconiq Capital,450000000.0,0.43
8,Madison Dearborn Partners,360100000.0,0.35
9,Omniscient Venture Partners,319000000.0,0.31


### Question 4

In [12]:
conn = sqlite3.connect('crunchbase-investments.db')
query = """
        SELECT i.company_name as company,
               i.company_category_code as category,
               i.investor_name as investor,
               SUM(i.raised_amount_usd) AS usd,
               c.usd_company_total,
               SUM(i.raised_amount_usd) / c.usd_company_total AS percent_company_total
        FROM investments AS i
        JOIN (
                SELECT 
                    company_name,
                    SUM(raised_amount_usd) AS usd_company_total
                FROM investments as i
                GROUP BY company_name
              ) AS c ON company = c.company_name
        GROUP BY company, investor
        ORDER BY percent_company_total DESC;
        """
company_investor_percent_raised = pd.read_sql(query,conn).dropna()
print("Which investors contributed the most money per startup?")
company_investor_percent_raised[company_investor_percent_raised['percent_company_total'] == 1]

Which investors contributed the most money per startup?


Unnamed: 0,company,category,investor,usd,usd_company_total,percent_company_total
0,0xdata,analytics,Nexus Venture Partners,1700000.0,1700000.0,1.0
1,1010data,software,Norwest Venture Partners,35000000.0,35000000.0,1.0
2,11i Solutions,enterprise,Steel Pier Capital Advisors,1800000.0,1800000.0,1.0
3,170 Systems,software,Polaris Venture Partners,14000000.0,14000000.0,1.0
4,1World Online,enterprise,Alex Fedosseev,1000000.0,1000000.0,1.0
...,...,...,...,...,...,...
3099,walkby,ecommerce,Lightbank,650000.0,650000.0,1.0
3100,whereIstand.com,web,Chuck Zegar,300000.0,300000.0,1.0
3101,wmbly,web,Anonymous Angel,20000.0,20000.0,1.0
3102,y prime,health,Ballast Point Ventures,5000000.0,5000000.0,1.0


Based on the above results. 3103 times a investor was 100 percent of the total amount of funds that was raised for a company.

## Question 5

In [13]:
conn = sqlite3.connect('crunchbase-investments.db')
query = """
        SELECT funding_round_type,
               SUM(raised_amount_usd) AS usd
        FROM investments AS i
        GROUP BY funding_round_type
        ORDER BY usd DESC;
        """
funding = pd.read_sql(query,conn).dropna()
funding['percent_total_funds'] = round(((funding['usd'] / funding['usd'].sum())*100),2)
print("Which funding round was the most popular? Which was the least popular?")
funding

Which funding round was the most popular? Which was the least popular?


Unnamed: 0,funding_round_type,usd,percent_total_funds
0,series-c+,265753500000.0,38.98
1,venture,130556500000.0,19.15
2,series-b,128326800000.0,18.82
3,series-a,86542150000.0,12.69
4,post-ipo,30917600000.0,4.54
5,other,18507260000.0,2.71
6,private-equity,16159880000.0,2.37
7,angel,4962075000.0,0.73
8,crowdfunding,6491500.0,0.0
