# Analyzing Startup Fundraising Deals from Crunchbase

In this project, we'll use techniques that allow us to work with large datasets in pandas to analyze startup investment data from [crunchbase.com](https://www.crunchbase.com). Crunchbase crowdsources information for every time a startup raises money, and the data is made available through their API. 

For this project, we'll use data from October 2013 that's been made available on [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv). The dataset contains more than 50,000 rows, so we'll read in the data into dataframes of 5,000 row chunks to make sure that each chunk uses less than 10mb of memory.

## Introduction to the Data

Let's read in the data and take a look at what we're working with.

In [3]:
import pandas as pd
pd.options.display.max_columns = 99

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

preview = pd.concat(chunk_iter)
preview.head()

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012.0,2000000.0
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012.0,20000.0
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012.0,20000.0
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012.0,20000.0
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011.0,20000.0


## Check for Missing Values

In [9]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

missing_values = []
for chunk in chunk_iter:
    missing_values.append(chunk.isnull().sum())
    
combined_missing_values = pd.concat(missing_values)
unique_missing_values = combined_missing_values.groupby(combined_missing_values.index).sum()
unique_missing_values.sort_values(ascending=False)

investor_category_code    50427
investor_state_code       16809
investor_city             12480
investor_country_code     12001
raised_amount_usd          3599
company_category_code       643
company_city                533
company_state_code          492
funding_round_type            3
funded_year                   3
funded_month                  3
funded_at                     3
funded_quarter                3
investor_name                 2
investor_permalink            2
investor_region               2
company_region                1
company_permalink             1
company_name                  1
company_country_code          1
dtype: int64

## Check Memory Footprint

In [7]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

counter = 0
memory_footprint = pd.Series(dtype='float64')

for chunk in chunk_iter:
    if counter == 0:
        memory_footprint = chunk.memory_usage(deep=True)
    else:
        memory_footprint += chunk.memory_usage(deep=True)
    counter += 1
    
memory_footprint = memory_footprint.drop('Index') # Dropping calculation for the index
memory_footprint

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411545
company_city              3505886
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

In [12]:
tot_mem_fp = memory_footprint.sum() / (1024 * 1024)
print('Total memory footprint:', tot_mem_fp, 'MB')

Total memory footprint: 56.98753070831299 MB


Let's select columns to drop that won't be useful for analysis. We'll start with those columns that have too many missing values and those representing URL's.

In [15]:
drop_columns = ['investor_permalink', 'company_permalink', 'investor_category_code']
use_columns = chunk.columns.drop(drop_cols)
use_columns.tolist

<bound method IndexOpsMixin.tolist of Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')>

## Selecting Data Types

Next, we'll identify which columns we can represent using more space efficient data types.

In [17]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

column_types = {}
for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in column_types:
            column_types[col] = [str(chunk.dtypes[col])]
        else:
            column_types[col].append(str(chunk.dtypes[col]))
            
column_types_unique = {}
for key, value in column_types.items():
    column_types_unique[key] = set(column_types[key])

column_types_unique

{'company_permalink': {'object'},
 'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_permalink': {'object'},
 'investor_name': {'object'},
 'investor_category_code': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

Next Steps:

Identify the numeric columns we can represent using more space efficient types.

For text columns:

Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.

See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying.

Make your changes to the code from the last step so that the overall memory the data consumes stays under 10 megabytes.

## Loading Chunks Into SQLite

Now we're ready to start analyzing the data. The next step is to load each chunk into a table in a SQLite database so that we can query the full dataset.

In [18]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

for chunk in chunk_iter:
    chunk.to_sql('investments', conn, if_exists='append', index=False)

## Analyzing the Startup Investments

Now we can use the pandas & SQLite workflow to analyze and explore the startup investments. Each row in our dataset represents a unique investment from a single investor, so some startups will span multiple rows.

We'll look to answer the following questions:
* What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
* Which category of company attracted the most investments?
* Which investor contributed the most money (across all startups)?
* Which investors contributed the most money per startup?
* Which funding round was the most popular? Which was the least popular?

## Conclusion & Next Steps

In this project we worked with and analyzed a large dataset of fundraising deals in chunks using pandas and SQLite.

Some next steps we might take in this project could be to:

* Repeat the tasks, but under stricter memory constraints like 1MB.
* Clean and analyze some other Crunchbase datasets from the same GitHub repository.
* Look at which columns these datasets share and how they are related.
* Create a relational database that links the datasets together.
* Use pandas to populate each table in the database with the appropriate indexes.

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Processing Large Datasets In Pandas** course.