# Analyzing Startup Fundraising Deals from Crunchbase
---

 In this guided project, we'll practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, we know from earlier lessons that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

### 1. Introduction
We will start by importing the libraries in and the dataset in. Because the data set contains over 50,000 rows, we'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.

In [1]:
# Importing the libraries
import pandas as pd
import sqlite3

# Setting pandas display option
pd.options.display.max_columns = 99

# Importing the dataset
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1')

We will now check the dataset for missing values.

In [2]:
# Creating an empty list
mv_list = []

# Iterating through the chunks
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())

# Combining the values together    
combined_mv_vc = pd.concat(mv_list)

# Grouping the values and sorting the values
unique_combined_mv_vc = combined_mv_vc.groupby(combined_mv_vc.index).sum()
unique_combined_mv_vc.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

We can see that there are certain columns which contain a lot of missing values. We will now try and find the memory footprint for the columns.

In [3]:
# Importing the dataset in
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

# Iterating the over the chunks
counter = 0
series_memory_fp = pd.Series(dtype='float64')
for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = chunk.memory_usage(deep=True)
    else:
        series_memory_fp += chunk.memory_usage(deep=True)
    counter += 1

# Drop memory footprint calculation for the index
series_memory_fp = series_memory_fp.drop('Index')
series_memory_fp

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

We will now combine all the columns and measure the total footprint for the chunks in megabytes.

In [4]:
# Calculating the total footprint
series_memory_fp.sum()/1024**2

56.9876070022583

We can see that the total footprint of a 10 MB data is about 56 MB in python. We will start by dropping columns we do not need. Those columns are:
- `investor_permalink` and `company_permalink` are just URLs
- `investor_category_code` is riddled with many missing values

In [5]:
# Dropping the columns
keep_cols = chunk.columns.drop(['investor_permalink', 'company_permalink',
                                'investor_category_code'])

# Previewing the columns we kept
keep_cols.to_list()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

### 2. Selecting Data Types
We will try and see the data types for every column. This will help us in converting some columns if it is necessary.

In [6]:
# Creating an empty dictionary
col_types = {}

# Importing the data
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

# Iterating the chunks
for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))
            
# Creating a dictionary to store the unique values
uniq_col_types = {}

# Iterating through the first dictionary
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
    
# Previewing the result
uniq_col_types

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

Let's see if we can convert columns with the `object` data type if we can actually convert them into categorical data.

In [7]:
# Importing the data
loans_chunks = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1', usecols=keep_cols)

# Creating an empty dictionary for values of each chunks
uniques = {}

# Iterating through each chunks
for lc in loans_chunks:
    strings_only = lc.select_dtypes(include = ['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

# Creating an empty dictionary to combine it together
uniques_combined = {}

# Iterating through each keys in the unique dictionary
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    
# Printing the unique values for the object columns
for col in uniques_combined:
    print(col)
    print(uniques_combined[col])
    print("-----------")

company_name
#waywire          5
0xdata            1
1-800-DENTIST     2
1000memories     10
100Plus           4
                 ..
yaM Labs          1
ybuy              4
zozi             38
zulily            6
zuuka!            3
Name: company_name, Length: 11573, dtype: int64
-----------
company_category_code
2/7/08                 1
advertising         3200
analytics           1863
automotive           164
biotech             4951
cleantech           1948
consulting           233
design                55
ecommerce           2168
education            783
enterprise          4489
fashion              368
finance              931
games_video         1893
government            10
hardware            1537
health               670
hospitality          331
legal                 87
local                 22
manufacturing        310
medical             1315
messaging            452
mobile              4067
music                287
nanotech             216
network_hosting     1075
news      

We can see that there are columns we can change, those are:
- `funded_at`, `funded_month`, `funded_quarter` can be changed into datetime
- `funding_round_type`, `investor_state_code`, `investor_country_code` can be converted into categorical.

We can also remove `funded_month`, `funded_quarter`, and `funded_year` since they are both represented by `funded_at`.

In [8]:
# Changing the datatype
tar_cat = {
    "funding_round_type": "category", "investor_state_code": "category", 
    "investor_country_code": "category"
}

# Changing into datetime
date = ['funded_at']

# Removing unecessary columns
keep_cols = chunk.columns.drop(['funded_month', 'funded_quarter',
                                'funded_year'])

# Previewing the columns we kept
keep_cols.to_list()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'raised_amount_usd']

We'll try and import the dataset again and compare the memory usage.

In [9]:
# Importing the data set
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, 
                           encoding = 'ISO-8859-1', usecols = keep_cols,
                           parse_dates = date, dtype = tar_cat)

# Creating an empty list
mem_usage = []

# Cleaning using string manipulation
for chunk in chunk_iter:
    mem_usage.append(chunk.memory_usage(deep=True).sum() / 1024 ** 2)

# Printing the result
sum(mem_usage)

30.022969245910645

In [10]:
# Checking the data type
chunk.dtypes

company_name                     object
company_category_code            object
company_country_code             object
company_state_code               object
company_region                   object
company_city                     object
investor_name                    object
investor_country_code          category
investor_state_code            category
investor_region                  object
investor_city                   float64
funding_round_type             category
funded_at                datetime64[ns]
raised_amount_usd               float64
dtype: object

We can see that we have dropped around 26 megabytes by converting the data types.

### 3. Loading Chunks Into SQLite
We will now export the dataset into SQLite. We will use every single settings we have used to save some space.

In [11]:
# Establishing the connection
conn = sqlite3.connect('crunchbase.db')
cursor = conn.cursor()
cursor.execute('DROP TABLE IF EXISTS investments;')
               
# Loading the dataset
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, 
                           encoding = 'ISO-8859-1', usecols = keep_cols,
                           parse_dates = date, dtype = tar_cat)

# Exporting each chunk into the database
for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists = 'append', index = False)

We will now see the columns and the data types of our database in SQLite.

In [12]:
# Checking the datatype for the database
results_df = pd.read_sql('PRAGMA table_info(investments);', conn)
results_df

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,company_name,TEXT,0,,0
1,1,company_category_code,TEXT,0,,0
2,2,company_country_code,TEXT,0,,0
3,3,company_state_code,TEXT,0,,0
4,4,company_region,TEXT,0,,0
5,5,company_city,TEXT,0,,0
6,6,investor_name,TEXT,0,,0
7,7,investor_country_code,TEXT,0,,0
8,8,investor_state_code,TEXT,0,,0
9,9,investor_region,TEXT,0,,0


To end, we will be checking the first five results from our SQLite database.

In [13]:
# Checking the result from the database
test = pd.read_sql('SELECT * FROM investments LIMIT 5;', conn)
test

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,raised_amount_usd
0,AdverCar,advertising,USA,CA,SF Bay,San Francisco,1-800-FLOWERS.COM,USA,NY,New York,New York,series-a,2012-10-30 00:00:00,2000000.0
1,LaunchGram,news,USA,CA,SF Bay,Mountain View,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-23 00:00:00,20000.0
2,uTaP,messaging,USA,,United States - Other,,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-01 00:00:00,20000.0
3,ZoopShop,software,USA,OH,Columbus,columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-02-15 00:00:00,20000.0
4,eFuneral,web,USA,OH,Cleveland,Cleveland,10Xelerator,USA,OH,Columbus,Columbus,other,2011-09-08 00:00:00,20000.0
