## Introduction

In this project, some of the techniques learned on processing data in chunks will be used to analyze startup investments data from Crunchbase.com.

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups went to the site and released the data online. Since the information on the startups and their fundraising rounds is always changing, the dataset used will not be completely up to date.

The dataset of investments to be explored will be for data from October 2013. One can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this project, different memory constraints will be used. In this step, it will be assumed that there is only 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, it is known that pandas often requires 4 to 6 times the amount of space in memory as the file occupies does on disk (especially when there's multiple string columns).

In [15]:
# importing libs abd setting display options and importing data in chunk with correct encoding
import pandas as pd
pd.options.display.max_columns = 99
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')

Computing each column's missing value count

In [16]:
# 1st method with dictionary
missing_count = {}

for chunk in chunk_iter:
    columns = chunk.columns
    for col in columns:
        if col not in missing_count:
            missing_count[col] = chunk[col].isnull().sum()
        else:
            missing_count[col] += chunk[col].isnull().sum()

missing_count = sorted(missing_count.items(), key=lambda x: x[1])
missing_count

[('company_permalink', 1),
 ('company_name', 1),
 ('company_country_code', 1),
 ('company_region', 1),
 ('investor_permalink', 2),
 ('investor_name', 2),
 ('investor_region', 2),
 ('funding_round_type', 3),
 ('funded_at', 3),
 ('funded_month', 3),
 ('funded_quarter', 3),
 ('funded_year', 3),
 ('company_state_code', 492),
 ('company_city', 533),
 ('company_category_code', 643),
 ('raised_amount_usd', 3599),
 ('investor_country_code', 12001),
 ('investor_city', 12480),
 ('investor_state_code', 16809),
 ('investor_category_code', 50427)]

In [6]:
# 2nd method with list
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')
mv_list = []
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())
    
combined_mv_vc = pd.concat(mv_list)
unique_combined_mv_vc = combined_mv_vc.groupby(combined_mv_vc.index).sum()
unique_combined_mv_vc.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Total memory footprint

In [28]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')
counter = 0

# creating Series object to keep track of memory footprint
series_memory_fp = pd.Series(dtype='float64')
for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = (chunk.memory_usage(deep=True) / (1024 * 1024))
    else:
        series_memory_fp += (chunk.memory_usage(deep=True) / (1024 * 1024))
    counter += 1

# Drop index.
series_memory_fp = series_memory_fp.drop('Index')
print(series_memory_fp)
print("Total memory used: ", round(series_memory_fp.sum(), 4), "MB")

company_permalink         3.869808
company_name              3.424955
company_category_code     3.262619
company_country_code      3.025223
company_state_code        2.962161
company_region            3.253503
company_city              3.343473
investor_permalink        4.749821
investor_name             3.734270
investor_category_code    0.593590
investor_country_code     2.524654
investor_state_code       2.361876
investor_region           3.238946
investor_city             2.751430
funding_round_type        3.252704
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
raised_amount_usd         0.403366
dtype: float64
Total memory used:  56.9875 MB


In [30]:
# Drop columns representing unused data such as URL's or containing way too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
print(keep_cols.tolist())

['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']


### Selecting Data Types
To determine which columns shift types across chunks. Only the groundwork is laid out for this step.

In [33]:
# Key: Column name, Value: List of types
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))
            
col_types # shows data type per chunk

{'company_name': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'company_category_code': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'company_country_code': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'company_state_code': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'company_region': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'company_city': ['object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object',
  'object'],
 'investor_name': ['object',
  'object',
  'object',
  'object',
  'object',
  'o

In [32]:
# setting to showing only 1 column type for whole data set
uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

In [35]:
chunk[:5] # viewing chunk to see why data have more than one data type

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,


### Loading data chunks into a SQLite database

In [36]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index = False)