# Analyzing Startup Fundraising Deals from Crunchbase

This goal of this lesson is to analyze a dataset using chunks to determine ways to minimize the storage utilized through removing redundant columns, reducing numeric datatypes, and converting columns with low uniques to categories. This data will be loaded into SQL

In [1]:
import pandas as pd

In [2]:
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1')

In [3]:
# calculating null values for each column
null_totals = []

for chunk in chunk_iter:
    chunk_nulls = chunk.isnull().sum()
    null_totals.append(chunk_nulls)

combined = pd.concat(null_totals).groupby(level=0).sum()

In [4]:
combined # total nulls per column

company_category_code       643
company_city                533
company_country_code          1
company_name                  1
company_permalink             1
company_region                1
company_state_code          492
funded_at                     3
funded_month                  3
funded_quarter                3
funded_year                   3
funding_round_type            3
investor_category_code    50427
investor_city             12480
investor_country_code     12001
investor_name                 2
investor_permalink            2
investor_region               2
investor_state_code       16809
raised_amount_usd          3599
dtype: int64

### Memory calcs

In [5]:
# calculating the memory footprint of each column

chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1')
memory_footprint = {}

for chunk in chunk_iter:
    for col in chunk.columns:
        memory = chunk[col].memory_usage(deep=True) / (1024*1024)
        if col in memory_footprint:
            memory_footprint[col] += memory
        else:
            memory_footprint[col] = memory

In [6]:
memory_footprint # total memory per column

{'company_permalink': 3.8711891174316406,
 'company_name': 3.4263362884521484,
 'company_category_code': 3.2639999389648438,
 'company_country_code': 3.0266036987304688,
 'company_state_code': 2.9635419845581055,
 'company_region': 3.2548837661743164,
 'company_city': 3.3448543548583984,
 'investor_permalink': 4.751201629638672,
 'investor_name': 3.7356510162353516,
 'investor_category_code': 0.594970703125,
 'investor_country_code': 2.5260353088378906,
 'investor_state_code': 2.36325740814209,
 'investor_region': 3.2403268814086914,
 'investor_city': 2.752810478210449,
 'funding_round_type': 3.254084587097168,
 'funded_at': 3.379471778869629,
 'funded_month': 3.2282180786132812,
 'funded_quarter': 3.2282180786132812,
 'funded_year': 0.40474700927734375,
 'raised_amount_usd': 0.40474700927734375}

In [7]:
print(sum(memory_footprint.values())) # total memory of the table

57.01514911651611


### Removing Columns without value

In [8]:
chunk.head() # reviewing data to determine removable columns

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,/company/nuorder,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,/person/mortimer-singer,Mortimer Singer,,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,/company/chacha,ChaCha,advertising,USA,IN,Indianapolis,Carmel,/person/morton-meyerson,Morton Meyerson,,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,/company/unified-color,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,/person/mr-andrew-oung,Mr. Andrew Oung,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,


Reviewing the columns, it looks like company_permalink is basically company_name and same with investor_permalink. investor_category_code is mostly null. invester_city_code, investor_state_code and investor_country_code also have a lot of nulls but probably enough to keep (~75% non null). Also, funded_year, funded_month and funded_quarter are redundant.

In [9]:
# dropping columns mentioned above
updated_cols = chunk.columns.drop(['investor_permalink', 'company_permalink', 'investor_category_code', 'funded_month','funded_quarter','funded_year'])
updated_cols

Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at',
       'raised_amount_usd'],
      dtype='object')

### Reviewing Numeric column types for change in data types

In [10]:
# indentifying the data types of columns. we have to group a list and find uniques because a chunk of nas can return as an object when the rest are numeric
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1')
data_types = []

for chunk in chunk_iter:
    types = chunk[updated_cols].dtypes
    data_types.append(types)

combined = pd.concat(data_types).groupby(level=0).agg(lambda x: list(set(x)))

In [11]:
combined

company_category_code             [object]
company_city                      [object]
company_country_code              [object]
company_name                      [object]
company_region                    [object]
company_state_code                [object]
funded_at                         [object]
funding_round_type                [object]
investor_city            [object, float64]
investor_country_code    [object, float64]
investor_name                     [object]
investor_region                   [object]
investor_state_code      [object, float64]
raised_amount_usd                [float64]
dtype: object

In [12]:
chunk

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,/company/nuorder,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,/person/mortimer-singer,Mortimer Singer,,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,/company/chacha,ChaCha,advertising,USA,IN,Indianapolis,Carmel,/person/morton-meyerson,Morton Meyerson,,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,/company/binfire,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,/person/moshe-ariel,Moshe Ariel,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,/company/unified-color,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,/person/mr-andrew-oung,Mr. Andrew Oung,,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52865,/company/garantia-data,Garantia Data,enterprise,USA,CA,SF Bay,Santa Clara,/person/zohar-gilon,Zohar Gilon,,,,unknown,,series-a,2012-08-08,2012-08,2012-Q3,2012,3800000.0
52866,/company/duda-mobile,DudaMobile,mobile,USA,CA,SF Bay,Palo Alto,/person/zohar-gilon,Zohar Gilon,,,,unknown,,series-c+,2013-04-08,2013-04,2013-Q2,2013,10300000.0
52867,/company/sitebrains,SiteBrains,software,USA,CA,SF Bay,San Francisco,/person/zohar-israel,zohar israel,,,,unknown,,angel,2010-08-01,2010-08,2010-Q3,2010,350000.0
52868,/company/comprehend-systems,Comprehend Systems,enterprise,USA,CA,SF Bay,Palo Alto,/person/zorba-lieberman,Zorba Lieberman,,,,unknown,,series-a,2013-07-11,2013-07,2013-Q3,2013,8400000.0


In [13]:
# getting numeric col value counts
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1')
numeric_uniques = {}
numeric_cols = chunk[updated_cols].select_dtypes(include=['number']).columns

for chunk in chunk_iter:
    for col in numeric_cols:
        values = chunk[col].value_counts()
        if col in numeric_uniques:
            numeric_uniques[col].append(values)
        else:
            numeric_uniques[col] = [values]

In [14]:
# Combining the numeric col chunks of value counts
combined_numeric_values = {}

for col in numeric_uniques:
    combined = pd.concat(numeric_uniques[col])
    final = combined.groupby(combined.index).sum()
    combined_numeric_values[col] = final

In [15]:
combined_numeric_values

{'investor_country_code': investor_country_code
 ARE        7
 ARG       14
 AUS      163
 BEL       44
 BGR        4
        ...  
 UKR        9
 USA    36574
 VNM        5
 WSM        4
 ZAF        5
 Name: count, Length: 72, dtype: int64,
 'investor_state_code': investor_state_code
 AL       67
 AR       14
 AZ       84
 CA    18405
 CO      729
 CT      577
 DC      323
 DE       20
 FL      242
 GA      274
 HI       13
 IA        9
 ID       40
 IL      992
 IN       88
 KS       13
 KY       54
 LA       15
 MA     3619
 MD      486
 ME       41
 MI      315
 MN      101
 MO      148
 MS        6
 MT        1
 NC      339
 ND        5
 NE       35
 NH       51
 NJ      456
 NM       41
 NV       38
 NY     4404
 OH      309
 OK       21
 OR       85
 PA      762
 RI       92
 SC       34
 SD        9
 TN      147
 TX      816
 UT      200
 VA      579
 VT       26
 WA      847
 WI       82
 WV        4
 WY        3
 Name: count, dtype: int64,
 'investor_city': investor_city
 (Oc

In [16]:
pd.options.display.float_format = '{:,.2f}'.format 
combined_numeric_values['raised_amount_usd']

raised_amount_usd
1,000.00            3
2,000.00            2
2,100.00            1
3,000.00            3
5,000.00            8
                   ..
1,000,000,000.00    1
1,050,000,000.00    2
1,500,000,000.00    8
2,600,000,000.00    1
3,200,000,000.00    5
Name: count, Length: 1458, dtype: int64

raised_amount_usd can go to an int32 as this would go up to 4.2 billion and our highest value is 3.2 billion. The other columns above are text but showed as numeric in certain chunks because all values were null.

### Implementing data type changes, useful cols to get DF under 10 mbs

In [17]:
# creates a dictionary with the total uniques of each row iterating through chunks and keeps a total row count
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1', usecols = updated_cols)
unique_counts = {}
total_rows = 0
object_cols = chunk[updated_cols].select_dtypes(include=['object']).columns

for chunk in chunk_iter:
    total_rows += len(chunk)
    for col in chunk:
        if col in unique_counts:
            unique_counts[col] += len(chunk[col].unique())
        else:
            unique_counts[col] = len(chunk[col].unique())

In [18]:
# iterates through the total unique count dictionary and creates a list of columns where < 50% of values are unique
percent_unique = {}
less_than_50 = []
for column in unique_counts:
    percent_unique[column] = unique_counts[column] / total_rows
    if percent_unique[column] < .5:
        less_than_50.append(column)

In [19]:
less_than_50 # will use these to convert to categories

['company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'raised_amount_usd']

### Updating data types determined in analysis above

In [20]:
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1', usecols = updated_cols)
overall_memory = 0

for chunk in chunk_iter:
    chunk['raised_amount_usd'] = chunk['raised_amount_usd'].fillna(0).round().astype('int32')
    chunk['funded_at'] = pd.to_datetime(chunk['funded_at'])
    for col in less_than_50:
        chunk[col] = chunk[col].astype('category')
    for col in chunk.columns:
        memory = chunk[col].memory_usage(deep=True) / (1024*1024)
        overall_memory+= memory

In [21]:
overall_memory # overall memory is under 10 mb!

7.492980003356934

To get the data under 10 mbs, we dropped columns that did not provide useful info, primarily redundant data colums, we converted numeric values to smaller forms of int, we convert dates to datetime and we converted columns with < 50% uniques to categories

### Loading data into SQL

In [22]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = 'Latin-1', usecols = updated_cols)

for chunk in chunk_iter:
    chunk['raised_amount_usd'] = chunk['raised_amount_usd'].fillna(0).round().astype('int32')
    chunk['funded_at'] = pd.to_datetime(chunk['funded_at']).dt.strftime('%Y-%m-%d %H:%M:%S') # had to update this due to some sql formatting
    for col in less_than_50:
        chunk[col] = chunk[col].astype('category')
    chunk.to_sql("investments", conn, if_exists='append', index=False, dtype={'raised_amount_usd': 'INTEGER'})
    

### Analyzing data in SQL pulling to pandas

In [67]:
# which category company had the most investments
conn = sqlite3.connect('crunchbase.db')
query = """SELECT company_category_code, SUM (raised_amount_usd) AS total_funded
            FROM investments  
            GROUP BY company_category_code
            ORDER BY total_funded DESC"""
funds_by_category = pd.read_sql_query(query, conn)
funds_by_category.head(10) # top 10 funded cateogries

Unnamed: 0,company_category_code,total_funded
0,biotech,772774961434.0
1,software,511591617068.0
2,cleantech,368936575196.0
3,enterprise,321026490911.0
4,mobile,293017148824.0
5,web,281002854923.0
6,medical,177569736967.0
7,advertising,175536633153.0
8,ecommerce,157970540497.0
9,network_hosting,156937786880.0


In [70]:
# which investor contributed the most money
query = """SELECT investor_name, SUM (raised_amount_usd) AS total_funded
            FROM investments  
            GROUP BY investor_name
            ORDER BY total_funded DESC"""
funds_by_investor = pd.read_sql_query(query, conn)
funds_by_investor.head(10) # top 5 funded investors

Unnamed: 0,investor_name,total_funded
0,Kleiner Perkins Caufield & Byers,78524784632.0
1,New Enterprise Associates,67847796408.0
2,Accel Partners,45304883393.0
3,Goldman Sachs,44628213000.0
4,Sequoia Capital,42275816870.0
5,Greylock Partners,34726880573.0
6,Intel Capital,32869317285.0
7,Draper Fisher Jurvetson (DFJ),31510228666.0
8,Oak Investment Partners,30450453089.0
9,Andreessen Horowitz,29635716320.0


In [71]:
# which investor contributed the most money per company
query = """SELECT investor_name, company_name, SUM (raised_amount_usd) AS total_funded
            FROM investments  
            GROUP BY investor_name, company_name
            ORDER BY total_funded DESC"""
funds_by_investor_and_company = pd.read_sql_query(query, conn)
funds_by_investor_and_company.head(10) # top 10 funded investors

Unnamed: 0,investor_name,company_name,total_funded
0,Sprint Nextel,Clearwire,17500000000.0
1,Eagle River Holdings,Clearwire,16940000000.0
2,Digital Sky Technologies,Facebook,11900000000.0
3,Goldman Sachs,Facebook,10500000000.0
4,Battery Ventures,Groupon,7595000000.0
5,Digital Sky Technologies,Groupon,7595000000.0
6,GI Partners,Wave Broadband,7350000000.0
7,Oak Hill Capital Partners,Wave Broadband,7350000000.0
8,Comcast,Clearwire,7255098112.0
9,Intel,Clearwire,7255098112.0


In [65]:
# which funding round was the most popular and least popular
query = """SELECT funding_round_type
            FROM investments  
            """
fund_round = pd.read_sql_query(query, conn)
fund_round.value_counts()

funding_round_type
series-a              97566
series-c+             76090
angel                 62923
venture               62419
series-b              61558
other                  6748
private-equity         2499
post-ipo                231
crowdfunding             35
Name: count, dtype: int64

In [84]:
# determining the porportion of funds the top 10% raised, top 1% bottom 10% and bottom 1%
conn = sqlite3.connect('crunchbase.db')
query = """SELECT company_name, SUM (raised_amount_usd) AS total_funded
            FROM investments 
            WHERE raised_amount_usd > 0
            GROUP BY company_name
            ORDER BY total_funded DESC"""
funds_by_company = pd.read_sql_query(query, conn)
funds_by_company

Unnamed: 0,company_name,total_funded
0,Clearwire,111760000000.00
1,Groupon,71297800000.00
2,Nanosolar,31535000000.00
3,Facebook,29078700000.00
4,SurveyMonkey,22750000000.00
...,...,...
10352,PictureMe Universe,28000.00
10353,IndyGeek,21700.00
10354,WhiteWilly,21000.00
10355,uromovie,14000.00


In [87]:
num_companies=len(funds_by_company) # total companies to get top 1, 10 and bottom 
total_funds = funds_by_company['total_funded'].sum() #total funds to get percent of overall

#Finding the indexes for top companies
top_10_percent=int(num_companies*0.1)
top_1_percent=int(num_companies*0.01)
below_10_percent=int(num_companies*0.1)
below_1_percent=int(num_companies*0.01)

#Calculating the funds raised within the different percents (top 1, top 10 etc.)
top_10_funds=funds_by_company.iloc[:top_10_percent]['total_funded'].sum()
top_1_funds=funds_by_company.iloc[:top_1_percent]['total_funded'].sum()
below_10_funds=funds_by_company.iloc[-below_10_percent:]['total_funded'].sum()
below_1_funds=funds_by_company.iloc[-below_1_percent:]['total_funded'].sum()

In [88]:
# calculating the percentage of funds within each group against the total
percent_top_10_perc = top_10_funds / total_funds 
percent_top_1_perc = top_1_funds / total_funds
bottom_10_perc = below_10_funds / total_funds
bottom_1_perc = below_1_funds / total_funds

print(f"Top 10% of funds hold {percent_top_10_perc:.2%} of the total funds.")
print(f"Top 1% of funds hold {percent_top_1_perc:.2%} of the total funds.")
print(f"Bottom 10% of funds hold {bottom_10_perc:.2%} of the total funds.")
print(f"Bottom 1% of funds hold {bottom_1_perc:.2%} of the total funds.")

Top 10% of funds hold 63.69% of the total funds.
Top 1% of funds hold 23.29% of the total funds.
Bottom 10% of funds hold 0.03% of the total funds.
Bottom 1% of funds hold 0.00% of the total funds.
