The startup ecosystem is a dynamic and rapidly evolving space, with certain industries and regions attracting more investment than others. This project aims to analyze startup growth and investment trends to identify the industries that receive the most funding, the countries with the fastest-growing startups, and the relationship between funding rounds and growth rates.

To achieve this, I will be using a dataset from Kaggle that includes key variables such as industry, funding rounds, investment amounts, valuation, number of investors, country, year founded, and growth rate. However, before conducting any analysis, it is crucial to clean the dataset to handle missing values, inconsistencies, and potential outliers. Proper data cleaning ensures the accuracy and reliability of insights, allowing for well-informed conclusions about investment opportunities and growth trends in the startup world.

In [None]:
# upload file into colab
from google.colab import files
uploaded = files.upload()

Saving startup_growth_investment_data.csv to startup_growth_investment_data.csv


In [None]:
# load the "startup_growth_investment_data.csv" as a DataFrame
import pandas as pd
df = pd.read_csv('startup_growth_investment_data.csv')

In [None]:
# show first rows of df
df.head()

Unnamed: 0,Startup Name,Industry,Funding Rounds,Investment Amount (USD),Valuation (USD),Number of Investors,Country,Year Founded,Growth Rate (%)
0,Startup_1,Blockchain,8,1335166000.0,6621448000.0,50,Germany,2012,77.1
1,Startup_2,SaaS,2,2781498000.0,8363214000.0,36,UK,2006,105.52
2,Startup_3,EdTech,10,3309032000.0,15482700000.0,39,Singapore,2016,190.47
3,Startup_4,Fintech,5,4050196000.0,12682530000.0,44,France,2021,9.44
4,Startup_5,EdTech,9,1645080000.0,6887966000.0,48,India,2011,192.0


In [None]:
# show info on df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Startup Name             5000 non-null   object 
 1   Industry                 5000 non-null   object 
 2   Funding Rounds           5000 non-null   int64  
 3   Investment Amount (USD)  5000 non-null   float64
 4   Valuation (USD)          5000 non-null   float64
 5   Number of Investors      5000 non-null   int64  
 6   Country                  5000 non-null   object 
 7   Year Founded             5000 non-null   int64  
 8   Growth Rate (%)          5000 non-null   float64
dtypes: float64(3), int64(3), object(3)
memory usage: 351.7+ KB


In [None]:
# describe df
df.describe()

Unnamed: 0,Funding Rounds,Investment Amount (USD),Valuation (USD),Number of Investors,Year Founded,Growth Rate (%)
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,5.4916,2455567000.0,7971059000.0,25.542,2011.544,102.091732
std,2.913353,1423787000.0,5479487000.0,14.271838,6.885285,56.179781
min,1.0,1102610.0,1722547.0,1.0,2000.0,5.06
25%,3.0,1221506000.0,3598305000.0,13.0,2006.0,52.815
50%,6.0,2460634000.0,7002304000.0,25.0,2012.0,102.215
75%,8.0,3639951000.0,11476760000.0,38.0,2018.0,150.58
max,10.0,4999544000.0,24709060000.0,50.0,2023.0,199.97


I have imported the data into colab, and displayed some basic information to make sure the data is ready to be analyzed. There appears to be no missing values, and all data types are appropriate for the columns. Therefore, we can proceed.

In [None]:
# identify missing values in df
df.isnull().sum()

Unnamed: 0,0
Startup Name,0
Industry,0
Funding Rounds,0
Investment Amount (USD),0
Valuation (USD),0
Number of Investors,0
Country,0
Year Founded,0
Growth Rate (%),0


There are no missing values nor incorrect data types in this dataset

In [None]:
# identify duplicate values in df
df.duplicated().sum()

0

There are no duplicate values in this dataset

In [None]:
# import plotly
import plotly.express as px

# create a boxplot for the "Growth Rate" column
fig = px.box(df, y='Growth Rate (%)')

# show the plot
fig.show()

# create another bloxplot for "Investment Amount (USD)"
fig = px.box(df, y='Investment Amount (USD)')

# show the plot
fig.show()

There are no clear outliers in either of the "growth rate" or "investment amount" columns

In [None]:
# identify values in "Number of Investors" column below or euqal to 0
df[df['Number of Investors'] <= 0]

Unnamed: 0,Startup Name,Industry,Funding Rounds,Investment Amount (USD),Valuation (USD),Number of Investors,Country,Year Founded,Growth Rate (%)


In [None]:
# What is the valuation for startups?



# create a histogram for the "Valuation (USD)"
fig = px.histogram(df, x = 'Valuation (USD)')

# show the plot
fig.show()

This histogram shows the count of startups in each valuation bin. The histogram is right-skewed, meaning that most startups are valuated below $15 billion. Some startups, however, are evaluated at 20 billion or more.

The descriptioin of the data from earlier shows that the average valuation is approximately 7.97 Billion, which is on left-most side of the chart.

In [None]:
# create the csv file
df.to_csv('startup_growth_investment_data_cleaned.csv', index=False)

In [None]:
# download csv file
files.download('startup_growth_investment_data_cleaned.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This new cleaned dataset ensures that the analysis is accurate, reliable, and free from biases caused by missing or inconsistent data. By removing errors and standardizing values, the dataset becomes more suitable for identifying meaningful trends in startup growth and investment. This improves the quality of insights, allowing for better decision-making regarding industry opportunities and investment strategies.