# FUNDING ANALYSIS FOR INDIAN STARTUPS

#### Team: Team Namibia

## Table of Contents


- [Business Understanding](#Business-Understanding)

- [Data Understanding](#data-understanding)
    - [Data Collection](#data-collection)
    - [Feature Description](#feature-description)
    - [Exploratory Data Understanding](#exploratory-data-analysis)
    - [Data Quality Assessment](#data-quality-assessment)

- [Data Preparation](#data-preparation)




## Business Understanding
Team Namibia aims to venture into the start-up space in India and being the data expert of the team, we've have been tasked with investigating the econmoic landscape and coming out with a course of action for this endeavour.


#### Problem Statement:
Investigating the dynamics of startup funding in India over the period from 2018 to 2021. The aim is to understand the trends, sector preferences, investment stages, key investors, and geographical distribution of the funding. Additionally, if there have been significant differences in funding amounts across different years and sectors, it can guide the action plan to be taken.

#### Objective
The goal of this analysis is to provide insights into the startup funding landscape in India from 2018 to 2021. 
- Identify trends and patterns in funding amounts over the years.
- Determine which sectors received the most funding and how sector preferences changed over time.
- Understand the distribution of funding across different stages of startups (e.g., Seed, Series A).
- Identify key investors and their investment behaviors.
- Analyze the geographical distribution of funding within India.

#### Analytical Questions
1. What are the trends and patterns in funding amounts for startups in India beetween 2018 to 2021?
   - Analyzing the annual and quarterly trends in funding can reveal patterns and growth trajectories. Look for peaks, dips, and any consistent growth patterns over these years.
2. Which sectors received the most funding, and how did sector preferences change over time from 2018 to 2021?
   - Identifying which industries or sectors received the most funding can show sectoral preferences and shifts. Understanding how this distribution has evolved over the years can highlight emerging trends and declining interests.
3. How is the distribution of funding across different stages of startups (e.g., Seed, Series A)?
   - Analyzing the funding amounts at different startup stages can provide insights into the investment appetite at various growth phases. It can also help in understanding the maturity and risk preference of investors.
4. Who are the key investors in Indian startups, and what are their investment behaviors/patterns?
   - Identifying the most active investors and analyzing their investment portfolios can shed light on key players in the ecosystem. Understanding their investment patterns can also reveal strategic preferences and alliances.
5. What is the geographical distribution of startup funding within India, and how has this distribution changed over the years 2018 to 2021?
   - Analyzing the geographical distribution of startup funding can show regional hotspots for entrepreneurship and investment. Observing how this has changed over the years can reveal shifts in regional focus and development.



### Null Hypothesis (H0)
There is no significant difference in the funding amounts for startups in India between the years 2018 and 2021 across different sectors.

### Alternate Hypothesis (H1)
There is a significant difference in the funding amounts for startups in India between the years 2018 and 2021 across different sectors.


### Importing the necessary packages 

In [44]:
# Import the pyodbc library to handle ODBC database connections
import pyodbc 

# Import the dotenv function to load environment variables from a .env file
from dotenv import dotenv_values 

# Import the pandas library for data manipulation and analysis
import pandas as pd 


# Importing Matplotlib for  visualizations in Python
import matplotlib.pyplot as plt

# Importing Seaborn for statistical data visualization based on Matplotlib
import seaborn as sns


# Import the warnings library to handle warning messages
import warnings

# Filter out (ignore) any warnings that are raised
warnings.filterwarnings('ignore')


Matplotlib is building the font cache; this may take a moment.


 ### Establishing a connection to the database

In [14]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the .env file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

# Create the connection string using the retrieved credentials
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"



In [15]:
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string) 

In [16]:
# Define the SQL query to select all columns from the specified table
query = "Select * from dbo.LP1_startup_funding2020"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
data1 = pd.read_sql(query, connection)

data1

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,
...,...,...,...,...,...,...,...,...,...,...
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,,
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C,


In [17]:
# Define the SQL query to select all columns from the specified table
query = "Select * from dbo.LP1_startup_funding2021"

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
data2 = pd.read_sql(query, connection)

data2

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed
...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B


In [20]:
# Write the concatenated DataFrame 'df' to a CSV file named 'lp1.csv'
data1.to_csv('startup_funding2020.csv')
data2.to_csv('startup_funding2021.csv')

In [18]:
# Concatenate two DataFrames 'data1' and 'data2' vertically (along the rows)
df = pd.concat([data1, data2])

# Write the concatenated DataFrame 'df' to a CSV file named 'lp1.csv'
df.to_csv('lp1.csv')

# Read the CSV file into a DataFrame
lp1 = pd.read_csv('lp1.csv')

lp1.head(10)


Unnamed: 0.1,Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,
5,5,qZense,2019.0,Bangalore,AgriTech,qZense Labs is building the next-generation Io...,"Rubal Chib, Dr Srishti Batra","Venture Catalysts, 9Unicorns Accelerator Fund",600000.0,Seed,
6,6,MyClassboard,2008.0,Hyderabad,EdTech,MyClassboard is a full-fledged School / Colleg...,Ajay Sakhamuri,ICICI Bank.,600000.0,Pre-series A,
7,7,Metvy,2018.0,Gurgaon,Networking platform,AI driven networking platform for individuals ...,Shawrya Mehrotra,HostelFund,,Pre-series,
8,8,Rupeek,2015.0,Bangalore,FinTech,Rupeek is an online lending platform that spec...,"Amar Prabhu, Ashwin Soni, Sumit Maniyar","KB Investment, Bertelsmann India Investments",45000000.0,Series C,
9,9,Gig India,2017.0,Pune,Crowdsourcing,GigIndia is a marketplace that provides on-dem...,"Aditya Shirole, Sahil Sharma","Shantanu Deshpande, Subramaniam Ramadorai",1000000.0,Pre-series A,


In [19]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


[Back to Top](#Table-of-Contents)


# Data Understanding

In this section, we seek to undestand the data that will help us solve our business problem/question. We detail how the briefly mention how the data was collected, we describe the useful features within the dataset, assess our data through an exploratory data analysis and then conclude with a data quality assessment

### Data Collection
The analysis for the Indian startup funding covers a span of 4 years and the data on each of the years comes via a unique dataset in a csv format. 
- Two of the datasets (2020 and 2021) were obtained from the sakila dataset online. The 2020 dataset was saved to a variable data1 whilst the 2021 dataset was saved to data2.
- The dataset for the year 2019 is a  file named "startup_funding2019.csv" obtained from OneDrive using the link https://azubiafrica-my.sharepoint.com/personal/teachops_azubiafrica_org/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fteachops%5Fazubiafrica%5Forg%2FDocuments%2FCareer%20Accelerator%20Data%5FSets%2FLP1%20Datasets&ga=1 
- The final dataset is a file with the name "startup_funding2018.csv" and was obtained from a Github Repository that can be accessed uusing this link https://github.com/Azubi-Africa/Career_Accelerator_LP1-Data_Analysis.


### Feature Description
Descrpition of the columns in the dataset: 

- Company/Brand: Name of the company/start-up
- Founded: Year start-up was established
- Headquarters/Location: Location where the start-up is headquartered
- Sector/Industry: The industry or sector in which the start-up operates, such as healthtech, fintech, etc.
- What it does/About Company: Brief overview of the company's function
- Founders: Founders of the Company
- Amount: Total amount raised by the start-up in each funding round
- Stage/Round: Details of the funding stages such as seed, series A, series B, etc.
- Investors: The names of the investors or investment firms involved

These are the useful variables that will provide the nexxesary data to answer our analytical business questions.



## Exploratory Data Analysis

In this section, we will take a look at the data structure as well as the main characteristics of our raw data.

### Viewing the data
Here we will take a look into the data using different methods to gain an overall perspective of the data structure (number of columns and rows), the  number of columns and their data types and a sample of the dataframe (a preview of a few rows and columns)

In [57]:
# Read data from the CSV file 'startup_funding2020.csv' into a DataFrame named d21
d20 = pd.read_csv('startup_funding2020.csv')

# Display the first few rows of the DataFrame d20
d20.head()



Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,


In [27]:
# Print information about the structure of the DataFrame d20
d20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     1055 non-null   int64  
 1   Company_Brand  1055 non-null   object 
 2   Founded        842 non-null    float64
 3   HeadQuarter    961 non-null    object 
 4   Sector         1042 non-null   object 
 5   What_it_does   1055 non-null   object 
 6   Founders       1043 non-null   object 
 7   Investor       1017 non-null   object 
 8   Amount         801 non-null    float64
 9   Stage          591 non-null    object 
 10  column10       2 non-null      object 
dtypes: float64(2), int64(1), object(8)
memory usage: 90.8+ KB


In [29]:
# Display the shape of the DataFrame d20 (number of rows, number of columns)
d20.shape

(1055, 11)

In [37]:
# Generate summary statistics for the numerical columns in DataFrame d20 
num_stats = d20.describe().T

num_stats


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1055.0,527.0,304.6966,0.0,263.5,527.0,790.5,1054.0
Founded,842.0,2015.363,4.097909,1973.0,2014.0,2016.0,2018.0,2020.0
Amount,801.0,113043000.0,2476635000.0,12700.0,1000000.0,3000000.0,11000000.0,70000000000.0


In [39]:
# Generate descriptive statistics for the categorical columns in DataFrame d20
categorical_stats = d20.describe(include=['object']).T
categorical_stats

Unnamed: 0,count,unique,top,freq
Company_Brand,1055,905,Nykaa,6
HeadQuarter,961,77,Bangalore,317
Sector,1042,302,Fintech,80
What_it_does,1055,990,Provides online learning classes,4
Founders,1043,927,Falguni Nayar,6
Investor,1017,848,Venture Catalysts,20
Stage,591,42,Series A,96
column10,2,2,Pre-Seed,1


In [40]:
# Check for the total number of missing values in each column of the DataFrame d20
d20.isna().sum()


Unnamed: 0          0
Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [42]:
# Check for duplicates in the d20 Dataframe
d20.duplicated().sum()



0

#### Observation
With respect to Data Quality Assessment, it is where the quality of the data is assessed, including completeness, accuracy, consistency and relevance. This is a crucial step in understanding the shortcomings of our data and being able to use that knwoledge to plan how to handle/clean the data.

The dataset "startup_funding2020" has 11 columns and 1055 rows but has no duplicate rows. The columns in the dataset are a mixture of strings/text (Company Brand, Sector, Headquarters, Founders, What it does, Investors, Stage, Column10), floats (Amount, Founded) and integers (Unnamed : 0). 

In this dataset, 3 columns had no missing values (Unnamed :0, Company Brand and What it does) with majority of the columns having less than 10% missing values (HeadQuarter, Sector, Founders, Investor). Only two other columns had over 25% missing values which were stage (43%) and Column10 (99.81%). 

It is instantly obvious the columns Unnamed :0 and column 10 have no relevance cause they possess no useful data with respect to our project goals. Other columns that would not be needed in reaching our goals (answering the analytical business questions) are Founded and What_It_Does.


### Data Cleaning
Messy data such as missing data and dat in the wrong format can greatly impair our ability to analyze our data and gain useful insights from it. SO at this stage, we clean the data so it can be useful in our analysis. 

In [46]:
d20.isna().sum()

Unnamed: 0          0
Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [48]:
#dropping column10 and Unnamed: 0
d20.drop(columns=['column10', 'Unnamed: 0', 'Founded', 'What_it_does'], inplace= True)

### Handling Missing Values
The unecessary columns are dropped from the dataset. FOr the two remaining columns with over 25% missing values (Stage and Amount), the rows that have missing values in both columns are also removed. The resulting dataframe is assessed for the new levels of missing data. For the columns wit less than 10% missing values, they are willed with either the median or mode.

In [61]:

# Filter rows where either the 'Amount' and 'Name' column has missing values
d31= d20[ d20['Amount'].isna() & d20['Stage'].isna()]
#d31.shape

# Checking the missing values for the new Dataframe d30
d31


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
12,MasterG,2015.0,New Delhi,Fashion startup,MasterG is a design and skill development orga...,Gayatri Jolly,Acumen Fund's angel programme,,
18,Pine Labs,1998.0,Noida,FinTech,A merchant platform company that provides fina...,Amrish Rau,"Mastercard, Temasek Holdings",,
29,Delhivery,2011.0,Gurgaon,E-commerce,Delhivery is a supply chain services company t...,"Kapil Bharati, Mohit Tandon, Sahil Barua, Sura...","Steadview Capital, Canada Pension Plan Investm...",,
40,Fleeca India,2016.0,Jaipur,Tyre management,FLEECA is a Tyre Care Provider company.,Tikam Jain,Bridgestone India,,
44,PointOne Capital,2020.0,Bangalore,Venture capitalist,Pre-seed/Seed focussed VC investor,Mihir Jha,,,
...,...,...,...,...,...,...,...,...,...
1021,IncubateHub,,Bengaluru,Tech hub,Provides platform for Corporates to connect es...,Rajiv Mukherjee,Venture Catalysts,,
1022,Rage Coffee,,Delhi,FMCG,Provides effective Instant Coffee Infused with...,Bharat Sethi,Refex Capital,,
1023,Skilancer,,Noida,Technology,Solar module cleaning system [MCS] providers,Neeraj Kumar,Venture Catalysts,,
1024,Harappa Education,,New Delhi,Edtech,Provides online courses on foundational skills,Pramath Raj Sinha,James Murdoch-led Lupa Systems,,


In [62]:
#dropping the missing rows for both Amount and Stage
d20_cleaned1= d20.drop(index=d31.index)
d20_cleaned1

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,
...,...,...,...,...,...,...,...,...,...
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C


In [None]:
#