# *****Indian Startup Ecosystem Analysis*****

## ***Business Understanding***

The period from 2018 to 2021 witnessed significant evolution in India's Startups. This project seeks to gain insights on patterns and trends in the data on funding, investment details and startup information. By leveraging on data analytics techniques and methods, we will touch on the effect that funding levels, business location and industry specialization has on the performance of startups. Understanding the dynamics of this ecosystem is crucial for businesses and investors seeking to capitalize on the opportunities within India's vibrant startup ecosystem

Installing relevant libraries

In [112]:
#%pip install pandas

Importing relevant libraries

In [113]:
import pandas as pd
import numpy as np

## 1. ***Data collection/ Data loading***

In [114]:
df_2018 = pd.read_csv('startup_funding2018.csv')
df_2019 = pd.read_csv('startup_funding2019.csv')
df_2020 = pd.read_csv('startup_funding2020.csv')
df_2021 = pd.read_csv('startup_funding2021.csv')

## 2. ***Data Exploration***

We would be exploring each dataset separately to understand it characteristics and clean where neccesary

### ***2018 start-ups funds data exploration and cleaning***

In [115]:
# a view of the first 5 rows in the dataset

df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [116]:
# Checking the data types in the dataset

df_2018.dtypes

Company Name     object
Industry         object
Round/Series     object
Amount           object
Location         object
About Company    object
dtype: object

In [117]:
# Checking for missing values

df_2018.isnull().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

#### ***Checking the Company Name Column***

In [118]:
# Checking for duplicate in the company name column.

df_2018['Company Name'].duplicated().value_counts()

Company Name
False    525
True       1
Name: count, dtype: int64

In [119]:
# spooling the details of duplicated value in the company name column

df_2018[df_2018['Company Name'].duplicated(keep= False).sort_values()]

  df_2018[df_2018['Company Name'].duplicated(keep= False).sort_values()]


Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."


***Observation***
This is a clear case of duplication. Let's drop one of the duplicated rows

In [120]:
# Dropping duplicated row based on the column company name

df_2018.drop_duplicates(subset= ['Company Name'], keep= 'first', inplace= True, ignore_index= True)

df_2018.duplicated().sum()

0

#### ***Checking the Industry Column***

In [121]:
# Checking the Industry column

df_2018['Industry'].sample(15)

485    Artificial Intelligence, Human Resources, Info...
426                                Health Care, Hospital
513                  Food and Beverage, Restaurants, Tea
293                   Agriculture, AgTech, Manufacturing
15           Automotive, Search Engine, Service Industry
214                  Cloud Computing, Computer, Software
451    Last Mile Transportation, Railroad, Transporta...
259                                                    —
367     Beauty, Cosmetics, Health Care, Service Industry
238                                                    —
199                      Internet, Marketplace, Shopping
46                           Financial Services, Lending
350    Continuing Education, EdTech, Education, Skill...
496                       Internet, Knowledge Management
30                Internet of Things, Telecommunications
Name: Industry, dtype: object

***Observation***
Sampling through the industry column the points below were worth noting

- The Industry column featured sectors, subsectors among others. However, we are going to pick the first index which is predominantly sectors.
- Furthermore, some were represented with "-". We will replace them with nan

In [122]:
# Split the first index (sector) in the industry column from the others

df_2018['Industry'] = df_2018.Industry.str.split(",").str[0]

# Replacing "-" with nan

df_2018['Industry'] = df_2018.Industry.replace("—" ,  np.nan)

In [123]:
# Sampling the Industry section to verify changes

df_2018['Industry'].sample(20)

56           E-Commerce
474          E-Commerce
40     Renewable Energy
43           E-Commerce
81          Advertising
482         Health Care
524       Biotechnology
77             Consumer
298         Health Care
511           Education
84           E-Learning
235                 NaN
351        Smart Cities
369              EdTech
303      Cryptocurrency
97            Logistics
445          Accounting
363        Crowdfunding
477                 NaN
325                 NaN
Name: Industry, dtype: object

#### ***Checking the Round/ Series Column***

In [124]:
df_2018['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Secondary Market', 'Post-IPO Equity',
       'Non-equity Assistance', 'Funding Round'], dtype=object)

***Observation***: A look into the unique values of the Round/ Series Column, we noticed two of the are not associated with the column
- Undisclosed. Since the funding rounds were undisclosed we are going to rplace them with Nan
- Accessing the link https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593 informs us of it being seed funding

In [125]:
# Replacing Undisclosed with nan
df_2018['Round/Series'] = df_2018['Round/Series'].replace("Undisclosed", np.nan)

In [126]:
# Replacing the link with seed
df_2018['Round/Series'] = df_2018['Round/Series'].replace("https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593", "Seed")

In [127]:
# Verifying changes made in the Round/ Series column

df_2018['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', nan, 'Series D', 'Secondary Market',
       'Post-IPO Equity', 'Non-equity Assistance', 'Funding Round'],
      dtype=object)

#### ***Checking the Amount Column***

In [128]:
# Sampling the Amount column

df_2018.Amount.sample(20)

130      $540,000,000
116    ₹2,500,000,000
40                  —
169           5000000
489            550000
11                  —
136                 —
346                 —
467           3500000
344                 —
382                 —
444           4000000
138        $1,500,000
118                 —
357        $3,000,000
157                 —
173           1000000
50        ₹20,000,000
320                 —
501                 —
Name: Amount, dtype: object

***Observatoin***: Sampling through the Amount column, the following observations were made
- Some funds received were in rupees (₹). they must be converted to dollar using the rate in 2018
- The dollar signs ($) preceeding some of the figures need to be removed
- — needs to be replaced with nan
- Commas as thousand separators need to be removed

#### ***Handling observation made***

In [129]:
# Define conversion rate
conversion_rate = 0.0146

#Function to clean and/ convert amount column

def clean_and_convert(Amount):
    
    if Amount == "—":
        
        return float('nan')
    
    elif '₹' in Amount:
        
        Amount = Amount.replace('₹', '').replace(',', '')
        
        return float(Amount) * conversion_rate
    
    else:
        
        Amount = Amount.replace('$', '').replace(',', '')
        
        return float(Amount)
    

In [130]:
# Applying the cleaning and conversion function to the amount column

df_2018['Amount'] = df_2018['Amount'].apply(clean_and_convert)

In [131]:
# Sampling to verify changes has been effected

df_2018['Amount'].sample(20)

167      500000.0
488       40000.0
523      511000.0
346           NaN
210      102200.0
366      100000.0
441           NaN
471     3700000.0
183     3600000.0
307           NaN
524    35000000.0
274      300000.0
499     1138800.0
226           NaN
169     5000000.0
377     3300000.0
256     1450000.0
487     5840000.0
387     4964000.0
134      900000.0
Name: Amount, dtype: float64

#### ***Checking the Location Column***

In [132]:
# Sampling the location column

df_2018.Location.sample(20)

494         Noida, Uttar Pradesh, India
372             Gurgaon, Haryana, India
260         Bengaluru, Karnataka, India
321         Noida, Uttar Pradesh, India
413         Bangalore, Karnataka, India
51         Kanpur, Uttar Pradesh, India
451        Kormangala, Karnataka, India
44              New Delhi, Delhi, India
493          Mumbai, Maharashtra, India
411             Gurgaon, Haryana, India
381         Bengaluru, Karnataka, India
478       Lucknow, Uttar Pradesh, India
83              Gurgaon, Haryana, India
120            Pune, Maharashtra, India
23                  Delhi, Delhi, India
329         Bangalore, Karnataka, India
207    Hyderabad, Andhra Pradesh, India
498          Mumbai, Maharashtra, India
33          Bengaluru, Karnataka, India
148             Gurgaon, Haryana, India
Name: Location, dtype: object

***Observation***
- It can be observed that the city, state and country are put together but separated by a comma. We would need to disintegrate them as this will help us analyze the role the city and state plays

In [133]:
# Splitting Location into City, State and Country
df_2018[['City', 'State', 'Country']] = df_2018['Location'].str.split(', ', expand=True)[[0, 1, 2]]

# Displaying the DataFrame
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,City,State,Country
0,TheCollegeFever,Brand Marketing,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Bangalore,Karnataka,India
1,Happy Cow Dairy,Agriculture,Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Mumbai,Maharashtra,India
2,MyLoanCare,Credit,Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Gurgaon,Haryana,India
3,PayMe India,Financial Services,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Noida,Uttar Pradesh,India
4,Eunimart,E-Commerce Platforms,Seed,,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Hyderabad,Andhra Pradesh,India
...,...,...,...,...,...,...,...,...,...
520,Udaan,B2B,Series C,225000000.0,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Bangalore,Karnataka,India
521,Happyeasygo Group,Tourism,Series A,,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Haryana,Haryana,India
522,Mombay,Food and Beverage,Seed,7500.0,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mumbai,Maharashtra,India
523,Droni Tech,Information Technology,Seed,511000.0,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Mumbai,Maharashtra,India


#### ***Checking the About Company Column***

In [134]:
# Sampling the About Company column

df_2018['About Company'].sample(10)

49     Reserve your Room, Hostel or Paying Guest, whi...
423    A one of its kind digital integrated platform ...
316    Nua is an e-commerce and feminine care brand p...
488    GoGaga is a new age dating app that has remove...
493    CLP India Pvt. Ltd. is a foreign investor in t...
455                               Online gaming startup.
42     UClean is a tech enabled laundry service provi...
189    Rupeek is an asset backed online lending platf...
378    Signzy are creating 'building blocks for a Dig...
310    A Bengaluru-based fraud analytics solution pro...
Name: About Company, dtype: object

***Observation***
- There's not much that can be done here. Will leave it as it is

In [135]:
#Let's check the Data types to be certain everything is as expected

df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525 entries, 0 to 524
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company Name   525 non-null    object 
 1   Industry       495 non-null    object 
 2   Round/Series   523 non-null    object 
 3   Amount         377 non-null    float64
 4   Location       525 non-null    object 
 5   About Company  525 non-null    object 
 6   City           525 non-null    object 
 7   State          525 non-null    object 
 8   Country        521 non-null    object 
dtypes: float64(1), object(8)
memory usage: 37.0+ KB


### ***2019 start-ups funds data exploration and cleaning***

In [136]:
# a view of the first 5 rows in the dataset

df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


#### ***Checking the Company/Brand Column***

In [137]:
# checking for duplicated entries based on the company/Brand Column

# spooling the details of duplicated value in the company name column

df_2019[df_2019['Company/Brand'].duplicated(keep= False).sort_values()]

  df_2019[df_2019['Company/Brand'].duplicated(keep= False).sort_values()]


Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
7,Kratikal,2013.0,Noida,Technology,It is a product-based cybersecurity solutions ...,"Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa","Gilda VC, Art Venture, Rajeev Chitrabhanu.","$1,000,000",Pre series A
30,Licious,,Bangalore,Food,Online meat shop,"Vivek Gupta, Abhay Hanjura",Vertex Growth Fund,"$30,000,000",Series E
68,Licious,2015.0,Bangalore,Food,Online meat shop,"Vivek Gupta, Abhay Hanjura",Vertex Ventures,"$25,000,000",Series D
82,Kratikal,,Uttar pradesh,Technology,Provides cyber security solutions,Pavan Kushwaha,"Gilda VC, Art Venture, Rajeev Chitrabhanu","$1,000,000",Pre-series A


***Observation***
- We have two entries duplicated i.e. Kratikal and Licious
- In Kratical case it can be observed that row 7 has entries captured unlike row 82. It is worth noting that Noida is in the Uttar Pradesh state. Due to this we will be dropping row 82.
- In Licious case, it is more likely that row 68 is correct entry since it has all entries captured. We will drop line 30.

In [138]:
# Dropping duplicated rows based on the column company/Brand

df_2019.drop([30, 82], axis= 0, inplace= True)


In [139]:
# verifying if changes has been effected

df_2019['Company/Brand'].duplicated().value_counts()

Company/Brand
False    87
Name: count, dtype: int64

#### ***Checking the Founded Column***

In [140]:
# We checking to confirm if

df_2019.Founded.sample(10)

15    2017.0
4     2004.0
33    2011.0
81    2013.0
76    2018.0
64       NaN
5        NaN
48    2011.0
10    2010.0
13    2019.0
Name: Founded, dtype: float64

***Observation***

No issues with entries

#### ***Checking the HeadQuarter Column***

In [141]:
# Sampling the column

df_2019.HeadQuarter.sample(10)

49    Bangalore
9         Delhi
25        Delhi
79          NaN
87        Delhi
11    Bangalore
21          NaN
71        Surat
46        Noida
57       Mumbai
Name: HeadQuarter, dtype: object

***Observation***

No issues with entries

#### ***Checking the Sector Column***

In [142]:
df_2019.Sector.sample(10)

81    Transport & Rentals
34              Logistics
87             Automobile
6                    SaaS
76             Healthtech
78        Virtual Banking
38          E-marketplace
32          E-marketplace
12                Fintech
47               AgriTech
Name: Sector, dtype: object

***Observation***

No issues with entries

#### ***Checking the What it does Column***

In [143]:
df_2019['What it does'].sample(10)

48                         Enables to order food online
76    It creates an engagement loop between doctors,...
8     It is an AI and big data services company prov...
33    Develops drones that are used by the military,...
3                 Provides interior designing solutions
38        Online fruits and vegetables delivery company
2               It aims to make learning fun via games.
70    Platform uses encryption technology to allow b...
27    Developer of an artificial intelligence-powere...
86    Find automobile repair and maintenance service...
Name: What it does, dtype: object

***Observation***

No issues with entries

#### ***Checking the Founders Column***

In [144]:
df_2019.Founders.sample(10)

48                               Amit Raj, Anshul Gupta
17           Chapman, Priya Sharma, Ashish Anantharaman
69                      Anurag Garg, Sridhar Srinivasan
87    Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...
10                              Abhay Bhat, Kinnar Shah
2                                         Jatin Solanki
33                             Neel Mehta, Nihar Vartak
7       Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa
47        Devendra Gupta, Prateek Singhal, Vivek Pandey
25                         Arihant Jain, Ajeet Kushwaha
Name: Founders, dtype: object

***Observation***

No issues with entries

#### ***Checking the Investor Column***

In [145]:
df_2019.Investor.sample(10)

46    Manipal Education and Medical Group (MEMG) fam...
17                                       Goldman Sachs.
27                                      Canaan Partners
23                            Inflection Point Ventures
71                                        HNI investors
15           German development finance institution DEG
24                     Wilson Global Opportunities Fund
62                                     GAIL (India) Ltd
75    Inflection Point Ventures, SucSEED Venture Par...
9                                              SoftBank
Name: Investor, dtype: object

***Observation***

No issues with entries

#### ***Checking the Amount Column***

In [146]:
df_2019['Amount($)'].sample(10)

29      $2,500,000
10     Undisclosed
4       $6,000,000
67      $5,500,000
1     $150,000,000
58      $6,800,000
34     $20,000,000
52      $1,500,000
80    $311,000,000
11     $22,000,000
Name: Amount($), dtype: object

***Observation***

Sampling through the Amount column, the following observations were made:
- The dollar signs ($) preceeding some of the figures need to be removed
- Undisclosed needs to be replaced with nan
- Commas as thousand separators need to be removed

In [147]:
#Function to clean amount column

def clean_and_convert(Amount):
    
    if Amount == "Undisclosed":
        
        return float('nan')
    
    else:
        
        Amount = Amount.replace('$', '').replace(',', '')
        
        return float(Amount)
    

In [148]:
# Applying the cleaning and conversion function to the amount column

df_2019['Amount($)'] = df_2019['Amount($)'].apply(clean_and_convert)

In [149]:
# Sampling to verify changes has been effected

df_2019['Amount($)'].sample(10)

8      20000000.0
46     50000000.0
15      5000000.0
42      8000000.0
2      28000000.0
85    693000000.0
18       182700.0
83      1000000.0
53      1000000.0
81      4800000.0
Name: Amount($), dtype: float64

#### ***Checking the Stage Column***

In [150]:
# Sampling the Column

df_2019.Stage.unique()

array([nan, 'Series C', 'Fresh funding', 'Series D', 'Pre series A',
       'Series A', 'Series G', 'Series B', 'Post series A',
       'Seed funding', 'Seed fund', 'Series F', 'Series B+', 'Seed round'],
      dtype=object)

***Observation***:

Some rows were represented with various variation of "Seed" funding stage. So, We will change the following to "Seed":
- Seed funding
- Seed fund
- Fresh funding
- Seed round

In [151]:
#Function to clean Stage column

def clean(Stage):
    if Stage in ["Seed funding", "Seed fund", "Fresh funding", "Seed round"]:
        return ("Seed")
    else:
        # Handle other cases or raise an exception
        return Stage
    

In [152]:
# Applying the cleaning function to the Stage column

df_2019['Stage'] = df_2019['Stage'].apply(clean)

In [161]:
# Sampling to verify changes has been effected

df_2019['Stage'].sample(10)

27         NaN
1     Series C
39         NaN
78    Series A
28    Series C
18         NaN
50    Series C
75    Series B
36    Series B
32         NaN
Name: Stage, dtype: object