# *****Indian Startup Ecosystem Analysis*****

## ***Business Understanding***

The period from 2018 to 2021 witnessed significant evolution in India's Startups. This project seeks to gain insights on patterns and trends in the data on funding, investment details and startup information. By leveraging on data analytics techniques and methods, we will touch on the effect that funding levels, business location and industry specialization has on the performance of startups. Understanding the dynamics of this ecosystem is crucial for businesses and investors seeking to capitalize on the opportunities within India's vibrant startup ecosystem

Installing relevant libraries

In [1]:
#%pip install pandas

Importing relevant libraries

In [2]:
import pandas as pd
import numpy as np

## 1. ***Data collection/ Data loading***

In [3]:
df_2018 = pd.read_csv('startup_funding2018.csv')
df_2019 = pd.read_csv('startup_funding2019.csv')
df_2020 = pd.read_csv('startup_funding2020.csv')
df_2021 = pd.read_csv('startup_funding2021.csv')

## 2. ***Data Exploration***

We would be exploring each dataset separately to understand it characteristics and clean where neccesary

### ***2018 start-ups funds data exploration and cleaning***

In [4]:
# a view of the first 5 rows in the dataset

df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [5]:
# Checking the data types in the dataset

df_2018.dtypes

Company Name     object
Industry         object
Round/Series     object
Amount           object
Location         object
About Company    object
dtype: object

In [6]:
# Checking for missing values

df_2018.isnull().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

In [7]:
# Checking for duplicate in the company name column.

df_2018['Company Name'].duplicated().value_counts()

Company Name
False    525
True       1
Name: count, dtype: int64

In [8]:
# spooling the details of duplicated value in the company name column

df_2018[df_2018['Company Name'].duplicated(keep= False).sort_values()]

  df_2018[df_2018['Company Name'].duplicated(keep= False).sort_values()]


Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."


***Observation***
This is a clear case of duplication. Let's drop one of the duplicated rows

In [9]:
# Dropping duplicated row based on the column company name

df_2018.drop_duplicates(subset= ['Company Name'], keep= 'first', inplace= True, ignore_index= True)

df_2018.duplicated().sum()

0

***Checking the Industry Column***

In [10]:
# Checking the Industry column

df_2018['Industry'].sample(15)

51               Media and Entertainment, News, Outdoors
113                                    EdTech, Education
137                                             Hospital
243                                                    —
286    Basketball, Cricket, Cycling, eSports, Fitness...
382    Finance, FinTech, Payments, Property Developme...
121                                                    —
231    Dietary Supplements, Food and Beverage, Health...
47                        Fitness, Health Care, Wellness
134                            Consumer Lending, FinTech
312                                Digital Entertainment
110              E-Learning, Education, Higher Education
379                    Apps, Financial Services, FinTech
71                            Consulting, Retail, Social
55                                  Health Care, Medical
Name: Industry, dtype: object

***Observation***
Sampling through the industry column the points below were worth noting

- The Industry column featured sectors, subsectors among others. However, we are going to pick the first index which is predominantly sectors.
- Furthermore, some were represented with "-". We will replace them with nan

In [11]:
# Split the first index (sector) in the industry column from the others

df_2018['Industry'] = df_2018.Industry.str.split(",").str[0]

# Replacing "-" with nan

df_2018['Industry'] = df_2018.Industry.replace("—" ,  np.nan)

In [12]:
# Sampling the Industry section to verify changes

df_2018['Industry'].sample(20)

184               Education
25              Agriculture
446         Cloud Computing
38               E-Commerce
314                Delivery
145              E-Learning
79                   Mobile
471                     NaN
322          Medical Device
8      Information Services
480              Industrial
131           Manufacturing
156                 Banking
515       Food and Beverage
12                     Apps
284    Information Services
166           Digital Media
64         Trading Platform
426             Health Care
351            Smart Cities
Name: Industry, dtype: object

***Checking the Round/ Series Column***

In [13]:
df_2018['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Secondary Market', 'Post-IPO Equity',
       'Non-equity Assistance', 'Funding Round'], dtype=object)

***Observation***: A look into the unique values of the Round/ Series Column, we noticed two of the are not associated with the column
- Undisclosed. Since the funding rounds were undisclosed we are going to rplace them with Nan
- Accessing the link https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593 informs us of it being seed funding

In [14]:
# Replacing Undisclosed with nan
df_2018['Round/Series'] = df_2018['Round/Series'].replace("Undisclosed", np.nan)

In [15]:
# Replacing the link with seed
df_2018['Round/Series'] = df_2018['Round/Series'].replace("https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593", "Seed")

In [16]:
# Verifying changes made in the Round/ Series column

df_2018['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', nan, 'Series D', 'Secondary Market',
       'Post-IPO Equity', 'Non-equity Assistance', 'Funding Round'],
      dtype=object)

***Checking the Amount Column***

In [17]:
# Sampling the Amount column

df_2018.Amount.sample(20)

221       ₹22,500,000
280                 —
332           1000000
352        $1,100,000
241       ₹19,200,000
16             150000
251       ₹35,000,000
222        ₹5,000,000
268       ₹16,600,000
41             150000
440          30000000
92     ₹2,000,000,000
331                 —
259           1000000
189           6830000
458           2500000
253      ₹100,000,000
455      ₹280,000,000
70            ₹60,000
245           1000000
Name: Amount, dtype: object

***Observatoin***: Sampling through the Amount column, the following observations were made
- Some funds received were in rupees (₹). they must be converted to dollar using the rate in 2018
- The dollar signs ($) preceeding some of the figures need to be removed
- — needs to be replaced with nan
- Commas as thousand separators need to be removed

***Handling observation made***

In [18]:
# Define conversion rate
conversion_rate = 0.0146

#Function to clean and/ convert amount column

def clean_and_convert(Amount):
    
    if Amount == "—":
        
        return float('nan')
    
    elif '₹' in Amount:
        
        Amount = Amount.replace('₹', '').replace(',', '')
        
        return float(Amount) * conversion_rate
    
    else:
        
        Amount = Amount.replace('$', '').replace(',', '')
        
        return float(Amount)
    

In [19]:
# Applying the cleaning and conversion function to the amount column

df_2018['Amount'] = df_2018['Amount'].apply(clean_and_convert)

In [20]:
# Sampling to verify changes has been effected

df_2018['Amount'].sample(20)

361      6000000.0
390      1022000.0
127     36500000.0
5        1600000.0
467      3500000.0
249      4200000.0
114      2400000.0
146     14000000.0
139            NaN
208            NaN
352      1100000.0
431            NaN
235       400000.0
507            NaN
515     20440000.0
105      8030000.0
478            NaN
395      5000000.0
520    225000000.0
282            NaN
Name: Amount, dtype: float64

***Checking the Location Column***

In [21]:
# Sampling the location column

df_2018.Location.sample(20)

250       Jaipur, Rajasthan, India
372        Gurgaon, Haryana, India
475    Bengaluru, Karnataka, India
104     Mumbai, Maharashtra, India
278    Bengaluru, Karnataka, India
437    Bengaluru, Karnataka, India
332    Bengaluru, Karnataka, India
286        New Delhi, Delhi, India
130    Bangalore, Karnataka, India
175    Bengaluru, Karnataka, India
25           Mohali, Punjab, India
128        New Delhi, Delhi, India
186      Ahmedabad, Gujarat, India
137     Mumbai, Maharashtra, India
146       Pune, Maharashtra, India
401    Bengaluru, Karnataka, India
398    Bangalore, Karnataka, India
98          Kota, Rajasthan, India
341       Pune, Maharashtra, India
322    Bangalore, Karnataka, India
Name: Location, dtype: object

In [22]:
sd

NameError: name 'sd' is not defined

In [24]:
# Splitting Location into City and State
df_2018[['City', 'State']] = df_2018['Location'].str.split(', ', expand=True)[[0, 1]]

# Displaying the DataFrame
df_2018

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,City,State
0,TheCollegeFever,Brand Marketing,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Bangalore,Karnataka
1,Happy Cow Dairy,Agriculture,Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Mumbai,Maharashtra
2,MyLoanCare,Credit,Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Gurgaon,Haryana
3,PayMe India,Financial Services,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Noida,Uttar Pradesh
4,Eunimart,E-Commerce Platforms,Seed,,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Hyderabad,Andhra Pradesh
...,...,...,...,...,...,...,...,...
520,Udaan,B2B,Series C,225000000.0,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",Bangalore,Karnataka
521,Happyeasygo Group,Tourism,Series A,,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,Haryana,Haryana
522,Mombay,Food and Beverage,Seed,7500.0,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,Mumbai,Maharashtra
523,Droni Tech,Information Technology,Seed,511000.0,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,Mumbai,Maharashtra
