# Business Understanding
By [THE TIMES OF INDIA](https://timesofindia.indiatimes.com/business/india-business/india-becomes-third-largest-startup-ecosystem-in-the-world/articleshow/85871428.cms), India has emerged as the third largest startup ecosystem in the world after US and China. Following this, our team aims to strategically enter the Indian Startup Ecosystem by leveraging data-driven insights to identify high-potential opportunities. Through comprehensive research and analysis, we seek to gain insight into funding received by startups in India from 2018 to 2021.

#### Hypothesis
**Null Hypothesis (Ho):** The sector of a startup has no significant influence on the funding it receives.<br>

**Alternative Hypothesis (Ha):** The sector of a startup has significant influence on the funding it receives.

# Data Understanding
This data provides information into amount of money startups received from 2018 to 2021, the sector of startups, headquaters, what a startup do, the year of establishment, startup name, investors, and stage.<br>

`Feature Description`:
- **Company_Brand:** Name of startup
- **Founded:** Year of establishment
- **HeadQuater:** Location of startup Headquater
- **Sector:** Sector or industry of startup
- **What_it_does:** what the startup does
- **Founders:** Name od founder
- **Investors:** Name of investor
- **Amount:** Amount of investment in USD and INR
- **Statge:** Phase of development (eg. Ideation Stage, Pre-Seed Stage, Seed Stage, Early Stage (Series A, B, etc.))
- **Year:** Year startup received funding

#### Analytical Questions
1. How is funding spread across the years?
2. What are the dominant sectors within the Indian startup ecosystem across the years?
3. Are there any emerging sectors that have shown a significant increase in funding year over year?
4. Where in India could be considered the surviving grounds for startups?
5. How does the startup's location influence its funding and growth opportunities?
6. Is there a relationship between what a startup does and the funding it receives?
7. Is there a correlation between the year a startup received funding and the amount of funding it received?
8. Which cities or regions have the highest concentration of funded startups?

### Import libraries

In [1]:
# Import necesary libraries and packages
import pyodbc
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Load Environment Variables and Create SQL Server Connection

In [2]:
# Load environment variables from .env file
environment_variables = dotenv_values('.env')
# Access login credentials from  the '.env' file
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

In [3]:
#  Create Connection connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

# connect to the server using pyodbc
connection = pyodbc.connect(connection_string)

### Load Datasets 

In [4]:
# Write querry to retrieve tables from database
query1 = "Select * from dbo.LP1_startup_funding2020"
query2 = "Select * from dbo.LP1_startup_funding2021"

# Retrieve dataset from database with connection created
df_2020 = pd.read_sql(query1, connection)
df_2021 = pd.read_sql(query2, connection)

# Load CSV files
df_2018 = pd.read_csv('Data\startup_funding2018.csv')
df_2019 = pd.read_csv('Data\startup_funding2019.csv')

## Exploring data quality and characteristics

#### 2020 Dataset Exploration and Preparation

In [5]:
# Preview dataframe
df_2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,


In [6]:
print(df_2020.info(), "\n====================== Null Value Percentage ==========================")
print(df_2020.isna().mean().mul(100), "\n======================= Duplicated rows =========================")
df_2020.loc[df_2020.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB
None 
Company_Brand     0.000000
Founded          20.189573
HeadQuarter       8.909953
Sector            1.232227
What_it_does      0.000000
Founders          1.137441
Investor          3.601896
Amount           24.075829
Stage            43.981043
column10         99.810427
dtype: float64 


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
145,Krimanshi,2015.0,Jodhpur,Biotechnology company,Krimanshi aims to increase rural income by imp...,Nikhil Bohra,"Rajasthan Venture Capital Fund, AIM Smart City",600000.0,Seed,
205,Nykaa,2012.0,Mumbai,Cosmetics,Nykaa is an online marketplace for different b...,Falguni Nayar,"Alia Bhatt, Katrina Kaif",,,
362,Byju’s,2011.0,Bangalore,EdTech,An Indian educational technology and online tu...,Byju Raveendran,"Owl Ventures, Tiger Global Management",500000000.0,,


Drop Duplicates

In [7]:
# Drop duplicated rows from DataFrame
df_2020.drop_duplicates(keep = "first", inplace = True)

Column10 has 99.8% null values so we drop it

In [8]:
# Drop columns10
df_2020.drop(columns = ["column10"], inplace = True)
# Preview remaining columns 
df_2020.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

#### 2021 Dataset Exploration and Preparation

In [9]:
# Preview dataframe
df_2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


In [10]:
print(df_2021.info(), "\n====================== Null Value Percentage ======================")
print(df_2021.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
df_2021.loc[df_2021.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB
None 
Company_Brand     0.000000
Founded           0.082713
HeadQuarter       0.082713
Sector            0.000000
What_it_does      0.000000
Founders          0.330852
Investor          5.128205
Amount            0.248139
Stage            35.401158
dtype: float64 


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
107,Curefoods,2020.0,Bangalore,Food & Beverages,Healthy & nutritious foods and cold pressed ju...,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",$13000000,
109,Bewakoof,2012.0,Mumbai,Apparel & Fashion,Bewakoof is a lifestyle fashion brand that mak...,Prabhkiran Singh,InvestCorp,$8000000,
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
117,Advantage Club,2014.0,Mumbai,HRTech,Advantage Club is India's largest employee eng...,"Sourabh Deorah, Smiti Bhatt Deorah","Y Combinator, Broom Ventures, Kunal Shah",$1700000,
119,Ruptok,2020.0,New Delhi,FinTech,Ruptok fintech Pvt. Ltd. is an online gold loa...,Ankur Gupta,Eclear Leasing,$1000000,
243,Trinkerr,2021.0,Bangalore,Capital Markets,Trinkerr is India's first social trading platf...,"Manvendra Singh, Gaurav Agarwal",Accel India,$6600000,Series A
244,Zorro,2021.0,Gurugram,Social network,Pseudonymous social network platform,"Jasveer Singh, Abhishek Asthana, Deepak Kumar","Vijay Shekhar Sharma, Ritesh Agarwal, Ankiti Bose",$32000000,Seed
245,Ultraviolette,2021.0,Bangalore,Automotive,Create and Inspire the future of sustainable u...,"Subramaniam Narayan, Niraj Rajmohan","TVS Motor, Zoho",$150000000,Series C
246,NephroPlus,2009.0,Hyderabad,Hospital & Health Care,A vision and passion of redefining healthcare ...,Vikram Vuppala,IIFL Asset Management,$24000000,Series E
247,Unremot,2020.0,Bangalore,Information Technology & Services,Unremot is a personal office for consultants!,Shiju Radhakrishnan,Inflection Point Ventures,$700000,Seed


Drop Duplicates

In [11]:
# Drop duplicated rows
df_2021.drop_duplicates(keep = "first", inplace = True)

#### 2018 Dataset Exploration and Preparation


In [12]:
# Preview dataframe
df_2018.head(3)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India


In [13]:
print(df_2018.info(), "\n====================== Null Value Percentage ======================")
print(df_2018.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
df_2018.loc[df_2018.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB
None 
Company Name     0.0
Industry         0.0
Round/Series     0.0
Amount           0.0
Location         0.0
About Company    0.0
dtype: float64 


Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."


Drop Duplicates

In [14]:
# Drop duplicated rows from DataFrame
df_2018.drop_duplicates(keep = "first", inplace = True)

#### 2019 Dataset Exploration and Preparation


In [15]:
# Preview dataframe
df_2019.head(3)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding


In [16]:
print(df_2019.info(), "\n====================== Null Value Percentage ======================")
print(df_2019.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
df_2019.loc[df_2019.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB
None 
Company/Brand     0.000000
Founded          32.584270
HeadQuarter      21.348315
Sector            5.617978
What it does      0.000000
Founders          3.370787
Investor          0.000000
Amount($)         0.000000
Stage            51.685393
dtype: float64 


Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage


### Add year_funded column to each dataset

In [17]:
# Add year column to dataframe
df_2020["Year_Funded"] = 2020
# Convert data type to datetime format
df_2020["Year_Funded"] = pd.to_datetime(df_2020["Year_Funded"], format = "%Y")

# Add year column to dataframe
df_2021["Year_Funded"] = 2021
# Convert data type to datetime format
df_2021["Year_Funded"] = pd.to_datetime(df_2021["Year_Funded"], format = "%Y")

# Add year column to dataframe
df_2019["Year_Funded"] = 2019
# Convert data type to datetime format
df_2019["Year_Funded"] = pd.to_datetime(df_2019["Year_Funded"], format = "%Y")

# Add year column to dataframe
df_2018["Year_Funded"] = 2018
# Convert data type to datetime format
df_2018["Year_Funded"] = pd.to_datetime(df_2018["Year_Funded"], format = "%Y")

# Preview dataframe
df_2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020-01-01
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020-01-01
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020-01-01


#### Establish uniformity in column names

In [18]:
# Rename columns
df_2019.rename(columns={'Company/Brand':'Company_Brand','Amount($)':'Amount', 'What it does': 'What_it_does'}, inplace=True)

df_2018.rename(columns={'Company Name':'Company_Brand','Industry':'Sector','Round/Series':'Stage','Location':'HeadQuarter','About Company':'What_it_does'} ,inplace=True)

In [19]:
def rename_column(df):
    """
    This function takes in a dataframe and renames the column names to lower case
    """
    df.columns = [col_name.lower() for col_name in df.columns]
    return df

# Apply function to DataFrames
df_2018.pipe(rename_column)
df_2019.pipe(rename_column)
df_2020.pipe(rename_column)
df_2021.pipe(rename_column)

# Preview column names
df_2021.columns

Index(['company_brand', 'founded', 'headquarter', 'sector', 'what_it_does',
       'founders', 'investor', 'amount', 'stage', 'year_funded'],
      dtype='object')

In [20]:
# Compare if column names are uniform
print(df_2018.columns)
print(df_2019.columns)
print(df_2020.columns)
print(df_2021.columns)

Index(['company_brand', 'sector', 'stage', 'amount', 'headquarter',
       'what_it_does', 'year_funded'],
      dtype='object')
Index(['company_brand', 'founded', 'headquarter', 'sector', 'what_it_does',
       'founders', 'investor', 'amount', 'stage', 'year_funded'],
      dtype='object')
Index(['company_brand', 'founded', 'headquarter', 'sector', 'what_it_does',
       'founders', 'investor', 'amount', 'stage', 'year_funded'],
      dtype='object')
Index(['company_brand', 'founded', 'headquarter', 'sector', 'what_it_does',
       'founders', 'investor', 'amount', 'stage', 'year_funded'],
      dtype='object')


## Concatenate Datasets

In [21]:
# Combine 2020 and 2021 dataframes into a single dataframe
df_combined = pd.concat([df_2020, df_2021, df_2018, df_2019], axis = 0)
# Preview combined dataframe
df_combined.head(3)

Unnamed: 0,company_brand,founded,headquarter,sector,what_it_does,founders,investor,amount,stage,year_funded
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020-01-01
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020-01-01
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020-01-01


In [22]:
print(df_combined.info(), "\n====================== Null Value Percentage ======================")
print(df_combined.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
df_combined.loc[df_combined.duplicated()]

<class 'pandas.core.frame.DataFrame'>
Index: 2856 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   company_brand  2856 non-null   object        
 1   founded        2088 non-null   float64       
 2   headquarter    2742 non-null   object        
 3   sector         2838 non-null   object        
 4   what_it_does   2856 non-null   object        
 5   founders       2312 non-null   object        
 6   investor       2232 non-null   object        
 7   amount         2600 non-null   object        
 8   stage          1927 non-null   object        
 9   year_funded    2856 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 245.4+ KB
None 
company_brand     0.000000
founded          26.890756
headquarter       3.991597
sector            0.630252
what_it_does      0.000000
founders         19.047619
investor         21.848739
amount            8.

Unnamed: 0,company_brand,founded,headquarter,sector,what_it_does,founders,investor,amount,stage,year_funded


# Data Preparation

Drop columns that are not present throught the four datasets

In [23]:
df_combined.drop(columns = ["founded", "founders", "investor"], inplace = True)
df_combined.head()

Unnamed: 0,company_brand,headquarter,sector,what_it_does,amount,stage,year_funded
0,Aqgromalin,Chennai,AgriTech,Cultivating Ideas for Profit,200000.0,,2020-01-01
1,Krayonnz,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,100000.0,Pre-seed,2020-01-01
2,PadCare Labs,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,,Pre-seed,2020-01-01
3,NCOME,New Delhi,Escrow,Escrow-as-a-service platform,400000.0,,2020-01-01
4,Gramophone,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,340000.0,,2020-01-01


### Column Cleaning
`Column: amount`

In [24]:
df_combined["amount"].unique()[:40]

array([200000.0, 100000.0, nan, 400000.0, 340000.0, 600000.0, 45000000.0,
       1000000.0, 2000000.0, 1200000.0, 660000000.0, 120000.0, 7500000.0,
       5000000.0, 500000.0, 3000000.0, 10000000.0, 145000000.0,
       100000000.0, 21000000.0, 4000000.0, 20000000.0, 560000.0, 275000.0,
       4500000.0, 15000000.0, 390000000.0, 7000000.0, 5100000.0,
       700000000.0, 2300000.0, 700000.0, 19000000.0, 9000000.0,
       40000000.0, 750000.0, 1500000.0, 7800000.0, 50000000.0, 80000000.0],
      dtype=object)

Amount and Stage columns have some of their values interchanged

In [25]:
df_combined["stage"].unique()[30:-15]

array(['Pre series A1', 'Series E2', 'Pre series A', 'Seed Round',
       'Bridge Round', 'Pre seed round', 'Pre series B', 'Pre series C',
       'Seed Investment', 'Series D1', 'Mid series', 'Series C, D',
       'Seed funding', '$1200000', 'Seed+', 'Series F2', 'Series A+',
       'Series G', 'Series B3', 'PE', 'Series F1', 'Pre-series A1',
       '$300000', 'Early seed', '$6000000', '$1000000', 'Seies A',
       'Series A2', 'Series I', 'Angel', 'Private Equity',
       'Venture - Series Unknown'], dtype=object)

In [26]:
df_combined[df_combined["stage"] == "$1000000"]

Unnamed: 0,company_brand,headquarter,sector,what_it_does,amount,stage,year_funded
677,Saarthi Pedagogy,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...","JITO Angel Network, LetsVenture",$1000000,2021-01-01


In [27]:
# put values in their appropriate columns
df_combined.at[677, "amount"], df_combined.at[677, "stage"] = 1000000, np.NAN
df_combined.at[545, "amount"], df_combined.at[545, "stage"] = np.NAN, "Pre-series A"
df_combined.at[551, "amount"], df_combined.at[551, "stage"] = 300000, np.NAN
df_combined.at[538, "amount"], df_combined.at[538, "stage"] = 300000, np.NAN
df_combined.at[242, "amount"], df_combined.at[242, "stage"] = np.NAN, "Seed"
df_combined.at[257, "amount"], df_combined.at[257, "stage"] = np.NAN, "Seed"
df_combined.at[1148, "amount"], df_combined.at[1148, "stage"] = np.NAN, "Seed"
df_combined.at[98, "amount"], df_combined.at[98, "stage"] = 1200000, np.NAN
df_combined.at[674, "amount"], df_combined.at[674, "stage"] = 6000000, np.NAN


In [28]:
# Clean Amount colum and convert Indian Rupee to USD currency
def clean_amount(Amount):
    """ 
    Removes "$", and "₹"  and converts column to float
     """
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        # Check if the amount is in INR and convert to USD assuming 1 USD = 70 INR
        if "₹" in Amount:
            Amount = Amount.replace("₹", "")
            return round(float(Amount) / 70, 2)
        # Check if the amount is in USD
        elif "$" in Amount:
            Amount = Amount.replace("$", "")
            return round(float(Amount), 2) 
        # If no currency symbol, assume it's already in USD
        else:
            return round(float(Amount), 2)
    except ValueError:
        # For non-numeric entries, return NaN
        return np.NAN

# Apply the clean_amount function to the 'amount' column
df_combined["amount"] = df_combined["amount"].apply(clean_amount)

# Preview dataframe
df_combined["amount"].unique()[:35]

array([2.00e+05, 1.00e+05,      nan, 4.00e+05, 3.40e+05, 6.00e+05,
       4.50e+07, 1.00e+06, 2.00e+06, 1.20e+06, 6.60e+08, 1.20e+05,
       7.50e+06, 5.00e+06, 5.00e+05, 3.00e+06, 1.00e+07, 1.45e+08,
       1.00e+08, 2.10e+07, 4.00e+06, 2.00e+07, 5.60e+05, 2.75e+05,
       4.50e+06, 1.50e+07, 3.90e+08, 7.00e+06, 5.10e+06, 7.00e+08,
       2.30e+06, 7.00e+05, 1.90e+07, 9.00e+06, 4.00e+07])

In [30]:
# Fill null values with the median
df_combined.fillna({"amount": df_combined["amount"].median()}, inplace = True)
df_combined.isna().sum()

company_brand      0
headquarter      114
sector            18
what_it_does       0
amount             0
stage            933
year_funded        0
dtype: int64

`Column: headquarter`

In [48]:
df_combined["headquarter"].unique()[80:-50]

array(['Rajsamand', 'Ranchi', 'Faridabad, Haryana', 'Computer Games',
       'Vadodara', 'Food & Beverages', 'Pharmaceuticals\t#REF!',
       'Gurugram\t#REF!', 'Mohali', 'Powai', 'Ghaziabad', 'Nagpur',
       'West Bengal', 'Samsitpur', 'Lucknow', 'Telangana', 'Silvassa',
       'Thiruvananthapuram', 'Faridabad', 'Roorkee', 'Ambernath',
       'Panchkula', 'Surat', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Santra',
       'Mountain View, CA', 'Trivandrum', 'Jharkhand', 'Bhilwara',
       'Guwahati', 'Online Media\t#REF!', 'London',
       'Information Technology & Services', 'The Nilgiris', 'Gandhinagar',
       'Bangalore, Karnataka, India', 'Mumbai, Maharashtra, India'],
      dtype=object)

In [43]:
df_combined[df_combined["headquarter"] == "Information Technology & Services"]

Unnamed: 0,company_brand,headquarter,sector,what_it_does,amount,stage,year_funded
1176,Peak,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,75000000.0,Series C,2021-01-01


In [49]:


df_combined.at[98, "headquarter"] = "None"
df_combined.at[241, "headquarter"], df_combined.at[241, "sector"] = "Hauz Khas", "Food & Beverages"
df_combined.at[242, "What_it_does"] = df_combined.at[242, "sector"]
df_combined.at[242, "headquarter"], df_combined.at[242, "Sector"] = "None", "Pharmaceuticals"
df_combined.at[257, "what_it_does"] = df_combined.at[257, "Sector"]
df_combined.at[257, "headquarter"], df_combined.at[257, "Sector"] = "Gurugram", "None"
df_combined.at[1100, "what_it_does"] = df_combined.at[1100, "sector"]
df_combined.at[1100, "headquarter"], df_combined.at[1100, "sector"] = "None", "Online Media"
df_combined.at[1176, "headquarter"], df_combined.at[1176, "sector"] = "Manchester, Greater Manchester", "Information Technology & Services"

In [50]:
df_combined["headquarter"].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'None', 'Haryana', 'Kerala', 'Jodhpur',
       'Gujarat', 'Jaipur, Rajastan', 'Delhi',
       'Frisco, Texas, United States', 'California', 'Dhingsara, Haryana',
       'New York, United States', 'Patna',
       'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Hauz Khas', 'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Co

In [None]:
def clean_column(column):
    """
    Corrects misspelled values, splits string at the comma, and takes the first part
    """
    replacement = {"Delhi": "New Delhi", "New New Delhi": "New Delhi", "San Franciscao": "San Francisco", "San Francisco Bay Area": "San Francisco",
                   "Bangaldesh": "Bangladesh", "Milano": "Milan", "Newcastle Upon Tyne": "Newcastle", "Hyderebad": "Hyderabad", "Banglore": "Bangalore",
                   "Cochin": "Kochi", "Orissia": "Odisha", "Thiruvananthapuram": "Trivandrum", "Samsitpur": "Samastipur", "Telugana": "Telangana",
                   "Gurgaon": "Gurugram", "Ahmadabad": "Ahmedabad"}
    
    # Condition ensures function processes non-None values
    if column is not None:
        for old_value, new_value in replacement.items():
            column = column.split(",")[0]
            column = column.replace(old_value, new_value)
        return column

# Apply function to column
df_2020["HeadQuarter"] = df_2020["HeadQuarter"].apply(clean_column)
df_2020["HeadQuarter"].unique()

`Column: HeadQuarter`

In [None]:
df_2020["HeadQuarter"].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',


In [None]:
# def clean_column(column):
#     """
#     Corrects misspelled values, splits string at the comma, and takes the first part
#     """
#     replacement = {"Delhi": "New Delhi", "New New Delhi": "New Delhi", "San Franciscao": "San Francisco", "San Francisco Bay Area": "San Francisco",
#                    "Bangaldesh": "Bangladesh", "Milano": "Milan", "Newcastle Upon Tyne": "Newcastle",
#                    "Hyderebad": "Hyderabad", "Banglore": "Bangalore", "Cochin": "Kochi"}
    
#     # Condition ensures function processes non-None values
#     if column is not None:
#         for old_value, new_value in replacement.items():
#             column = column.split(",")[0]
#             column = column.replace(old_value, new_value)
#         return column

# # Apply function to column
# df_2020["HeadQuarter"] = df_2020["HeadQuarter"].apply(clean_column)
# df_2020["HeadQuarter"].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli', 'Thane', None, 'Singapore',
       'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur', 'Frisco',
       'California', 'Dhingsara', 'New York', 'Patna', 'San Francisco',
       'San Ramon', 'Paris', 'Plano', 'Sydney', 'Bangladesh', 'London',
       'Milan', 'Palmwoods', 'France', 'Trivandrum', 'Kochi',
       'Samastipur', 'Irvine', 'Tumkur', 'Newcastle', 'Shanghai',
       'Jiaxing', 'Rajastan', 'Ludhiana', 'Dehradun', 'Tangerang',
       'Berlin', 'Seattle', 'Riyadh', 'Seoul', 'Bangkok', 'Kanpur',
       'Chandigarh', 'Warangal', 'Odisha', 'Bihar', 'Goa', 'Tamil Nadu',
       'Uttar Pradesh', 'Bhopal', 'Coimbatore', 'Bengaluru'], dtype=object)

`Column: Stage`

In [None]:
df_2020["Stage"].unique()

array([None, 'Pre-seed', 'Seed', 'Pre-series A', 'Pre-series', 'Series C',
       'Series A', 'Series B', 'Debt', 'Pre-series C', 'Pre-series B',
       'Series E', 'Bridge', 'Series D', 'Series B2', 'Series F',
       'Pre- series A', 'Edge', 'Series H', 'Pre-Series B', 'Seed A',
       'Series A-1', 'Seed Funding', 'Pre-Seed', 'Seed round',
       'Pre-seed Round', 'Seed Round & Series A', 'Pre Series A',
       'Pre seed Round', 'Angel Round', 'Pre series A1', 'Series E2',
       'Pre series A', 'Seed Round', 'Bridge Round', 'Pre seed round',
       'Pre series B', 'Pre series C', 'Seed Investment', 'Series D1',
       'Mid series', 'Series C, D', 'Seed funding'], dtype=object)

Put Stages of funding under Indian Startup Ecosystem [Funding Guide](https://www.startupindia.gov.in/content/sih/en/funding.html) 

| Stages Classification | Description | Stages of funding per Data given |
|----------|-----------|-----------|
| Ideation | Brainstorming and developing business concepts, defining value propositions, and outlining plans | Pre-seed, Pre-Seed, Pre-seed Round, Pre seed Round, Pre seed round |
| Validation | Validating the business model, product-market fit, and scalability through research and feedback | Seed, Seed A, Seed Funding, Seed round, Seed Round, Seed Round & Series A, Seed Investment, Angel Round |
| Early Traction | Gaining initial traction, attracting early adopters, and refining based on feedback | Pre-series A, Pre- series A, Pre Series A, Pre series A, Pre series A1 |
| Scaling | Expanding operations, customer base, and market reach for rapid growth | Series A, Series A-1, Series B, Series B2, Pre-series B, Pre-Series B, Pre series B, Series C, Series C, D, Pre-series C, Pre series C, Series D, Series D1, Series E, Series E2, Series F, Series H, Mid series, Edge |
| Exit Options | Considering exit strategies such as mergers, acquisitions, or IPOs | Debt, Bridge, Bridge Round |


In [None]:
def stage_classification(column):
    """
    Categorises stage of funding
    """
    classification = {"Ideation": ["Pre-seed", "Pre-Seed", "Pre-seed Round", "Pre seed Round", "Pre seed round"],
                      "Validation": ["Seed", "Seed A", "Seed Funding", "Seed funding", "Seed round", "Seed Round", "Seed Round & Series A", "Seed Investment", "Angel Round"],
                      "Early Traction": ["Pre-series", "Pre-series A", "Pre- series A", "Pre Series A", "Pre series A", "Pre series A1"],
                      "Scaling": ["Series A", "Series A-1", "Series B", "Series B2", "Pre-series B", "Pre-Series B", "Pre series B", "Series C", "Series C, D",
                                  "Pre-series C", "Pre series C", "Series D", "Series D1", "Series E", "Series E2", "Series F", "Series H", "Mid series", "Edge"],
                        "Exit Options": ["Debt", "Bridge", "Bridge Round"]}
    
    for classes, stage in classification.items():
        if column in stage:
            column = classes
    return column

# Apply function to column
df_2020["Stage"] = df_2020["Stage"].apply(stage_classification)
df_2020["Stage"].unique()

array([None, 'Ideation', 'Validation', 'Early Traction', 'Scaling',
       'Exit Options'], dtype=object)

Add Year_Funded column to DataFrame

In [None]:
# # Add year column to dataframe
# df_2020["Year_Funded"] = 2020
# # Convert data type to datetime format
# df_2020["Year_Funded"] = pd.to_datetime(df_2020["Year_Funded"], format = "%Y")
# df_2020.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020-01-01
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Ideation,2020-01-01
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,3000000.0,Ideation,2020-01-01


#### 2021 Dataset Exploration and Preparation

In [None]:
# # Preview dataframe
# df_2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D


In [None]:
# print(df_2021.info(), "\n====================== Null Value Percentage ======================")
# print(df_2021.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
# df_2021.loc[df_2021.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB
None 
Company_Brand     0.000000
Founded           0.082713
HeadQuarter       0.082713
Sector            0.000000
What_it_does      0.000000
Founders          0.330852
Investor          5.128205
Amount            0.248139
Stage            35.401158
dtype: float64 


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
107,Curefoods,2020.0,Bangalore,Food & Beverages,Healthy & nutritious foods and cold pressed ju...,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",$13000000,
109,Bewakoof,2012.0,Mumbai,Apparel & Fashion,Bewakoof is a lifestyle fashion brand that mak...,Prabhkiran Singh,InvestCorp,$8000000,
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
117,Advantage Club,2014.0,Mumbai,HRTech,Advantage Club is India's largest employee eng...,"Sourabh Deorah, Smiti Bhatt Deorah","Y Combinator, Broom Ventures, Kunal Shah",$1700000,
119,Ruptok,2020.0,New Delhi,FinTech,Ruptok fintech Pvt. Ltd. is an online gold loa...,Ankur Gupta,Eclear Leasing,$1000000,
243,Trinkerr,2021.0,Bangalore,Capital Markets,Trinkerr is India's first social trading platf...,"Manvendra Singh, Gaurav Agarwal",Accel India,$6600000,Series A
244,Zorro,2021.0,Gurugram,Social network,Pseudonymous social network platform,"Jasveer Singh, Abhishek Asthana, Deepak Kumar","Vijay Shekhar Sharma, Ritesh Agarwal, Ankiti Bose",$32000000,Seed
245,Ultraviolette,2021.0,Bangalore,Automotive,Create and Inspire the future of sustainable u...,"Subramaniam Narayan, Niraj Rajmohan","TVS Motor, Zoho",$150000000,Series C
246,NephroPlus,2009.0,Hyderabad,Hospital & Health Care,A vision and passion of redefining healthcare ...,Vikram Vuppala,IIFL Asset Management,$24000000,Series E
247,Unremot,2020.0,Bangalore,Information Technology & Services,Unremot is a personal office for consultants!,Shiju Radhakrishnan,Inflection Point Ventures,$700000,Seed


In [None]:
# # Drop duplicated rows
# df_2021.drop_duplicates(keep = "first", inplace = True)

`Column: Amount`

In [None]:
# df_2021["Amount"].unique()[:20]

array(['$1,200,000', '$120,000,000', '$30,000,000', '$51,000,000',
       '$2,000,000', '$188,000,000', '$200,000', 'Undisclosed',
       '$1,000,000', '$3,000,000', '$100,000', '$700,000', '$9,000,000',
       '$40,000,000', '$49,000,000', '$400,000', '$300,000',
       '$25,000,000', '$160,000,000', '$150,000'], dtype=object)

In [None]:
# df_2021["Stage"].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       '$1200000', 'Bridge', 'Seed+', 'Series F2', 'Series A+',
       'Series G', 'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', '$300000', 'Early seed', 'Series D1', '$6000000',
       '$1000000', 'Seies A', 'Pre-series', 'Series A2', 'Series I'],
      dtype=object)

Some of the amount were entered in the stage column due to human errors

In [None]:
# df_2021[df_2021["Stage"] == "$300000"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in ...,Vishal Gupta,ah! Ventures,$300000
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employ...,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale","ITO Angel Network, LetsVenture",$300000


In [None]:
# # Replace all non-numeric values with the right values
# df_2021["Amount"].replace("Upsparks", 1200000, inplace = True)
# df_2021["Amount"].replace("ah! Ventures", 300000, inplace = True)
# df_2021["Amount"].replace("ITO Angel Network, LetsVenture", 300000, inplace = True)
# df_2021["Amount"].replace("None", 6000000, inplace = True)
# df_2021["Amount"].replace("JITO Angel Network, LetsVenture", 1000000, inplace = True)

Amount and Stage columns have some of their values interchanged.


In [None]:
# df_2021.loc[df_2021["Amount"] == "Seed"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",$5000000,Seed,
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics...,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",1000000\t#REF!,Seed,


In [None]:
# # Put values in the appropriate columns
# df_2021.at[242, "Amount"], df_2021.at[242, "Stage"] = 22000000, "Series C"
# df_2021.at[256, "Amount"], df_2021.at[256, "Stage"] = 22000000, "Series C"
# df_2021.at[257, "Amount"], df_2021.at[257, "Stage"] = 5000000, "Seed"
# df_2021.at[1148, "Amount"], df_2021.at[1148, "Stage"] = 1000000, "Seed"
# df_2021.at[545, "Amount"], df_2021.at[545, "Stage"] = 1000000, "Pre-series A"

In [None]:
# # Clean Amount colum
# def clean_amount(Amount):
#     """ 
#     Removes "$" and converts column to float
#     """
#     try:
#         Amount = str(Amount)
#         # Remove commas
#         Amount = Amount.replace(",", "")
#         # Replace USD sign with nothing
#         if "$" in Amount:
#             Amount = Amount.replace("$", "")
#             return round(float(Amount), 2) 
#         # If no currency symbol, assume it's already in USD
#         else:
#             return round(float(Amount), 2)
#     except ValueError:
#         # For non-numeric entries, return NaN
#         return np.nan

# # Apply the clean_amount function to the 'amount' column
# df_2021["Amount"] = df_2021["Amount"].apply(clean_amount)

In [None]:
# Fill null values with the median value
# df_2021.fillna({"Amount": df_2021["Amount"].median()}, inplace = True)
# df_2021.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1191 entries, 0 to 256
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1190 non-null   object 
 1   Founded        1189 non-null   float64
 2   HeadQuarter    1189 non-null   object 
 3   Sector         1190 non-null   object 
 4   What_it_does   1190 non-null   object 
 5   Founders       1186 non-null   object 
 6   Investor       1129 non-null   object 
 7   Amount         1191 non-null   float64
 8   Stage          774 non-null    object 
dtypes: float64(2), object(7)
memory usage: 125.3+ KB


`column: Stage`

In [None]:
df_2021["Stage"].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt', 'None',
       'Bridge', 'Seed+', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', '$300000', 'Early seed', 'Series D1', '$6000000',
       '$1000000', 'Seies A', 'Pre-series', 'Series A2', 'Series I'],
      dtype=object)

In [None]:
def stage_classification(column):
    classes = {"Ideation": ["Pre-seed", "Early seed"],
               "Validation": ["Seed", "Seed+"],
               "Early Traction": [""]}

In [None]:
| Stage          | Categories       |
|----------------|------------------|
| **Ideation**   | Pre-seed         |
|                | Early seed       |
|                | Seed             |
| **Validation** | Seed+            |
|                | Pre-series       |
|                | Pre-series A     |
|                | Pre-series B     |
|                | Pre-series A1    |
| **Early Traction** | Series A        |
|                | Series A+        |
|                | Series A2        |
|                | Seies A          |
| **Scaling**    | Series B         |
|                | Series B3        |
|                | Series C         |
|                | Series D         |
|                | Series D1        |
|                | Series E         |
|                | Series F         |
|                | Series F1        |
|                | Series F2        |
|                | Series G         |
|                | Series H         |
|                | Series I         |
| **Exit Option**| PE (Private Equity) |
|                | Debt             |
|                | Bridge           |
| **Unclear/Outliers** | None             |
|                | $300000          |
|                | $6000000         |
|                | $1000000         |


Put Stages of funding under Indian Startup Ecosystem [Funding Guide](https://www.startupindia.gov.in/content/sih/en/funding.html) 

| Stages Classification | Description | Stages of funding per Data given |
|----------|-----------|-----------|
| Ideation | Brainstorming and developing business concepts, defining value propositions, and outlining plans | Pre-seed, Early seed |
| Validation | Validating the business model, product-market fit, and scalability through research and feedback | Seed, Seed+ |
| Early Traction | Gaining initial traction, attracting early adopters, and refining based on feedback | Pre-series, Pre-series A, Pre-series B, Pre-series A1, Pre-series B, Pre-Series B, Pre series B,  |
| Scaling | Expanding operations, customer base, and market reach for rapid growth | Series A, Series A+, Series A2, Seies A, Series C, Series C, D, Pre-series C, Pre series C, Series D, Series D1, Series E, Series E2, Series F, Series H, Mid series, Edge |
| Exit Options | Considering exit strategies such as mergers, acquisitions, or IPOs | Debt, Bridge, PE |


`Column: HeadQuarter`

In [None]:
df_2021["HeadQuarter"].unique()[:35]

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'Pharmaceuticals\t#REF!', 'Gurugram\t#REF!',
       'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai',
       'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna', 'Samsitpur',
       'Lucknow', 'Telangana', 'Silvassa', 'Thiruvananthapuram',
       'Faridabad', 'Roorkee', 'Ambernath', 'Panchkula', 'Surat',
       'Coimbatore', 'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Mountain View, CA', 'Trivandrum',
       'Jharkhand', 'Kanpur', 'Bhilwara', 'Guwahati',
       'Online Media\t#REF!', 'Kochi', 'London',
       'Information Technol

In [None]:
df_2021[df_2021["HeadQuarter"] == "Information Technology & Services"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1176,Peak,2014.0,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,Atul Sharma,SoftBank Vision Fund 2,75000000.0,Series C


In [None]:
# df_2021.at[98, "HeadQuarter"], df_2021.at[98, "Stage"] = "None", "None"
# df_2021.at[241, "HeadQuarter"], df_2021.at[241, "Sector"] = "Hauz Khas", "Food & Beverages"
# df_2021.at[242, "HeadQuarter"], df_2021.at[242, "Sector"], df_2021.at[242, "What_it_does"], df_2021.at[242, "Investor"] = "None", "Pharmaceuticals", "None", "Varun Khanna"
# df_2021.at[257, "HeadQuarter"], df_2021.at[257, "Investor"], df_2021.at[257, "Sector"] = "Gurugram", "Vikash Mishra, Mragank Jain", "None"
# df_2021.at[1100, "HeadQuarter"], df_2021.at[1100, "Sector"] = "None", "Online Media"
# df_2021.at[1176, "HeadQuarter"], df_2021.at[1176, "Sector"] = "Manchester, Greater Manchester", "Information Technology & Services"

In [None]:
# def clean_column(column):
#     """
#     Corrects misspelled values, splits string at the comma, and takes the first part
#     """
#     replacement = {"Cochin": "Kochi", "Orissia": "Odisha", "Thiruvananthapuram": "Trivandrum",
#                    "Samsitpur": "Samastipur", "Telugana": "Telangana", "Gurgaon": "Gurugram", "Ahmadabad": "Ahmedabad"}
    
#     column = str(column)
#     # Condition ensures function processes non-None values
#     if column is not None:
#         for old_value, new_value in replacement.items():
#             column = column.split(",")[0]
#             column = column.replace(old_value, new_value)
#         return column

# # Apply function to column
# df_2021["HeadQuarter"] = df_2021["HeadQuarter"].apply(clean_column)
# df_2021["HeadQuarter"].replace("None", np.nan,  inplace = True)
# df_2021["HeadQuarter"].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmedabad', 'Chennai', nan, 'Small Towns', 'Goa',
       'Rajsamand', 'Ranchi', 'Faridabad', 'Gujarat', 'Pune', 'Thane',
       'Kochi', 'Noida', 'Chandigarh', 'Vadodara', 'Hauz Khas', 'Kolkata',
       'Mohali', 'Haryana', 'Indore', 'Powai', 'Ghaziabad', 'Nagpur',
       'West Bengal', 'Patna', 'Samastipur', 'Lucknow', 'Telangana',
       'Silvassa', 'Trivandrum', 'Roorkee', 'Ambernath', 'Panchkula',
       'Surat', 'Coimbatore', 'Andheri', 'Mangalore', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Odisha', 'Jodhpur',
       'New York', 'Santra', 'Mountain View', 'Jharkhand', 'Kanpur',
       'Bhilwara', 'Guwahati', 'London', 'Manchester', 'The Nilgiris',
       'Gandhinagar', 'nan'], dtype=object)

Add Year_Funded column to DataFrame

In [None]:
# # Add year column to dataframe
# df_2021["Year_Funded"] = 2021
# # Convert data type to datetime format
# df_2021["Year_Funded"] = pd.to_datetime(df_2021["Year_Funded"], format = "%Y")
# df_2021.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A,2021-01-01
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,,2021-01-01
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D,2021-01-01


#### Add Year Column to 2018 dataset

In [None]:
# # Preview dataframe
# df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [None]:
# # Add year column to dataframe
# df_2018["Year"] = 2018
# # Convert data type to datetime format
# df_2018["Year"] = pd.to_datetime(df_2018["Year"], format = "%Y")
# df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018-01-01
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018-01-01
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018-01-01
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018-01-01
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018-01-01


In [None]:
# print(df_2018.info(), "\n====================== Null Value Percentage ======================")
# print(df_2018.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
# df_2018.loc[df_2018.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company Name   526 non-null    object        
 1   Industry       526 non-null    object        
 2   Round/Series   526 non-null    object        
 3   Amount         526 non-null    object        
 4   Location       526 non-null    object        
 5   About Company  526 non-null    object        
 6   Year           526 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(6)
memory usage: 28.9+ KB
None 
Company Name     0.0
Industry         0.0
Round/Series     0.0
Amount           0.0
Location         0.0
About Company    0.0
Year             0.0
dtype: float64 


Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Year
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018-01-01


Startup is in seed stage so we drop it

In [None]:
# # Drop duplicated rows from DataFrame
# df_2018.drop_duplicates(keep = "first", inplace = True)

In [None]:
df_2018["Amount"].unique()

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

#### Clean Amount column in 2018 dataset

In [None]:
# # Clean Amount colum and convert Indian Rupee to USD currency
# def clean_amount(Amount):
#     try:
#         Amount = str(Amount)
#         # Remove commas
#         Amount = Amount.replace(",", "")
#         # Check if the amount is in INR and convert to USD assuming 1 USD = 70 INR
#         if "₹" in Amount:
#             Amount = Amount.replace("₹", "")
#             return round(float(Amount) / 70, 2)
#         # Check if the amount is in USD
#         elif "$" in Amount:
#             Amount = Amount.replace("$", "")
#             return round(float(Amount), 2) 
#         # If no currency symbol, assume it's already in USD
#         else:
#             return round(float(Amount), 2)
#     except ValueError:
#         # For non-numeric entries, return NaN
#         return np.nan

# # Apply the clean_amount function to the 'amount' column
# df_2018["Amount"] = df_2018["Amount"].apply(clean_amount)

In [None]:
df_2018.fillna({"Amount": df_2018["Amount"].median()}, inplace = True)
df_2018.info()


<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company Name   525 non-null    object        
 1   Industry       525 non-null    object        
 2   Round/Series   525 non-null    object        
 3   Amount         525 non-null    float64       
 4   Location       525 non-null    object        
 5   About Company  525 non-null    object        
 6   Year           525 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 32.8+ KB


#### Add Year column to 2019 dataset


In [None]:
# # Preview dataframe
# df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [None]:
# # Add year column to dataframe
# df_2019["Year"] = 2019
# # Convert data type to datetime format
# df_2019["Year"] = pd.to_datetime(df_2019["Year"], format = "%Y")
# df_2019.head()



Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019-01-01
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019-01-01
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019-01-01
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019-01-01
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019-01-01


In [None]:
# print(df_2019.info(), "\n====================== Null Value Percentage ======================")
# print(df_2019.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
# df_2019.loc[df_2019.duplicated()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company/Brand  89 non-null     object        
 1   Founded        60 non-null     float64       
 2   HeadQuarter    70 non-null     object        
 3   Sector         84 non-null     object        
 4   What it does   89 non-null     object        
 5   Founders       86 non-null     object        
 6   Investor       89 non-null     object        
 7   Amount($)      89 non-null     object        
 8   Stage          43 non-null     object        
 9   Year           89 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 7.1+ KB
None 
Company/Brand     0.000000
Founded          32.584270
HeadQuarter      21.348315
Sector            5.617978
What it does      0.000000
Founders          3.370787
Investor          0.000000
Amount($)         0

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Year


#### Clean Amount column in 2019 dataset

In [None]:
df_2019["Amount($)"].unique()

array(['$6,300,000', '$150,000,000', '$28,000,000', '$30,000,000',
       '$6,000,000', 'Undisclosed', '$1,000,000', '$20,000,000',
       '$275,000,000', '$22,000,000', '$5,000,000', '$140,500',
       '$540,000,000', '$15,000,000', '$182,700', '$12,000,000',
       '$11,000,000', '$15,500,000', '$1,500,000', '$5,500,000',
       '$2,500,000', '$140,000', '$230,000,000', '$49,400,000',
       '$32,000,000', '$26,000,000', '$150,000', '$400,000', '$2,000,000',
       '$100,000,000', '$8,000,000', '$100,000', '$50,000,000',
       '$120,000,000', '$4,000,000', '$6,800,000', '$36,000,000',
       '$5,700,000', '$25,000,000', '$600,000', '$70,000,000',
       '$60,000,000', '$220,000', '$2,800,000', '$2,100,000',
       '$7,000,000', '$311,000,000', '$4,800,000', '$693,000,000',
       '$33,000,000'], dtype=object)

In [None]:
# Clean Amount colum
def clean_amount(Amount):
    try:
        Amount = str(Amount)
        # Remove commas
        Amount = Amount.replace(",", "")
        # Replace USD sign with nothing
        if "$" in Amount:
            Amount = Amount.replace("$", "")
            return round(float(Amount), 2) 
        # If no currency symbol, assume it's already in USD
        else:
            return round(float(Amount), 2)
    except ValueError:
        # For non-numeric entries, return NaN
        return np.nan

# Apply the clean_amount function to the 'amount' column
df_2019["Amount($)"] = df_2019["Amount($)"].apply(clean_amount)

In [None]:
df_2019.fillna({"Amount($)": df_2019["Amount($)"].median()}, inplace = True)
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company/Brand  89 non-null     object        
 1   Founded        60 non-null     float64       
 2   HeadQuarter    70 non-null     object        
 3   Sector         84 non-null     object        
 4   What it does   89 non-null     object        
 5   Founders       86 non-null     object        
 6   Investor       89 non-null     object        
 7   Amount($)      89 non-null     float64       
 8   Stage          43 non-null     object        
 9   Year           89 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 7.1+ KB


#### Establish uniformity in column names

In [None]:
# # Rename columns
# df_2019.rename(columns={'Company/Brand':'Company_Brand','Amount($)':'Amount', 'What it does': 'What_it_does'}, inplace=True)

# df_2018.rename(columns={'Company Name':'Company_Brand','Industry':'Sector','Round/Series':'Stage','Location':'HeadQuarter','About Company':'What_it_does'} ,inplace=True)

### Concatenate Datasets

In [None]:
# # Combine 2020 and 2021 dataframes into a single dataframe
# df_combined = pd.concat([df_2020, df_2021, df_2018, df_2019], axis = 0)
# # Preview combined dataframe
# df_combined.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020-01-01,NaT
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Ideation,2020-01-01,NaT
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,3000000.0,Ideation,2020-01-01,NaT
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020-01-01,NaT
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020-01-01,NaT


In [None]:
# print(df_combined.info(), "\n====================== Null Value Percentage ======================")
# print(df_combined.isna().mean().mul(100), "\n====================== Duplicated rows ======================")
# df_combined.loc[df_combined.duplicated()]

<class 'pandas.core.frame.DataFrame'>
Index: 2857 entries, 0 to 88
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company_Brand  2856 non-null   object        
 1   Founded        2088 non-null   float64       
 2   HeadQuarter    2742 non-null   object        
 3   Sector         2838 non-null   object        
 4   What_it_does   2856 non-null   object        
 5   Founders       2312 non-null   object        
 6   Investor       2232 non-null   object        
 7   Amount         2857 non-null   float64       
 8   Stage          1932 non-null   object        
 9   Year_Funded    2243 non-null   datetime64[ns]
 10  Year           614 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), object(7)
memory usage: 267.8+ KB
None 
Company_Brand     0.035002
Founded          26.916346
HeadQuarter       4.025201
Sector            0.665033
What_it_does      0.035002
Founders         19.075

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded,Year


In [None]:
df_combined["HeadQuarter"].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli', 'Thane', None, 'Singapore',
       'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur', 'Frisco',
       'California', 'Dhingsara', 'New York', 'Patna', 'San Francisco',
       'San Ramon', 'Paris', 'Plano', 'Sydney', 'Bangladesh', 'London',
       'Milan', 'Palmwoods', 'France', 'Trivandrum', 'Cochin',
       'Samastipur', 'Irvine', 'Tumkur', 'Newcastle', 'Shanghai',
       'Jiaxing', 'Rajastan', 'Kochi', 'Ludhiana', 'Dehradun',
       'Tangerang', 'Berlin', 'Seattle', 'Riyadh', 'Seoul', 'Bangkok',
       'Kanpur', 'Chandigarh', 'Warangal', 'Odisha', 'Bihar', 'Goa',
       'Tamil Nadu', 'Uttar Pradesh', 'Bhopal', 'Coimbatore', 'Bengaluru',
       'Ahmadabad', 'Small Towns, Andhra Pradesh', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Computer Games', 'Vadodara',
       'Food & Beverages',

HeadQuarter column needs to be cleaned up to get just the headquarters' locations as some of its rows included country and town name

In [None]:
df_combined["HeadQuarter"] = df_combined["HeadQuarter"].str.split(",", expand = True)[0]
df_combined["HeadQuarter"] = df_combined["HeadQuarter"].str.split("\t", expand = True)[0]

df_combined["HeadQuarter"].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli', 'Thane', None, 'Singapore',
       'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur', 'Frisco',
       'California', 'Dhingsara', 'New York', 'Patna', 'San Francisco',
       'San Ramon', 'Paris', 'Plano', 'Sydney', 'Bangladesh', 'London',
       'Milan', 'Palmwoods', 'France', 'Trivandrum', 'Cochin',
       'Samastipur', 'Irvine', 'Tumkur', 'Newcastle', 'Shanghai',
       'Jiaxing', 'Rajastan', 'Kochi', 'Ludhiana', 'Dehradun',
       'Tangerang', 'Berlin', 'Seattle', 'Riyadh', 'Seoul', 'Bangkok',
       'Kanpur', 'Chandigarh', 'Warangal', 'Odisha', 'Bihar', 'Goa',
       'Tamil Nadu', 'Uttar Pradesh', 'Bhopal', 'Coimbatore', 'Bengaluru',
       'Ahmadabad', 'Small Towns', 'Rajsamand', 'Ranchi', 'Faridabad',
       'Computer Games', 'Vadodara', 'Food & Beverages',
       'Pharmaceuticals', 'Moha

In [None]:
df_combined.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020-01-01,NaT
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Ideation,2020-01-01,NaT
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,3000000.0,Ideation,2020-01-01,NaT
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020-01-01,NaT
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020-01-01,NaT


In [None]:
df_combined.isna().mean().mul(100)

Company_Brand     0.035002
Founded          26.916346
HeadQuarter       4.025201
Sector            0.665033
What_it_does      0.035002
Founders         19.075954
Investor         21.876094
Amount            0.000000
Stage            32.376619
Year_Funded      21.491075
Year             78.508925
dtype: float64

Drop columns that do not contribute to answering the research questions or achieving the analysis goals

In [None]:
# df_combined.drop(columns = ["Founded", "Founders", "Stage", "What_it_does"], inplace = True)
# df_combined.head()

Unnamed: 0,Company_Brand,HeadQuarter,Sector,Investor,Amount,Year_Funded,Year
0,Aqgromalin,Chennai,AgriTech,Angel investors,200000.0,2020-01-01,NaT
1,Krayonnz,Bangalore,EdTech,GSF Accelerator,100000.0,2020-01-01,NaT
2,PadCare Labs,Pune,Hygiene management,Venture Center,3000000.0,2020-01-01,NaT
3,NCOME,New Delhi,Escrow,"Venture Catalysts, PointOne Capital",400000.0,2020-01-01,NaT
4,Gramophone,Indore,AgriTech,"Siana Capital Management, Info Edge",340000.0,2020-01-01,NaT


We drop null values in HeadQuarter and sector columns. This is because dropping them won't have anay significant effect on the analysis


In [None]:
# df_combined.dropna(subset = ["HeadQuarter", "Sector"], inplace = True)

Fill investor column with the mose frquent value

In [None]:
# # Fill the 'Sector' column with the most frequent value
# df_combined.fillna({"Investor": df_combined["Investor"].mode()[0]} , inplace = True)
# print(df_combined.info(), "\n====================== Null Value Percentage ======================")
# print(df_combined.isna().mean().mul(100))
# df_combined.head()


<class 'pandas.core.frame.DataFrame'>
Index: 2727 entries, 0 to 88
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Company_Brand  2727 non-null   object        
 1   HeadQuarter    2727 non-null   object        
 2   Sector         2727 non-null   object        
 3   Investor       2727 non-null   object        
 4   Amount         2727 non-null   float64       
 5   Year_Funded    2136 non-null   datetime64[ns]
 6   Year           591 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 170.4+ KB
None 
Company_Brand     0.000000
HeadQuarter       0.000000
Sector            0.000000
Investor          0.000000
Amount            0.000000
Year_Funded      21.672167
Year             78.327833
dtype: float64


Unnamed: 0,Company_Brand,HeadQuarter,Sector,Investor,Amount,Year_Funded,Year
0,Aqgromalin,Chennai,AgriTech,Angel investors,200000.0,2020-01-01,NaT
1,Krayonnz,Bangalore,EdTech,GSF Accelerator,100000.0,2020-01-01,NaT
2,PadCare Labs,Pune,Hygiene management,Venture Center,3000000.0,2020-01-01,NaT
3,NCOME,New Delhi,Escrow,"Venture Catalysts, PointOne Capital",400000.0,2020-01-01,NaT
4,Gramophone,Indore,AgriTech,"Siana Capital Management, Info Edge",340000.0,2020-01-01,NaT


| Stages of funding in Indian startup | Description | Stages of funding per Data given |
|----------|-----------|-----------|
| Ideation | Brainstorming and developing business concepts, defining value propositions, and outlining plans | Pre-seed, Pre-Seed, Pre-seed Round, Pre seed Round, Pre seed round |
| Validation | Validating the business model, product-market fit, and scalability through research and feedback | Seed, Seed A, Seed Funding, Seed round, Seed Round, Seed Investment, Seed funding, Angel Round |
| Early Traction | Gaining initial traction, attracting early adopters, and refining based on feedback | Pre-series A, Pre series A, Pre-Series A, Pre- series A, Pre series A1, Series A, Series A-1 |
| Scaling | Expanding operations, customer base, and market reach for rapid growth | Series B, Series B2, Pre-series B, Pre series B, Pre-Series B, Series C, Pre-series C, Pre series C, Series D, Series D1, Series E, Series E2, Series C, D, Mid series, Series F, Series H |
| Exit Options | Considering exit strategies such as mergers, acquisitions, or IPOs | Debt, Bridge, Bridge Round |
| Others | Miscellaneous phases or unique development activities |  |


## Hypothesis Testing