# DATA ANALYSIS PROJECT

## INDIAN START-UPS FUNDING. INSIGHTS AND TRENDS FROM 2018 TO 2021

### Project Description
This data analysis project focuses on the funding received by start-ups in India from 2018 to 2021. <br><br>The objective is to gain insights into the ecosystem and propose the best course of action for our team's venture. <br><br>By analyzing the data on funding amounts, start-up details, and investor information, we aim to unearth prevailing patterns and gain insights about the opportunities in India's start-up ecosystem to inform decision-making.

### Taking Preview of the Data at Hand

Data from four different sources were gathered for this project. Two from Microsoft SQL Server, one from onedrive and the last one from a github repo.

We set the van off with loading the needed libraries.

In [423]:
# Load some libraries

import pyodbc
from dotenv import dotenv_values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.impute import SimpleImputer
import re
import warnings

warnings.filterwarnings('ignore')

# Display all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Convert amounts from scientific format
pd.set_option('display.float_format', '{:.1f}'.format)

In [424]:
# # Load environment variables from .env file into a dictionary

# env_var = dotenv_values('.env')

# # Get the values for the credentials you set in the '.env' file

# database = env_var.get('DATABASE')
# username = env_var.get('USERNAME')
# password = env_var.get('PASSWORD')
# server = env_var.get('SERVER')


# connection_string = f"DRIVER={{SQL Server}}; SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [425]:
# Get the values for the database credentials set in the '.env' file
env_var = {key.upper(): value for key, value in dotenv_values('.env').items()}

# Unpack the values from the dictionary
database, username, password, server = (env_var.get(key) for key in ['DATABASE', 'USERNAME', 'PASSWORD', 'SERVER'])

connection_string = f"DRIVER={{SQL Server}}; SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [426]:
# Connect to database

connection = pyodbc.connect(connection_string)

Success! Now, we read the data using sql SELECT statement and pandas' read_sql.

In [427]:
# Loading the data for 2020

query2020 = "select * from dbo.LP1_startup_funding2020"

data2020 = pd.read_sql(query2020, connection)

In [428]:
# Loading the data for 2021

query2021 = "select * from dbo.LP1_startup_funding2021"

data2021 = pd.read_sql(query2021, connection)

Perfect! We hop onto the other datasets obtained in '.csv' format

In [429]:
# Loading data for 2018 & 2019


data2018 = pd.read_csv('startup_funding2018.csv')
data2019 = pd.read_csv('startup_funding2019.csv')

### Data Inspection

### 2021 Data

In [430]:
data2021.head(10)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed
5,Urban Company,2014.0,New Delhi,Home services,Urban Company (Formerly UrbanClap) is a home a...,"Abhiraj Singh Bhal, Raghav Chandra, Varun Khaitan",Vy Capital,"$188,000,000",
6,Comofi Medtech,2018.0,Bangalore,HealthTech,Comofi Medtech is a healthcare robotics startup.,Gururaj KB,"CIIE.CO, KIIT-TBI","$200,000",
7,Qube Health,2016.0,Mumbai,HealthTech,India's Most Respected Workplace Healthcare Ma...,Gagan Kapur,Inflection Point Ventures,Undisclosed,Pre-series A
8,Vitra.ai,2020.0,Bangalore,Tech Startup,Vitra.ai is an AI-based video translation plat...,Akash Nidhi PS,Inflexor Ventures,Undisclosed,
9,Taikee,2010.0,Mumbai,E-commerce,"Taikee is the ISO-certified, B2B e-commerce pl...","Nidhi Ramachandran, Sachin Chhabra",,"$1,000,000",


### Issues arising from data2021:

* Founded column is a float. It has to be a date

* Some Amounts have $undisclosed, Undisclosed and undisclosed. We may treat them as missing values

* The Amounts and Stage for FanPlay company are interchanged

* At index 242, 256, 257, and 545, the amount appears at the investor’s column and then stage at the amount column

* Little Leap campany at index 538 has Ah! Ventures(investors) instead of amount and then amount at stage column. Also, ‘Holistic Development Programs for children in …; should be   replaced with Vishal Gupta as founder.

* BHyve company has part of ‘what_it_does’ at founders cl then founders at investor cl and investors at amount col ……. It also had investor’s name as ITO Angel Network instead of JITO Angel Network

* Some amounts are separated by ‘,’ some too have ‘$$’ preceding them and some only ‘$’ as amount…this is for EventBeep, MPL

* Amount for Godamwale is misspelt as 1000000\t#REF! instead of 1000000 and is at investor col with stage rather taking its place. Also, the investor is **Capt. Anand Aryamane**

* for index 1100-Sochcast company, Heaquarter is replaced with ‘Online Media\t#REF!’ and sector
* There are some duplicates

### Resolution

* Delete duplicates
* Delete rows with anomalous values
* Impute with right values from credible sources where such values are mispelt, omitted and wrongly placed
* Convert columns to right data types
* Standardise the 'Stage' column for good analysis by renaming or grouping

In [431]:
# Confirm duplicated records

data2021.duplicated().sum()

19

Let's take a look, first.

In [432]:
data2021[data2021.duplicated()].head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
107,Curefoods,2020.0,Bangalore,Food & Beverages,Healthy & nutritious foods and cold pressed ju...,Ankit Nagori,"Iron Pillar, Nordstar, Binny Bansal",$13000000,
109,Bewakoof,2012.0,Mumbai,Apparel & Fashion,Bewakoof is a lifestyle fashion brand that mak...,Prabhkiran Singh,InvestCorp,$8000000,
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
117,Advantage Club,2014.0,Mumbai,HRTech,Advantage Club is India's largest employee eng...,"Sourabh Deorah, Smiti Bhatt Deorah","Y Combinator, Broom Ventures, Kunal Shah",$1700000,
119,Ruptok,2020.0,New Delhi,FinTech,Ruptok fintech Pvt. Ltd. is an online gold loa...,Ankur Gupta,Eclear Leasing,$1000000,


Alright! They have to be on the go...!

In [433]:
# Drop the duplicates

data2021 = data2021.drop_duplicates()


Checking.......

In [434]:
# Check data shape before deletion
data2021.shape

(1190, 9)

Success!

### Dealing with anomalous rows

....taking a look first at the unique values of each column!

Deleting rows with anomalous values


In [435]:

data2021['Company_Brand'].unique()

array(['Unbox Robotics', 'upGrad', 'Lead School', ..., 'Gigforce',
       'Vahdam', 'WeRize'], dtype=object)

In [436]:
# Function to drop anomalous rows


def drop_rows_by_column_values(dataframe, column_name, value):
    """
    Drops rows from the specified DataFrame based on a specific column value.

    Parameters:
        dataframe (pandas.DataFrame): The DataFrame to modify.
        column_name (str): The name of the column to check for the specified value.
        value: The value to match and drop rows based on the specified column.

    Returns:
        None

    Description:
        This function removes rows from the specified DataFrame where the specified column
        matches the provided value. The rows are dropped in-place, modifying the DataFrame directly.

        Parameters:
        - dataframe (pandas.DataFrame): The DataFrame to modify.
        - column_name (str): The name of the column to check for the specified value.
        - value: The value to match and drop rows based on the specified column.

        Returns:
        - None

        Example usage:
        drop_rows_by_column_value(data2021, 'Company_Brand', 'BrandXYZ')

        This example will drop all rows from the 'data2021' DataFrame where the 'Company_Brand'
        column contains the value 'BrandXYZ'.
    """
    dataframe.drop(dataframe[dataframe[column_name].isin(value)].index, inplace=True)

In [437]:
# Invoke function to drop the list of rows

drop_rows_by_column_values(data2021, 'Company_Brand', ['Little Leap', 'Godamwale', 'Sochcast', 'BHyve', 'Fullife Healthcare', 'MoEVing', 'AdmitKard', 'MYRE Capital'])


In [438]:
# Making sure the rows were deleted

data2021.shape

(1182, 9)

In [439]:
# Re-arrange indices for the data frame

data2021.reset_index(drop=True, inplace=True)

In [440]:
# Check for re-indexing

data2021.tail()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1177,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A
1178,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D
1179,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C
1180,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B
1181,WeRize,2019.0,Bangalore,Financial Services,India’s first socially distributed full stack ...,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",$8000000,Series A


### Dealing with the 'Founded' Col

In [441]:
# Imputing missing values

array = data2021['Founded'].values.reshape(-1,1)
imputer = SimpleImputer(strategy='most_frequent')

imputer.fit(array)

data2021['Founded'] = imputer.transform(array)

In [442]:
data2021['Founded'].unique()

array([2019., 2015., 2012., 2021., 2014., 2018., 2016., 2020., 2010.,
       2017., 1993., 2008., 2013., 1999., 1989., 2011., 2009., 2002.,
       1994., 2006., 2000., 2007., 1978., 2003., 1998., 1991., 1984.,
       2004., 2005., 1963.])

Converting 'Founded' from float to int64

In [443]:

data2021['Founded'] = data2021['Founded'].apply(str).apply(lambda x: x.split('.')[0])
data2021['Founded'] = pd.to_numeric(data2021['Founded'])
data2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [444]:
# Making sure the conversion was done

data2021['Founded'].dtype

dtype('int64')

In [445]:
data2021['Founded'].unique()

array([2019, 2015, 2012, 2021, 2014, 2018, 2016, 2020, 2010, 2017, 1993,
       2008, 2013, 1999, 1989, 2011, 2009, 2002, 1994, 2006, 2000, 2007,
       1978, 2003, 1998, 1991, 1984, 2004, 2005, 1963], dtype=int64)

### Cleaning the 'Amount' col

In [446]:
# Drop additional rows with anomalous values at the Amount col

drop_rows_by_column_values(data2021, 'Amount', ['Upsparks', 'JITO Angel Network, LetsVenture'])

Replacing undisclosed/$undisclosed/$Undisclosed values under the 'Amounts' Column

In [447]:
# Function to replace anomalous amount values with np.nan

def replace_values_with_nan(df, column_name, values_to_replace):
    """
    Replaces specified values in a column of a DataFrame with np.nan.

    Args:
        df (pandas.DataFrame): The DataFrame containing the column to modify.
        column_name (str): The name of the column to replace values in.
        values_to_replace (list or scalar): The value(s) to replace with np.nan. Can be a single value or a list of values.

    Returns:
        pandas.DataFrame: A modified DataFrame with the specified values replaced by np.nan.

    Example:
        # Create a sample DataFrame
        data = {
            'Column1': [1, 2, 3, 4, 5],
            'Column2': ['A', 'B', 'C', 'D', 'E'],
            'Column3': ['X', 'Y', 'Z', 'X', 'Z']
        }

        df = pd.DataFrame(data)

        # Define the column name and values to replace with np.nan
        column_name = 'Column3'
        values_to_replace = ['X', 'Z']

        # Call the replace_values_with_nan function
        df_modified = replace_values_with_nan(df, column_name, values_to_replace)

        # Print the modified DataFrame
        print(df_modified)
    """
    df[column_name] = df[column_name].replace(values_to_replace, np.nan)
    return df


In [448]:
# Invoking the function to replace the missing values with np.nan

data2021 = replace_values_with_nan(data2021, 'Amount', ['$Undisclosed', '$undisclosed', 'undisclosed', 'Undisclosed', 'None'])

# Replacing the undesired characters

data2021['Amount'] = data2021['Amount'].str.replace('[$,]', '')

In [449]:
# Converting the 'Amount' col to float

data2021['Amount'] = pd.to_numeric(data2021['Amount'])

In [450]:
# Checking....

data2021['Amount'].dtype

dtype('float64')

In [451]:
pd.set_option('display.float_format', '{:.1f}'.format)


In [452]:
data2021.describe()

Unnamed: 0,Founded,Amount
count,1180.0,1040.0
mean,2016.6,172664159.6
std,4.5,4651133125.4
min,1963.0,10000.0
25%,2015.0,1000000.0
50%,2018.0,3700000.0
75%,2020.0,15000000.0
max,2021.0,150000000000.0


Filling missing values in the 'Amount' Col

In [453]:
# Imputing missing values

array = data2021['Amount'].values.reshape(-1,1)
imputer = SimpleImputer(strategy='median')

imputer.fit(array)

data2021['Amount'] = imputer.transform(array)

In [454]:
data2021['Amount'] = data2021['Amount'].astype(float)

In [455]:
data2021[data2021['Amount'] == 150000000000.0]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
73,Alteria Capital,2018,Mumbai,FinTech,Alteria Capital is a Venture debt firm .,Vinod Murali,,150000000000.0,Debt


In [456]:
top_10_fundings = data2021.groupby('Company_Brand')['Amount'].sum().nlargest(10)

top_10_fundings

Company_Brand
Alteria Capital    150000000000.0
VerSe Innovation     1450000000.0
BYJU'S               1260000000.0
Dream Sports         1240000000.0
Meesho                870000000.0
Zetwerk               870000000.0
OYO                   865000000.0
Swiggy                800000000.0
BharatPe              530200000.0
Ola                   500000000.0
Name: Amount, dtype: float64

In [457]:
data2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A
1,upGrad,2015,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,
2,Lead School,2012,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D
3,Bizongo,2015,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000.0,Series C
4,FypMoney,2021,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000.0,Seed


In [458]:
data2021['Year_Funded'] = 2021

In [459]:
data2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded
0,Unbox Robotics,2019,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A,2021
1,upGrad,2015,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,,2021
2,Lead School,2012,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D,2021
3,Bizongo,2015,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000.0,Series C,2021
4,FypMoney,2021,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000.0,Seed,2021


### Cleaning 'Stage' Col

Inspecting the unique values in the 'Stage' Column

In [460]:
data2021['Stage'].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       'Bridge', 'Seed+', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', 'Early seed', 'Series D1', 'Seies A',
       'Pre-series', 'Series A2', 'Series I'], dtype=object)

Check how many records have the null value at the 'Stage' column

In [461]:
data2021['Stage'].isnull().sum()

416

Oops! Pretty much! Let's take a deeper look at those records.

In [462]:
data2021[data2021['Stage'].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year_Funded
1,upGrad,2015,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,,2021
5,Urban Company,2014,New Delhi,Home services,Urban Company (Formerly UrbanClap) is a home a...,"Abhiraj Singh Bhal, Raghav Chandra, Varun Khaitan",Vy Capital,188000000.0,,2021
6,Comofi Medtech,2018,Bangalore,HealthTech,Comofi Medtech is a healthcare robotics startup.,Gururaj KB,"CIIE.CO, KIIT-TBI",200000.0,,2021
8,Vitra.ai,2020,Bangalore,Tech Startup,Vitra.ai is an AI-based video translation plat...,Akash Nidhi PS,Inflexor Ventures,3700000.0,,2021
9,Taikee,2010,Mumbai,E-commerce,"Taikee is the ISO-certified, B2B e-commerce pl...","Nidhi Ramachandran, Sachin Chhabra",,1000000.0,,2021
11,FreeStand,2017,New Delhi,B2B service,FreeStand enables FMCG brands to execute track...,"Konark Sharma, Sneh Soni","SucSEED Indovation, IIM Calcutta Innovation Park",100000.0,,2021
13,Freyr Energy,2014,Hyderabad,Renewable Energy,Freyr Energy is a company that provides full s...,"Radhika Choudary, Saurabh Marda","Impact Partners, C4D Partners",2000000.0,,2021
14,DealShare,2018,Jaipur,E-commerce,DealShare is a Social Commerce Startup,"Sankar Bora, Sourjyendu Medda, Vineet Rao","Tiger Global Management, InnoVen Capital",9000000.0,,2021
15,Tessolve,1993,Bangalore,Electronics,Tessolve Semiconductor offers engineering in s...,"P Raja Manickam, Srinivas Chinamilli, Veerappan V",Novo Tellus Capital,40000000.0,,2021
16,Smart Joules,2014,New Delhi,Renewable Energy,Smart Joules is an energy management company.,"Arjun P Gupta, Ujjal Majumdar, Sidhartha Gupta","Raintree Family Office, ADB arm",49000000.0,,2021


Alright! These companies are listed on cruchbase, the leading provider of private-company prospecting and research solutions. <br>
Let's see if we can get reliable data to fill-in the nulls

In [463]:
data2021.iloc[1, 8] = 'Venture - Series Unknown'
data2021.iloc[5, 8] = 'Secondary Market'
data2021.iloc[6, 8] = 'Pre-Seed'
data2021.iloc[8, 8] = 'Seed'
data2021.iloc[9, 8] = 'Venture - Series Unknown'
data2021.iloc[11, 8] = 'Pre-Seed'
data2021.iloc[13, 8] = 'Series A'
data2021.iloc[14, 8] = 'Series D'
data2021.iloc[15, 8] = 'Series E'
data2021.iloc[16, 8] = 'Series A'
data2021.iloc[24, 8] = 'Debt'
data2021.iloc[31, 8] = ' Series A'
data2021.iloc[34, 8] = 'Venture - Series Unknown'
data2021.iloc[35, 8] = 'Venture - Series Unknown'
data2021.iloc[36, 8] = 'Venture - Series Unknown'
data2021.iloc[37, 8] = 'Venture - Series Unknown'
data2021.iloc[40, 8] = 'Series A'
data2021.iloc[42, 8] = 'Seed'
data2021.iloc[46, 8] = 'Equity Crowdfunding'

Good! I could just get a few. 

Let's have a look at what the unique values are again.

In [464]:
data2021['Stage'].unique()

array(['Pre-series A', 'Venture - Series Unknown', 'Series D', 'Series C',
       'Seed', 'Secondary Market', 'Pre-Seed', 'Series A', 'Series E',
       'Series B', 'Pre-seed', 'Debt', ' Series A', None,
       'Equity Crowdfunding', 'Pre-series B', 'Bridge', 'Seed+',
       'Series F2', 'Series A+', 'Series G', 'Series F', 'Series H',
       'Series B3', 'PE', 'Series F1', 'Pre-series A1', 'Early seed',
       'Series D1', 'Seies A', 'Pre-series', 'Series A2', 'Series I'],
      dtype=object)

Ok, Let's put these values in a standard format.

In [465]:

def update_value(value):
    replacements = {
        r'Pre series|Early seed|Pre-series A|Pre-series A1': 'Pre-series',
        r'Seies A|Seed+|Pre-series B|Series A2': 'Series A',
        r'PE': 'Private Equity',
        r'Debt': 'Debt Financing',
        r'Seed1': 'Seed',
        r'None': 'Venture - Series Unknown',
        r'Series A+|Series B3| Series B': 'Series B',
        r'Series F2|Series F1|Series D1|Series D|Series G|Series H|Series I|Series E|Series F': 'Series C',
    }
    
    for pattern, replacement in replacements.items():
        value = re.sub(pattern, replacement, str(value))
    
    return value

data2021 = data2021.applymap(update_value)


Great! Checking........

In [466]:
data2021['Stage'].unique()

array(['Pre-series', 'Venture - Series Unknown', 'Series C', 'Series B',
       'Secondary Market', 'Pre-Series B', 'Pre-seed', 'Debt Financing',
       ' Series B', 'Equity Crowdfunding', 'Bridge', 'Series B+',
       'Private Equity', 'Pre-series1'], dtype=object)

Ouch!  'Series B+',  'Pre-Series B' and 'Pre-series1' want to be treated diferently! I see. Let's implore another method!

In [467]:
data2021['Stage'].replace('Series B+', 'Series B', inplace=True)
data2021['Stage'].replace('Pre-series1', 'Pre-series', inplace=True)
data2021['Stage'].replace('Pre-Series B', 'Series A', inplace=True)
data2021['Stage'].replace(' Series B', 'Series B', inplace=True)


Nice! Let's take a look at the unique values for stage column again!

In [468]:
data2021['Stage'].unique()

array(['Pre-series', 'Venture - Series Unknown', 'Series C', 'Series B',
       'Secondary Market', 'Series A', 'Pre-seed', 'Debt Financing',
       'Equity Crowdfunding', 'Bridge', 'Private Equity'], dtype=object)

Perfect!

In [469]:
data2021['HeadQuarter'].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', 'Venture - Series Unknown',
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Cochin',
       'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara', 'Food & Beverages',
       'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai',
       'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna', 'Samsitpur',
       'Lucknow', 'Telangana', 'Silvassa', 'Thiruvananthapuram',
       'Faridabad', 'Roorkee', 'Ambernath', 'Panchkula', 'Surat',
       'Coimbatore', 'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Mountain View, CA', 'Trivandrum',
       'Jharkhand', 'Kanpur', 'Bhilwara', 'Guwahati', 'Kochi', 'London',
       'Information Technology & Services', 'The Nilgiris', 'Gandhinagar'],
      dtype=object)