# Exploring the Indian Startup Ecosystem: A Data Driven Analysis of Funding Trends and Industry Sectors

## Business Understanding
### Business Scenario
Your team is trying to venture into the Indian start-up
ecosystem. As the data expert of the team, you are to
investigate the ecosystem and propose the best course
of action.

*Analyze funding received by start-ups in India from
2018 to 2021.*
- Separate data for each year of funding will is
provided.
- In these datasets, you'll find the start-ups' details,
the funding amounts received, and the investors'
information.

### Business Objective
The aim of this project is to perform analysis on the Indian start-ups ecosystem and advice stakeholders on which venture to invest in to increase the potential of high profit/income.

## Hypothesis Testing
*Hypothesis* - The amount of funds a company receive depends on the sector a company finds itself
- Null Hypothesis(H_o) - The funds a company receive does not depend on the sector of investment
- Alternate Hypothesis(H_a) - The funds a company receive depends on the sector of investment

### Business Questions
- Which particular sector received the most funding over the time frame?
- The distribution of start ups in stages and the amount allocated each?
- What is the distribution of fundings based on locations?
- Which year had the most investors?
- Top 3 investor considerations in investing in start ups
- What was the impact of COVID-19 pandemic on startup funding in 2020?


# Data Loading and Exploration


## Data Dictionary

| Column Names|Description| Data Type|
|-------------|-----------|----------|
|Company/Brand|Name of company/start-up|Object|
|Founded|Year Start-up was founded|Int|
|Sector| Sector of service|Object|
|What it Does|Description about the company|Object|
|Founders| Founders of the company|Object|
|Investor| Investors|Object|
|Amount($)|Raised Founds|Float|
|Stage|Round of funding reached|Object|
|Headquarter|Location of company|Object|

## Data Understanding
Data Understanding phase drives the focus to identify, collect, and analyse
the data sets that can help accomplish the project goals. This
phase also has four tasks:


In [1]:
#import all necessary libraries

# data manipulation
import pandas as pd
import numpy as np

# data visualization libraries
import matplotlib.pyplot as plt
from plotly import express as px
import seaborn as sns

# statistical libraries
from scipy import stats
import statistics as stat

# database manipulation libraries
import pyodbc
from dotenv import dotenv_values

# hide warnings
import warnings
warnings.filterwarnings("ignore")
 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Setup Database Connection

In [2]:
# load environment variables
environment_variables = dotenv_values(".env")

# load database configurations
database = environment_variables.get("DB_DATABASENAME")
server = environment_variables.get("DB_SERVER")
username = environment_variables.get("DB_USERNAME")
password = environment_variables.get("DB_PASSWORD")

# database connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"



In [3]:
# create pyodbc connector
connection = pyodbc.connect(connection_string)


In [4]:
# Loading 2021 dataset from MS SQL server
query_2021 = " SELECT * FROM dbo.LP1_startup_funding2021"
df_2021 = pd.read_sql(query_2021,connection)
df_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [5]:
# Load 2020 dataset from MS SQL Server
query_2020 = "SELECT * FROM dbo.LP1_startup_funding2020"
df_2020 = pd.read_sql(query_2020,connection)
df_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [6]:
# load 2019 dataset
df_2019 = pd.read_csv("data/startup_funding2019.csv")
df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [7]:
# load 2018 dataset
df_2018 = pd.read_csv("./data/startup_funding2018.csv")
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


# Data Assesment
After the data was loaded from the various sources, the data was Visually and Programatically assessed on Quality Issues.

## Data Quality Issues
I identified several data quality issues while exploring the datasets
1. Incosistent and missing columns:Some datasets were identified to have incosistent columns structures,with missing columns.This applies to the 2018 dataset.
2. Missing values and duplicates: There exists some null values and duplicates within the indivdual datasets.
3. Incosistent Values and currencies in the amount column: The amount column contains incosistent values and different currency types. 
4. Incosistent Values in the Sector,Location.,Industry Column: The following columns contain a lot of inconsistent values which needs to be looked at.



# Data Cleaning on the 2018 Dataset

In [8]:
# load dataset
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [9]:
# rename column labels
df_2018 = df_2018.rename(columns=lambda x: x.lower().replace(" ","_"))
df_2018.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [10]:
# rename  column
rename_columns= {
    "round/series":"stage",
    "industry":"sector",
    "about_company":"what_it_does"
}
df_2018 =df_2018.rename(columns=rename_columns) 

In [11]:
# confirm changes
df_2018.columns

Index(['company_name', 'sector', 'stage', 'amount', 'location',
       'what_it_does'],
      dtype='object')

In [12]:
# check the shape of the 2018 dataset
df_2018.shape

(526, 6)

In [13]:
# perform descriptive statistics on the dataset
df_2018.describe(include="all").T

Unnamed: 0,count,unique,top,freq
company_name,526,525,TheCollegeFever,2
sector,526,405,—,30
stage,526,21,Seed,280
amount,526,198,—,148
location,526,50,"Bangalore, Karnataka, India",102
what_it_does,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


In [14]:
# check the info about the dataset
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  526 non-null    object
 1   sector        526 non-null    object
 2   stage         526 non-null    object
 3   amount        526 non-null    object
 4   location      526 non-null    object
 5   what_it_does  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


## Observations on the 2018 dataset
- The dataset consists of 526 rows and 6 columns
- All the data types of the columns are of the object type
- The amount column must be a numeric value
- There exist some duplicates in the dataset

### Course of Action
- Convert the amount column data type to float
- Drop duplicated rows

In [15]:
# check for null values
df_2018.isna().sum()

company_name    0
sector          0
stage           0
amount          0
location        0
what_it_does    0
dtype: int64

In [16]:
# check for duplicated values
df_2018.duplicated().sum()

1

In [17]:
# drop the duolicate value
df_2018.drop_duplicates(inplace=True)


In [18]:
# verify if duplicate still exists
df_2018.duplicated().sum() 

0

## Cleaning of the Amount Column
Exchange Rate Between USD and Indian Rupee as of 2018

1USD = 68.4933INR

source: https://www.poundsterlinglive.com/history/USD-INR-2018

assumption: All the amount values without any currencies are assumed to be in USD

In [19]:
df_2018["amount"].unique()

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

In [20]:
# create function to clean the amount column
exchange_rate = 68.4933
    
def clean_amount():
    # copy original amount columns
    amount_column = df_2018["amount"].copy().str.replace(",","") 
    # extract values in Rupees(₹)
    amount_in_rupee = amount_column[amount_column.str.startswith("₹")]
    # strip off the ₹ symbol
    amount_in_rupee = amount_in_rupee.str.lstrip("₹")
    # convert the amount in rupee to USD by the exchange rate of 68.4933
    amount_in_rupee = amount_in_rupee.apply(lambda x: float(x)/exchange_rate)
    # extract values in dollars($)
    amount_in_dollar = amount_column[amount_column.str.startswith("$")]
    # strip off the dollar symbol
    amount_in_dollar = amount_in_dollar.str.lstrip("$")
    # Replace the Unclean columns with the clean one
    amount_column.loc[amount_in_rupee.index] = amount_in_rupee
    amount_column.loc[amount_in_dollar.index] = amount_in_dollar
    
    # convert the clean column to numeric
    amount_column = pd.to_numeric(amount_column,errors="coerce")
    # update the amount column 
    df_2018["amount"] = amount_column




In [21]:
# call the clean_amount function
clean_amount()

In [22]:
# check info about the dataset after cleaning
df_2018.info()


<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company_name  525 non-null    object 
 1   sector        525 non-null    object 
 2   stage         525 non-null    object 
 3   amount        377 non-null    float64
 4   location      525 non-null    object 
 5   what_it_does  525 non-null    object 
dtypes: float64(1), object(5)
memory usage: 44.9+ KB


In [23]:
# check for the null values present after cleaning the amount column
df_2018.isna().sum()

company_name      0
sector            0
stage             0
amount          148
location          0
what_it_does      0
dtype: int64

In [24]:
# perform a discriptive statistics on the new dataset
df_2018.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
company_name,525.0,525.0,TheCollegeFever,1.0,,,,,,,
sector,525.0,405.0,—,30.0,,,,,,,
stage,525.0,21.0,Seed,279.0,,,,,,,
amount,377.0,,,,17616765.176312,77972595.460917,875.99809,500000.0,1300000.0,5000000.0,1000000000.0
location,525.0,50.0,"Bangalore, Karnataka, India",101.0,,,,,,,
what_it_does,525.0,524.0,Algorithmic trading platform.,2.0,,,,,,,


## Observations
- There is a vast difference between the mean and the median of the amount column which suggests outliers present in the dataset
- There exists some null values still present in the dataset 

## Course Of Actions
Fillna Values in the amount column:Impute the NaN values in the amount column with the median because of the outliers

In [25]:
# Function to fill NaN values in the amount column with the average based on stage of company
def impute_amount_column(df,filter_name,fill_value):
    unique_values = df[filter_name].unique()
    for val, avg_amount in zip(unique_values,df.groupby(filter_name)[fill_value].transform('median')):
        df.loc[df[filter_name] == val, fill_value] = df.loc[df[filter_name] == val, fill_value].fillna(avg_amount)
    return df
     


In [26]:
# call the impute_amount_column function
df_2018 = impute_amount_column(df_2018,"stage","amount")

In [27]:
df_2018.head()

Unnamed: 0,company_name,sector,stage,amount,location,what_it_does
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [28]:
# check info about the cleaned dataset
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company_name  525 non-null    object 
 1   sector        525 non-null    object 
 2   stage         525 non-null    object 
 3   amount        525 non-null    float64
 4   location      525 non-null    object 
 5   what_it_does  525 non-null    object 
dtypes: float64(1), object(5)
memory usage: 44.9+ KB


In [29]:
# Split the Industry Column
industry_split = df_2018["sector"].str.split(",", expand=True)
industry_split.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,Brand Marketing,Event Promotion,Marketing,Sponsorship,Ticketing,,,,,,,
1,Agriculture,Farming,,,,,,,,,,
2,Credit,Financial Services,Lending,Marketplace,,,,,,,,
3,Financial Services,FinTech,,,,,,,,,,
4,E-Commerce Platforms,Retail,SaaS,,,,,,,,,


In [30]:
# concatenate industry_split with the original df_2018
df_2018 = pd.concat([df_2018,industry_split], ignore_index=False, axis=1)
df_2018.head()

Unnamed: 0,company_name,sector,stage,amount,location,what_it_does,0,1,2,3,4,5,6,7,8,9,10,11
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Brand Marketing,Event Promotion,Marketing,Sponsorship,Ticketing,,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,Farming,,,,,,,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Credit,Financial Services,Lending,Marketplace,,,,,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,FinTech,,,,,,,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,E-Commerce Platforms,Retail,SaaS,,,,,,,,,


In [31]:
# drop the unwanted columns
df_2018.drop(columns=["sector",1,2,3,4,5,6,7,8,9,10,11],inplace=True)
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,0
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Brand Marketing
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Credit
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,E-Commerce Platforms


In [32]:
# Rename the column 0 to sector
df_2018 = df_2018.rename(columns={0:"sector"})
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Brand Marketing
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Credit
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,E-Commerce Platforms


In [33]:
# add founded column to the dataset
df_2018["founded"] = 2018
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Brand Marketing,2018
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Credit,2018
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,E-Commerce Platforms,2018


In [34]:
# check info about the new dataset
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company_name  525 non-null    object 
 1   stage         525 non-null    object 
 2   amount        525 non-null    float64
 3   location      525 non-null    object 
 4   what_it_does  525 non-null    object 
 5   sector        525 non-null    object 
 6   founded       525 non-null    int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 49.0+ KB


In [35]:
# look for unique values in the sector column
df_2018["sector"].sort_values(ascending=True).unique()


array(['3D Printing', 'Accounting', 'Advertising', 'Aerospace', 'AgTech',
       'Agriculture', 'Air Transportation', 'Alternative Medicine',
       'Analytics', 'Android', 'Apps', 'Artificial Intelligence', 'Audio',
       'Automotive', 'Autonomous Vehicles', 'B2B', 'Banking',
       'Basketball', 'Battery', 'Beauty', 'Big Data', 'Biopharma',
       'Biotechnology', 'Blockchain', 'Brand Marketing', 'Broadcasting',
       'Business Development', 'Business Intelligence', 'Business Travel',
       'Career Planning', 'Catering', 'Child Care', 'Children',
       'Classifieds', 'Clean Energy', 'CleanTech', 'Cloud Computing',
       'Cloud Infrastructure', 'Collaboration', 'Commercial',
       'Commercial Real Estate', 'Communities', 'Computer', 'Consulting',
       'Consumer', 'Consumer Applications', 'Consumer Electronics',
       'Consumer Goods', 'Consumer Lending', 'Continuing Education',
       'Cooking', 'Cosmetics', 'Creative Agency', 'Credit',
       'Credit Cards', 'Crowdfunding', 

## Restructuring the Sector Column into Key Sectors of the Economy
source: https://www.businessinsider.in/business/startups/news/top-10-industries-for-new-startups-in-india-as-per-hurun-list/articleshow/105651758.cms

In [36]:
# Check for unique values in the sector column again

df_2018["sector"].sort_values().unique()

array(['3D Printing', 'Accounting', 'Advertising', 'Aerospace', 'AgTech',
       'Agriculture', 'Air Transportation', 'Alternative Medicine',
       'Analytics', 'Android', 'Apps', 'Artificial Intelligence', 'Audio',
       'Automotive', 'Autonomous Vehicles', 'B2B', 'Banking',
       'Basketball', 'Battery', 'Beauty', 'Big Data', 'Biopharma',
       'Biotechnology', 'Blockchain', 'Brand Marketing', 'Broadcasting',
       'Business Development', 'Business Intelligence', 'Business Travel',
       'Career Planning', 'Catering', 'Child Care', 'Children',
       'Classifieds', 'Clean Energy', 'CleanTech', 'Cloud Computing',
       'Cloud Infrastructure', 'Collaboration', 'Commercial',
       'Commercial Real Estate', 'Communities', 'Computer', 'Consulting',
       'Consumer', 'Consumer Applications', 'Consumer Electronics',
       'Consumer Goods', 'Consumer Lending', 'Continuing Education',
       'Cooking', 'Cosmetics', 'Creative Agency', 'Credit',
       'Credit Cards', 'Crowdfunding', 

In [43]:
sector_mapping = {
    '3D Printing': 'IT & Technology',
    'Accounting': 'Business Services',
    'Advertising': 'Media & Entertainment',
    'Aerospace': 'Manufacturing',
    'AgTech': 'Agriculture',
    'Agriculture': 'Agriculture',
    'Air Transportation': 'Transportation & Logistics',
    'Alternative Medicine': 'Healthcare',
    'Analytics': 'IT & Technology',
    'Android': 'IT & Technology',
    'Apps': 'IT & Technology',
    'Artificial Intelligence': 'IT & Technology',
    'Audio': 'Media & Entertainment',
    'Automotive': 'Manufacturing',
    'Autonomous Vehicles': 'Manufacturing',
    'B2B': 'Business Services',
    'Banking': 'Financial Services',
    'Basketball': 'Sports',
    'Battery': 'IT & Technology',
    'Beauty': 'Consumer Goods',
    'Big Data': 'IT & Technology',
    'Biopharma': 'Healthcare',
    'Biotechnology': 'Healthcare',
    'Blockchain': 'IT & Technology',
    'Brand Marketing': 'Media & Entertainment',
    'Broadcasting': 'Media & Entertainment',
    'Business Development': 'Business Services',
    'Business Intelligence': 'Business Services',
    'Business Travel': 'Business Services',
    'Career Planning': 'Business Services',
    'Catering': 'Hospitality',
    'Child Care': 'Business Services',
    'Children': 'Consumer Goods',
    'Classifieds': 'Media & Entertainment',
    'Clean Energy': 'IT & Technology',
    'CleanTech': 'IT & Technology',
    'Cloud Computing': 'IT & Technology',
    'Cloud Infrastructure': 'IT & Technology',
    'Collaboration': 'IT & Technology',
    'Commercial': 'Real Estate',
    'Commercial Real Estate': 'Real Estate',
    'Communities': 'IT & Technology',
    'Computer': 'IT & Technology',
    'Consulting': 'Business Services',
    'Consumer': 'Consumer Goods',
    'Consumer Applications': 'IT & Technology',
    'Consumer Electronics': 'Consumer Goods',
    'Consumer Goods': 'Consumer Goods',
    'Consumer Lending': 'Financial Services',
    'Continuing Education': 'Education',
    'Cooking': 'Hospitality',
    'Cosmetics': 'Consumer Goods',
    'Creative Agency': 'Business Services',
    'Credit': 'Financial Services',
    'Credit Cards': 'Financial Services',
    'Crowdfunding': 'Financial Services',
    'Crowdsourcing': 'Business Services',
    'Cryptocurrency': 'Financial Services',
    'Customer Service': 'Business Services',
    'Dating': 'Media & Entertainment',
    'Delivery': 'Transportation & Logistics',
    'Delivery Service': 'Transportation & Logistics',
    'Dental': 'Healthcare',
    'Dietary Supplements': 'Healthcare',
    'Digital Entertainment': 'Media & Entertainment',
    'Digital Marketing': 'Media & Entertainment',
    'Digital Media': 'Media & Entertainment',
    'E-Commerce': 'Retail',
    'E-Commerce Platforms': 'Retail',
    'E-Learning': 'Education',
    'EdTech': 'Education',
    'Education': 'Education',
    'Electric Vehicle': 'Transportation & Logistics',
    'Embedded Systems': 'IT & Technology',
    'Energy': 'IT & Technology',
    'Enterprise Resource Planning (ERP)': 'IT & Technology',
    'Enterprise Software': 'IT & Technology',
    'Environmental Consulting': 'Business Services',
    'Events': 'Hospitality',
    'Eyewear': 'Consumer Goods',
    'Facilities Support Services': 'Business Services',
    'Fantasy Sports': 'Sports',
    'Farming': 'Agriculture',
    'Fashion': 'Consumer Goods',
    'File Sharing': 'IT & Technology',
    'FinTech': 'Financial Services',
    'Finance': 'Financial Services',
    'Financial Services': 'Financial Services',
    'Fitness': 'Sports',
    'Food Delivery': 'Hospitality',
    'Food Processing': 'Consumer Goods',
    'Food and Beverage': 'Consumer Goods',
    'Fraud Detection': 'Financial Services',
    'Funding Platform': 'Financial Services',
    'Gaming': 'Media & Entertainment',
    'Government': 'Business Services',
    'Health Care': 'Healthcare',
    'Health Diagnostics': 'Healthcare',
    'Health Insurance': 'Financial Services',
    'Home Decor': 'Consumer Goods',
    'Hospital': 'Healthcare',
    'Hospitality': 'Hospitality',
    'Human Resources': 'Business Services',
    'Industrial': 'Manufacturing',
    'Industrial Automation': 'Manufacturing',
    'Information Services': 'IT & Technology',
    'Information Technology': 'IT & Technology',
    'Insurance': 'Financial Services',
    'Internet': 'IT & Technology',
    'Internet of Things': 'IT & Technology',
    'Last Mile Transportation': 'Transportation & Logistics',
    'Logistics': 'Transportation & Logistics',
    'Manufacturing': 'Manufacturing',
    'Market Research': 'Business Services',
    'Marketing': 'Media & Entertainment',
    'Marketplace': 'Retail',
    'Media and Entertainment': 'Media & Entertainment',
    'Medical': 'Healthcare',
    'Medical Device': 'Healthcare',
    'Mobile': 'IT & Technology',
    'Mobile Payments': 'Financial Services',
    'Music': 'Media & Entertainment',
    'Music Streaming': 'Media & Entertainment',
    'Nanotechnology': 'IT & Technology',
    'News': 'Media & Entertainment',
    'Online Games': 'Media & Entertainment',
    'Online Portals': 'Media & Entertainment',
    'Packaging Services': 'Manufacturing',
    'Reading Apps': 'Media & Entertainment',
    'Renewable Energy': 'IT & Technology',
    'Rental': 'Real Estate',
    'Retail': 'Retail',
    'Search Engine': 'IT & Technology',
    'Smart Cities': 'IT & Technology',
    'Social Media': 'Media & Entertainment',
    'Software': 'IT & Technology',
    'Sports': 'Sports',
    'Tourism': 'Hospitality',
    'Trading Platform': 'Financial Services',
    'Training': 'Education',
    'Transportation': 'Transportation & Logistics',
    'Travel': 'Hospitality',
    'Veterinary': 'Healthcare',
    'Wealth Management': 'Financial Services',
    'Wedding': 'Hospitality',
    'Wellness': 'Healthcare',
    'eSports': 'Sports',
    '—': 'Others'
}


In [51]:
# Replacing the sector column values with the new sectors
df_2018["sector"]= df_2018["sector"].replace(sector_mapping)
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Media & Entertainment,2018
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Financial Services,2018
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Retail,2018


In [52]:
df_2018["sector"].value_counts()

sector
IT & Technology               116
Financial Services             77
Healthcare                     45
Consumer Goods                 44
Business Services              36
Media & Entertainment          32
Education                      31
Others                         30
Manufacturing                  29
Retail                         26
Hospitality                    19
Transportation & Logistics     14
Sports                         12
Agriculture                    10
Real Estate                     4
Name: count, dtype: int64

In [56]:
others_cat = df_2018[df_2018["sector"] == "Others" ]
others_cat

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded
58,MissMalini Entertainment,Seed,1518397.0,"Mumbai, Maharashtra, India",MissMalini Entertainment is a multi-platform n...,Others,2018
105,Jagaran Microfin,Debt Financing,8029982.0,"Kolkata, West Bengal, India",Jagaran Microfin is a Microfinance institution...,Others,2018
121,FLEECA,Seed,583998.7,"Jaipur, Rajasthan, India",FLEECA is a Tyre Care Provider company.,Others,2018
146,WheelsEMI,Series B,14000000.0,"Pune, Maharashtra, India","WheelsEMI is the brand name of NBFC, WheelsEMI...",Others,2018
153,Fric Bergen,Venture - Series Unknown,583998.7,"Alwar, Rajasthan, India",Fric Bergen is a leader in the specialty food ...,Others,2018
174,Deftouch,Seed,583998.7,"Bangalore, Karnataka, India",Deftouch is a mobile game development company ...,Others,2018
181,Corefactors,Seed,583998.7,"Bangalore, Karnataka, India","Corefactors is a leading campaign management, ...",Others,2018
210,Cell Propulsion,Seed,102199.8,"Bangalore, Karnataka, India",Cell Propulsion is an electric mobility startu...,Others,2018
230,Flathalt,Angel,50000.0,"Gurgaon, Haryana, India",FInd your Customized Home here.,Others,2018
235,dishq,Seed,400000.0,"Bengaluru, Karnataka, India",dishq leverages food science and machine learn...,Others,2018


## Recategorizing the Others Values into their appropriate sectors

In [55]:
# Recategorize the Others Values into their appropriate sectors
others_cat["what_it_does"]

58     MissMalini Entertainment is a multi-platform n...
105    Jagaran Microfin is a Microfinance institution...
121              FLEECA is a Tyre Care Provider company.
146    WheelsEMI is the brand name of NBFC, WheelsEMI...
153    Fric Bergen is a leader in the specialty food ...
174    Deftouch is a mobile game development company ...
181    Corefactors is a leading campaign management, ...
210    Cell Propulsion is an electric mobility startu...
230                      FInd your Customized Home here.
235    dishq leverages food science and machine learn...
238    Trell is a location based network which helps ...
242          New Apartments, Flats for Sale in Bangalore
243    It is a fabless semiconductor company focused ...
247    SaffronStays connects travellers to India's In...
251    Inner Being Wellness manufactures beauty, well...
257    SEO, PPC, Search Engine Marketing, Social Medi...
258                             Digital Marketing Agency
259    Scale Labs is a cross bo

In [59]:
keywords = {
    'entertainment': 'Media & Entertainment',
    'microfinance': 'Financial Services',
    'tyre care': 'Consumer Goods',
    'nbfc': 'Financial Services',
    'specialty food': 'Consumer Goods',
    'mobile game development': 'IT & Technology',
    'campaign management': 'Business Services',
    'electric mobility startup': 'Transportation & Logistics',
    'food science': 'Consumer Goods',
    'machine learning': 'IT & Technology',
    'location based network': 'IT & Technology',
    'real estate': 'Real Estate',
    'semiconductor company': 'IT & Technology',
    'travellers accommodation': 'Hospitality',
    'beauty, wellness': 'Wellness',
    'search engine marketing': 'IT & Technology',
    'digital marketing agency': 'Business Services',
    'cross border e-commerce solutions': 'Business Services',
    'wealth management platform': 'Financial Services',
    'micro-event & contextual marketing': 'Business Services',
    'partners with small and medium businesses': 'Financial Services',
    'celebrate and reward': 'Financial Services',
    'post-harvest management': 'Business Services',
    'cyber security': 'IT & Technology',
    'cosmetics brand': 'Consumer Goods',
    'activity discovery & booking platform': 'Hospitality',
    'edutech': 'Education'
}


In [64]:

# Function to assign sector based on description
def assign_sector(what_it_does):
    for keyword,sector in keywords.items():
        if keyword in what_it_does.lower():
            return sector
    return "Others" #Keep others if no keyword matches


In [61]:
# Update sectors for entries currently labeled as "Others"
df_2018.loc[df_2018['sector'] == 'Others', 'sector'] = df_2018.loc[df_2018['sector'] == 'Others', 'what_it_does'].apply(assign_sector)


In [65]:
# confirm changes
df_2018["sector"].value_counts()

sector
IT & Technology               121
Financial Services             82
Consumer Goods                 48
Healthcare                     45
Business Services              40
Media & Entertainment          33
Education                      32
Manufacturing                  29
Retail                         26
Hospitality                    20
Transportation & Logistics     15
Sports                         12
Agriculture                    10
Others                          7
Real Estate                     4
Wellness                        1
Name: count, dtype: int64

## Cleaning the Location Column

### Observations From The Column
- The format of the data found in the column mostly are city,state,country format
### Tasks to Performed
- Split the Column values by comma
- Drop the state and the country columns
- Retain the city column and rename it headquarter to be in-line with the other three datasets

In [66]:
# look for unique values in the location column
df_2018["location"].unique()

array(['Bangalore, Karnataka, India', 'Mumbai, Maharashtra, India',
       'Gurgaon, Haryana, India', 'Noida, Uttar Pradesh, India',
       'Hyderabad, Andhra Pradesh, India', 'Bengaluru, Karnataka, India',
       'Kalkaji, Delhi, India', 'Delhi, Delhi, India', 'India, Asia',
       'Hubli, Karnataka, India', 'New Delhi, Delhi, India',
       'Chennai, Tamil Nadu, India', 'Mohali, Punjab, India',
       'Kolkata, West Bengal, India', 'Pune, Maharashtra, India',
       'Jodhpur, Rajasthan, India', 'Kanpur, Uttar Pradesh, India',
       'Ahmedabad, Gujarat, India', 'Azadpur, Delhi, India',
       'Haryana, Haryana, India', 'Cochin, Kerala, India',
       'Faridabad, Haryana, India', 'Jaipur, Rajasthan, India',
       'Kota, Rajasthan, India', 'Anand, Gujarat, India',
       'Bangalore City, Karnataka, India', 'Belgaum, Karnataka, India',
       'Thane, Maharashtra, India', 'Margão, Goa, India',
       'Indore, Madhya Pradesh, India', 'Alwar, Rajasthan, India',
       'Kannur, Kerala, Ind

In [67]:
# Split the location column
df_2018[["headquarter","state","country"]] = df_2018["location"].str.split(",",expand=True)
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded,headquarter,state,country
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Media & Entertainment,2018,Bangalore,Karnataka,India
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018,Mumbai,Maharashtra,India
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Financial Services,2018,Gurgaon,Haryana,India
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018,Noida,Uttar Pradesh,India
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Retail,2018,Hyderabad,Andhra Pradesh,India


In [68]:
# check for the unique cities
df_2018["headquarter"].unique()

array(['Bangalore', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad',
       'Bengaluru', 'Kalkaji', 'Delhi', 'India', 'Hubli', 'New Delhi',
       'Chennai', 'Mohali', 'Kolkata', 'Pune', 'Jodhpur', 'Kanpur',
       'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin', 'Faridabad', 'Jaipur',
       'Kota', 'Anand', 'Bangalore City', 'Belgaum', 'Thane', 'Margão',
       'Indore', 'Alwar', 'Kannur', 'Trivandrum', 'Ernakulam',
       'Kormangala', 'Uttar Pradesh', 'Andheri', 'Mylapore', 'Ghaziabad',
       'Kochi', 'Powai', 'Guntur', 'Kalpakkam', 'Bhopal', 'Coimbatore',
       'Worli', 'Alleppey', 'Chandigarh', 'Guindy', 'Lucknow'],
      dtype=object)

In [69]:
# check for the unique states in the dataset
df_2018["state"].unique()

array([' Karnataka', ' Maharashtra', ' Haryana', ' Uttar Pradesh',
       ' Andhra Pradesh', ' Delhi', ' Asia', ' Tamil Nadu', ' Punjab',
       ' West Bengal', ' Rajasthan', ' Gujarat', ' Kerala', ' Goa',
       ' Madhya Pradesh', ' India', ' Assam', ' Chandigarh'], dtype=object)

In [70]:
# check for the unique values in the country column
df_2018["country"].unique()

array([' India', None, ' Asia'], dtype=object)

In [71]:
# drop the country column
df_2018.drop(columns=["country"],inplace=True)
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded,headquarter,state
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Media & Entertainment,2018,Bangalore,Karnataka
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018,Mumbai,Maharashtra
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Financial Services,2018,Gurgaon,Haryana
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018,Noida,Uttar Pradesh
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Retail,2018,Hyderabad,Andhra Pradesh


In [72]:
# Replace the States of Haryana, Uttar Pradesh with their capital cities
df_2018["headquarter"] = df_2018["headquarter"].replace("Haryana","Chandigarh")
df_2018["headquarter"] = df_2018["headquarter"].replace("Uttar Pradesh","Lucknow")
df_2018.head()

Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded,headquarter,state
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Media & Entertainment,2018,Bangalore,Karnataka
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018,Mumbai,Maharashtra
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Financial Services,2018,Gurgaon,Haryana
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018,Noida,Uttar Pradesh
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Retail,2018,Hyderabad,Andhra Pradesh


In [73]:
# drop the state column 
df_2018.drop(columns="state",inplace=True)

In [74]:
# confirm changes
df_2018.head()


Unnamed: 0,company_name,stage,amount,location,what_it_does,sector,founded,headquarter
0,TheCollegeFever,Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",Media & Entertainment,2018,Bangalore
1,Happy Cow Dairy,Seed,583998.7,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,Agriculture,2018,Mumbai
2,MyLoanCare,Series A,948997.9,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,Financial Services,2018,Gurgaon
3,PayMe India,Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,Financial Services,2018,Noida
4,Eunimart,Seed,583998.7,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,Retail,2018,Hyderabad
