# Introduction

For this project, we seek to analyse funding received by startups in India. The aim is to prescribe the best course of action for a startup looking into the Indian business ecosystem. Our first step will be to gain business understanding of the problem.

# Business Understanding

India has become an attractive location for investors and has seen a number of successful startups achieve the coveted "unicorn" status. To guide our quest for the best course of action as an upcoming startup, we asked a few questions which we will attempt to answer using the data on hand.

### Questions

- Does the age of the  startup affect the funding received?
- Which sectors received the most funding?
- Does the number of founders affect the funding received?
- At what stage do startups receive the most funding?
- Does the location affect the funding received?

### Hypothesis 

##### NULL: Technological industries do not have a higher success rate of being funded 

##### ALTERNATE: Technological industries have a higher success rate of being funded

## Setup

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import numpy as np 
import pandas as pd 

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns 
sns.set_style('whitegrid')
plt.style.use("fivethirtyeight")



import plotly.express as px

# EDA (pandas-profiling, etc. )

from scipy import stats

from scipy.stats import pearsonr

from scipy.stats import chi2_contingency



# Data Loading
Here is the section to load the datasets and the additional files

#### Load 2018 Data  

In [3]:
# For CSV, use pandas.read_csv

#import the 2018 dataset 
#select specific columns 
startup_funding_2018 = pd.read_csv('startup_funding2018.csv', 
                                   usecols = ['Company Name','Industry','Round/Series','Amount','Location'])

# rename the columns for consistency 

#industry --> sector 
#Round/Series --> stage 
startup_funding_2018.rename(columns = {'Industry':'Sector'}, inplace = True)

startup_funding_2018.rename(columns = {'Round/Series':'Stage'}, inplace = True)

# Add the funding year as a column 

startup_funding_2018['Funding Year'] = "2018"

#Change the funding year to integer type 

startup_funding_2018['Funding Year'] = startup_funding_2018['Funding Year'].astype(int)

#### 2018 Data Exploration & Cleaning

In [4]:
#check shape of dataset
startup_funding_2018.shape

(526, 6)

In [5]:
#inspect dataset
startup_funding_2018.head()

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018


In [6]:
#check for null values
startup_funding_2018.isna().any()

Company Name    False
Sector          False
Stage           False
Amount          False
Location        False
Funding Year    False
dtype: bool

In [7]:
#Strip the location data to only the city-area. 
startup_funding_2018['Location'] = startup_funding_2018.Location.str.split(',').str[0]
startup_funding_2018['Location'].head()

0    Bangalore
1       Mumbai
2      Gurgaon
3        Noida
4    Hyderabad
Name: Location, dtype: object

In [8]:
#get index of rows where 'Amount' column is in rupees, this will be used when changing the digits to dollars
get_index = startup_funding_2018.index[startup_funding_2018['Amount'].str.contains('₹')]

In [9]:
#Check the summary information about the 2018 dataset 
startup_funding_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  526 non-null    object
 1   Sector        526 non-null    object
 2   Stage         526 non-null    object
 3   Amount        526 non-null    object
 4   Location      526 non-null    object
 5   Funding Year  526 non-null    int64 
dtypes: int64(1), object(5)
memory usage: 24.8+ KB


In [10]:
#To convert the data type in a column to a numerical one, there is the need to remove some symbols including commas and currency

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('₹', ''))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('$', ''))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace(',', ''))

#startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('—', '0'))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].replace('—', np.nan)


In [11]:
#converting the Amount values to numeric type, any value which cannot be converted will be changed to NaN

startup_funding_2018['Amount'] = pd.to_numeric(startup_funding_2018['Amount'], errors='coerce')

In [12]:
#Check the final dataset information. 
startup_funding_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company Name  526 non-null    object 
 1   Sector        526 non-null    object 
 2   Stage         526 non-null    object 
 3   Amount        378 non-null    float64
 4   Location      526 non-null    object 
 5   Funding Year  526 non-null    int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 24.8+ KB


In [13]:
#Convert the rows with rupees to dollars
#Multiply the rupees values in the amount column by the conversion rate 

dollarToRupeeConversionRate = 0.012
startup_funding_2018.loc[get_index,['Amount']]=startup_funding_2018.loc[get_index,['Amount']].values*dollarToRupeeConversionRate

startup_funding_2018.loc[:,['Amount']].head()


Unnamed: 0,Amount
0,250000.0
1,480000.0
2,780000.0
3,2000000.0
4,


In [14]:
#print the first 50 rows of the dataset 
startup_funding_2018.head(50)

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,480000.0,Mumbai,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,780000.0,Gurgaon,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,Noida,2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,,Hyderabad,2018
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000.0,Bengaluru,2018
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,192000.0,Kalkaji,2018
7,Hyperdata.IO,Market Research,Angel,600000.0,Hyderabad,2018
8,Freightwalla,"Information Services, Information Technology",Seed,,Mumbai,2018
9,Microchip Payments,Mobile Payments,Seed,,Bangalore,2018


In [15]:
startup_funding_2018.loc[(178)]

Company Name                                       BuyForexOnline
Sector                                                     Travel
Stage           https://docs.google.com/spreadsheets/d/1x9ziNe...
Amount                                                  2000000.0
Location                                                Bangalore
Funding Year                                                 2018
Name: 178, dtype: object

In [17]:
startup_funding_2018.loc[178, ['Stage']] = ['']

startup_funding_2018['Stage'] = startup_funding_2018['Stage'].apply(lambda x:str(x).replace('Undisclosed', ''))

startup_funding_2018.loc[(178)]


Company Name    BuyForexOnline
Sector                  Travel
Stage                         
Amount               2000000.0
Location             Bangalore
Funding Year              2018
Name: 178, dtype: object

In [18]:
#find duplicates 
duplicate = startup_funding_2018[startup_funding_2018.duplicated()]

duplicate

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,2018


In [19]:
#drop duplicates 

startup_funding_2018 = startup_funding_2018.drop_duplicates(keep='first')


#### Load 2019 Data  

In [None]:
#import the 2019 dataset 
#select specific columns 
 
startup_funding_2019 = pd.read_csv('startup_funding2019.csv', usecols = ['Company/Brand','Founded','HeadQuarter','Sector','Investor','Amount($)','Stage'])

# rename the columns for consistency 

#Company/Brand  --> Company Name 
#HeadQuarter --> Location 
#Amount($)  --> Amount 

startup_funding_2019.rename(columns = {'Company/Brand':'Company Name'}, inplace = True)

startup_funding_2019.rename(columns = {'HeadQuarter':'Location'}, inplace = True)

startup_funding_2019.rename(columns = {'Amount($)':'Amount'}, inplace = True)

# Add the funding year as a column 

startup_funding_2019['Funding Year'] = "2019"

#Change the funding year to integer type

startup_funding_2019['Funding Year'] = startup_funding_2019['Funding Year'].astype(int)

### 2019 Data Exploration &  Cleaning

In [None]:
#check the shape of the dataset 
startup_funding_2019.shape

In [None]:
#check the first 5 records of the dataset 
startup_funding_2019.head()

In [None]:
#check the summarized information on the 2019 dataset 
startup_funding_2019.info()

In [None]:
#check on the location information 
startup_funding_2019['Location'].head()

In [None]:
startup_funding_2019.head()

In [None]:
#To convert the column to a numerical one, there the need to remove some symbols including commas and currency

startup_funding_2019['Amount'] = startup_funding_2019['Amount'].apply(lambda x:str(x).replace('₹', ''))

startup_funding_2019['Amount'] = startup_funding_2019['Amount'].apply(lambda x:str(x).replace('$', ''))

startup_funding_2019['Amount'] = startup_funding_2019['Amount'].apply(lambda x:str(x).replace(',', ''))

#startup_funding_2019['Amount'] = startup_funding_2019['Amount'].apply(lambda x:str(x).replace('—', '0'))
startup_funding_2019['Amount'] = startup_funding_2019['Amount'].replace('—', np.nan)


In [None]:
#Some rows-values in the amount column are undisclosed 
# Extract the rows with undisclosed funding information 

index_new = startup_funding_2019.index[startup_funding_2019['Amount']=='Undisclosed']
#Print the number of rows with such undisclosed values
print('The number of values with undisclosed amount is ', len(index_new))

In [None]:
#explore these records 
startup_funding_2019.loc[(index_new)]

In [None]:
#Since undisclosed amounts does not provide any intelligenc, 
#we decided to drop rows with such characteristics 
# Replace the undisclosed amounts with an empty string 

#startup_funding_2019 = startup_funding_2019.drop(labels=index_new, axis=0)
#startup_funding_2019['Amount'] = startup_funding_2019['Amount'].apply(lambda x:str(x).replace('Undisclosed', ''))

startup_funding_2019['Amount'] = startup_funding_2019['Amount'].replace('Undisclosed', np.nan)

In [None]:
startup_funding_2019.loc[(index_new)]

In [None]:
#Convert the Amount column to float 
#startup_funding_2019['Amount'] = startup_funding_2019.Amount.apply(lambda x:float(x))
startup_funding_2019['Amount'] = pd.to_numeric(startup_funding_2019['Amount'], errors='coerce')

In [None]:
#Check the first 5 rows of the dataset 
startup_funding_2019.head()

In [None]:
##Check the summary information of the dataset 

startup_funding_2019.info()

In [None]:
#Check if there are any NULL VALUES 
startup_funding_2019.isna().any()

Company Name    False
Founded          True
Location         True
Sector           True
Investor        False
Amount           True
Stage            True
Funding Year    False
dtype: bool

In [None]:
#Check if there are any NULL VALUES 
startup_funding_2019.isna().any().sum()

5

Although there are some NULL values in 2019 dataset, we plan to analyze it at a later point 

In [None]:
#find duplicates 

duplicate = startup_funding_2019[startup_funding_2019.duplicated()]

duplicate



There are no duplicates 

##### Load 2020 data 

In [None]:
#import the 2020 dataset 
#select specific columns 

startup_funding_2020 = pd.read_csv('startup_funding2020.csv', usecols = ['Company/Brand','Founded','HeadQuarter','Sector','Investor','Amount($)','Stage'])

# rename the columns for consistency 

#Company/Brand  --> Company Name 
#HeadQuarter --> Location 
#Amount($)  --> Amount 

startup_funding_2020.rename(columns = {'Company/Brand':'Company Name'}, inplace = True)

startup_funding_2020.rename(columns = {'HeadQuarter':'Location'}, inplace = True)

startup_funding_2020.rename(columns = {'Amount($)':'Amount'}, inplace = True)

# Add the funding year as a column 


startup_funding_2020['Funding Year'] = "2020"

#Change the funding year to integer type

startup_funding_2020['Funding Year'] = startup_funding_2020['Funding Year'].astype(int)


In [None]:
#Check the first 5 rows of the 2020 funding data
startup_funding_2020.head()

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
0,Aqgromalin,2019,Chennai,AgriTech,Angel investors,"$200,000",,2020
1,Krayonnz,2019,Bangalore,EdTech,GSF Accelerator,"$100,000",Pre-seed,2020
2,PadCare Labs,2018,Pune,Hygiene management,Venture Center,Undisclosed,Pre-seed,2020
3,NCOME,2020,New Delhi,Escrow,"Venture Catalysts, PointOne Capital","$400,000",,2020
4,Gramophone,2016,Indore,AgriTech,"Siana Capital Management, Info Edge","$340,000",,2020


In [None]:
#Summary information the dataset 
startup_funding_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  1055 non-null   object
 1   Founded       843 non-null    object
 2   Location      961 non-null    object
 3   Sector        1042 non-null   object
 4   Investor      1017 non-null   object
 5   Amount        1052 non-null   object
 6   Stage         591 non-null    object
 7   Funding Year  1055 non-null   int32 
dtypes: int32(1), object(7)
memory usage: 61.9+ KB


As can be seen the year Founded and Amount attributes will need conversion to numeric data. 

In [None]:
#To convert the funding attribute to numeric data, we had to corece the conversion
#This is due to some missing data values which were causing errors 

startup_funding_2020['Founded'] = pd.to_numeric(startup_funding_2020['Founded'], errors='coerce').convert_dtypes(int)

In [None]:
# check for NA's 
startup_funding_2020.isna().sum()

Company Name      0
Founded         213
Location         94
Sector           13
Investor         38
Amount            3
Stage           464
Funding Year      0
dtype: int64

In [None]:
#To convert the column to a numerical one, there the need to remove some symbols including commas and currency

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].apply(lambda x:str(x).replace('₹', ''))

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].apply(lambda x:str(x).replace('$', ''))

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].apply(lambda x:str(x).replace(',', ''))

#startup_funding_2020['Amount'] = startup_funding_2020['Amount'].apply(lambda x:str(x).replace('—', '0'))
startup_funding_2020['Amount'] = startup_funding_2020['Amount'].replace('—', np.nan)

In [None]:
#Find the number of rows with undisclosed amounts 
index1 = startup_funding_2020.index[startup_funding_2020['Amount']=='Undisclosed']
print('The total number of undisclosed records is', len(index1))

The total number of undisclosed records is 243


In [None]:
#Since undisclosed amounts does not provide any intelligence, 
#we decided to replace with empty NAN

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].replace('Undisclosed', np.nan)

In [None]:
#print a summary information on the 2020 data 
startup_funding_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  1055 non-null   object
 1   Founded       842 non-null    Int64 
 2   Location      961 non-null    object
 3   Sector        1042 non-null   object
 4   Investor      1017 non-null   object
 5   Amount        812 non-null    object
 6   Stage         591 non-null    object
 7   Funding Year  1055 non-null   int32 
dtypes: Int64(1), int32(1), object(6)
memory usage: 63.0+ KB


The amount attribute needs to be changed to a numeric datatype 

In [None]:
#Find the row with 887000 23000000 in the amount section
index1 = startup_funding_2020.index[startup_funding_2020['Amount']=='887000 23000000']
index1

Int64Index([465], dtype='int64')

In [None]:
#print the row record
startup_funding_2020.loc[(index1)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
465,True Balance,2014,Gurugram,Finance,Balancehero,887000 23000000,Series C,2020


In [None]:
#replace the values with the average 
avg = str((887000+23000000)/2)
startup_funding_2020.at[465, 'Amount'] = avg 

#print the row record to confirm
print(startup_funding_2020.loc[(465)])

Company Name    True Balance
Founded                 2014
Location            Gurugram
Sector               Finance
Investor         Balancehero
Amount            11943500.0
Stage               Series C
Funding Year            2020
Name: 465, dtype: object


In [None]:
#Find the row with 800000000 to 850000000 in the amount section
index1 = startup_funding_2020.index[startup_funding_2020['Amount']=='800000000 to 850000000']
index1

Int64Index([472], dtype='int64')

In [None]:
#print the row record 
startup_funding_2020.loc[(index1)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
472,Eruditus,2010,Mumbai,Education,"Bertelsmann India Investments, Sequoia Capital...",800000000 to 850000000,,2020


In [None]:
#replace the values with the average 
avg = str((800000000+850000000)/2)

startup_funding_2020.at[472, 'Amount'] = avg 

#print the row record to confirm 
print(startup_funding_2020.loc[(472)])

Company Name                                             Eruditus
Founded                                                      2010
Location                                                   Mumbai
Sector                                                  Education
Investor        Bertelsmann India Investments, Sequoia Capital...
Amount                                                825000000.0
Stage                                                         NaN
Funding Year                                                 2020
Name: 472, dtype: object


In [None]:
#Find the row with Undiclsosed in the amount column 
index4 = startup_funding_2020.index[startup_funding_2020['Amount']=='Undiclsosed']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
index4

Int64Index([665], dtype='int64')

In [None]:
#print the row record 
startup_funding_2020.loc[(index4)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
665,Credgencies,2018,,AI & Debt,Titan Capital,Undiclsosed,Seed Round,2020


In [None]:
# Replace the  row by index value with undisclosed amount 
#startup_funding_2020 = startup_funding_2020.drop(labels=index4, axis=0)

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].replace('Undiclsosed', np.nan)

In [None]:
#Find the row with Undiclsosed in the amount column 
index5 = startup_funding_2020.index[startup_funding_2020['Amount']=='Undislosed']
#index5 = startup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
index5

Int64Index([1012], dtype='int64')

In [None]:
#print the row record 
startup_funding_2020.loc[(index5)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
1012,Toddle,,Bengaluru,,Matrix Partners India,Undislosed,,2020


In [None]:
# delete the  row by index value with undisclosed amount 
#startup_funding_2020 = startup_funding_2020.drop(labels=index5, axis=0)

startup_funding_2020['Amount'] = startup_funding_2020['Amount'].replace('Undislosed', np.nan)

In [None]:
#Convert the Amount column to float 

startup_funding_2020['Amount'] = pd.to_numeric(startup_funding_2020['Amount'], errors='coerce')

In [None]:
duplicates = startup_funding_2020[startup_funding_2020.duplicated()]

duplicates

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
145,Krimanshi,2015,Jodhpur,Biotechnology company,"Rajasthan Venture Capital Fund, AIM Smart City",600000.0,Seed,2020
205,Nykaa,2012,Mumbai,Cosmetics,"Alia Bhatt, Katrina Kaif",,,2020
362,Byju’s,2011,Bangalore,EdTech,"Owl Ventures, Tiger Global Management",500000000.0,,2020


In [None]:
#delete duplicates 

startup_funding_2020 = startup_funding_2020.drop_duplicates(keep='first')


In [None]:
#Check the 2020 datatset information to confirm the datatypes 
startup_funding_2020.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1052 entries, 0 to 1054
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company Name  1052 non-null   object 
 1   Founded       839 non-null    Int64  
 2   Location      958 non-null    object 
 3   Sector        1039 non-null   object 
 4   Investor      1014 non-null   object 
 5   Amount        805 non-null    float64
 6   Stage         590 non-null    object 
 7   Funding Year  1052 non-null   int32  
dtypes: Int64(1), float64(1), int32(1), object(5)
memory usage: 70.9+ KB


In [None]:
#Check the first set of row 
startup_funding_2020.head()

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
0,Aqgromalin,2019,Chennai,AgriTech,Angel investors,200000.0,,2020
1,Krayonnz,2019,Bangalore,EdTech,GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018,Pune,Hygiene management,Venture Center,,Pre-seed,2020
3,NCOME,2020,New Delhi,Escrow,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016,Indore,AgriTech,"Siana Capital Management, Info Edge",340000.0,,2020


In [None]:
#Check the final shape of the data after preprocessing 
startup_funding_2020.shape

(1052, 8)

### Load 2020 data 

In [None]:
#import the 2021 dataset 
#select specific columns 

startup_funding_2021 = pd.read_csv('startup_funding2021.csv', usecols = ['Company/Brand','Founded','HeadQuarter','Sector','Investor','Amount($)','Stage'])

# rename the columns for consistency 
#Company/Brand  --> Company Name 
#HeadQuarter --> Location 
#Amount($)  --> Amount 

startup_funding_2021.rename(columns = {'Company/Brand':'Company Name'}, inplace = True)

startup_funding_2021.rename(columns = {'HeadQuarter':'Location'}, inplace = True)

startup_funding_2021.rename(columns = {'Amount($)':'Amount'}, inplace = True)

# Add the funding year as a column 

startup_funding_2021['Funding Year'] = "2021"

#Change the funding year to integer type
startup_funding_2021['Funding Year'] = startup_funding_2021['Funding Year'].astype(int)

In [None]:
#Check the 2021 funding data 
startup_funding_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company Name  1209 non-null   object 
 1   Founded       1208 non-null   float64
 2   Location      1208 non-null   object 
 3   Sector        1209 non-null   object 
 4   Investor      1147 non-null   object 
 5   Amount        1206 non-null   object 
 6   Stage         781 non-null    object 
 7   Funding Year  1209 non-null   int32  
dtypes: float64(1), int32(1), object(6)
memory usage: 71.0+ KB


In [None]:
index6 = startup_funding_2021.index[startup_funding_2021['Amount']=='Undisclosed']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 

print(len(index6))


43


In [None]:
#print the row records 
startup_funding_2021.loc[(index6)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
7,Qube Health,2016.0,Mumbai,HealthTech,Inflection Point Ventures,Undisclosed,Pre-series A,2021
8,Vitra.ai,2020.0,Bangalore,Tech Startup,Inflexor Ventures,Undisclosed,,2021
21,Uable,2020.0,Bangalore,EdTech,"Chiratae Ventures, JAFCO Asia",Undisclosed,Seed,2021
39,TruNativ,2019.0,Mumbai,Food & Beverages,9Unicorns,Undisclosed,Seed,2021
54,AntWak,2019.0,Bangalore,EdTech,"Vaibhav Domkundwar, Kunal Shah",Undisclosed,Seed,2021
64,Rage Coffee,2018.0,New Delhi,Food & Beverages,"GetVantage, Prakash Katama",Undisclosed,Pre-series A,2021
67,Kudos,2014.0,Pune,FinTech,Marquee fintech founders,Undisclosed,Pre-series A,2021
316,Hubhopper,2015.0,New Delhi,Podcast,"ITI Growth Opportunities Fund, Unit-E Ventures",Undisclosed,,2021
319,Battery Smart,2019.0,New Delhi,Battery,Orios Venture Partners,Undisclosed,Seed,2021
321,Onelife,2019.0,Mumbai,Healthcare,Wipro venture capital arm,Undisclosed,,2021


In [None]:
# Replace the Undisclosed with empty string 
#startup_funding_2021 = startup_funding_2021.drop(labels=index6, axis=0)

startup_funding_2021['Amount'] = startup_funding_2021['Amount'].replace('Undisclosed', np.nan)

In [None]:
#print the row records 
startup_funding_2021.loc[(index6)]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
7,Qube Health,2016.0,Mumbai,HealthTech,Inflection Point Ventures,,Pre-series A,2021
8,Vitra.ai,2020.0,Bangalore,Tech Startup,Inflexor Ventures,,,2021
21,Uable,2020.0,Bangalore,EdTech,"Chiratae Ventures, JAFCO Asia",,Seed,2021
39,TruNativ,2019.0,Mumbai,Food & Beverages,9Unicorns,,Seed,2021
54,AntWak,2019.0,Bangalore,EdTech,"Vaibhav Domkundwar, Kunal Shah",,Seed,2021
64,Rage Coffee,2018.0,New Delhi,Food & Beverages,"GetVantage, Prakash Katama",,Pre-series A,2021
67,Kudos,2014.0,Pune,FinTech,Marquee fintech founders,,Pre-series A,2021
316,Hubhopper,2015.0,New Delhi,Podcast,"ITI Growth Opportunities Fund, Unit-E Ventures",,,2021
319,Battery Smart,2019.0,New Delhi,Battery,Orios Venture Partners,,Seed,2021
321,Onelife,2019.0,Mumbai,Healthcare,Wipro venture capital arm,,,2021


In [None]:
index7 = startup_funding_2021.index[startup_funding_2021['Amount']=='Upsparks']

print(len(index7)), index7

2


(None, Int64Index([98, 111], dtype='int64'))

In [None]:
startup_funding_2021.loc[index7]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
98,FanPlay,2020.0,Computer Games,Computer Games,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000,2021
111,FanPlay,2020.0,Computer Games,Computer Games,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000,2021


In [None]:
#drop the duplicate

startup_funding_2021 = startup_funding_2021.drop(labels=index7[1], axis=0)

In [None]:
#Rearrange the record data correctly 

startup_funding_2021.loc[index7[0], ['Amount', 'Stage']] = ['$1200000', '']


In [None]:
startup_funding_2021.loc[index7[0]]

Company Name                        FanPlay
Founded                              2020.0
Location                     Computer Games
Sector                       Computer Games
Investor        Pritesh Kumar, Bharat Gupta
Amount                             $1200000
Stage                                      
Funding Year                           2021
Name: 98, dtype: object

In [None]:
index8 = startup_funding_2021.index[startup_funding_2021['Amount']=='Series C']

print(len(index8)), index8

2


(None, Int64Index([242, 256], dtype='int64'))

In [None]:
startup_funding_2021.loc[index8]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,$22000000,Series C,,2021
256,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,$22000000,Series C,,2021


In [None]:
#since its duplicated, we can drop one 
startup_funding_2021 = startup_funding_2021.drop(labels=index8[1], axis=0)

In [None]:
startup_funding_2021.loc[index8[0], ['Sector', 'Location', 'Amount', 'Investor', 'Stage']] = ['Pharmaceuticals', '', '$22000000', '', 'Series C']

In [None]:
startup_funding_2021.loc[242]

Company Name    Fullife Healthcare
Founded                     2009.0
Location                          
Sector             Pharmaceuticals
Investor                          
Amount                   $22000000
Stage                     Series C
Funding Year                  2021
Name: 242, dtype: object

In [None]:
index9 = startup_funding_2021.index[startup_funding_2021['Amount']=='Seed']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
print(index9)

Int64Index([257, 1148], dtype='int64')


In [None]:
startup_funding_2021.loc[index9]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,$5000000,Seed,,2021
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,1000000\t#REF!,Seed,,2021


In [None]:
startup_funding_2021.loc[index9[0], ['Sector', 'Location', 'Amount', 'Investor', 'Stage']] = ['Electric Mobility', 'Gurugram', '$5000000', '', 'Seed']
startup_funding_2021.loc[index9[1], ['Amount', 'Investor', 'Stage']] = ['1000000', '', 'Seed']

In [None]:
index10 = startup_funding_2021.index[startup_funding_2021['Amount']=='undisclosed']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
print(index10)

Int64Index([], dtype='int64')


In [None]:
startup_funding_2021.loc[index10]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year


In [None]:
# 
startup_funding_2021['Amount'] = startup_funding_2021['Amount'].replace('undisclosed', np.nan)


In [None]:
#For #ah! Ventures

index11 = startup_funding_2021.index[startup_funding_2021['Amount']=='ah! Ventures']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
print(index11)


Int64Index([538], dtype='int64')


In [None]:
startup_funding_2021.loc[index11]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
538,Little Leap,2020.0,New Delhi,EdTech,Vishal Gupta,ah! Ventures,$300000,2021


In [None]:
startup_funding_2021.loc[index11, ['Amount', 'Stage']] = ['$300000', '']

In [None]:
startup_funding_2021.loc[index11]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
538,Little Leap,2020.0,New Delhi,EdTech,Vishal Gupta,$300000,,2021


In [None]:
#Pre-series A

index12 = startup_funding_2021.index[startup_funding_2021['Amount']=='Pre-series A']
#index1 = tartup_funding_2020.index[startup_funding_2019['Amount'].str.contains('Undisclosed')] 
index12

Int64Index([545], dtype='int64')

In [None]:
startup_funding_2021.loc[index12]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
545,AdmitKard,2016.0,Noida,EdTech,$1000000,Pre-series A,,2021


In [None]:
# 
#startup_funding_2021 = startup_funding_2021.drop(labels=index12, axis=0)

startup_funding_2021.at[545, 'Amount'] = '$1000000'
startup_funding_2021.at[545, 'Investor'] = ''
startup_funding_2021.at[545, 'Stage'] = 'Pre-series A'

In [None]:
startup_funding_2021.loc[index12]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
545,AdmitKard,2016.0,Noida,EdTech,,$1000000,Pre-series A,2021


In [None]:
index13 = startup_funding_2021.index[startup_funding_2021['Amount']=='ITO Angel Network, LetsVenture']
#ITO Angel Network LetsVenture

index13

Int64Index([551], dtype='int64')

In [None]:
startup_funding_2021.loc[index13]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
551,BHyve,2020.0,Mumbai,Human Resources,"Omkar Pandharkame, Ketaki Ogale","ITO Angel Network, LetsVenture",$300000,2021


In [None]:
# delete a single row by index value 0
#startup_funding_2021 = startup_funding_2021.drop(labels=index13, axis=0)

#startup_funding_2021.at[551, 'Sector'] = 'Electric Mobility'
#startup_funding_2021.at[551, 'Location'] = 'Gurugram'
startup_funding_2021.at[551, 'Amount'] = '$300000'
startup_funding_2021.at[551, 'Investor'] = 'Omkar Pandharkame, Ketaki Ogale, JITO Angel Network, LetsVenture'
startup_funding_2021.at[551, 'Stage'] = ''

In [None]:
startup_funding_2021.loc[index13]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
551,BHyve,2020.0,Mumbai,Human Resources,"Omkar Pandharkame, Ketaki Ogale, JITO Angel Ne...",$300000,,2021


In [None]:
#JITO Angel Network LetsVenture
index14 = startup_funding_2021.index[startup_funding_2021['Amount']=='JITO Angel Network, LetsVenture']

index14

Int64Index([677], dtype='int64')

In [None]:
startup_funding_2021.loc[index14]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,Sushil Agarwal,"JITO Angel Network, LetsVenture",$1000000,2021


In [None]:
# delete a single row by index value 0
#startup_funding_2021 = startup_funding_2021.drop(labels=index14, axis=0)

#startup_funding_2021.at[677, 'Sector'] = 'Electric Mobility'
#startup_funding_2021.at[677, 'Location'] = 'Gurugram'
startup_funding_2021.at[677, 'Amount'] = '$1000000'
startup_funding_2021.at[677, 'Investor'] = 'Sushil Agarwal, JITO Angel Network, LetsVenture'
startup_funding_2021.at[677, 'Stage'] = ''

In [None]:
startup_funding_2021.loc[index14]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"Sushil Agarwal, JITO Angel Network, LetsVenture",$1000000,,2021


In [None]:
# drop the NaN values
#startup_funding_2021['Amount']= startup_funding_2021['Amount'].dropna()
#startup_funding_2021['Amount'] = startup_funding_2021['Amount'].apply(lambda x:str(x).replace('—', '0'))
index15 = startup_funding_2021.index[startup_funding_2021['Amount']=='nan']

index15

Int64Index([], dtype='int64')

In [None]:
startup_funding_2021.loc[index15]

Unnamed: 0,Company Name,Founded,Location,Sector,Investor,Amount,Stage,Funding Year


In [None]:
# delete a single row by index value 0
#startup_funding_2021 = startup_funding_2021.drop(labels=index15, axis=0)
#startup_funding_2021['Amount'] = startup_funding_2021['Amount'].replace('nan', '0')
startup_funding_2021['Amount'] = startup_funding_2021['Amount'].replace('nan', np.nan)

In [None]:
startup_funding_2021['Amount'] = startup_funding_2021['Amount'].apply(lambda x:str(x).replace('₹', ''))

startup_funding_2021['Amount'] = startup_funding_2021['Amount'].apply(lambda x:str(x).replace('$', ''))

startup_funding_2021['Amount'] = startup_funding_2021['Amount'].apply(lambda x:str(x).replace(',', ''))

#startup_funding_2021['Amount'] = startup_funding_2021['Amount'].apply(lambda x:str(x).replace('—', '0'))

startup_funding_2021['Amount'] = startup_funding_2021['Amount'].replace('—', np.nan)

In [None]:
#startup_funding_2021['Amount']  = pd.to_numeric(startup_funding_2021['Amount'], downcast="float")
startup_funding_2021['Amount']  = pd.to_numeric(startup_funding_2021['Amount'], errors='coerce')
#startup_funding_2021['Amount'] = startup_funding_2021.Amount.apply(lambda x:float(x))

In [None]:
startup_funding_2021.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1207 entries, 0 to 1208
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company Name  1207 non-null   object 
 1   Founded       1206 non-null   float64
 2   Location      1206 non-null   object 
 3   Sector        1207 non-null   object 
 4   Investor      1145 non-null   object 
 5   Amount        1064 non-null   float64
 6   Stage         784 non-null    object 
 7   Funding Year  1207 non-null   int32  
dtypes: float64(2), int32(1), object(5)
memory usage: 112.4+ KB


##### Dealing with the location attribute 

In [None]:
startup_funding_2021.loc[98]

Company Name                        FanPlay
Founded                              2020.0
Location                     Computer Games
Sector                       Computer Games
Investor        Pritesh Kumar, Bharat Gupta
Amount                            1200000.0
Stage                                      
Funding Year                           2021
Name: 98, dtype: object

In [None]:
startup_funding_2021.loc[752]

Company Name    NewLink Group
Founded                2016.0
Location              Beijing
Sector           Tech Startup
Investor         Bain Capital
Amount            200000000.0
Stage                     NaN
Funding Year             2021
Name: 752, dtype: object

In [None]:
startup_funding_2021['Location'] = startup_funding_2021.Location.str.split(',').str[0]
#startup_funding_2021.at[32, 'Location'] = 'Andhra Pradesh'
startup_funding_2021.at[98, 'Location'] = ''
startup_funding_2021.at[241, 'Location'] = ''
startup_funding_2021.at[255, 'Location'] = ''
startup_funding_2021.at[752, 'Location'] = ''
startup_funding_2021.at[1100, 'Location'] = ''
startup_funding_2021.at[1176, 'Location'] = ''

##### Dealing with the Sector attribute 

In [None]:
#startup_funding_2021['Sector']
#startup_funding_2021['Sector'] = startup_funding_2021.Sector.str.split(',').str[0]
startup_funding_2021.at[1100, 'Sector'] = 'Audio experience'

In [None]:
#find duplicates 

startup_funding_2021[startup_funding_2021.duplicated()]



duplicate

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
348,TheCollegeFever,Brand Marketing,Seed,250000.0,Bangalore,2018
