# Exploring the Indian Startup Ecosystem: A Data Driven Analysis of Funding Trends and Industry Sectors

## Business Understanding
### Business Scenario
Your team is trying to venture into the Indian start-up
ecosystem. As the data expert of the team, you are to
investigate the ecosystem and propose the best course
of action.

*Analyze funding received by start-ups in India from
2018 to 2021.*
- Separate data for each year of funding will is
provided.
- In these datasets, you'll find the start-ups' details,
the funding amounts received, and the investors'
information.

### Business Objective
The aim of this project is to perform analysis on the Indian start-ups ecosystem and advice stakeholders on which venture to invest in to increase the potential of high profit/income.

In [81]:
#import all necessary libraries

# data manipulation
import pandas as pd
import numpy as np

# data visualization libraries
import matplotlib.pyplot as plt
from plotly import express as px
import seaborn as sns

# statistical libraries
from scipy import stats
import statistics as stat

# database manipulation libraries
import pyodbc
from dotenv import dotenv_values

# hide warnings
import warnings
warnings.filterwarnings("ignore")


 

## Setup Database Connection

In [82]:
# load environment variables
environment_variables = dotenv_values(".env")

# load database configurations
database = environment_variables.get("DB_DATABASENAME")
server = environment_variables.get("DB_SERVER")
username = environment_variables.get("DB_USERNAME")
password = environment_variables.get("DB_PASSWORD")

# database connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"



In [83]:
# create pyodbc connector
connection = pyodbc.connect(connection_string)

In [84]:
# Loading 2021 dataset from MS SQL server
query_2021 = " SELECT * FROM dbo.LP1_startup_funding2021"
df_2021 = pd.read_sql(query_2021,connection)
df_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [85]:
# Load 2020 dataset from MS SQL Server
query_2020 = "SELECT * FROM dbo.LP1_startup_funding2020"
df_2020 = pd.read_sql(query_2020,connection)
df_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [86]:
# load 2019 dataset
df_2019 = pd.read_csv("data/startup_funding2019.csv")
df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [87]:
# load 2018 dataset
df_2018 = pd.read_csv("D:\Programming Stuffs\DAP(Azubi Africa)\Career Accelerator\Sprint1\Indian_Start_Up_Analysis\data\startup_funding2018.csv")
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [88]:
# concatenated all dataset
data = pd.concat([df_2018,df_2019,df_2020,df_2021],ignore_index=False)
data.head() 

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Company_Brand,What_it_does,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,,,,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,,,,,,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,,,,,,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,,,,,,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,,,,,,,,,


In [89]:
# save dataframe to csv
data.to_csv("Indian_startup_dataset")

In [90]:
# Renaming columns to lowercase with underscores
data = data.rename(columns=lambda x: x.lower().replace(' ', '_'))
data.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company,company/brand,founded,headquarter,sector,what_it_does,founders,investor,amount($),stage,company_brand,what_it_does.1,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,,,,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,,,,,,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,,,,,,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,,,,,,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,,,,,,,,,


In [91]:
# convert final dataset from csv to dataframe
data.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company,company/brand,founded,headquarter,sector,what_it_does,founders,investor,amount($),stage,company_brand,what_it_does.1,column10
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",,,,,,,,,,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,,,,,,,,,,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,,,,,,,,,,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,,,,,,,,,,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,,,,,,,,,,,,


In [92]:
# view the last five rows
data.tail()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company,company/brand,founded,headquarter,sector,what_it_does,founders,investor,amount($),stage,company_brand,what_it_does.1,column10
1204,,,,$3000000,,,,2019.0,Gurugram,Staffing & Recruiting,,"Chirag Mittal, Anirudh Syal",Endiya Partners,,Pre-series A,Gigforce,A gig/on-demand staffing company.,
1205,,,,$20000000,,,,2015.0,New Delhi,Food & Beverages,,Bala Sarda,IIFL AMC,,Series D,Vahdam,VAHDAM is among the world’s first vertically i...,
1206,,,,$55000000,,,,2019.0,Bangalore,Financial Services,,"Arnav Kumar, Vaibhav Singh",Owl Ventures,,Series C,Leap Finance,International education loans for high potenti...,
1207,,,,$26000000,,,,2015.0,Gurugram,EdTech,,Ruchir Arora,"Winter Capital, ETS, Man Capital",,Series B,CollegeDekho,"Collegedekho.com is Student’s Partner, Friend ...",
1208,,,,$8000000,,,,2019.0,Bangalore,Financial Services,,"Vishal Chopra, Himanshu Gupta","3one4 Capital, Kalaari Capital",,Series A,WeRize,India’s first socially distributed full stack ...,


In [93]:
# check the shape of data
data.shape

(2879, 18)

In [94]:
# check for columns in the data
data.columns

Index(['company_name', 'industry', 'round/series', 'amount', 'location',
       'about_company', 'company/brand', 'founded', 'headquarter', 'sector',
       'what_it_does', 'founders', 'investor', 'amount($)', 'stage',
       'company_brand', 'what_it_does', 'column10'],
      dtype='object')

## Hypothesis Testing
*Hypothesis* - The amount of funds a company receive depends on the sector a company finds itself
- Null Hypothesis(H_o) - The funds a company receive does not depend on the sector of investment
- Alternate Hypothesis(H_a) - The funds a company receive depends on the sector of investment

### Business Questions
- Which particular sector received the most funding over the time frame?
- The distribution of start ups in stages and the amount allocated each?
- In which 3 locations have start ups had the most funding?
- Which year had the most investors?
- Top 3 investor considerations in investing in start ups
- What was the impact of COVID-19 pandemic on startup funding in 2020?
- How is funding related to Metropolitan cities
- What is the distribution of companies based on the Round/Series 

## Data Understanding
Data Understanding it drives the focus to identify, collect, and analyse
the data sets that can help you accomplish the project goals. This
phase also has four tasks:
### Tasks To Be Performed
1. Collect initial data: Acquire the necessary data and (if
necessary) load it into your analysis tool.
2. Describe data: Examine the data and document its surface
properties like data format, number of records, or field
identities.
3. Explore data: Dig deeper into the data. Query it, visualize it,
and identify relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document
any quality issues.

## Data Preprocessing
Data Preprocessing is done to check the quality of the data.The following methods can be used to check the quality of data
- Data types
- Missing data
- Categorical values
- Continuous variable values
- Duplicated records
- Custom rule based checks

*Note*: After checking for duplicated records, we have to go through the columns and perform rule based checks where appropriate.



# Data Exploration and Cleaning on the 2018 Dataset

In [95]:
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [96]:
# rename column labels
df_2018 = df_2018.rename(columns=lambda x: x.lower().replace(" ","_"))
df_2018.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [97]:
# rename round/stage column to stage
df_2018 =df_2018.rename(columns={"round/series":"stage"}) 

In [98]:
# check for the columns in the dataframe
df_2018.columns

Index(['company_name', 'industry', 'stage', 'amount', 'location',
       'about_company'],
      dtype='object')

In [99]:
# check the shape of the 2018 dataset
df_2018.shape

(526, 6)

In [100]:
# perform descriptive statistics on the dataset
df_2018.describe(include="all").T

Unnamed: 0,count,unique,top,freq
company_name,526,525,TheCollegeFever,2
industry,526,405,—,30
stage,526,21,Seed,280
amount,526,198,—,148
location,526,50,"Bangalore, Karnataka, India",102
about_company,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


In [101]:
# check the info about the dataset
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   company_name   526 non-null    object
 1   industry       526 non-null    object
 2   stage          526 non-null    object
 3   amount         526 non-null    object
 4   location       526 non-null    object
 5   about_company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


## Observations on the 2018 dataset
- The dataset consists of 526 rows and 6 columns
- All the data types of the columns are of the object type
- The amount column is supposed to be in float

### Course of Action
- Convert the amount column data type to float
- Drop duplicated rows

In [102]:
# check for null values
df_2018.isna().sum()

company_name     0
industry         0
stage            0
amount           0
location         0
about_company    0
dtype: int64

In [103]:
# check for duplicated values
df_2018.duplicated().sum()

1

In [104]:
# drop the duolicate value
df_2018.drop_duplicates(inplace=True)


In [105]:
# verify if duplicate still exists
df_2018.duplicated().sum() 

0

In [106]:
# create function to clean the amount column
exchange_rate = 1/0.0146
    
def clean_amount(df, col_name):
    cleaned_data = df[col_name].copy()
    if df[col_name].dtype == 'object':
        # Remove non-numeric characters except '.' and ','
        cleaned_data = cleaned_data.str.replace("[^\d.,]", "", regex=True)
        
        # Replace empty strings with NaN
        cleaned_data = cleaned_data.replace('', float('nan'), regex=True)
        
        # Remove commas from numbers
        cleaned_data = cleaned_data.str.replace(",", "", regex=True)
        
        if cleaned_data.str.startswith("$").any():
            # If the column starts with '$', clean it as dollars
            cleaned_data = cleaned_data.str.replace("[$]", "", regex=True).astype(float)
        elif cleaned_data.str.startswith("₹").any():
            # If the column starts with '₹', clean it as rupees
            cleaned_data = cleaned_data.str.replace("[₹]", "", regex=True).astype(float) * exchange_rate
        else:
            # Convert to float if there is no currency symbol preceding the amount
            cleaned_data = cleaned_data.astype(float)
    return cleaned_data



In [107]:
# apply the clean_data function to the 2018 dataframe
df_2018["amount"] = clean_amount(df_2018, "amount")

In [108]:
# check info about the dataset after cleaning
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   company_name   525 non-null    object 
 1   industry       525 non-null    object 
 2   stage          525 non-null    object 
 3   amount         377 non-null    float64
 4   location       525 non-null    object 
 5   about_company  525 non-null    object 
dtypes: float64(1), object(5)
memory usage: 28.7+ KB


In [109]:
# check for the null values present after cleaning the amount column
df_2018.isna().sum()

company_name       0
industry           0
stage              0
amount           148
location           0
about_company      0
dtype: int64

In [110]:
average_amount_by_stage = df_2018.groupby("stage")["amount"].mean()

In [117]:
# Function to fill NaN values in the amount column with the average based on stage of company
def impute_amount_column(df,filter_name,fill_value):
    unique_values = df[filter_name].unique()
    for val, avg_amount in zip(unique_values,df.groupby(filter_name)[fill_value].transform('mean')):
        df.loc[df[filter_name] == val, fill_value] = df.loc[df[filter_name] == val, fill_value].fillna(avg_amount)
    return df
     


In [118]:
df_2018 = impute_amount_column(df_2018,"stage","amount")

In [119]:
df_2018

Unnamed: 0,company_name,industry,stage,amount,location,about_company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,2.500000e+05,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,4.000000e+07,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,6.500000e+07,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2.000000e+06,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,3.328377e+07,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,2.250000e+08,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif..."
522,Happyeasygo Group,"Tourism, Travel",Series A,3.328377e+07,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7.500000e+03,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...
524,Droni Tech,Information Technology,Seed,3.500000e+07,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...


In [116]:
# reload the 2018 dataframe to check where the null values are
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 525 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   company_name   525 non-null    object 
 1   industry       525 non-null    object 
 2   stage          525 non-null    object 
 3   amount         525 non-null    float64
 4   location       525 non-null    object 
 5   about_company  525 non-null    object 
dtypes: float64(1), object(5)
memory usage: 28.7+ KB


In [33]:
# apply cleaned_data function on dataframe
df_2018 = clean_data(df_2018,"amount")

AttributeError: 'str' object has no attribute 'str'

In [60]:
# check for the currency with the dollar symbol
starts_with_dollar = df_2018["amount"].str.startswith("$")
dollar_data = df_2018[starts_with_dollar]

# count the currencies with dollar infront
len(dollar_data)
    

59

In [61]:
# check for the currency with the Rupee symbol
starts_with_rupee = df_2018["amount"].str.startswith("₹")
rupee_data = df_2018[starts_with_rupee]

# count the currencies with dollar infront
len(rupee_data)

144

In [65]:
# strip off currency symbols from the dataset
rupee_data["amount"] = rupee_data["amount"].str.replace("₹","")

# remove all commas from the column 
rupee_data["amount"] = rupee_data["amount"].str.replace(",","")

# convert the amount column to float
rupee_data["amount"] = rupee_data["amount"].astype(float)
rupee_data.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company
1,Happy Cow Dairy,"Agriculture, Farming",Seed,40000000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,65000000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,16000000.0,"Kalkaji, Delhi, India",Tripshelf is an online market place for holida...
7,Hyperdata.IO,Market Research,Angel,50000000.0,"Hyderabad, Andhra Pradesh, India",Hyperdata combines advanced machine learning w...
15,Pitstop,"Automotive, Search Engine, Service Industry",Seed,100000000.0,"Bengaluru, Karnataka, India",Pitstop offers general repair and maintenance ...


In [67]:
# convert Rupees to Dollars
exchange_rate = 0.0146 #0.0146USD = 1 Rupee
rupee_data["amount"] = rupee_data["amount"] * (1/exchange_rate)
rupee_data.head()


Unnamed: 0,company_name,industry,round/series,amount,location,about_company
1,Happy Cow Dairy,"Agriculture, Farming",Seed,2739726000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,4452055000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,1095890000.0,"Kalkaji, Delhi, India",Tripshelf is an online market place for holida...
7,Hyperdata.IO,Market Research,Angel,3424658000.0,"Hyderabad, Andhra Pradesh, India",Hyperdata combines advanced machine learning w...
15,Pitstop,"Automotive, Search Engine, Service Industry",Seed,6849315000.0,"Bengaluru, Karnataka, India",Pitstop offers general repair and maintenance ...


In [70]:
# combine the dollar and the rupee dataset 
df_2018 = pd.concat([dollar_data,rupee_data],ignore_index=False)
df_2018.head()

Unnamed: 0,company_name,industry,round/series,amount,location,about_company
86,WHR,"Health Care, Information Technology",Seed,143145.0,"Pune, Maharashtra, India",WHR is to make affordable healthcare a reality...
90,SBI Life,Insurance,Private Equity,742000000.0,"Mumbai, Maharashtra, India",SBI Life is one of the life insurance company ...
93,NoPaperForms Solutions Pvt. Ltd.,"EdTech, Education, Information Services, SaaS",Series B,3980000.0,"New Delhi, Delhi, India","NoPaperForms is a marketing automation, lead n..."
95,AuthMetrik,"B2B, Biometrics, Cyber Security, Fraud Detecti...",Grant,10000.0,"Gurgaon, Haryana, India","SaaS, B2B, Security, Stop account sharing, Fra..."
101,Swiggy,"Food Delivery, Food Processing, Internet",Series H,1000000000.0,"Bangalore, Karnataka, India",Swiggy is a food ordering and delivery company...


In [72]:
# check the information on the dataset
df_2018.shape

(203, 6)

In [None]:
# Clean the amount column inside the rupee data
# strip off currency symbols from the dataset
dollar_data["amount"] = dollar_data["amount"].str.replace("$","")

# remove all commas from the column 
dollar_data["amount"] = dollar_data["amount"].str.replace(",","")

# convert the amount column to float
dollar_data["amount"] = dollar_data["amount"].astype(float)
dollar_data.head()

From the analysis of the currencies in the amoount column, we have 144 amount in Rupees and 59 data with the dollar sign 

In [30]:
# convert the data type of amount to float
df_2018["amount"] = df_2018["amount"].astype(float) 

ValueError: could not convert string to float: '₹40,000,000'

In [11]:
# perform descriptive statistics on data
data.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
company_name,526.0,525.0,TheCollegeFever,2.0,,,,,,,
industry,526.0,405.0,—,30.0,,,,,,,
round/series,526.0,21.0,Seed,280.0,,,,,,,
amount,2533.0,754.0,—,148.0,,,,,,,
location,526.0,50.0,"Bangalore, Karnataka, India",102.0,,,,,,,
about_company,526.0,524.0,"TheCollegeFever is a hub for fun, fiesta and f...",2.0,,,,,,,
company/brand,89.0,87.0,Kratikal,2.0,,,,,,,
founded,2110.0,,,,2016.079621,4.368006,1963.0,2015.0,2017.0,2019.0,2021.0
headquarter,2239.0,123.0,Bangalore,764.0,,,,,,,
sector,2335.0,502.0,FinTech,173.0,,,,,,,


In [12]:
# check the information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, 0 to 1208
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   company_name   526 non-null    object 
 1   industry       526 non-null    object 
 2   round/series   526 non-null    object 
 3   amount         2533 non-null   object 
 4   location       526 non-null    object 
 5   about_company  526 non-null    object 
 6   company/brand  89 non-null     object 
 7   founded        2110 non-null   float64
 8   headquarter    2239 non-null   object 
 9   sector         2335 non-null   object 
 10  what_it_does   89 non-null     object 
 11  founders       2334 non-null   object 
 12  investor       2253 non-null   object 
 13  amount($)      89 non-null     object 
 14  stage          1415 non-null   object 
 15  company_brand  2264 non-null   object 
 16  what_it_does   2264 non-null   object 
 17  column10       2 non-null      object 
dtypes: float64(1)


### Observations
- The dataset consists of 18 columns in all with some duplicated columns
- It is assumed that the duplicated columns are; what_it_does, company_brand and company/brand
- The datatype of 17 of the columns are objects except the founded column which is a float 
- All the columns in the dataset consists of null values
- Column 10 has no real signinficant data and will be dropped in data cleaning
- Some rows have wrong information pertaining to the columns and will have to be discussed further
- Nulls in the dataset will also have to be discussed and dealt with


### Asumptions
- All amounts will be converted to USD in the Data Cleaning and EDA
 - 2018 amount column will assume the currency of USD
- Rename of df_2018 columns based on similarities in wording and comparison with other years


