# Introduction

For this project, we seek to analyse funding received by startups in India. The aim is to prescribe the best course of action for a startup looking into the Indian business ecosystem. Our first step will be to gain business understanding of the problem.

# Business Understanding

India has become an attractive location for investors and has seen a number of successful startups achieve the coveted "unicorn" status. To guide our quest for the best course of action as an upcoming startup, we asked a few questions which we will attempt to answer using the data on hand.

### Questions

- Does the age of the  startup affect the funding received?
- Which sectors received the most funding?
- Does the number of founders affect the funding received?
- At what stage do startups receive the most funding?
- Does the location affect the funding received?

### Hypothesis 

##### NULL: Technological industries do not have a higher success rate of being funded 

##### ALTERNATE: Technological industries have a higher success rate of being funded

## Setup

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import numpy as np 
import pandas as pd 

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns 
sns.set_style('whitegrid')
plt.style.use("fivethirtyeight")



import plotly.express as px

# EDA (pandas-profiling, etc. )

from scipy import stats

from scipy.stats import pearsonr

from scipy.stats import chi2_contingency



# Data Loading
Here is the section to load the datasets and the additional files

#### Load 2018 Data  

In [3]:
# For CSV, use pandas.read_csv

#import the 2018 dataset 
#select specific columns 
startup_funding_2018 = pd.read_csv('startup_funding2018.csv', 
                                   usecols = ['Company Name','Industry','Round/Series','Amount','Location'])

# rename the columns for consistency 

#industry --> sector 
#Round/Series --> stage 
startup_funding_2018.rename(columns = {'Industry':'Sector'}, inplace = True)

startup_funding_2018.rename(columns = {'Round/Series':'Stage'}, inplace = True)

# Add the funding year as a column 

startup_funding_2018['Funding Year'] = "2018"

#Change the funding year to integer type 

startup_funding_2018['Funding Year'] = startup_funding_2018['Funding Year'].astype(int)

#### 2018 Data Exploration & Cleaning

In [4]:
#check shape of dataset
startup_funding_2018.shape

(526, 6)

In [5]:
#inspect dataset
startup_funding_2018.head()

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018


In [6]:
#check for null values
startup_funding_2018.isna().any()

Company Name    False
Sector          False
Stage           False
Amount          False
Location        False
Funding Year    False
dtype: bool

In [7]:
#Strip the location data to only the city-area. 
startup_funding_2018['Location'] = startup_funding_2018.Location.str.split(',').str[0]
startup_funding_2018['Location'].head()

0    Bangalore
1       Mumbai
2      Gurgaon
3        Noida
4    Hyderabad
Name: Location, dtype: object

In [8]:
#get index of rows where 'Amount' column is in rupees, this will be used when changing the digits to dollars
get_index = startup_funding_2018.index[startup_funding_2018['Amount'].str.contains('₹')]

In [9]:
#Check the summary information about the 2018 dataset 
startup_funding_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  526 non-null    object
 1   Sector        526 non-null    object
 2   Stage         526 non-null    object
 3   Amount        526 non-null    object
 4   Location      526 non-null    object
 5   Funding Year  526 non-null    int64 
dtypes: int64(1), object(5)
memory usage: 24.8+ KB


In [10]:
#To convert the data type in a column to a numerical one, there is the need to remove some symbols including commas and currency

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('₹', ''))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('$', ''))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace(',', ''))

#startup_funding_2018['Amount'] = startup_funding_2018['Amount'].apply(lambda x:str(x).replace('—', '0'))

startup_funding_2018['Amount'] = startup_funding_2018['Amount'].replace('—', np.nan)


In [11]:
#converting the Amount values to numeric type, any value which cannot be converted will be changed to NaN

startup_funding_2018['Amount'] = pd.to_numeric(startup_funding_2018['Amount'], errors='coerce')

In [12]:
#Check the final dataset information. 
startup_funding_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Company Name  526 non-null    object 
 1   Sector        526 non-null    object 
 2   Stage         526 non-null    object 
 3   Amount        378 non-null    float64
 4   Location      526 non-null    object 
 5   Funding Year  526 non-null    int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 24.8+ KB


In [13]:
#Convert the rows with rupees to dollars
#Multiply the rupees values in the amount column by the conversion rate 

dollarToRupeeConversionRate = 0.012
startup_funding_2018.loc[get_index,['Amount']]=startup_funding_2018.loc[get_index,['Amount']].values*dollarToRupeeConversionRate

startup_funding_2018.loc[:,['Amount']].head()


Unnamed: 0,Amount
0,250000.0
1,480000.0
2,780000.0
3,2000000.0
4,


In [14]:
#print the first 50 rows of the dataset 
startup_funding_2018.head(50)

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,480000.0,Mumbai,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,780000.0,Gurgaon,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,Noida,2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,,Hyderabad,2018
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000.0,Bengaluru,2018
6,Tripshelf,"Internet, Leisure, Marketplace",Seed,192000.0,Kalkaji,2018
7,Hyperdata.IO,Market Research,Angel,600000.0,Hyderabad,2018
8,Freightwalla,"Information Services, Information Technology",Seed,,Mumbai,2018
9,Microchip Payments,Mobile Payments,Seed,,Bangalore,2018


In [15]:
startup_funding_2018.loc[(178)]

Company Name                                       BuyForexOnline
Sector                                                     Travel
Stage           https://docs.google.com/spreadsheets/d/1x9ziNe...
Amount                                                  2000000.0
Location                                                Bangalore
Funding Year                                                 2018
Name: 178, dtype: object

In [17]:
startup_funding_2018.loc[178, ['Stage']] = ['']

startup_funding_2018['Stage'] = startup_funding_2018['Stage'].apply(lambda x:str(x).replace('Undisclosed', ''))

startup_funding_2018.loc[(178)]


Company Name    BuyForexOnline
Sector                  Travel
Stage                         
Amount               2000000.0
Location             Bangalore
Funding Year              2018
Name: 178, dtype: object

In [18]:
#find duplicates 
duplicate = startup_funding_2018[startup_funding_2018.duplicated()]

duplicate

Unnamed: 0,Company Name,Sector,Stage,Amount,Location,Funding Year
348,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,2018


In [None]:
#drop duplicates 

startup_funding_2018 = startup_funding_2018.drop_duplicates(keep='first')
