# INDIAN STARTUP ECOSYSTEM DATA ANALYSIS

This is a simple data understanding and cleaning process done on the Indian startup ecosystem dataset of 2019 employing the influential CRISP-DM methodology as our guiding framework of which the above processes are a main part of. The various steps involve in each of the stated process include;

## Data Understanding

* Collect initial data
* Describe data
* Explore data
* Verify data quality

## Data Preparation

* Select data
* Clean data
* Construct data
* Format data





In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.impute import SimpleImputer

import statsmodels.api as sm
from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings('ignore')

from scipy import stats

In [None]:
file_path = '/Users/florenceaffoh/Downloads/startup_funding2019.csv'
startup_2019 = pd.read_csv(file_path)

startup_2019.tail()

DATA EXPLORATORY ANALYSIS

In [None]:
print(startup_2019.shape)
print(startup_2019.isnull().sum())


In [None]:
startup_2019.columns

In [None]:
startup_2019.nunique()

In [None]:
startup_2019.describe()


In [None]:
startup_2019.dtypes

In [None]:
startup_2019['Amount($)'].unique()

In [None]:
startup_2019['Sector'].unique()

In [None]:
startup_2019['Company/Brand'].duplicated().sum()

duplicated_companies = startup_2019[startup_2019['Company/Brand'].duplicated()]
print(duplicated_companies['Company/Brand'])


In [None]:
startup_2019['HeadQuarter'].duplicated().sum()

In [None]:
startup_2019['Stage'].unique()

Dealing with Missing values

In [None]:
# Remove the '$' sign and any other non-numeric characters
startup_2019['Amount($)'] = startup_2019['Amount($)'].str.replace('[^\d.]', '', regex=True)

# Convert the 'amount' column to integer dtype
startup_2019['Amount($)'] = pd.to_numeric(startup_2019['Amount($)'], errors='coerce', downcast='integer')

# Verify the updated dtype of the 'amount' column
print(startup_2019['Amount($)'].dtype)


In [None]:
startup_2019['Amount($)'].unique()


In [None]:
startup_2019.columns


In [None]:
num_col = ['Amount($)']
cat_cols = ['Company/Brand', 'HeadQuarter', 'Sector', 'What it does', 'Founders', 'Investor','Stage','Founded']

In [None]:
num_imputer = SimpleImputer(strategy = 'mean').fit(startup_2019[num_col])
cal_imputer= SimpleImputer(strategy = "most_frequent").fit(startup_2019[cat_cols])

In [None]:
startup_2019_num_imputed = pd.DataFrame(num_imputer.transform(startup_2019[num_col]), columns = num_col)
cat_num_imputed =  pd.DataFrame(cal_imputer.transform(startup_2019[cat_cols]), columns = cat_cols)

In [None]:
startup_2019_num_imputed

In [None]:
cat_num_imputed

In [None]:
startup_2019 = pd.concat([startup_2019_num_imputed, cat_num_imputed])

startup_2019

Hypothesis

Null Hypothesis: The location of the startup does not influence the amount of funding it would receive.

Alternate Hypothesis: The sector within which the startup finds itself influences the amount of funding it would receive.

Questions

1. Which Sector gets the most/least funding?
2. Does the year  the startup was funded influence the amount of funding it would receive?
3. which startup according to age was funding awarded the most/ least?
4. Does the number of founders in a startup influence the amount of funding it would receive?
5. What are the top 10 highest amount of money given out to a startup?
6. (Skipped me but was about the stages)

In [None]:
# Perform the t-test
group1 = startup_2019['HeadQuarter']
group2 = startup_2019['Amount($)']

t_statistic, p_value = stats.ttest_ind(group1, group2)

# Print the results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)
