# Pandas and Exploratory Data Analysis (EDA)

It's important that you hone your Pandas and exploratory data analysis (EDA) skills before the session starts. If you are having trouble, Google is your best friend! If you are still having problems, ask your fellow Fellows for help through the Platform. Good luck!
<br> <br>
Begin by downloading the Crunchbase dataset on start-up investments, which can be found [here](https://drive.google.com/file/d/1zsjN1tGWdXPb4wf4eTM62usMciSV-0sX/view).

**Exercise:**
The first thing we should do is import the Pandas library. It will probably be helpful to give this library an alias, too. Then, import the dataset and give it a name!

In [6]:
import pandas as pd
df = pd.read_csv("./Crunchbase_Startup_Investment_Data.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 7: invalid continuation byte

Some of you may have experienced a problem already - thats ok! We can deal.
<br><br>
The problem here is that the dataset is encoded in Latin-1, but Pandas has defaulted to UTF-8 encoding. Bad, pandas! But it's ok, you can correct for this by specifying the encoding in your command. <br><br>
*Pro tip:* If you're having trouble, try Googling your error messages. You are probably not the first to encounter any particular error.

In [7]:
df = pd.read_csv("./Crunchbase_Startup_Investment_Data.csv", encoding='latin1')



Now that we have successfully imported the data, let's do some Exploratory Data Analysis! 

**Exercise:**<br>
Let's begin by displaying the first 5 rows of each column. <br>(*Hint: there is a special command for this!*)

In [None]:
print(df.head())


**Question:**<br>
How many columns are in this dataset? How many rows?

In [None]:
print("N of rows", len(df))
print("N of columns", len(df.columns))
print(df.info())

**Exercise:**<br>
You'll probably notice that the command above actually truncates the number of columns it shows. This is to make display
 easier. However, we will definitely want to see each of the column names so that we know what kinds of data are available to us.
  Try pulling out all of the column names.

In [None]:
for col in df.columns:
    print(col)

**Question:**<br>
What are the data types in each column?

In [None]:
print(df.dtypes)


One of the inevitable frustrations in working with large datasets is that they can be messy. Often, values can be missing.
Values might be missing because they don't apply, or simply because they got lost in the shuffle
 (e.g. wasn't recorded, data was corrupted, etc.)
 Missing values can take different forms in different datasets -
  and sometimes even multiple forms!
  One typical form is NaN, which is an acronym for Not A Number. <br><br>
**Question:**<br>
How many NaN's appear in each column? How many total across columns?

In [None]:
print("Nans in columns", df.isna().sum())
print("Nans in rows", df.isnull().sum(axis=1))

**Exercise:**<br>
Let's take a look at all the columns that pertain to the amounts of money each company has
raised. How many columns are relevant? Can you pull them all out at once?

In [None]:
for col in df.columns:
    if 'raise' in col:
        print(col)


One of the first things that you should notice is that the column 'raised_amount_each' is
completely useless. This kind of thing is another unfortunate consequence of large datasets -
they can be messy, and sometimes data doesn't get filled in correctly.

Luckily, there is another column that can help us out here.
Let's take a look at 'raised_amount_total_usd'.

You've probably noticed that some rows contain numbers, while others contain NaN's.

**Question:**<br>
How many rows contain numbers?

In [None]:
print( len(df['raised_amount_total_usd'])-df['raised_amount_total_usd'].isna().sum())



**Question:**<br>
How much money in total was raised across every company in this dataset?

In [None]:
print(df['raised_amount_total_usd'].sum())


Did you get an error? Oh noooooo! Can you piece together what happened from the TypeError?
What type of data appears in that column? What can you do to remedy it?

(*Hint: you'll need to convert these values, but this may be a 2-step process.
 You may need to remove certain elements first.*)

In [8]:
df_new = df.dropna(subset=['raised_amount_total_usd'])
df_new['raised_amount_total_usd']=df_new['raised_amount_total_usd'].astype(str).str.strip()
df_new['raised_amount_total_usd'] = df_new['raised_amount_total_usd'].str.replace(',','')
df_new = df_new[~df_new['raised_amount_total_usd'].str.contains("-")]
df_new['raised_amount_total_usd'] = df_new['raised_amount_total_usd'].astype(float)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Ok, whew! Now that THAT'S done, we can return to our question.

**Question:**<br>
How much money in total was raised across every company in this dataset?

In [9]:
print(df_new['raised_amount_total_usd'].sum())



1280964574193.0


WOW! That's a lot of moola!! Does it make you want to start a business??
Let's pretend you said 'yes'. And, since you're no dummy,
I'm sure you would do the appropriate market research before crafting a business model.

**Question**:<br>
How many unique types of company markets are there? What are they?

In [None]:
print(len(df_new['company_market'].unique()))

print(df_new['company_market'].unique())


As I'm sure you've guessed, not all of these markets received an equal share of investment
money. Let's try breaking investment down by different markets!

**Question:**<br>
How much money was invested in each company market?

(*Hint: You'll need to **group** the data **by** market type...*)

In [None]:
print(df_new.groupby('company_market')['raised_amount_total_usd'].sum())



It's good to know how much investment each market saw,
but we need a bit more organization here.
We don't want to build our startup in just ANY market, we want the HOTTEST market!

**Question:**<br>
Which company markets received the most investment money? Find the top 10.

In [None]:
print(df_new.groupby('company_market')['raised_amount_total_usd'].sum().nlargest(10))



**Question:**<br>
Which company markets received no investment money?
Can you find the bottom 3 markets to recieve at least SOME investment money (aka more than $0)?

In [None]:
grouped = df_new.groupby('company_market')['raised_amount_total_usd'].sum()
print(grouped[grouped==0])
print(grouped.nsmallest(3))

Fantastic work! Now we know which company markets to avoid, and which to pursue. 

Next, we will want to narrow down WHERE to build our startup.
After all, funding can change based on where our business is located!

**Question:**<br>
In which countries did startups in the top market recieve the most funding?

In [None]:
largest_10 = df_new.groupby('company_market')['raised_amount_total_usd'].sum().nlargest(10).index
print(df_new[df_new['company_market'].isin(largest_10)].groupby('company_country_code')['raised_amount_total_usd'].sum().sort_values(ascending=False))

Woohoo! Go USA! But should we start our business in Maine?
In Florida? In Washington state? Let's try narrowing it down even further.

**Question:**<br>
Which state of the top country in the top company market recieved the most investment funding?

In [None]:
top_countries = df_new[df_new['company_market'].isin(largest_10)].groupby('company_country_code')['raised_amount_total_usd'].sum().sort_values(ascending=False)
top_country = top_countries.index[0]
print(top_country)
print(df_new[(df_new['company_market'].isin(largest_10)) & (df_new['company_country_code'] == top_country)].groupby('company_state_code')['raised_amount_total_usd'].sum().nlargest(1))


Great! Now let's zoom in even further! 

**Quesiton:**<br>
How about the cities in the top state?

In [None]:
top_states = df_new[(df_new['company_market'].isin(largest_10)) & (df_new['company_country_code'] == top_country)].groupby('company_state_code')['raised_amount_total_usd'].sum().nlargest(1).index
print(top_states[0])
print(df_new[(df_new['company_market'].isin(largest_10)) & (df_new['company_country_code'] == top_country) & (df_new['company_state_code'] == top_states[0])].groupby('company_city')['raised_amount_total_usd'].sum().sort_values(ascending=False))



Are you surprised by the city? It turns out that our cofounder, Investra Q. McMoney, **hates**
the hot weather. But maybe there are other cities that would be good candidates for our startup...

**Question:**<br>
What are the top 5 cities in the USA for biotechnology company market investment funding?

In [None]:
print(df_new[(df_new['company_market']=='Biotechnology') & (df_new['company_country_code'] == top_country)].groupby('company_city')['raised_amount_total_usd'].sum().nlargest(5))


Fantastic! Looks like we have at least a few locations to scout!

In the meantime, we should consider the sources of funding. With that in mind,
let's turn our attention to investor markets.

**Question:**<br>
Which investor markets raised the most money?

In [None]:
print(df_new.groupby('investor_market')['raised_amount_total_usd'].sum().sort_values(ascending=False))


Are you surprised? How do you interpret the difference between company market
 investment and investment market investment?
 Why wouldn't these numbers be the same? Interpreting these types of apparent mis-matches is
  super important, especially when it comes to generating actionable insights.

But we should go deeper here. Let's look at this data over time, shall we?

**Question:**<br>
What is the earliest year for which we have funding data?
What is the latest year?

In [None]:
print(df_new['funded_year'].min())
print(df_new['funded_year'].max())

Let's take a look at how the investor market has changed over time.
We don't want to get ourselves ensnared in a bubble!

**Question:**<br>
What investor market raised the most money in the earliest year for which we have data?

In [None]:
df_new = df_new.dropna(subset = ['investor_market'])
earliest = df_new['funded_year'].min()
print(earliest)
print(df_new[df_new['funded_year'] == earliest].groupby('investor_market')['raised_amount_total_usd'].sum().nlargest(1))

Did you get any results? Wny not? Try to troubleshoot.

This is another problem with big datasets. Sometimes they can be sparser than they appear.

Any one particular year, especially earlier years in this dataset,
 may not have much representation in this dataset.
 One way to approach this problem, then, is to look at investor markets using larger
 temporal windows.

**Exercise:**
Look at money raised in different investor markets over larger windows of time.
How have the investor markets changed over time? What used to be hot? Whats hot now?

In [14]:
#df_time = df_new.set_index('funded_year')
df_new['funded_year'] = pd.to_datetime(df_new.funded_year, format='%Y')
print(df_new.groupby([pd.Grouper(key="funded_year", freq="5Y"), 'investor_market'])['raised_amount_total_usd'].sum().reset_index().sort_values(['funded_year','raised_amount_total_usd']))


    funded_year        investor_market  raised_amount_total_usd
0    1989-12-31        Venture Capital             2.500000e+06
2    1994-12-31                Finance             3.400000e+05
3    1994-12-31    Hardware + Software             1.300000e+07
1    1994-12-31              Education             1.755000e+07
4    1994-12-31                  Legal             1.755000e+07
..          ...                    ...                      ...
293  2014-12-31            Health Care             9.574701e+09
406  2014-12-31               Software             1.149453e+10
314  2014-12-31  Investment Management             2.721440e+10
283  2014-12-31                Finance             6.690651e+10
429  2014-12-31        Venture Capital             7.401890e+10

[443 rows x 3 columns]


Looks like some investor markets have changed,
but others are very consistent!

**Exercise:**<br>
How does the investor market compare to company markets
for these same windows of time?

In [15]:
print(df_new.groupby([pd.Grouper(key="funded_year", freq="5Y"), 'company_market'])['raised_amount_total_usd'].sum().reset_index().sort_values(['funded_year','raised_amount_total_usd']))


     funded_year       company_market  raised_amount_total_usd
0     1979-12-31  Enterprise Software             2.000000e+06
1     1989-12-31             Software             2.500000e+06
4     1994-12-31                Games             6.000000e+04
3     1994-12-31           Consulting             3.350000e+06
5     1994-12-31             Software             4.134000e+07
...          ...                  ...                      ...
868   2014-12-31          Health Care             3.706816e+10
678   2014-12-31     Clean Technology             3.960144e+10
775   2014-12-31           E-Commerce             4.242423e+10
1168  2014-12-31             Software             4.482451e+10
645   2014-12-31        Biotechnology             9.838299e+10

[1273 rows x 3 columns]


Congratulations, smarties!
You've made it to the end of this introduction to Pandas!

But this is really only where exploratory data analysis begins. The next step in EDA is data visualization. Try to come up with different data visualizations for these data. Data viz can often shed light on some surprising aspects of your data, and can inspire whole new analyses that you might not have otherwise expected.


Here are some ideas for data stories you can tell using visualizatitons:<br>
* Try plotting a time series of funding by company or investor market. 
* Which industries are receiving the most funding?<br> 
* Are there differences in the funding structures of different industries?<br>
* What is the geographical distribution of funding?<br>
* How has startup funding changed over time?<br>


But EDA is really only the jumping off point for real Data Science. The bread and butter of DS is data analysis, to which none of you are strangers. Try your hand at some analysis! Begin by importing some additional libraries. What kind of machine learning algorithms can you apply? Think carefully about why you would user one ML technique over another. This is a critical skill: companies will care about how you think about data. Your advanced degree is a big leg-up here: you've had experience thinking deeply about complex problems and the appropriaite analyses to apply to them. What can you come up with?


Here are some ideas of analyses or avenues of investigation:
* How does early round funding impact the future success of a company?
* Does goegraphical location affect the funding or future success of a company?
* How significant are the results?
* Does it qualitatively make sense?


Discuss these topics with your fellow Fellows on the Platform:
* Why are your results interesting?
* Could you imagine a useful product on top of this?
* From a technical point-of-view, what was challenging about dealing with this dataset?
* What were the hardest points or roadblocks along the way?
* Are there any secondary data sources you can call upon to gain further insight?
* Where did you make wrong turns?
* What aspects of analysis did you get stuck on?
* How would you approach your workflow differently?
* Could task sharing or communication be streamlined?
