<a href="https://colab.research.google.com/github/birkenkrahe/py/blob/main/unicorn_python_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code along in Python: unicorn companies

This is a demo notebook using Python 3 (Python Software Foundation, 2023) and the IPython shell (Perez & Granger, 2023) as part of a Colab exercise originally developed for DataCamp workspace (Schouwenaars & Cotton, 2021). We use the "unicorns" dataset (CB Insights, 2023) to demonstrate the following data analytics steps:

    1.  Import CSV data as `pandas` data frame and look at the headers.
    2.  View the `unique` values of the data frame column `Category`.
    3.  Clean the data frame column `Category` by eliminating errors.
    4.  Group the records by industry `Category` and count them.
    5.  Plot the `Category` count by industry as a barplot.

## How this works

You get a prepared notebook that you open in your private Colab workspace. We'll work through the notebook together to learn all there is to learn about interactive notebooks. A somplete solution notebook is available here: https://tinyurl.com/mrykukbw - when you open the link, it will be uploaded to your private workspace, and any changes you make will be saved to your `Colab Notebooks` folder in `GDrive`.

## Unicorn Companies

A unicorn company is a privately held company with a current valuation of over $1 billion USD. In this workspace, we'll be looking at a dataset that consists of unicorn companies and startups across the globe as of November 2021, including country of origin, sector, select investors, and valuation of each unicorn. Former unicorn companies that have since exited due to IPO or acquisitions are not included in this list.

## Import CSV data as `pandas` data frame and look at the headers.

In the following code block, we import `pandas` and alias it as `pd`. Then we read the CSV file from the current directory using the `pd.read_csv` function. We assign the result to a data frame `df`. We then print the data frame by entering its name. You run a code cell with `CTRL-Enter` (or with `SHIFT-Enter` if you want to go to the next cell).

*Note: you have to upload the CSV file as `unicorn_companies.csv` to Colab first. Get the file [from this link](https://drive.google.com/file/d/1z8chiWMfIrdRnGerPAbq3xGNLSEddfiP/view?usp=sharing) and download it to your PC first before uploading it here.*

In [None]:
# import pandas

# read CSV file

# show data frame


You notice that the result does not give you the entire 917-row data frame but only a selection. At the right hand side of the table you find two icons: one for the default table format, and the other one to "suggest a graph".

## View the `unique` values of the data frame column `Category`.

In the next code block, we use another `pandas` function, `pd.unique`. This last format is what you need if you want to look up information about this function using Python's `help`. You can get help with `?` in a sidebar, or with `help` as regular output, or you can get the same information [via the web](https://pandas.pydata.org/docs/reference/api/pandas.unique.html).

In [None]:
# show help in sidebar


In [None]:
# show help in output cell


In the next code block, we use the `[]` operator to access only the `'Category'` column of the data frame `df`. Notice that the function call includes `()`, which means that this call needs no arguments.

In [None]:
# Print out all categories


The formatting is somewhat strange and contains information we don't need right now. In the next codeblock, we print one value per line only using the `print` function inside a `for` loop that runs over the values of the `array`:

In [None]:
# Print out all categories - one per line


Just for completeness: you can also use a *list comprehension* with a dummy argument `_`. Here, `[]` is not the index operator but the delimiter for the `list` data structure. Lastly, the `;` at the end is an artefact of the IPython shell that we're using. Without it, IPython will print a bunch of `None` values following the desired output.

In [None]:
# print unique categories as a list


It turns out there are some duplicates because of typos and different capitalization. Let's clean those up.

## Clean the data frame column `Category` by eliminating errors.

We want to replace all occurrences of `'Artificial intelligence'` by `'Artificial Intelligence'`, and all occurrences of `'Finttech'` by `'Fintech'`. We can use the function `pd.DataFrame.replace` to do this. You can look up the help with `?` or `help` ([or here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)) This function has many arguments but you only need two: the value you wish to replace (`to_replace`) and the value to replace it by (`value`). In Python, we can "serialize" the execution of these tasks using the "attribute access" or "dot" operator `.` We save the cleaned data frame in `df_clean` and print it out using a list comprehension.

*Note that the command goes over two lines. To make sure that Python understands that, we end a continued line with `\`.*

In [None]:
# Replace wrong text cells

# Add another printout here


Much better! With the categories cleaned up, let's see how many unicorns there are in each category.

## Group the records by industry `Category` and count them.

To find out how many unicorn companies are there in each `Category` (aka industry), we group the corresponding records using the function `pd.DataFrame.groupby`. The command in the code cell below performs several operations on the `df_clean` dataframe:

`groupby(by = 'Category', as_index = False)`: This groups the dataframe by the `'Category'` column. The `as_index = False` parameter ensures that the resulting groups retain `'Category'` as a column rather than using it as an index.

`size()`: After grouping, this function is used to compute the size of each group. In the context of `groupby`, the `size()` function returns a `pd.Series` (a vector or 1-dim array) with the number of items in each group. This is essentially a count of rows for each `'Category'`.

`.sort_values(by=['size'])`: This sorts the resulting `pd.Series` based on the size/count.

Now, when you use the `size()` function with `groupby`, the resulting `pd.Series` will have the counts of each group as its values. When you sort this and convert it back into a dataframe (which happens implicitly because of `as_index=False`), the counts become a new column. By default, this column is named `size` – hence the creation of a new column named `size` in the output.

In [None]:
# group data by category

# show data frame of categories and their size


The result, `category_counts`, is a pandas data frame with two columns sorted by size of group rather than alphabetically. When you let Colab suggest a graph, you get a line plot, a histogram (distribution) and a time series. `type` returns the data structure of its argument, and `pd.DataFrame.shape` is an attribute of the dataframe that contains its dimensions.

In [None]:
# show the data type of category_counts

# show the dimension of category counts


## Plot the `Category` count by industry as a barplot.

The last table `category_counts` is a data frame with a categorical and a numerical frequency column. To visualize the relationship between these, a barplot is most suitable.

There are many different graphics packages available. The one most often mentioned is `matplotlib`. It is a great package to get a quick overview but you usually need to customize the graphs quite a bit before they look publishable.

Instead, we use the `plotly` package, which has an `express` module that does most of the heavy lifting for us. All it needs is the data and the names of the x and y column, and a title:

In [None]:
# import plotly.express

# Create a bar plot of category group size vs. category


Compare the result when using `matplotlib.pyplot`: instead of one line, we need several lines of code to get a similarly appealing result. However, as I said, for quick data exploration, this is the way to go.

In [None]:
# import matplotlib.pyplot

# plot category group size vs. Category

# rotate the x ticks by 90 degrees to make them readable

# add a title

# label the y-axis

# draw a grid to increase readability

# show the final plot


# References

CB Insights. The Complete List of Unicorn Companies. CB Insights. Published 2023. Accessed August 19, 2023. https://www.cbinsights.com/research-unicorn-companies

Google LLC. Google Colaboratory. Accessed August 19, 2023. https://colab.research.google.com

Pérez F, Granger BE. IPython (Version 8.14.0). IPython Development Team. Published 2023. Accessed August 19, 2023. https://ipython.org

Python Software Foundation. Python (Version 3.8.10). Python Software Foundation. Published 2021. Accessed August 19, 2023. https://www.python.org

Schouwenaars F, Cotton R. Unicorn companies. DataCamp. Published 2022. Accessed August 19, 2023. http://bit.ly/ws-unicorn



References formatted in [AMA style](https://academic.oup.com/amamanualofstyle)

- The names of all authors are inverted (the last name precedes the initials of the first and middle names).
- All authors are separated by a comma, except for the last two authors, which are separated by an ampersand (&).
- The title of the work is followed by the name of the website or publisher.
- The publication year follows the publisher and is followed by the access date.
- The URL is the final component of the citation.