In [None]:
# <<Jupyter notebook for beginners: A tutorial>>
# https://www.dataquest.io/blog/jupyter-notebook-tutorial/
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
import time
time.sleep(3)

This cell above doesn't produce any output, but it does take three seconds to execute. Notice how Jupyter signifies that the cell is currently running by changing its label to In [*].

In general, the output of a cell comes from any text data specifically printed during the cells execution, as well as the value of the last line in the cell, be it a lone variable, a function call, or something else. For example:

In [None]:
def say_hello(recipient):
    return 'Hello, {}!'.format(recipient)

#say_hello('Tim')
a = 'Hello Time'  # Why this not cause output?
a                 # But this work!

# Markdown
Markdown is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-one correspondance with HTML tags, so some prior knowledge here would be helpful but is definitely not a prerequisite. Remember that this article was written in a Jupyter notebook, so all of the narrative text and images you have seen so far was achieved in Markdown. Let's cover the basics with a quick example.


# This is a level 1 heading
## This is a level 2 heading
This is some plain text that forms a paragraph.
Add emphasis via **bold** and __bold__, or *italic* and _italic_.

Paragraphs must be separated by an empty line.

* Sometimes we want to include lists.
 * Which can be indented.

1. Lists can also be numbered.
2. For ordered lists.

[It is possible to include hyperlinks](https://www.example.com)

Inline code uses single backticks: `foo()`, and code blocks use triple backticks:

```
bar()
```

Or can be intented by 4 spaces:

    foo()

And finally, adding images is easy: ![Alt text](https://www.example.com/image.jpg)

When attaching images, you have three options:

* Use a URL to an image on the web.
* Use a local URL to an image that you will be keeping alongside your notebook, such as in the same git repo.
* Add an attachment via "Edit > Insert Image"; this will convert the image into a string and store it inside your notebook .ipynb file.
Note that this will make your .ipynb file much larger!

There is plenty more detail to Markdown, especially around hyperlinking, and it's also possible to simply include plain HTML. Once you find yourself pushing the limits of the basics above, you can refer to the official guide from the creator, John Gruber, on his website.

In [None]:
df = pd.read_csv('fortune500.csv') 

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
len(df)

In [None]:
df.columns

In [None]:
df.columns[0]

In [None]:
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']

In [None]:
df.columns

In [None]:
df.dtypes

Uh oh. It looks like there's something wrong with the profits column — we would expect it to be a float64 like the revenue column. This indicates that it probably contains some non-integer values, so let's take a look.

In [None]:
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()

Just as we suspected! Some of the values are strings, which have been used to indicate missing data. Are there any other values that have crept in?

In [None]:
set(df.profit[non_numberic_profits])

That makes it easy to interpret, but what should we do? Well, that depends how many values are missing.

In [None]:
len(df.profit[non_numberic_profits])

It's a small fraction of our data set, though not completely inconsequential as it is still around 1.5%. If rows containing N.A. are, roughly, uniformly distributed over the years, the easiest solution would just be to remove them. So let's have a quick look at the distribution.

In [None]:
bin_sizes, _, _ = plt.hist(df.year[non_numberic_profits], bins=range(1955, 2006))

At a glance, we can see that the most invalid values in a single year is fewer than 25, and as there are 500 data points per year, removing these values would account for less than 4% of the data for the worst years. Indeed, other than a surge around the 90s, most years have fewer than half the missing values of the peak. For our purposes, let's say this is acceptable and go ahead and remove these rows.

In [None]:
df = df.loc[~non_numberic_profits]
df.profit = df.profit.apply(pd.to_numeric)

We should check that worked.

In [None]:
len(df)

In [None]:
df.dtypes

Great! We have finished our data set setup.

If you were going to present your notebook as a report, you could get rid of the investigatory cells we created, which are included here as a demonstration of the flow of working with notebooks, and merge relevant cells (see the Advanced Functionality section below for more on this) to create a single data set setup cell. This would mean that if we ever mess up our data set elsewhere, we can just rerun the setup cell to restore it.