## Getting started in Colab



1.  Log into your (Lyon) Google account

2.  Open [https://colab.research.google.com](https://colab.research.google.com)

3.  Select `New notebook`.

4.  At the top, change `Untitled` in the title to `ColabDemo`.

5.  You're on a free (virtual) Linux server (see sidebar).

6.  Enter `CTRL + ALT + T` or select `+Text` for a text block.

7.  This is *Markup*, a minimal layout language like HTML.

8.  Create a headline: `# Import CSV data as DataFrame`

9.  Enter `CTRL + ALT + I` or select `+Code` for a code block.

10. Text and code blocks can be moved up and down.

11. The code block is an *IPython* shell that sits on top of the Python
    shell, which is a terminal for the Python interpreter, which
    runs your commands right away.

12. IPython allows you to practice interactive computing. For example,
    you can import CSV files, print them as a table and get summary
    stats information:



In [1]:
import pandas as pd
df = pd.read_csv("./sample_data/california_housing_test.csv")

1.  In Colab, run a code block with `CTRL + ENTER` or with `SHIFT + ENTER`
    (creates a new code block below it). This is just like in Snap!
    which is also an *interpreted* language (written in JavaScript and
    compiled to an HTML5 executable).

2.  You can print the data frame as a table by typing its name:



In [1]:
df

1.  You can see that there are 3000 rows or records and 9 columns or
    features describing the CA housing market. You can quickly get
    statistical information on this dataset:



In [1]:
df.describe()

1.  You recognize the total count (number of entries or records or
    rows), the average, the minimum and maximum. Without more
    information about these data (units) this means little. It does
    however, not make sense to average over longitude and latitude.

2.  This command limits the function to the columns 3 through 9
    (excluding columns 1-2): the notation between the [ ] uses the
    'slicing' operator to subset the rows and columns:



In [1]:
df.iloc[:,2:9].describe()

1.  In fact there are three operators and two functions at work here:
    1.  The dot operator to extract methods/functions
    2.  The [] operator to index the data frame
    3.  The , operator to separate rows and columns
    4.  The : operator to slice off rows and columns
    5.  The `iloc[]` method to identify a data frame value based on index
    6.  The `describe()` method to compute a stats summary

2.  In IPython, you can quickly make plots using the `matplotlib`
    library, which contains a module `pyplot`, which in turn contains
    plotting functions like `boxplot` or `scatter`:



In [1]:
import matplotlib.pyplot as plt
plt.boxplot(df[:100]['median_house_value'],vert=False)
plt.show()

1.  In the last code block, we imported the plotting library
    `matplotlib` and the `pyplot` module in it, and told Python to plot a
    boxplot for one variable/column only, the median house value, and
    to restrict the plot to the first 100 records (i.e. locations).

2.  The `plt.show()` command indicates that there's a difference, to
    Python, between making the plot and displaying it, or sending it
    to the standard output.

3.  You can see in the box plot that there are five 'outliers', houses
    that are much more valuable than the rest, in this data, and that
    the middle magnitude (or median) is at around 180,000 USD
    (i.e. half the houses (except the outliers) are less, the other
    half more expensive.

4.  Another useful plot is the scatterplot - that's what we did in the
    R programming language for the `mtcars` dataset and the variables
    `mpg` vs. `wt` (miles per gallon vs. car weight). Now, we plot the
    median income as a function of the house value - and we expect
    them to be *positively correlated*, that is to increase together:



In [1]:
plt.scatter(df.median_house_value,df.median_income)
plt.show()

1.  It's hard to see anything in this plot (there are 3000 values
    here, one for each record) so let's reduce the number of
    (x,y)-values to 100 each:



In [1]:
plt.scatter(df.median_house_value[:100],df.median_income[:100])
plt.show()

1.  We can customize this plot minimally by adding labels and a title:



In [1]:
plt.scatter(df.median_house_value[:100],df.median_income[:100])
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Scatterplot: Median Income vs. Median House Value')

1.  IPython/Colab has a lot more power, e.g. there are many 'magic
    commands' with additional information. For example, enter `%whos`
    now for a list of all the user-defined variables and functions
    that you created in this notebook session.

This concludes the demonstration of Colab's IPython capabilities.

