# Introduction to Python
### Prof. Dr. Juanjo Manjarín
**Statistics $ Data Analysis**

---

In this introductory document we are going to explore some of the basic operations with Python:

  * Packages and Modules
  * Data Types
  * The Pandas Data Frame
  * Reading and Saving external data
  * Google's Colaboratory Environment



##1.- Introduction

Python is a programming language that can be either **compiled** or **interpreted**. This means that we can use it both to write complex programs that we compiled in executable files, libraries and other data, but it can also be used to write small scripts that are executed on the run.

Here we will focus on this last execution mode and a very simple and illustrative example of it is letting Python behave as a huge calculator. For example, let's add 10 and 10

In [1]:
10 + 10

20

##2.- Packages and Modules

Python is a general purpose programming langauge. There resides most of its strengths but, at the same time, we cannot have everything we need or may need from the base installation.

There are many different Python projects that have built a set of consistent functions that we can use for our daily work. These functions are packed in external modules that we can load in our projects (we will use the words *module*, *package* and *library* to refer these packs, however there are subtle but important differences bewteen them).

The most important libraries we will use duing this course are:

- numpy (typically as `np`): For numerical anlysis. See [manual](https://www.numpy.org/devdocs/contents.html).
- scipy: For scientific computing. See the [scipy lecture notes](http://scipy-lectures.org/index.html). We will extensively use the submodule `stats` of the scipy library for statistical modeling.
- pandas:   For data structuring and manipulation. See [documentation](http://pandas.pydata.org/pandas-docs/stable/) and [cheatsheet](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).
- matplotlib: For 2D plots. See [tutorials](https://matplotlib.org/tutorials/index.html).

The way we load them is through the use of different Python **keywords**, the main one is **import**. Then to load the **pandas** library we would do

```python
import pandas
```

however there are common abbreviations to these packages so that we do not have to write the whole name everytime, then we would do

```python
import pandas as pd
```

now once we want to use any of these pandas functions, say the **Series** one, we must do it as

```python
pd.Series()
```

in such a way that our interpreter knows that we are calling for the Series function inside the Pandas library, i.e. the abbreviation is a reference name for the whole library.

Some other times we do not want or need to import the whole set of functions in the library then we can restrict the importation as much as we want, for example, the **pyplot** module in **matplotlib** can be imported as

```python
import matplotlib.pyplot as plt
```

some others we just want a very small set of functions, for example the ones corresponding to the normal distribution from the **scipy** library, then we do

```python
from scipy.stats import norm
```

This modules structure is a common source of confusion when we take our first steps in Python, and most of our difficulties could be sum up in one single question: "how do I know where is the function I need or even if it exists?" The short answer is *with practise* which implies that many times we have to take a long tour through the documentation or the web looking for some of the answers.

##3.- Types of Data

Just as with the functions of the base installations, which do not contain all the functions we may ever need, the base data types are rather general and powerful but do not satisfy the needs of data analysis. We are going to explain briefly these base ones because we will use them from time to time and then the ones that come with the modules important for us.

####3.1.- The Base Library

There are four data types that are predifined in Python

  * Strings
  * Lists
  * Tuples
  * Sets
  * Dictionaries

To generate a new data structure we can use two different approaches:

 * Using the **constructor** of the type. This is a special type of function that will be `str()`, `list()`, `tuple()`, `set()` and `dict()`. This is usually the safest form (mostly at the beginning)
 * Using the **format** of the type. This will be `" "`, `[ ]`, `( )` and `{ }` (with a small subtlety for this last one)

The important observation to keep in mind is that although there are some basic common methods for all the types, some of them are specific of each class. Let's see the basic properties of each type



#####3.1.1 Strings

A **string** is a literal, then if we write a number, this type represents the text written and not its numeric content (this is important in any arithmetic operation we make with them)

```python
"my number is 2"
```

is equivalent to 

```python
str("my number is 2")
```


In [3]:
a = str("my number is 2")
a

'my number is 2'


#####3.1.2 Lists

A **list** is the first important data type in Python. In them we can store under one common name, a **not fixed** amount of sequential values.

The following two forms a equivalent, from the format

```python
[1, 2, "Hello"]
```

or with the constructor

```python
list([1, 2, "Hello"])
```

In [6]:
mylist = list([1, 2, "Hello"])
mylist

[1, 2, 'Hello']

In this context, the word sequential means that we can access the information in the list in a sequential form. Lists can also be generated through the use of some list comprehensions as the `for` loops (out of this course)

```python
[x**2 for x in [1,2,3]]
```

We may have multidimensional lists which allow us to store in one single variable tables or higher dimensional structures. For bidimensional tables we may write

```python
[[1, 2, 3], ["a", "b", "c"], [15, 25, 30]]
```

which is a table with three rows and four columns

#####3.1.3 Tuples

The **tuples** are very similar to lists with one exception: its elements cannot be changed: we cannot add, replace, remove or reorder them in any way.

We may create a tuple with the format

```python
(1, 2, 3)
```

with the constructor

```python
tuple([1, 2, 3])
```

or with the list comprehension together with the constructor

```python
tuple([x**2 for x in [1, 2, 3]])
```

In [22]:
mytuple = tuple([x**2 for x in [1, 2, 3]])
mytuple

(1, 4, 9)

the fact that this type cannot be changed can be easily seen if we try to add a new element to it using the index operator: the interpreter will throw an error

In [23]:
mytuple[3] = 2

TypeError: ignored

#####3.1.4 Sets

The **Sets** are used to store data without any order in them. Then we cannot access by index (see later) but by the explicit value of the input. The main advantage with respect to lists and tuples is that sets are way faster.

We can generate sets using the format

```python
{1, 2, 3}
```

or the constructor applied to a string, a list, a tuple or a list comprehension

```python
set([1, 2, 3])
```

obviously, we can generate a list or a tuple using their constructors with a set. 

#####3.1.5 Dictionaries

The **dictionaries** are keyed-sets, this means that in the definition we must incorporate a key for each of the data we want to store.

If we define the data from the format this may be

```python
{1: "John", 2: "Eve"}
```

if we use the constructor we have

```python
dict(a = "John", b = "Eve")
```

in this second case the keys must be *valid* and numbers are not, that's why we have changed them to *a* and *b*. A way to move around that imposibility is using the constructor on the format of a dictionary

```python
dict({1: "John", 2: "Eve"})
```


In [32]:
mydict = dict({1: "John", 2: "Eve"})
mydict

{1: 'John', 2: 'Eve'}

this class comes with some specific methods that we can invoke from the dictionary itself (take a look at the documentation), for example, to know the keys of the dictionary

In [42]:
mydict.keys()


dict_keys([1, 2])

###3.2.- The Numpy Array

The Numpy array is one of the possible arrays we can find in the Python packages. To use it we need to import the numpy package so let's do it before any other thing

In [0]:
import numpy as np

the constructor for this data type is the `np.array()` function which can be used in the usual way

In [3]:
myarray = np.array([1, 2, 3])
myarray

array([1, 2, 3])

you see that the output makes explicit reference to the array nature of the dataset.

Now, this structure is mostly like a list: both are iterated, can be sliced, are used to store data and can be indexed. However, while the list is **NOT** a vector and cannot be used in vector operations, the array can and will be used as such. Consider the following examples

In [4]:
np.array([1, 2, 3])/2

array([0.5, 1. , 1.5])

in this case we have a vectorized operation where division by 2 is applied to all the elements of the array. Try to do this same operation in a list and you will receive an error: run the following code

```python
[1, 2, 3]/2
```

In any case, you have to keep in mind that the numpy array behaves this way only if it has been defined from a list.

####3.2.1 Indexing and Slicing

An important observation is that in numpy the dimensions are known as **axes**. This is very relevant since in many operations you will be required to say explicitely the axis along which you want to perform it. 

Among these operations, the **indexing** and **slicing** are a must-know.

  * **Indexing** refers to the fact that each element in the data structure has an reference index that allow us to gain access to it individually. The only peculiar behaviour is that Python inherits from **C** the **0-based index**, i.e. the index begins at 0 not at 1, then the first element's index is always 0. The index is used with the squared brackets right after the name of the data structure

In [9]:
myarray[0]

1

> you see that the element with index 0 in `myarray` is the first element, in this case the number 1. We can also have negative indices, implying that we reverse the counting order (always keeping the first element as 0), then

In [10]:
myarray[-1]

3

> In multidimensional arrays we will find a different index per axis, then

In [12]:
mybidimarray = np.array([[1, 3, 5], [2, 4, 6]])
mybidimarray

array([[1, 3, 5],
       [2, 4, 6]])

> now the index is used as follows: we will have a pair of indices the first one for the first axis, the second one for the second axis (try to say which is going to be the output of the following code without executing it)

```python
mybidimarray[1,2]
```

How do we access more than one element in the data set? Through slicing

  * **Sclicing** refers to the use of the slicing operator "**:**" which is to be read as "init:end". Let's define a longer array and see how to use it

In [21]:
myarray = np.random.random(10)
myarray

array([0.71968138, 0.09225487, 0.02679831, 0.70784509, 0.07940403,
       0.8711127 , 0.08729038, 0.20672998, 0.37942293, 0.05036329])

> then the elements from the second to the fifth elements are

In [25]:
myarray[1:5]

array([0.09225487, 0.02679831, 0.70784509, 0.07940403])

> If we do not impose any of the boundaries the interpreter considers that we want every element in that direction, then for example, the three first elements are

In [27]:
myarray[:3]

array([0.71968138, 0.09225487, 0.02679831])


> This can be used in a cumulative way in both boundaries as in

In [26]:
myarray[:1:3]

array([0.71968138])

> in two dimensional arrays, we can slice in exactly the same way but considering the two axes in the data. Let's reshape our data in a 2x5 array

In [32]:
myarray = myarray.reshape(2,5)
myarray

array([[0.71968138, 0.09225487, 0.02679831, 0.70784509, 0.07940403],
       [0.8711127 , 0.08729038, 0.20672998, 0.37942293, 0.05036329]])

> then to obtain the third column we would do

In [33]:
myarray[:, 2]

array([0.02679831, 0.20672998])

###3.3.- The Pandas Data Frame

The Pandas package contains two relevant structures. One of them are the **series**, which are similar to arrays. In fact these are a sort of enhanced arrays. You can take a look at the documentation to see how they work.

Here we are going to concentrate on the data frames. These are the main data structure we are going to work with and just as the numpy array resembles the list, this data frame resembles the dictionaries, which is particularly important when we try to go through filter and selection operations.

In sort we can say that a data frame is a two dimensional labeled data structure. This is basically what we see in any spreadsheet we have worked with: rows and columns with names and indices.

To generate this structure we need to load the pandas library

In [0]:
import pandas as pd

now we are ready to use the basic constructor of this data type, the `pd.DataFrame()` function, let's see it using a randomly generated dataset

In [61]:
mydf = pd.DataFrame({"Age": np.random.randint(18, 65, 10), 
                     "Exper": np.random.randint(1, 15, 10), 
                     "Tenure": np.random.randint(1, 5, 10)
                    })
mydf

Unnamed: 0,Age,Exper,Tenure
0,46,4,4
1,38,10,2
2,46,9,1
3,64,9,4
4,58,4,4
5,37,11,1
6,29,3,3
7,42,8,3
8,45,4,1
9,33,3,4


Note that what we had as keys in the dictionaries are the names of the variables/columns. Also see that in Python we will always have the extra column on the left which corresponds to the index, so now each observation/row will have an explicit index.

####3.3.1.- Selecting and Filtering

We have two different subsetting operations:

 * One in which we **select** different variables/columns out of the whole set
 * A second in which we **filter** to keep only some rows/observations based on some conditions

Now that we now the different types of labels in a data frame we must always keep in mind that **columns** are addressed by their name while **rows** are by their index.

Then if we want to select the column named *Age* in the previous dataframe we must be explicit with the name, usually written inside squared brackets. We must, however, be careful with the selection process: There is a huge difference in using single or double brackets

The selection with a **single bracket** returns

In [62]:
mydf["Age"]

0    46
1    38
2    46
3    64
4    58
5    37
6    29
7    42
8    45
9    33
Name: Age, dtype: int64

but this is **NOT a data frame**, it is a pandas **series** as can be seen if we do

In [63]:
type(mydf["Age"])

pandas.core.series.Series

the same result is obtained if we use a full-stop mark

In [64]:
mydf.Age

0    46
1    38
2    46
3    64
4    58
5    37
6    29
7    42
8    45
9    33
Name: Age, dtype: int64

However if we use **double brackets** we have

In [65]:
mydf[["Age"]]

Unnamed: 0,Age
0,46
1,38
2,46
3,64
4,58
5,37
6,29
7,42
8,45
9,33


which can be seen, even from the output, to be a data frame. Finally sometimes we may need only the array of values stripped from all the pandas structure, this can be obtained using

In [66]:
mydf["Age"].values

array([46, 38, 46, 64, 58, 37, 29, 42, 45, 33])

which return a single numpy array with the values.

If we want to extract two different columns, we should think that the subset is a list, then it must be passed as a list argument

In [67]:
mydf[["Age", "Tenure"]]

Unnamed: 0,Age,Tenure
0,46,4
1,38,2
2,46,1
3,64,4
4,58,4
5,37,1
6,29,3
7,42,3
8,45,1
9,33,4


When we want to **filter** some values we do it either through their index or based on a condition of the values of some variable (for example when we want to keep only one of the categories in a categorical variable)

In order to filter by index we use the same slicing operator as in numpy, then

In [68]:
mydf[2:5]

Unnamed: 0,Age,Exper,Tenure
2,46,9,1
3,64,9,4
4,58,4,4


returns the rows with indices from the second to the fifth (see that the second value is not included in the interval). All the other properties of this operator are kept too.

The filter under conditions is done using the logical conditions as for example

In [70]:
mydf[mydf["Tenure"] >= 3]

Unnamed: 0,Age,Exper,Tenure
0,46,4,4
3,64,9,4
4,58,4,4
6,29,3,3
7,42,8,3
9,33,3,4


or adding more conditions (careful with the parentheses)

In [71]:
mydf[(mydf["Tenure"] >= 3) & (mydf["Age"] <= 45)]

Unnamed: 0,Age,Exper,Tenure
6,29,3,3
7,42,8,3
9,33,3,4


Finally, we have one option of filtering and selecting in one single step. Then if we want only the rows from the index 8 on and columns *Tenure* and *Exper* we can write

In [72]:
mydf[["Exper", "Tenure"]][8:]

Unnamed: 0,Exper,Tenure
8,4,1
9,3,4


Finally, there are two functions that we can use to select and filter: 

 * **iloc**, which is *integer location based index*, and
 * **loc**, which is just *location* (we are not going to use this function in these notes but you can take a look at the documentation)
 
with iloc you can specify `iloc[row, column]` using the number of row and column as in the following examples

- To select the first column

In [73]:
mydf.iloc[:,0]

0    46
1    38
2    46
3    64
4    58
5    37
6    29
7    42
8    45
9    33
Name: Age, dtype: int64

- To select the first row

In [74]:
mydf.iloc[0]

Age       46
Exper      4
Tenure     4
Name: 0, dtype: int64

- To select the last row

In [75]:
mydf.iloc[-1]

Age       33
Exper      3
Tenure     4
Name: 9, dtype: int64

- To select the rows number (not index) 1, 4, 7 and the first and third columns 

In [76]:
mydf.iloc[[0,3,6], [0,2]]

Unnamed: 0,Age,Tenure
0,46,4
3,64,4
6,29,3


####3.3.2.- Adding Columns

Adding new variables to a data frame is a rather easy operation (remember that the only requirement is that the number of observations is the same as the number of rows in the data frame).

We can use the single brackets: we write the name of the new column inside of the brackets and assign to it any array. 




In [77]:
mydf["gender"] = ["f", "f", "f", "m", "m", "f", "m", "m", "f", "m"]
mydf

Unnamed: 0,Age,Exper,Tenure,gender
0,46,4,4,f
1,38,10,2,f
2,46,9,1,f
3,64,9,4,m
4,58,4,4,m
5,37,11,1,f
6,29,3,3,m
7,42,8,3,m
8,45,4,1,f
9,33,3,4,m


usually we may want to add a variable based on the values of another variable. Suppose that we want to add a variable based on the values of *Exper* such that

 - if Exper <= 5, we denote it as "Low"
 - if 5 < Exper <= 10,we denote it as "Mid"
 - if Exper > 10, we denote it as "High"
 
in this case we can use the function `where()` from from the numpy package, which is a sort of if-else condition and just as in those constructions, it can be nested as many times as needed. The structure is `np.where(condition, [x, y])`, such that if the condition is satisfied, it returns *x* and *y* otherwise.

In the following example, since 1 is not greater than 3, it will return "B"

In [26]:
np.where(1 > 3, "A", "B")

array('B', dtype='<U1')

Let's use this, together with the single bracket in order to add the new variable (let's denote it as *Code_Exp*)

In [79]:
mydf["Code_Exp"] = np.where(mydf["Exper"] <= 5, "Low",
                            np.where(mydf["Exper"] > 10, "High", "Mid"))
mydf

Unnamed: 0,Age,Exper,Tenure,gender,Code_Exp
0,46,4,4,f,Low
1,38,10,2,f,Mid
2,46,9,1,f,Mid
3,64,9,4,m,Mid
4,58,4,4,m,Low
5,37,11,1,f,High
6,29,3,3,m,Low
7,42,8,3,m,Mid
8,45,4,1,f,Low
9,33,3,4,m,Low


as mentioned before, we had to nest a second conditional in order to generate the three possible values. This can be generated to any number of conditions.

####3.3.3- Sorting Data

In many situations it is important or just interesting to sort the values of the data frame descending or ascending as a function of one of the variables. This can be done with the function `sort_values()` as a method of the data frame itself.

Suppose we want to sort our previous data frame using the variable A 

In [80]:
mydf.sort_values(by = ["Age"], ascending = True)

Unnamed: 0,Age,Exper,Tenure,gender,Code_Exp
6,29,3,3,m,Low
9,33,3,4,m,Low
5,37,11,1,f,High
1,38,10,2,f,Mid
7,42,8,3,m,Mid
8,45,4,1,f,Low
0,46,4,4,f,Low
2,46,9,1,f,Mid
4,58,4,4,m,Low
3,64,9,4,m,Mid


if we want to sort with respect to two different variables we may just add them to the list argument of `by`, for example, if we used `gender` and `Code_Exp` we would do (note that the sorting in the `Code_Exp` is done alphabetically) 

In [81]:
mydf.sort_values(by = ["gender", "Code_Exp"], ascending = True)

Unnamed: 0,Age,Exper,Tenure,gender,Code_Exp
5,37,11,1,f,High
0,46,4,4,f,Low
8,45,4,1,f,Low
1,38,10,2,f,Mid
2,46,9,1,f,Mid
4,58,4,4,m,Low
6,29,3,3,m,Low
9,33,3,4,m,Low
3,64,9,4,m,Mid
7,42,8,3,m,Mid


####3.3.4.- Grouping

Pandas is very helpful when we want to find answers from our dataset based on certain conditions of some of the variables. For example, in our data set we may want to find the mean expertise for males and females separatedly. This procedure is knowning as **grouping** and is done with the **`groupby()`** function

In [82]:
mydf.groupby("gender").mean()

Unnamed: 0_level_0,Age,Exper,Tenure
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
f,42.4,7.6,1.8
m,45.2,5.4,3.6


In the same sense if we want to keep filter the dataset using this grouping we can use the **`get_group()`** function 

In [83]:
mydf.groupby("gender").get_group("f")

Unnamed: 0,Age,Exper,Tenure,gender,Code_Exp
0,46,4,4,f,Low
1,38,10,2,f,Mid
2,46,9,1,f,Mid
5,37,11,1,f,High
8,45,4,1,f,Low


this is equivalent to

In [84]:
mydf[mydf.gender == "f"]

Unnamed: 0,Age,Exper,Tenure,gender,Code_Exp
0,46,4,4,f,Low
1,38,10,2,f,Mid
2,46,9,1,f,Mid
5,37,11,1,f,High
8,45,4,1,f,Low


##4.- Input/Output in Pandas

Pandas also allow us to load external data from many different sources: CSV (Comma Separated Values), Excel, HTML, JSON,... into our workspace. All the functions have more or less the same structure `read_type()`. In particular we will mostly use the `read_csv(source, options)` function.

We are going to use this later, once we see how to mount the Google Drive, now let's just explain some of the arguments we can use in the function:

 * `sep`, this is the separator between the columns and, by default it is a comma. However, sometimes we can find it to be a colon or any other character that may be explcitely written here, for example `sep = ";"`
 * `na_values`, makes explicit how the NAs are written in the dataset. Since not all datasets are consistent and sometimes different sources have different encodings, we can find different ways of denoting these values. These must be passed as a list, then we can find `na_values = ["no_info", " ", "."]` 



##5.- Google's Colaboratory

This is the general environment where we will be working on. The main advantage we have here is that we do not have to install anything in our computers and there is no further work on our side to keep everything updated and working. It has some disadvantages but are more technical and do not really concern us.

The Colaboratory environment can be seen as a fork of Jupyter and has the same general structure: they are general text documents with a structure of cells were we can either code or write text



###5.1.- Code Cells

Here is were we will write our Python codes, then

* Type **Cmd/Ctrl+Enter** to run the cell in place; or
* Press the **Play** button to execute the code.

Note the effect of the `print()` statement.

In [91]:
a = 10
b = 20
print(a)
print(b)

10
20


###5.2.- Text cells
This is a **text cell**. You can **double-click** or **press-enter** to edit this cell. Text cells use markdown syntax. To learn more, see the official [markdown guide](/notebooks/markdown_guide.ipynb).

You can also add math to text cells using [$\LaTeX$](http://www.latex-project.org/) Just place the statement within a pair of **\$** signs. For example `$\sqrt{3x-1}+(1+x)^2$` becomes the inline equation $\sqrt{3x-1}+(1+x)^2$. If you want to separate this you should create an environment with **\\begin\{equation\}** and **\\end\{equation\}**

\begin{equation}
\sqrt{3x-1}+(1+x)^2
\end{equation}

Some interesting markdown features are:

* Asterisk '*' or dash '-' for bullet points
>- '>-' for indenting a list

1. Numbered list
2. Second item

Include words between double asterisks (or double underscores) for bold, and single asterisks (single underscore) for italic. Example:

- **This is bold**, as well as __this__
- _This is italic_, as well as _this_

Now, a separation line to separate blocks:
***

or

---

To define titles and a Table of Contents:

```
# Main Title
## Subtitle
### Subsubtitle
```

this go up to six levels.

A simple table:

| Column 1 | Column 2 |
|-----------------|-----------------|
|  value 1   | value 2      |

###5.3.- Connecting Colaboratory to Google Drive

To import and export data, we need to connect colaboratory with our own google drive. This needs to be done every time a session is started.

1. Go to "View" and click on "Table of Contents"
2. Select Files. Navigate one level up, to observe the structure of the machine where your notebook resides. To discover where is your notebook located, type:

```python
!pwd
```

in a code cell. Try `!ls` to get the list of files in your current directory.

3. Now, to connect your google drive under the current directory, copy and execute the following code:

```python
from google.colab import drive
drive.mount('mydrive')
```
You will be ask to enter an authorization code. Once you are done, press `REFRESH` in the Files tab on the left. You should now see a new folder, called `mydrive`, appearing under `content`. You can also use `!ls` to get the updated list. Now **you can access** any file you have in your google drive. This is how we will import files into colaboratory:

- Step 1. Upload the file from your PC to Google Drive
- Step 2. Connect Colaboratory and Google Drive
- Step 3. Navigate to the file inside Google Drive. Right-click on the file, to get the full path. 
- Step 4. Import the file using the appropriate function, typically via:

```python
myData = pd.read_csv("/full path")
```
**Note**: Do not forget to add the first slash "/" in front of the full path.

**Alternative Approach**: You can simply upload the file into the current directory from your local PC. Remember, though, the the next time you connect to this notebook, the file will no longer be there, i.e., you will have to upload it again.

In [87]:
!pwd
!ls

/content
sample_data


now we can connect to our drive

In [88]:
from google.colab import drive
drive.mount('mydrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at mydrive


and then we can read the data set and load it into our workspace

In [89]:
pd.read_csv("/content/sample_data/california_housing_train.csv")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.60,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0
