<a href="https://colab.research.google.com/github/dianakorka/dmml2021/blob/master/week2/Basic_Pandas_Load_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Pandas operations

*Goal*: Our goal here is to learn how to read a dataset into a **Pandas Data Frame**.

Then we'll have a look at what a Pandas Data Frame is, what are some of its properties, and then perform some basic data manipulations. This will help you better understand your data.

## 1. Loading a structured dataset that is made available in CSV format

First, import the `Pandas` package. It comes pre-installed in Google Colab and Anaconda.

In [None]:
import pandas as pd # press shift+enter to execute for Mac (Ctrl+enter for Windows)

Re-cap from last time: you can autocomplete your code with functions that are included in `pandas`. Eg type `pd.read` and see that it recommends some functions.

Let's load a CSV file from the [data folder](https://github.com/michalis0/DataMining_and_MachineLearning/tree/master/week2/data) of the Git repository folder for week 2. Select the data file `pandas_tutorial_read.csv` and then click on `Raw` to obtain the link for the code below.

In [2]:
# let's load a CSV file
data = pd.read_csv('https://raw.githubusercontent.com/dianakorka/dmml2021/master/week2/data/pandas_tutorial_read.csv')
data.head()

Unnamed: 0,2018-01-01 00:01:01;read;country_7;2458151261;SEO;North America
0,2018-01-01 00:03:20;read;country_7;2458151262;...
1,2018-01-01 00:04:01;read;country_7;2458151263;...
2,2018-01-01 00:04:02;read;country_7;2458151264;...
3,2018-01-01 00:05:03;read;country_8;2458151265;...
4,2018-01-01 00:05:42;read;country_6;2458151266;...


Is the above correct? Most likely not. We see there are `;` (semi-colons) and the data seem to be all read in one column.

The problem is that the default delimiter in the `pd.read_csv()` function is comma `,` so we need to change it to `;`.

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/dianakorka/dmml2021/master/week2/data/pandas_tutorial_read.csv',
                   delimiter=';')
data.head()

Unnamed: 0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
0,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
1,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
2,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
3,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
4,2018-01-01 00:05:42,read,country_6,2458151266,Reddit,North America


This looks better. But something else does not look good now: our data frame misses column/variable names. If you do not know the column names, you can always use the `header` parameter to set some default numeric column names, so that the first row is not used to infer column names. For more information, see the `pd.read_csv()` [documentation file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [4]:
data = pd.read_csv('https://raw.githubusercontent.com/dianakorka/dmml2021/master/week2/data/pandas_tutorial_read.csv',
                   delimiter=';',
                   header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America


Column names can usually be derived from the metadata information that is posted together with the datatset and we can add it by using the `names` attribute below.

In [5]:
data = pd.read_csv('https://raw.githubusercontent.com/dianakorka/dmml2021/master/week2/data/pandas_tutorial_read.csv',
                   delimiter=';',
                   names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])
data.head()

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America


### A first look at the Pandas Data Frame

With the `data.head()` function you see the first 5 lines. You can also check:

- the whole dataset: just type ```data```
- the last 5 rows with ```data.tail()``` or
- a random sample such as ```data.sample(5)```

Try it out below:

In [6]:
data.sample(5)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
555,2018-01-01 07:37:41,read,country_6,2458151816,Reddit,Europe
322,2018-01-01 04:18:39,read,country_3,2458151583,AdWords,North America
1752,2018-01-01 23:29:22,read,country_6,2458153013,Reddit,Asia
303,2018-01-01 04:07:19,read,country_2,2458151564,Reddit,Australia
756,2018-01-01 10:09:20,read,country_7,2458152017,Reddit,South America


## Data Frame components
There are three components of a Data Frame:
- the index,
- columns and
- data (values).

We can store each of these components into separate variables. Let's do that and then inspect them:

In [7]:
index = data.index
columns = data.columns
values = data.values

In [8]:
index

RangeIndex(start=0, stop=1795, step=1)

By default, each row is given an index number, starting from 0, then 1,2,3, etc. up until the maximum number of rows (n) minus 1.

In [9]:
columns

Index(['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'], dtype='object')

In [10]:
values

array([['2018-01-01 00:01:01', 'read', 'country_7', 2458151261, 'SEO',
        'North America'],
       ['2018-01-01 00:03:20', 'read', 'country_7', 2458151262, 'SEO',
        'South America'],
       ['2018-01-01 00:04:01', 'read', 'country_7', 2458151263,
        'AdWords', 'Africa'],
       ...,
       ['2018-01-01 23:59:36', 'read', 'country_6', 2458153053, 'Reddit',
        'Asia'],
       ['2018-01-01 23:59:36', 'read', 'country_7', 2458153054,
        'AdWords', 'Europe'],
       ['2018-01-01 23:59:38', 'read', 'country_5', 2458153055, 'Reddit',
        'Asia']], dtype=object)

## Data types of the Pandas Data Frame components

In [11]:
type(index)

pandas.core.indexes.range.RangeIndex

In [12]:
type(columns)

pandas.core.indexes.base.Index

In [13]:
type(values)

numpy.ndarray


The index and the columns are the same type: a pandas **`Index`** object (**`RangeIndex`** is of type **`Index`**), which is a sequence of labels for either the rows or the columns.

The values are a NumPy **`ndarray`**, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy.

## General information about the data

Using the `info()` method you can obtain a concise summary of the data, including the data types under which each column has been saved, here object (or string) and integer for user_id.

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   my_datetime  1795 non-null   object
 1   event        1795 non-null   object
 2   country      1795 non-null   object
 3   user_id      1795 non-null   int64 
 4   source       1795 non-null   object
 5   topic        1795 non-null   object
dtypes: int64(1), object(5)
memory usage: 84.3+ KB


The `shape` property allows you to see how many rows and columns there are.

In [15]:
data.shape

(1795, 6)

In [16]:
# this is the number of rows/observations, the data numerosity
data.shape[0]

1795

In [17]:
# this is the number of columns or attributes, the data dimensionality
data.shape[1]

6

## Selecting columns

If you want to select some particular columns from the data frame you can do it like this:

```data[['country', 'user_id']]```

also possible to use a **different order**:

```data[['user_id', 'country']]```.

The way to remember the syntax is that outer brackets signify that you want to select columns, and the inner brackets are for the list of columns itself.

Try it out.

In [18]:
data[['user_id', 'source', 'country']]

Unnamed: 0,user_id,source,country
0,2458151261,SEO,country_7
1,2458151262,SEO,country_7
2,2458151263,AdWords,country_7
3,2458151264,AdWords,country_7
4,2458151265,Reddit,country_8
...,...,...,...
1790,2458153051,AdWords,country_2
1791,2458153052,SEO,country_8
1792,2458153053,Reddit,country_6
1793,2458153054,AdWords,country_7


In [19]:
type(data[['user_id', 'source', 'country']])

The above returns a `pandas.DataFrame`. If you want to return a `pandas.Series` instead then you can use this syntax:

```data.user_id ```

or

``` data['user_id'] ```

In [20]:
data.user_id

0       2458151261
1       2458151262
2       2458151263
3       2458151264
4       2458151265
           ...    
1790    2458153051
1791    2458153052
1792    2458153053
1793    2458153054
1794    2458153055
Name: user_id, Length: 1795, dtype: int64

In [21]:
type(data.user_id)

## Selecting rows

You can also select a few rows of your dataset using the Data Frame index. For example below we select the first two rows, from 0 (included) to 2 (last value not included).

More on this, a little further below.

In [22]:
data[0:2]

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America


## Boolean indexing

This is to filter rows of a certain kind, for example in our case, search results that came from SEO as a source. To do so you can write:

``` data[data.source == 'SEO'] ```

where the inner statement creates a boolean mask.

In [23]:
data[data.source == 'SEO']

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
11,2018-01-01 00:08:57,read,country_7,2458151272,SEO,Australia
15,2018-01-01 00:11:22,read,country_7,2458151276,SEO,North America
16,2018-01-01 00:13:05,read,country_8,2458151277,SEO,North America
...,...,...,...,...,...,...
1772,2018-01-01 23:45:58,read,country_7,2458153033,SEO,South America
1777,2018-01-01 23:49:52,read,country_5,2458153038,SEO,North America
1779,2018-01-01 23:51:25,read,country_4,2458153040,SEO,South America
1784,2018-01-01 23:54:03,read,country_2,2458153045,SEO,North America


## Selecting both rows and columns by name (`df.loc`) or by position (`df.iloc`)

Sometimes you need to select the values for a given set of rows and columns, like below.

The recommended way to do this is by using either `df.loc` or `df.iloc`.   

The first one is *label based* so you need to pass it the index value of the row and the column names.

In [24]:
data.loc[0, 'topic']

'North America'

In [25]:
data.loc[0:2, ['my_datetime', 'event']]

Unnamed: 0,my_datetime,event
0,2018-01-01 00:01:01,read
1,2018-01-01 00:03:20,read
2,2018-01-01 00:04:01,read



The second one is *integer-position based*, so you need to pass it the row and column number, so for example row 0 (first row) and column 0 (the first column of the data frame, for us `my_datetime`).


In [26]:
data.iloc[[0,2], 0:2]

Unnamed: 0,my_datetime,event
0,2018-01-01 00:01:01,read
2,2018-01-01 00:04:01,read



Either one of these two methods will provide you with a view of your data, which can be used for replacing values. This is different from chained indexing (see the section below).

For example you can use the code below to change the value of an observation. This is inplace and carries forward to the data frame.

In [27]:
data.loc[0, 'topic']='USA'

In [28]:
data.head(1)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,USA


### Chaining (or chained indexing)

Selecting by row and column can also be done using a combination of selection methods as follows:

``` data[['country', 'user_id']][0:1] ```


In [None]:
data[['country', 'user_id']][0:1]

Unnamed: 0,country,user_id
0,country_7,2458151261


**CAUTION**: Keep in mind that when you use chaining you work on *copies* of the original DataFrame. So if you use chaining to change data, you may observe that the original DataFrame was not changed.

If you try to replace a value in this way, it will not modify the original data frame.

In [None]:
data[['country', 'user_id']][0:1].user_id=1

In [None]:
data.head(1)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,USA


You can find more documentation on indexing and chained indexing [in the Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

---

Now it's your turn to solve an exercise and deepen your knowledge.


<div class="alert alert-block alert-success">
    <h2>Exercise 1:</h2>


    
>Select the user_id, the country and the topic columns for the users who are from country_2, and show only the first 10 rows
</div>

In [None]:
# enter your solution here.

---
## 2. Loading JSON files

Many of the data in the Internet exists in JSON format which is a semi-structured text format, and is very similar to a Python dictionary.

We will see how to load a JSON dataset in a Pandas DataFrame.

We will use the Oslo City Bike API that provides a open real-time data on the [City bike stations in Oslo, Norway](https://oslobysykkel.no/en/open-data/realtime).

To have a look at the latest JSON data by station, open this link: https://gbfs.urbansharing.com/oslobysykkel.no/station_status.json

<img src='https://upload.wikimedia.org/wikipedia/commons/a/ae/SmartBike_%C3%A0_Oslo_Norv%C3%A8ge_en_2016_.jpg' width="300">


In [None]:
import requests
url = 'https://gbfs.urbansharing.com/oslobysykkel.no/station_status.json'
data = requests.get(url).json()
data

{'data': {'stations': [{'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1633278843,
    'num_bikes_available': 1,
    'num_docks_available': 17,
    'station_id': '2315'},
   {'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1633278843,
    'num_bikes_available': 17,
    'num_docks_available': 12,
    'station_id': '2309'},
   {'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1633278843,
    'num_bikes_available': 11,
    'num_docks_available': 1,
    'station_id': '2308'},
   {'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1633278843,
    'num_bikes_available': 24,
    'num_docks_available': 5,
    'station_id': '2307'},
   {'is_installed': 1,
    'is_renting': 1,
    'is_returning': 1,
    'last_reported': 1633278843,
    'num_bikes_available': 15,
    'num_docks_available': 3,
    'station_id': '2306'},
   {'is_installed': 1,
    'is_renting'

In [None]:
type(data)

dict

Above you see how the JSON file looks. The JSON results contain 4 keys: The `last_updated`, `ttl`, `version`, and `data`. The `data` is a list of dictionaries, and within it the `stations` key contains the latest data on each City bike station.

In [None]:
data.keys()

dict_keys(['last_updated', 'ttl', 'version', 'data'])

With Pandas we can easily convert a list of dictionaries into a DataFrame

In [None]:
import pandas
df = pandas.DataFrame(data["data"]["stations"])
df.head(5)

Unnamed: 0,station_id,is_installed,is_renting,is_returning,last_reported,num_bikes_available,num_docks_available
0,2315,1,1,1,1633278843,1,17
1,2309,1,1,1,1633278843,17,12
2,2308,1,1,1,1633278843,11,1
3,2307,1,1,1,1633278843,24,5
4,2306,1,1,1,1633278843,15,3


To see if the data has been imported correctly, we can verify the datatypes of the columns. Pandas tries to infer the datatypes and for this case it does a pretty good job. In general, you should consider providing explicitly the datatypes of each column.

In [None]:
df.dtypes

station_id             object
is_installed            int64
is_renting              int64
is_returning            int64
last_reported           int64
num_bikes_available     int64
num_docks_available     int64
dtype: object

One column that looks not parsed correctly is the **last_reported** which is an `integer`, so you may want to convert it to the `datetime` type.

<div class="alert alert-block alert-success">
    <h2>Exercise 2:</h2>


    
>Convert the **last_reported** column from UNIX date format (read as integer here) to `datetime` datatype. <br>
**Hint**: Use the [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.
</div>

In [None]:
df.last_reported.head(2)

0    1633278843
1    1633278843
Name: last_reported, dtype: int64

In [None]:
# Your solution here

0     2021-10-03 16:34:03
1     2021-10-03 16:34:03
2     2021-10-03 16:34:03
3     2021-10-03 16:34:03
4     2021-10-03 16:34:03
              ...        
245   2021-10-03 16:34:03
246   2021-10-03 16:34:03
247   2021-10-03 16:34:03
248   2021-10-03 16:34:03
249   2021-10-03 16:34:03
Name: last_reported, Length: 250, dtype: datetime64[ns]

Let's confirm that the **last_reported** column is now of type `datetime`.

## Adding a column...or two

We notice that:

- **total_docks** = **num_bikes_available** (bikes ready to rent) + **num_docks_available** (how many docks are free)

Thus we can add a column with the name `total_docks`. And then we can add a column `perc_full` that shows how full each station is.

In [None]:
df['total_docks'] = df.num_bikes_available + df.num_docks_available

In [None]:
df["perc_full"] = df.num_bikes_available / df.total_docks
df.head()

Unnamed: 0,station_id,is_installed,is_renting,is_returning,last_reported,num_bikes_available,num_docks_available,total_docks,perc_full
0,2315,1,1,1,1633278843,1,17,18,0.055556
1,2309,1,1,1,1633278843,17,12,29,0.586207
2,2308,1,1,1,1633278843,11,1,12,0.916667
3,2307,1,1,1,1633278843,24,5,29,0.827586
4,2306,1,1,1,1633278843,15,3,18,0.833333


## Summary Statistics

You can also use the `describe` function of Pandas to get some general understanding of the central values and the tendencies of each column.

In [None]:
df.describe()

Unnamed: 0,is_installed,is_renting,is_returning,last_reported,num_bikes_available,num_docks_available,total_docks,perc_full
count,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0
mean,1.0,1.0,1.0,1633279000.0,8.864,13.028,21.892,0.419544
std,0.0,0.0,0.0,0.0,7.130841,8.840125,9.02297,0.306956
min,1.0,1.0,1.0,1633279000.0,0.0,0.0,6.0,0.0
25%,1.0,1.0,1.0,1633279000.0,3.0,6.0,15.0,0.1525
50%,1.0,1.0,1.0,1633279000.0,8.0,12.0,20.0,0.396429
75%,1.0,1.0,1.0,1633279000.0,12.75,18.0,27.0,0.666667
max,1.0,1.0,1.0,1633279000.0,36.0,44.0,49.0,1.0


The question for the following lab next week will be **Are the values in the summary statistics what you expected them to be?**


## Writing the data to a CSV file

With the above, we just scratched the surface of what it means to do data processing.

After you did your basic data processing, you may want to save the DataFrame in a new CSV file, so that you don't have to repeat the same pre-processing everytime. You can use the [to_csv](https://datatofish.com/export-dataframe-to-csv/) function.

**Note**: When you use Google Colab, this file will only be saved in your temporary virtual machine space and will be deleted once your Colab instance is closed (i.e. you close the window). To access the file, click on the Files icon in the command palette on the left hand side of the web interface.

If you want to explore more permanent solutions of saving your file, see [here](https://colab.research.google.com/notebooks/io.ipynb).


In [None]:
# uncomment the following to save the file
#df.to_csv("my_new_file.csv", sep=',', index=False)

## (optional) Yes, but how about saving separately and reading JSON files ?

See below how to save a dictionary as a JSON file and then read it in with `pd.read_json()`.

In [None]:
import json
with open("oslo.json", mode='w') as f:
  json.dump(data['data']['stations'], f)

In [None]:
pd.read_json('/content/oslo.json')

Unnamed: 0,station_id,is_installed,is_renting,is_returning,last_reported,num_bikes_available,num_docks_available
0,2315,1,1,1,1633278843,1,17
1,2309,1,1,1,1633278843,17,12
2,2308,1,1,1,1633278843,11,1
3,2307,1,1,1,1633278843,24,5
4,2306,1,1,1,1633278843,15,3
...,...,...,...,...,...,...,...
245,457,1,1,1,1633278843,3,6
246,377,1,1,1,1633278843,17,12
247,738,1,1,1,1633278843,2,10
248,460,1,1,1,1633278843,2,25
