<a href="https://colab.research.google.com/github/aguinaldoabbj/minicourse_open_data_natal_2019/blob/master/1_intro_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas basics




## Introduction

Pandas is a library that unifies the most common workflows that data analysts and data scientists previously relied on many different libraries for. Pandas has quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python. Tabular data is any data that can be represented as rows and columns. 

To represent tabular data, Pandas uses a custom data structure called a **DataFrame**. A DataFrame is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The DataFrame is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that Pandas has over NumPy is the **ability to store mixed data types** in rows and columns. Many tabular datasets contain a range of data types and Pandas DataFrames handle mixed data types effortlessly while NumPy doesn't. Pandas DataFrames **can also handle missing values gracefully** using a custom object, **NaN**, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, pandas DataFrames contain axis labels for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that dataframes preserve the metadata from the file around the data.

To get things started, we have to load Pandas library:

In [0]:
import pandas as pd

## Our case study

In this part of the course, you'll learn the basics of pandas while exploring several datasets [Dados Abertos Natal](http://dados.natal.br/). Let's start things out by using the "Bilhetagem" dataset , which refers to the bus service billing system in Natal. 

You can dowload it through this URL: [Bilhetagem Analítica 2018](http://dados.natal.br/dataset/4fad551d-4d3b-4597-b8d3-7e887e22332e/resource/ec5b95a3-7b93-4346-98f6-1bd013faa651/download/dados-be-2018-analitico.csv)

## Downloading Data

We can download data directly from  [Dados Abertos Natal](http://dados.natal.br/) to our workspace in Google Colab. We can use 'wget' download utility (available in most systems) to get the job done (the exclamation mark in the beginning sinalyzes Colab to use system's utilities instead of Python interpreter)

In [3]:
!wget -c https://raw.githubusercontent.com/aguinaldoabbj/minicourse_open_data_natal_2019/master/data/dados-be-2018-analitico.csv
#wget -c https://github.com/aguinaldoabbj/minicourse_open_data_natal_2019/blob/master/data/dados-be-2018-analitico.csv

--2019-03-02 14:17:05--  https://raw.githubusercontent.com/aguinaldoabbj/minicourse_open_data_natal_2019/master/data/dados-be-2018-analitico.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72991 (71K) [text/plain]
Saving to: ‘dados-be-2018-analitico.csv’


2019-03-02 14:17:05 (5.04 MB/s) - ‘dados-be-2018-analitico.csv’ saved [72991/72991]



We can check if file was downloaded using 'ls':


In [4]:
!ls -lah  dados-be*

-rw-r--r-- 1 root root 72K Mar  2 14:17 dados-be-2018-analitico.csv


# First steps with pandas

## Loading data from a CSV file
As we can see, downloaded data is a CSV file. CSVs are commonly used to store tabular data. In simple words, it contains table rows whose cells are separated by some token, like a comma or semicolon. Pandas provides the read_csv() function, which takes the path of the CSV file and produces a DataFrame representation of its data.

In [0]:
# CSV file uses ';' as separator and is encodded in iso-8859-1
df = pd.read_csv("dados-be-2018-analitico.csv", encoding='iso-8859-1', sep=';')

We can check if data was loaded successfully by calling Dataframe's head( ) method and see the first 5 rows:

In [6]:
df.head(5)

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
0,1,CONCEIÇÃO,21,13.493,9.506,35.944,8.671,0.0,17.138,7.397,6.542,43.727,3.965,3.684
1,1,CONCEIÇÃO,22,4.644,3.371,13.088,3.23,0.0,8.109,3.157,2.529,15.363,0.0,1.737
2,1,CONCEIÇÃO,30,8.519,5.614,20.58,5.346,0.0,7.265,3.191,4.415,21.507,1.537,2.5
3,1,CONCEIÇÃO,31,9.631,6.02,23.46,7.587,0.0,7.065,2.922,5.491,22.212,140.0,2.478
4,1,CONCEIÇÃO,41,6.556,4.038,20.388,7.924,0.0,5.704,2.791,3.883,13.87,0.0,2.296


### Exploring the dataframe

Now that we've read the dataset into a DataFrame, we can start using more Pandas DataFrame methods to explore the data. In the same way we have just used the head() to see the firts rows of the DataFrame, we can use the tail( ) method to see the last ones:

In [21]:
df.tail()

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
944,10,SANTA MARIA - URB,561,61.0,27.0,2029.0,536.0,0.0,131.0,7.0,219.0,695.0,120.0,340.0
945,10,VIA SUL - URB,50,48362.0,10350.0,71459.0,22085.0,215.0,18052.0,9226.0,14547.0,58611.0,8484.0,5172.0
946,10,VIA SUL - URB,51,30382.0,6422.0,25362.0,9241.0,0.0,10020.0,7065.0,6933.0,29185.0,2293.0,3265.0
947,10,VIA SUL - URB,52,21965.0,5475.0,23329.0,7763.0,0.0,10908.0,7764.0,6190.0,27879.0,2089.0,3120.0
948,10,VIA SUL - URB,65,14231.0,3355.0,14359.0,4247.0,0.0,4541.0,2308.0,3658.0,13157.0,0.0,1739.0


A new Dataframe containing 10 random samples of the original Dataframe can be created by using the sample( ) method:

In [22]:
df.sample(10)

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
325,4,GUANABARA - URB,75,14.054,5.433,18.184,3.86,120.0,5.986,2.319,4.512,18.333,616.0,1.909
487,6,CONCEIÇÃO,599,8830.0,2806.0,14389.0,5953.0,7.0,4768.0,811.0,3536.0,9744.0,0.0,1385.0
556,6,SANTA MARIA - URB,54,27472.0,17008.0,36128.0,14230.0,0.0,10828.0,7197.0,7932.0,41499.0,0.0,3223.0
357,4,REUNIDAS,7847,23.672,11.535,33.532,7.26,95.0,11.067,4.995,9.012,33.191,1.345,2.588
876,10,DUNAS - URB,57,4165.0,804.0,3029.0,1242.0,0.0,1813.0,604.0,755.0,2672.0,0.0,500.0
13,1,CONCEIÇÃO,905,155.0,79.0,890.0,38.0,0.0,249.0,32.0,52.0,522.0,491.0,249.0
219,3,GUANABARA - URB,4,13.369,10.562,33.885,7.357,1.0,4.927,2.675,4.07,23.109,977.0,2.244
521,6,GUANABARA - URB,592,1588.0,392.0,1817.0,407.0,55.0,1026.0,244.0,366.0,1243.0,0.0,675.0
670,8,CONCEIÇÃO,76,4329.0,1428.0,7847.0,1701.0,57.0,3852.0,1316.0,1660.0,9830.0,0.0,700.0
273,3,SANTA MARIA - URB,48,17.962,10.429,16.002,7.693,0.0,8.347,4.397,3.446,19.621,459.0,2.105


By these Dataframe printouts, it is possible to overview data structure (rows, columns and cells). A better understanding of the study Dataframe can be obtained with a series of Pandas Dataframe methods:

In [8]:
#get the shape of the df
df.shape


(949, 14)

In [9]:
# show the basic information about the dataset (columns, sizes, column types, etc)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 949 entries, 0 to 948
Data columns (total 14 columns):
Mês                        949 non-null int64
Empresa                    949 non-null object
Linha                      949 non-null int64
Estudante_Cartao           949 non-null float64
Estudante_BT               949 non-null float64
Vale_Transporte            949 non-null float64
Integracao_Plena           949 non-null float64
Integracao_Complementar    949 non-null float64
Gratuito_Cartao            949 non-null float64
Gratuito_BT                949 non-null float64
Inteira_Cartao             949 non-null float64
Inteira_Especie            949 non-null float64
Tarifa_Social              949 non-null float64
Qtd_Viagens                949 non-null float64
dtypes: float64(11), int64(2), object(1)
memory usage: 103.9+ KB


In [10]:
# describe statistical information about the numerical columns
df.describe()

Unnamed: 0,Mês,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
count,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0
mean,5.492097,1109.404636,7410.613845,2328.857335,10773.196672,3142.489758,70.314604,3736.196684,1852.104958,2381.738394,10533.645599,489.891953,1034.513133
std,2.88441,2239.082157,11892.630953,4004.715689,17442.582959,5382.396103,155.008061,5853.841755,2824.474806,3602.624978,16393.121063,1446.41432,1303.173175
min,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3.0,39.0,11.796,6.554,19.368,6.418,0.0,8.371,4.458,5.582,19.119,0.0,2.918
50%,5.0,75.0,162.0,158.0,313.0,364.0,0.0,270.0,381.0,319.0,271.0,1.305,434.0
75%,8.0,599.0,10670.0,3239.0,15774.0,4184.0,65.0,6141.0,2991.0,3746.0,17110.0,381.0,1829.0
max,10.0,9018.0,64442.0,29076.0,112950.0,35284.0,1309.0,37702.0,17586.0,18331.0,88285.0,17298.0,5407.0


# Basic concepts and operations in pandas

## Indexing

When you read in a file into a DataFrame, pandas uses the values in the first row (also known as the header) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index. DataFrames contain both a row index and a column index. Here's a diagram that displays some of the column and row labels for data:

![Pandas Datafrane](https://raw.githubusercontent.com/ivanovitchm/cba2018/master/1-intro-pandas/indexing.png)

The labels allow us to refer to values in the DataFrame, which we'll learn more about in the rest of this notebook.

## Data types

When you displayed individual rows, represented as Series objects, you may have noticed the text "dtype: object" after the last value. "dtype: object" refers to the data type, or dtype, of that Series. The object dtype is equivalent to the string type in Python. Pandas borrows from the NumPy type system and contains the following dtypes:

* "object" - for representing string values.
* "int" - for representing integer values.
* "float" - for representing float values.
* "datatime" - for representing time values.
* "bool" - for representing Boolean values.

When reading a file into a DataFrame, pandas analyzes the values and infers each column's types. To access the types - for each column, use the DataFrame.dtypes attribute to return a Series containing each column name and its corresponding type. It is also possible to specify column's types at the moment of reading data into Pandas by using the dtype attribute in read_csv(). Read more about data types on the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/).

Sometimes we can encounter the value - "NaN" - for representing Not a Number, in other words is value that pandas dosen't know how to represent, normally it's associated with missing values.

In [15]:
df.dtypes

Mês                          int64
Empresa                     object
Linha                        int64
Estudante_Cartao           float64
Estudante_BT               float64
Vale_Transporte            float64
Integracao_Plena           float64
Integracao_Complementar    float64
Gratuito_Cartao            float64
Gratuito_BT                float64
Inteira_Cartao             float64
Inteira_Especie            float64
Tarifa_Social              float64
Qtd_Viagens                float64
dtype: object

## Series
The Series object is a core data structure that Pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector. The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.
Pandas utilizes this feature to provide more context when returning a row or a column from a DataFrame. For example, when you select a row from a DataFrame, instead of just returning the values in that row as a list, pandas returns a Series object that contains the column labels as well as the corresponding values:
The Series object representing the first row looks like:

## Selecting a row
While we use bracket notation to access elements in a NumPy array or a standard list, we need to use the Pandas method loc[] to select rows in a DataFrame. The loc[] method allows you to select rows by row labels. Recall that when you read a file into a dataframe, pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.
If you're interested in accessing a single row, pass in the row label to the loc[] method. Also, Python will return an error if you don't pass in a valid row label. For example, the following line takes the seventh row:

In [12]:
# Series object representing the seventh row.
df.loc[6]

Mês                                1
Empresa                    CONCEIÇÃO
Linha                             63
Estudante_Cartao              21.918
Estudante_BT                  14.054
Vale_Transporte               44.118
Integracao_Plena               14.63
Integracao_Complementar            0
Gratuito_Cartao               15.034
Gratuito_BT                    5.771
Inteira_Cartao                 7.765
Inteira_Especie               39.855
Tarifa_Social                  3.182
Qtd_Viagens                    3.768
Name: 6, dtype: object

## Selecting Multiple Rows

If you're interested in accessing multiple rows of the DataFrame, you can pass in either a slice of row labels or a list of row labels, and pandas will return a DataFrame. Note that unlike slicing lists in Python, a slice of a DataFrame using .loc[] will include both the start and the end row:

In [14]:
# DataFrame containing the rows at index 3, 4, 5, and 6 returned.
df.loc[3:6]

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
3,1,CONCEIÇÃO,31,9.631,6.02,23.46,7.587,0.0,7.065,2.922,5.491,22.212,140.0,2.478
4,1,CONCEIÇÃO,41,6.556,4.038,20.388,7.924,0.0,5.704,2.791,3.883,13.87,0.0,2.296
5,1,CONCEIÇÃO,59,8.184,6.086,22.623,7.332,0.0,13.514,4.373,4.622,29.075,4.573,2.65
6,1,CONCEIÇÃO,63,21.918,14.054,44.118,14.63,0.0,15.034,5.771,7.765,39.855,3.182,3.768


In [17]:
# DataFrame containing the rows at index 2, 5, and 10 returned
df.loc[[2,5,10]]

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
2,1,CONCEIÇÃO,30,8.519,5.614,20.58,5.346,0.0,7.265,3.191,4.415,21.507,1.537,2.5
5,1,CONCEIÇÃO,59,8.184,6.086,22.623,7.332,0.0,13.514,4.373,4.622,29.075,4.573,2.65
10,1,CONCEIÇÃO,411,5.208,2.869,10.397,4.33,0.0,4.723,1.654,2.676,11.043,580.0,1.095


### Exercise

![alt text](https://cdn.dribbble.com/users/2344801/screenshots/4774578/alphatestersanimation2.gif =150x120)

Select the last 10 rows of our dataframe and assign to the variable last_rows. #tip: use .iloc[] method.


In [0]:
# put your code in this cell

## Selecting Individual Columns

When accessing a column in a DataFrame, Pandas returns a Series object containing the row label and each row's value for that column. To access a single column, use bracket notation and pass in the column name as a string:

In [19]:
df['Empresa']

0              CONCEIÇÃO
1              CONCEIÇÃO
2              CONCEIÇÃO
3              CONCEIÇÃO
4              CONCEIÇÃO
5              CONCEIÇÃO
6              CONCEIÇÃO
7              CONCEIÇÃO
8              CONCEIÇÃO
9              CONCEIÇÃO
10             CONCEIÇÃO
11             CONCEIÇÃO
12             CONCEIÇÃO
13             CONCEIÇÃO
14             CONCEIÇÃO
15             CONCEIÇÃO
16             CONCEIÇÃO
17             CONCEIÇÃO
18             CONCEIÇÃO
19           DUNAS - URB
20           DUNAS - URB
21           DUNAS - URB
22           DUNAS - URB
23           DUNAS - URB
24           DUNAS - URB
25           DUNAS - URB
26           DUNAS - URB
27           DUNAS - URB
28           DUNAS - URB
29           DUNAS - URB
             ...        
919             REUNIDAS
920             REUNIDAS
921             REUNIDAS
922             REUNIDAS
923             REUNIDAS
924             REUNIDAS
925             REUNIDAS
926             REUNIDAS
927             REUNIDAS


### Exercise

![alt text](https://cdn.dribbble.com/users/2344801/screenshots/4774578/alphatestersanimation2.gif =150x120)



1.   Assign the "Empresa" column to the variable empresas
2.   Assign the last 5 rows of "Linha" column to the variable linhas. #tip: use the tail() function.

In [0]:
# put your code in this cell

## Selecting Multiple Columns By Name

To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns. The following code returns a dataframe containing the "Empresa" and "Linha" columns, in that order:

In [25]:
df[['Empresa','Linha']]

Unnamed: 0,Empresa,Linha
0,CONCEIÇÃO,21
1,CONCEIÇÃO,22
2,CONCEIÇÃO,30
3,CONCEIÇÃO,31
4,CONCEIÇÃO,41
5,CONCEIÇÃO,59
6,CONCEIÇÃO,63
7,CONCEIÇÃO,71
8,CONCEIÇÃO,76
9,CONCEIÇÃO,83


When selecting multiple columns, the order of the columns in the returned DataFrame matches the order of the column names in the list of strings that you passed in. This allows you to easily explore specific columns that may not be positioned next to each other in the DataFrame.

### Exercise

![alt text](https://cdn.dribbble.com/users/2344801/screenshots/4774578/alphatestersanimation2.gif =150x120)

Select and display at least a random sample row of a DataFrame created with only "Empresa", "Linha" and "Mês" columns. Tip: use the sample() or any() functions.

In [0]:
# put your code in this cell

## Column Uniqueness

When in the process of knowing the data, it's interesting to know the uniqueness of some attibutes. For this, Pandas implements the function unique() which can be use, for example, to know how many different bus companies appear in our data:

In [35]:
# uniqueness of column "Empresa"
empresas = df['Empresa'].unique()


array(['CONCEIÇÃO', 'DUNAS - URB', 'GUANABARA - URB', 'REUNIDAS',
       'SANTA MARIA - URB', 'VIA SUL - URB'], dtype=object)

In [33]:
# length of the array returned
len(empresas)

6

So, there are 6 different bus companies in our DataFrame.

### Exercise

![alt text](https://cdn.dribbble.com/users/2344801/screenshots/4774578/alphatestersanimation2.gif =150x120)

How many different bus lines are there in our DataFrame?

In [0]:
# put your code in this cell

## Data manipulation with Pandas

### Overview
In the previous sections, we learned how to explore a pandas DataFrame. In this mission, we'll explore how to manipulate a DataFrame and make transformations to it. We'll continue to work with the same data set from the Power Plants information. We'll build a better dataset cleaning the data and removing not useful information. We also gonna learn how to group up informatio and manipulate data.