# Pandas basics




## Introduction

Pandas is a library that unifies the most common workflows that data analysts and data scientists previously relied on many different libraries for. Pandas has quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python. Tabular data is any data that can be represented as rows and columns. 

To represent tabular data, Pandas uses a custom data structure called a **DataFrame**. A DataFrame is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The DataFrame is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that Pandas has over NumPy is the **ability to store mixed data types** in rows and columns. Many tabular datasets contain a range of data types and Pandas DataFrames handle mixed data types effortlessly while NumPy doesn't. Pandas DataFrames **can also handle missing values gracefully** using a custom object, **NaN**, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, pandas DataFrames contain axis labels for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that dataframes preserve the metadata from the file around the data.

To get things started, we have to load Pandas library:

In [0]:
import pandas as pd

## Our case study

In this part of the course, you'll learn the basics of pandas while exploring several datasets [Dados Abertos Natal](http://dados.natal.br/). Let's start things out by using the "Bilhetagem" dataset , which refers to the bus service billing system in Natal. 

You can dowload it through this URL: [Bilhetagem Analítica 2018](http://dados.natal.br/dataset/4fad551d-4d3b-4597-b8d3-7e887e22332e/resource/ec5b95a3-7b93-4346-98f6-1bd013faa651/download/dados-be-2018-analitico.csv)

## Downloading Data

We can download data directly from  [Dados Abertos Natal](http://dados.natal.br/) to our workspace in Google Colab. We can use 'wget' download utility (available in most systems) to get the job done (the exclamation mark in the beginning sinalyzes Colab to use system's utilities instead of Python interpreter)

In [27]:
!wget -c https://raw.githubusercontent.com/aguinaldoabbj/minicourse_open_data_natal_2019/master/data/dados-be-2018-analitico.csv
#wget -c https://github.com/aguinaldoabbj/minicourse_open_data_natal_2019/blob/master/data/dados-be-2018-analitico.csv

--2019-02-13 18:22:39--  https://raw.githubusercontent.com/aguinaldoabbj/minicourse_open_data_natal_2019/master/data/dados-be-2018-analitico.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72991 (71K) [text/plain]
Saving to: ‘dados-be-2018-analitico.csv’


2019-02-13 18:22:39 (2.94 MB/s) - ‘dados-be-2018-analitico.csv’ saved [72991/72991]



We can check if file was downloaded using 'ls':


In [28]:
!ls -lah  dados-be*

-rw-r--r-- 1 root root 72K Feb 13 18:22 dados-be-2018-analitico.csv


## Loading data from a CSV file
As we can see, downloaded data is a CSV file. CSVs are commonly used to store tabular data. In simple words, it contains table rows whose cells are separated by some token, like a comma or semicolon. Pandas provides the read_csv() function, which takes the path of the CSV file and produces a DataFrame representation of its data.

In [0]:
# CSV file uses ';' as separator and is encodded in iso-8859-1
df = pd.read_csv("dados-be-2018-analitico.csv", encoding='iso-8859-1', sep=';')

We can check if data was loaded successfully by calling Dataframe's head( ) method and see the first 5 rows:

In [35]:
df.head(5)

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
0,1,CONCEIÇÃO,21,13.493,9.506,35.944,8.671,0.0,17.138,7.397,6.542,43.727,3.965,3.684
1,1,CONCEIÇÃO,22,4.644,3.371,13.088,3.23,0.0,8.109,3.157,2.529,15.363,0.0,1.737
2,1,CONCEIÇÃO,30,8.519,5.614,20.58,5.346,0.0,7.265,3.191,4.415,21.507,1.537,2.5
3,1,CONCEIÇÃO,31,9.631,6.02,23.46,7.587,0.0,7.065,2.922,5.491,22.212,140.0,2.478
4,1,CONCEIÇÃO,41,6.556,4.038,20.388,7.924,0.0,5.704,2.791,3.883,13.87,0.0,2.296


### Exploring the dataframe
Now that we've read the dataset into a DataFrame, we can start using more Pandas DataFrame methods to explore the data.

A new Dataframe containing random samples of the original Dataframe can be created by using the sample( ) method:

In [34]:
sample_df = df.sample(10)
#show Dataframe
sample_df

Unnamed: 0,Mês,Empresa,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
760,9,CONCEIÇÃO,21,27123.0,4877.0,37501.0,9927.0,0.0,17449.0,8856.0,7483.0,43980.0,2142.0,3918.0
219,3,GUANABARA - URB,4,13.369,10.562,33.885,7.357,1.0,4.927,2.675,4.07,23.109,977.0,2.244
656,7,SANTA MARIA - URB,411,6678.0,1697.0,7761.0,3243.0,0.0,3327.0,1080.0,2032.0,7945.0,0.0,717.0
673,8,CONCEIÇÃO,587,1163.0,0.0,479.0,249.0,0.0,888.0,109.0,125.0,1385.0,0.0,1468.0
403,5,DUNAS - URB,332,3.743,1.206,5.495,1.27,0.0,1.29,496.0,1.441,3.642,253.0,577.0
939,10,SANTA MARIA - URB,66,16390.0,3831.0,15271.0,2454.0,0.0,3834.0,2102.0,3065.0,16972.0,3290.0,2091.0
908,10,GUANABARA - URB,1516,15376.0,3478.0,22809.0,4293.0,182.0,9273.0,5323.0,5042.0,26397.0,3650.0,2172.0
549,6,SANTA MARIA - URB,36,9882.0,4407.0,15131.0,5527.0,0.0,7763.0,3904.0,3177.0,16874.0,0.0,2095.0
284,3,VIA SUL - URB,52,18.588,12.721,23.356,7.949,1.0,10.974,6.446,5.829,29.111,474.0,3.091
257,3,REUNIDAS,68,14.582,6.371,26.601,6.185,214.0,10.614,3.26,7.084,27.343,754.0,2.327


By these Dataframe printouts, it is possible to overview data structure (rows, columns and cells). A better understanding of the study Dataframe can be obtained with a series of Pandas Dataframe methods:

In [38]:
#get the shape of the df
df.shape


(949, 14)

In [40]:
# show the basic information about the dataset (columns, sizes, column types, etc)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 949 entries, 0 to 948
Data columns (total 14 columns):
Mês                        949 non-null int64
Empresa                    949 non-null object
Linha                      949 non-null int64
Estudante_Cartao           949 non-null float64
Estudante_BT               949 non-null float64
Vale_Transporte            949 non-null float64
Integracao_Plena           949 non-null float64
Integracao_Complementar    949 non-null float64
Gratuito_Cartao            949 non-null float64
Gratuito_BT                949 non-null float64
Inteira_Cartao             949 non-null float64
Inteira_Especie            949 non-null float64
Tarifa_Social              949 non-null float64
Qtd_Viagens                949 non-null float64
dtypes: float64(11), int64(2), object(1)
memory usage: 103.9+ KB


In [43]:
# describe statistical information about the numerical columns
df.describe()

Unnamed: 0,Mês,Linha,Estudante_Cartao,Estudante_BT,Vale_Transporte,Integracao_Plena,Integracao_Complementar,Gratuito_Cartao,Gratuito_BT,Inteira_Cartao,Inteira_Especie,Tarifa_Social,Qtd_Viagens
count,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0
mean,5.492097,1109.404636,7410.613845,2328.857335,10773.196672,3142.489758,70.314604,3736.196684,1852.104958,2381.738394,10533.645599,489.891953,1034.513133
std,2.88441,2239.082157,11892.630953,4004.715689,17442.582959,5382.396103,155.008061,5853.841755,2824.474806,3602.624978,16393.121063,1446.41432,1303.173175
min,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3.0,39.0,11.796,6.554,19.368,6.418,0.0,8.371,4.458,5.582,19.119,0.0,2.918
50%,5.0,75.0,162.0,158.0,313.0,364.0,0.0,270.0,381.0,319.0,271.0,1.305,434.0
75%,8.0,599.0,10670.0,3239.0,15774.0,4184.0,65.0,6141.0,2991.0,3746.0,17110.0,381.0,1829.0
max,10.0,9018.0,64442.0,29076.0,112950.0,35284.0,1309.0,37702.0,17586.0,18331.0,88285.0,17298.0,5407.0
