---

# 3. Pandas

Pandas is the best known Python library for manipulating and analyzing data. It is built on top of NumPy, so many features are similar. We will use Pandas to work with structured datasets.

Just as NumPy provides us with arrays and with them we access many new features, Pandas provides us with DataFrames and Series. By far the most used object is the first one, DataFrames.

We are going to use the open data of the Argentine government, so you will have to download the csv from the following link: [Names 2010-2014](https://www.datos.gob.ar/dataset/otros-nombres-personas-fisicas)

In [None]:
import pandas as pd

## Reading a csv file

In [None]:
df_names = pd.read_csv("https://raw.githubusercontent.com/agusle/something_new_everyday/master/intensive_training/data/nombres-2010-2014.csv")
df_names

Unnamed: 0,nombre,cantidad,anio
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010
...,...,...,...
871489,Leire Jasmin,1,2014
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014


## Columns renaming

First of all let's rename the columns to name, amount and year

In [None]:
df_names = df_names.rename(columns={'nombre': 'name', 'cantidad': 'amount', 'anio': 'year'})
df_names

Unnamed: 0,name,amount,year
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010
...,...,...,...
871489,Leire Jasmin,1,2014
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014


## Some Pandas useful functions

**TODO:** Investigate the functions that are implemented in the next cell. What do they do? What do you think they can be useful for?

In [None]:
df_names.head()

Unnamed: 0,name,amount,year
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010


In [None]:
df_names.tail()

Unnamed: 0,name,amount,year
871489,Leire Jasmin,1,2014
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014
871493,Elias Hernando,1,2014


In [None]:
df_names.count()

name      871494
amount    871494
year      871494
dtype: int64

In [None]:
df_names.shape

(871494, 3)

In [None]:
df_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 871494 entries, 0 to 871493
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    871494 non-null  object
 1   amount  871494 non-null  int64 
 2   year    871494 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 19.9+ MB


In [None]:
df_names.sample(10)

Unnamed: 0,name,amount,year
554365,Yazmin Selene Iriel,1,2013
646918,Alejandro Pedro,1,2013
290829,Camilo Samuel,3,2012
562775,Cielo Aleli,1,2013
559520,Yenhy Maya,1,2013
199418,Oriana Valentina Mariana,1,2011
692299,Nila,4,2014
101663,Emanuel Adair,1,2010
431199,Nadia Micaela,1,2012
208449,Alhue Ana Estrella,1,2011


## Append a new row

**TODO:** Suppose that in the data load, someone forgot to add a name and its respective amount and year.

Let's add to our dataset the following row with said information:

Name: "Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons"

Amount: 100
Year: 2011

In [None]:
new_name = {
    'name': 'Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons',
    'amount': 100,
    'year': 2011
}

df_names = df_names.append(new_name, ignore_index=True)
df_names

Unnamed: 0,name,amount,year
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010
...,...,...,...
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014
871493,Elias Hernando,1,2014


**TODO:** Investigate the columns and index functions. What do they do? What data type is their output? What known data type do they resemble?

In [None]:
df_names.columns

Index(['name', 'amount', 'year'], dtype='object')

In [None]:
df_names.index

RangeIndex(start=0, stop=871495, step=1)

**TODO:** What do the following operations do

In [None]:
df_names['name']

0                                                  Benjamin
1                                                     Sofia
2                                                  Bautista
3                                                   Joaquín
4                                              Juan Ignacio
                                ...                        
871490                               Isaias Sebastian Ariel
871491                                     Yanira Valentina
871492                                         Angie Ainara
871493                                       Elias Hernando
871494    Daenerys Stormborn of the House Targaryen, Fir...
Name: name, Length: 871495, dtype: object

In [None]:
df_names[['name', 'year']]

Unnamed: 0,name,year
0,Benjamin,2010
1,Sofia,2010
2,Bautista,2010
3,Joaquín,2010
4,Juan Ignacio,2010
...,...,...
871490,Isaias Sebastian Ariel,2014
871491,Yanira Valentina,2014
871492,Angie Ainara,2014
871493,Elias Hernando,2014


In [None]:
df_names.amount

0         2986
1         2252
2         2176
3         2111
4         2039
          ... 
871490       1
871491       1
871492       1
871493       1
871494     100
Name: amount, Length: 871495, dtype: int64

In [None]:
df_names['amount']

0         2986
1         2252
2         2176
3         2111
4         2039
          ... 
871490       1
871491       1
871492       1
871493       1
871494     100
Name: amount, Length: 871495, dtype: int64

In [None]:
'name' in df_names

True