# 3. <a id='intro'>[Pandas](https://www.freecodecamp.org/news/how-to-analyze-data-with-python-pandas/)</a>

- <a href='#def'> 3.1 Definition</a>  
- <a href='#series'>3.2. Pandas Series</a>
     - <a href='#3.2.1'>3.2.1. From `lists` to `Series`</a>
     - <a href='#3.2.2'> 3.2.2. From `NumPy array` to `Series`</a>
     - <a href='#3.2.3'> 3.2.3. From `Dictionary` to `Series`</a>
     - <a href='#3.2.4'> 3.2.4. `Series` vs `NumPy`</a>
     - <a href='#3.2.5'> 3.2.5 Indexing</a></a>
- <a href='#3.3'>3.3 DataFrame</a>
     - <a href='#3.3.1'>3.3.1 DataFrame Generation</a>
     - <a href='#3.3.2'>3.3.2 Indexing</a>
     - <a href='#3.3.3'>3.3.3 General Methods</a>
     - <a href='#3.3.4'>3.3.4 Importing Data</a>
     - <a href='#3.3.5'>3.3.5 Filtering data</a> 
     - <a href='#3.3.6'>3.3.6 Dealing with nulls</a>  
     - <a href='#3.3.7'>3.3.7 Duplicates</a>  
     - <a href='#3.3.8'>3.3.8 Groupby</a>  
     - <a href='#3.3.9'>3.3.9 Reshape</a>  
     - <a href='#3.3.10'>3.3.10 Merge</a>  
     
- <a href='#3.4'>3.4 References</a>  
     

## 3.1. <a id='def'>Definition</a>

Pandas is a Python library used for working with data sets. This is the "must-learn" library for Data I/O, cleansing, transforming and aggregation. It is an external library so we need to import it in your applications by adding the `import` keyword

In [1]:
import pandas as pd
import numpy as np

Now the `Pandas` package can be referred to as `pd` instead of pandas.

## 3.2. <a id='series'>[Pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)</a>

The first pandas data strucuture is a Series. A Series is a one-dimensional array that can hold any datatype, similar to a ndarray. However, a Series has a **index** that gives a a label to each entry. An index generally is used to label the data.
Typically a Series contains information about **one feature** of the data. <br>

**Es la unidad básica de pandas (cada columna de un data frame)**

A `Pandas Series` is a one-dimensional array of indexed data. It can be created from a list or array as follows:

### 3.2.1. <a id='3.2.1'>From `lists` to `Series`<a>

In [2]:
list_1 = [0.25, 0.5, 0.75, 1.0]
list_1

[0.25, 0.5, 0.75, 1.0]

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0]) 
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [4]:
data = pd.Series(list_1) 
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [5]:
print(data)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


In [6]:
type(list_1)

list

In [7]:
type(data)

pandas.core.series.Series

### 3.2.2. <a id='3.2.2'> From `NumPy array` to `Series` <a>

In [8]:
vector_1 = np.array( [ 10, 20, 1, 2, 
                      3, 4, 5, 6, 7] )
vector_1

array([10, 20,  1,  2,  3,  4,  5,  6,  7])

In [9]:
series1 = pd.Series( vector_1 )
series1

0    10
1    20
2     1
3     2
4     3
5     4
6     5
7     6
8     7
dtype: int32

In [10]:
type(vector_1)

numpy.ndarray

In [11]:
type(series1)

pandas.core.series.Series

### 3.2.3.  <a id='3.2.3'> From `Dictionary` to `Series` </a>

Los diccionarios son json files, y se desea pasar a pandas

In [12]:
population_dict = { 'California' : 38332521,
                    'Texas'      : 26448193,
                    'New York'   : 19651127,
                    'Florida'    : 19552860,
                    'Illinois'   : 12882135 }

In [13]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

As we see in the output, the `Series` wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes. The values are simply a familiar NumPy array:

In [14]:
population.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [15]:
population.values

array([38332521, 26448193, 19651127, 19552860, 12882135], dtype=int64)

### 3.2.4.  <a id='3.2.4'> `Series` vs `NumPy`</a>

The essential difference is the presence of the index: while the `Numpy Array` has an implicitly defined integer index used to access the values, the `Pandas Series` has an explicitly defined index associated with the values. <br>

The `index` do not need to be an integer. we can use `strings`.

In [16]:
claudia = np.arange(5, 21, 2)
claudia

array([ 5,  7,  9, 11, 13, 15, 17, 19])

In [17]:
import numpy as np
math = pd.Series( np.arange(5,21,3) , ['joyce','jeremy','ivan','marcy','daniel','franclin']) # pd.Series(valores, indices)
math

joyce        5
jeremy       8
ivan        11
marcy       14
daniel      17
franclin    20
dtype: int32

In [18]:
info = np.arange(5, 15 ,3.)
index_info =  ['joyce','jeremy','ivan','marcy']

In [19]:
math_2 = pd.Series(info, index_info) # pd.Series(valores, indices)
math_2

joyce      5.0
jeremy     8.0
ivan      11.0
marcy     14.0
dtype: float64

In [20]:
math_3 = pd.Series(info, index_info, dtype = int,  name = "Daniel") #pd.Series(valores, indices, tipo_de_dato, nombre_de_col)
math_3

joyce      5
jeremy     8
ivan      11
marcy     14
Name: Daniel, dtype: int32

Excersice: <br>
Get the `values` and `index` from `math` `Series`.

In [21]:
math_3.values

array([ 5,  8, 11, 14])

In [22]:
math_3.index

Index(['joyce', 'jeremy', 'ivan', 'marcy'], dtype='object')

### 3.2.5.  <a id='3.2.5'> Indexing</a>


Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

In [23]:
print( data )

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


In [24]:
print( data[ 1:3 ] ) # segunda y tercera fila (posiciones 1 y 2 en Python) # la indexación normalmente no incluye el último elemento

1    0.50
2    0.75
dtype: float64


In [25]:
print( population )

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [26]:
print( population[ 'California':'New York' ] ) # cuando la indexación es con texto, sí incluye el último

California    38332521
Texas         26448193
New York      19651127
dtype: int64


| Method 	| Definition 	|
| --- | --- |
| loc() 	| Gets rows (and/or columns) with particular labels.<br> Accept `Boolean` for indexing. |
| iloc() 	| gets rows (and/or columns) at integer locations. <br> Do not accept `Boolean` for indexing.|

Get the value of New York.

In [27]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [28]:
population.loc["New York"]  # fila de posición New York

19651127

In [29]:
population.iloc[2] # fila de posición 2

19651127

In [30]:
print( population.loc[ "New York" ] == population.iloc[ 2 ] )

True


In [31]:
population.iloc[1:4] # filas de posiciones 1, 2 y 3 (o sea la segunda, tercera y cuarta)

Texas       26448193
New York    19651127
Florida     19552860
dtype: int64

In [32]:
print(population.loc[["Texas","New York","Florida"]])

Texas       26448193
New York    19651127
Florida     19552860
dtype: int64


Replicate this excersice for `data` Series.

## 3.3.  <a id='3.3'> [DataFrame](https://www.w3schools.com/python/pandas/pandas_dataframes.asp)</a>


A DataFrame is a collection of multiple Series. It can be thought of as a 2-dimensional array, where each row is a separate datapoint and each column is a feature of the data. The rows are labeled with an index(as in a Series) and the columns are lebeled in the attribute columns.<br>
There are many different ways to initialize a DataFrame. <br>


### 3.3.1. <a id='3.3.1'> DataFrame Generation</a>
#### From `lists` and `dict` to `DataFrame`

In [33]:
# Grades
students = [ "Alejandro", "Pedro", "Ramiro", "Axel", "Juan" ]
math     = [ 15, 16, 10, 12, 13 ]
english  = [ 13, 9, 16, 14, 17 ]
art      = [ 12, 16, 15, 19, 10 ]

# Dictionary
grades_A = {'Students':students, 'Math':math, 'English':english, 'Art':art}

In [34]:
grades_A

{'Students': ['Alejandro', 'Pedro', 'Ramiro', 'Axel', 'Juan'],
 'Math': [15, 16, 10, 12, 13],
 'English': [13, 9, 16, 14, 17],
 'Art': [12, 16, 15, 19, 10]}

In [35]:
gradesA1 = pd.DataFrame( grades_A )
gradesA1

Unnamed: 0,Students,Math,English,Art
0,Alejandro,15,13,12
1,Pedro,16,9,16
2,Ramiro,10,16,15
3,Axel,12,14,19
4,Juan,13,17,10


In [36]:
type(gradesA1)

pandas.core.frame.DataFrame

#### From `lists` and `NumPy` to `DataFrame`

In [37]:
values = np.array([ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ] )
values

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [38]:
col_names = [ 'a', 'b', 'c' ]

In [39]:
data2 = pd.DataFrame( values, columns = col_names )
data2

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [40]:
type(data2)

pandas.core.frame.DataFrame

### 3.3.2. <a id='3.3.2'> Indexing</a>

We can use the same methods as `Series`: `iloc` and `loc`. We can select columns and rows.

In [41]:
# Grades
students = [ "Gissela", "Daniel", "Andres", "Sandra", "Rosalyn" ]
math     = [ 16, 14, 17, 17, 17 ]
english  = [ 16, 17, 19, 18, 15 ]
art      = [ 11, 17, 13, 14, 17 ]

# Dictionary
diplomado = {'Students':students, 'Math':math, 'English':english, 'Art':art}
gradesA1 = pd.DataFrame( diplomado )
gradesA1

Unnamed: 0,Students,Math,English,Art
0,Gissela,16,16,11
1,Daniel,14,17,17
2,Andres,17,19,13
3,Sandra,17,18,14
4,Rosalyn,17,15,17


#### Using .loc

In [42]:
gradesA1.loc[:, "Students"] # : significa toda la información de las filas

0    Gissela
1     Daniel
2     Andres
3     Sandra
4    Rosalyn
Name: Students, dtype: object

In [43]:
(gradesA1.loc[0:2, "Students"]) # 0:2 significa que seleccionará las filas 0 al 2 (sí incluye el último índice). loc es de ubicación precisa

0    Gissela
1     Daniel
2     Andres
Name: Students, dtype: object

In [44]:
(gradesA1.loc[0:2, :]) 

Unnamed: 0,Students,Math,English,Art
0,Gissela,16,16,11
1,Daniel,14,17,17
2,Andres,17,19,13


In [45]:
gradesA1.loc[:, ["Students" , "Art", "English"] ] # se puede cambiar el orden de las columnas

Unnamed: 0,Students,Art,English
0,Gissela,11,16
1,Daniel,17,17
2,Andres,13,19
3,Sandra,14,18
4,Rosalyn,17,15


#### Using .iloc

In [46]:
gradesA1

Unnamed: 0,Students,Math,English,Art
0,Gissela,16,16,11
1,Daniel,14,17,17
2,Andres,17,19,13
3,Sandra,17,18,14
4,Rosalyn,17,15,17


In [47]:
gradesA1.iloc[ 0,0 ] # dato de la fila 0 y columna 0 (primera fila y primera columna)

'Gissela'

In [48]:
gradesA1.iloc[ 0:3, 1:3 ] # filas de posiciones 0 al 2 (no incluye 3), columnas de posición 1 y 2

Unnamed: 0,Math,English
0,16,16
1,14,17
2,17,19


In [49]:
gradesA1.iloc[[0, 2], [1, 3]] # filas de posiciones 0 y 2 (primera y tercera) columnas de posiciones 1 y 3 (segunda y cuarta)

Unnamed: 0,Math,Art
0,16,11
2,17,13


### 3.3.3. <a id='3.3.3'> General Methods</a>

|Method|Description|
|------|-----------|
|columns()|Get the name of the columns.|
|sort_values()|Sort by the values along either axis.|
|sort_index()|Sort by the index.|
|head()|Show the first N observations.|
|drop( )| Remove the entries  <br>  with the specified label or labels|
|append( )| Concatenate two or more Series.|
|drop_duplicates( )| Remove duplicate values|
|dropna( ) |Drop null entries|
|fillna( ) |Replace null entries <br> with a specified value or strategy|
|reset_index( )| Index as column.|
|sample( ) |Draw a random entry|
|shift( ) |Shift the index|
|unique( ) |Return unique values|


In [50]:
deps = {
        'dep' : ['Lima', 'Piura', 'Tumbes', 'Cuzco', 'Ica', 'Puno'],
        'year': [ 2000, 2001, 2002, 2001, 2002, 2003 ],
        'pop' : [ 1.5, 1.7, 3.6, 2.4, 2.9, 3.2 ] 
        }
dep1 = pd.DataFrame( deps )
dep1

Unnamed: 0,dep,year,pop
0,Lima,2000,1.5
1,Piura,2001,1.7
2,Tumbes,2002,3.6
3,Cuzco,2001,2.4
4,Ica,2002,2.9
5,Puno,2003,3.2


#### sort

In [51]:
# Not `inplace` argument
dep1_sort = dep1.sort_values(['year','pop'], ascending = False)

In [52]:
dep1

Unnamed: 0,dep,year,pop
0,Lima,2000,1.5
1,Piura,2001,1.7
2,Tumbes,2002,3.6
3,Cuzco,2001,2.4
4,Ica,2002,2.9
5,Puno,2003,3.2


In [53]:
dep1_sort

Unnamed: 0,dep,year,pop
5,Puno,2003,3.2
2,Tumbes,2002,3.6
4,Ica,2002,2.9
3,Cuzco,2001,2.4
1,Piura,2001,1.7
0,Lima,2000,1.5


In [54]:
# `inplace` argument # sobreescribe
dep1.sort_values(['year','pop'],ascending = False , inplace = True )
dep1

Unnamed: 0,dep,year,pop
5,Puno,2003,3.2
2,Tumbes,2002,3.6
4,Ica,2002,2.9
3,Cuzco,2001,2.4
1,Piura,2001,1.7
0,Lima,2000,1.5


In [55]:
# `inplace` argument # sobreescribe
dep1.sort_values(['year','pop'],ascending = False , inplace = True, ignore_index = True)
dep1

Unnamed: 0,dep,year,pop
0,Puno,2003,3.2
1,Tumbes,2002,3.6
2,Ica,2002,2.9
3,Cuzco,2001,2.4
4,Piura,2001,1.7
5,Lima,2000,1.5


In [56]:
# Back to the original
dep1.sort_index(inplace = True)
dep1

Unnamed: 0,dep,year,pop
0,Puno,2003,3.2
1,Tumbes,2002,3.6
2,Ica,2002,2.9
3,Cuzco,2001,2.4
4,Piura,2001,1.7
5,Lima,2000,1.5


In [57]:
dep1.sort_values(['year','pop'], ascending = False)

Unnamed: 0,dep,year,pop
0,Puno,2003,3.2
1,Tumbes,2002,3.6
2,Ica,2002,2.9
3,Cuzco,2001,2.4
4,Piura,2001,1.7
5,Lima,2000,1.5


In [58]:
dep1.sort_values(['year','pop'], ascending = [False,True])

Unnamed: 0,dep,year,pop
0,Puno,2003,3.2
2,Ica,2002,2.9
1,Tumbes,2002,3.6
4,Piura,2001,1.7
3,Cuzco,2001,2.4
5,Lima,2000,1.5


#### Operations with DataFrame, new column

In [59]:
gradesA1[ 'avg' ] = ( gradesA1[ 'Math' ] + gradesA1[ 'English' ] + gradesA1[ 'Art' ] ) / 3
gradesA1

Unnamed: 0,Students,Math,English,Art,avg
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333


In [60]:
gradesA1.rename(columns = {'avg': 'promedio'}, inplace = True)
gradesA1

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333


In [61]:
# Mean English & Art
gradesA1.iloc[:, 2:4].mean( axis = 1) # es un panda.Series

0    13.5
1    17.0
2    16.0
3    16.0
4    16.0
dtype: float64

In [62]:
# Mean Math & Art
gradesA1.iloc[:, [1, 3]].mean( axis = 1) # es un panda.Series

0    13.5
1    15.5
2    15.0
3    15.5
4    17.0
dtype: float64

In [63]:
# head
gradesA1.head( 4 )

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333


In [64]:
# drop column
# Always use copy when you want to modify your DataFrame
gradesA1_1 = gradesA1.drop( [ 'promedio' , 'Art' ], axis = 1 )
gradesA1_1

Unnamed: 0,Students,Math,English
0,Gissela,16,16
1,Daniel,14,17
2,Andres,17,19
3,Sandra,17,18
4,Rosalyn,17,15


#### Concatenate

In [65]:
# add new data gradesA2
students = [ "Rebeca", "Xavi", "Cristiano", "Ronaldo", "Leo" ]
math     = [ 15, 18, 14, 7, 10 ]
english  = [ 18, 9, 11, 12, 20 ]
art      = [ 10, 16, 20, 19, 5 ]

# Dictionary
grades_A2 = {'Students':students, 'Math':math, 'English':english, 'Art':art}
gradesA2 = pd.DataFrame( grades_A2 )

In [66]:
gradesA2

Unnamed: 0,Students,Math,English,Art
0,Rebeca,15,18,10
1,Xavi,18,9,16
2,Cristiano,14,11,20
3,Ronaldo,7,12,19
4,Leo,10,20,5


In [67]:
gradesA1

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333


In [68]:
pd.concat([gradesA1,gradesA2], ignore_index=True)

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333
5,Rebeca,15,18,10,
6,Xavi,18,9,16,
7,Cristiano,14,11,20,
8,Ronaldo,7,12,19,
9,Leo,10,20,5,


In [69]:
grades_total = pd.concat([gradesA1,gradesA2], ignore_index=True).copy()
grades_total

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333
5,Rebeca,15,18,10,
6,Xavi,18,9,16,
7,Cristiano,14,11,20,
8,Ronaldo,7,12,19,
9,Leo,10,20,5,


In [70]:
# grades_total =  gradesA1.append( gradesA2 ) 
# a partir de la versión 2.0 de pandas, el método append() fue eliminado.

#### Duplicates

In [71]:
cars = pd.DataFrame({
            'brands'    : [ 'hyundai', 'hyundai', 'kia', 'kia', 'kia' ] ,
            'model'     : [ 'sedan', 'sedan', 'sedan', 'truck', 'truck' ] ,
            'passengers': [ 4, 4, 5, 6, 8 ]
            })
cars

Unnamed: 0,brands,model,passengers
0,hyundai,sedan,4
1,hyundai,sedan,4
2,kia,sedan,5
3,kia,truck,6
4,kia,truck,8


In [72]:
cars_subset = cars.drop_duplicates(subset = [ 'brands' ])
cars_subset

Unnamed: 0,brands,model,passengers
0,hyundai,sedan,4
2,kia,sedan,5


In [73]:
cars_subset1 = cars.drop_duplicates(subset = [ 'brands' ], keep = 'last') # keep = last mantiene la última fila
cars_subset1

Unnamed: 0,brands,model,passengers
1,hyundai,sedan,4
4,kia,truck,8


In [74]:
cars_subset2 = cars.drop_duplicates(subset = [ 'brands', 'model' ])
cars_subset2

Unnamed: 0,brands,model,passengers
0,hyundai,sedan,4
2,kia,sedan,5
3,kia,truck,6


In [75]:
cars.drop_duplicates(subset = ['brands','model','passengers' ], keep = 'last')

Unnamed: 0,brands,model,passengers
1,hyundai,sedan,4
2,kia,sedan,5
3,kia,truck,6
4,kia,truck,8


In [76]:
all_columns = cars.columns
cars.drop_duplicates(subset = all_columns) # hacemos esto para evitar escribir el nombre de todas las columnas

Unnamed: 0,brands,model,passengers
0,hyundai,sedan,4
2,kia,sedan,5
3,kia,truck,6
4,kia,truck,8


In [77]:
cars

Unnamed: 0,brands,model,passengers
0,hyundai,sedan,4
1,hyundai,sedan,4
2,kia,sedan,5
3,kia,truck,6
4,kia,truck,8


#### unique

In [78]:
cars['brands'].unique()

array(['hyundai', 'kia'], dtype=object)

In [79]:
cars['model'].unique()

array(['sedan', 'truck'], dtype=object)

#### drop na y fill na

In [80]:
#dropna
grades_total

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333
5,Rebeca,15,18,10,
6,Xavi,18,9,16,
7,Cristiano,14,11,20,
8,Ronaldo,7,12,19,
9,Leo,10,20,5,


In [81]:
grades_total_NA = grades_total.dropna() # no es bueno chancar la base de datos existente
grades_total_NA 

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333


In [82]:
# fillna
print( grades_total , "\n" )
grades_total_fill_na = grades_total.fillna(5)
print( grades_total_fill_na )

    Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17  16.000000
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10        NaN
6       Xavi    18        9   16        NaN
7  Cristiano    14       11   20        NaN
8    Ronaldo     7       12   19        NaN
9        Leo    10       20    5        NaN 

    Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17  16.000000
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10   5.000000
6       Xavi    18        9   16   5.000000
7  Cristiano    14       11   20   5.000000
8    Ronaldo     7       12   19   5.000000
9        Leo    10       20    5   5.000000


In [83]:
print( grades_total , "\n" )
grades_total_fill_na = grades_total.fillna( "No value" )
print( grades_total_fill_na )

    Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17  16.000000
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10        NaN
6       Xavi    18        9   16        NaN
7  Cristiano    14       11   20        NaN
8    Ronaldo     7       12   19        NaN
9        Leo    10       20    5        NaN 

    Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17       16.0
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10   No value
6       Xavi    18        9   16   No value
7  Cristiano    14       11   20   No value
8    Ronaldo     7       12   19   No value
9        Leo    10       20    5   No value


#### reset index

In [84]:
grades_total

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333
5,Rebeca,15,18,10,
6,Xavi,18,9,16,
7,Cristiano,14,11,20,
8,Ronaldo,7,12,19,
9,Leo,10,20,5,


In [85]:
grades_total.reset_index()
# a diferencia de ignore_index, sí guarda los índices anteriores

Unnamed: 0,index,Students,Math,English,Art,promedio
0,0,Gissela,16,16,11,14.333333
1,1,Daniel,14,17,17,16.0
2,2,Andres,17,19,13,16.333333
3,3,Sandra,17,18,14,16.333333
4,4,Rosalyn,17,15,17,16.333333
5,5,Rebeca,15,18,10,
6,6,Xavi,18,9,16,
7,7,Cristiano,14,11,20,
8,8,Ronaldo,7,12,19,
9,9,Leo,10,20,5,


In [86]:
grades_total.reset_index( drop = True ) # drop = True permite retirar los índices

Unnamed: 0,Students,Math,English,Art,promedio
0,Gissela,16,16,11,14.333333
1,Daniel,14,17,17,16.0
2,Andres,17,19,13,16.333333
3,Sandra,17,18,14,16.333333
4,Rosalyn,17,15,17,16.333333
5,Rebeca,15,18,10,
6,Xavi,18,9,16,
7,Cristiano,14,11,20,
8,Ronaldo,7,12,19,
9,Leo,10,20,5,


#### sample

In [87]:
grades_total.sample(n = 5)

Unnamed: 0,Students,Math,English,Art,promedio
2,Andres,17,19,13,16.333333
9,Leo,10,20,5,
7,Cristiano,14,11,20,
4,Rosalyn,17,15,17,16.333333
6,Xavi,18,9,16,


In [88]:
grades_total.sample( frac = 0.5).reset_index()

Unnamed: 0,index,Students,Math,English,Art,promedio
0,0,Gissela,16,16,11,14.333333
1,4,Rosalyn,17,15,17,16.333333
2,2,Andres,17,19,13,16.333333
3,3,Sandra,17,18,14,16.333333
4,6,Xavi,18,9,16,


In [89]:
print(grades_total.sample( frac = 0.5).reset_index(), "\n\n", grades_total)

   index Students  Math  English  Art   promedio
0      0  Gissela    16       16   11  14.333333
1      8  Ronaldo     7       12   19        NaN
2      2   Andres    17       19   13  16.333333
3      4  Rosalyn    17       15   17  16.333333
4      1   Daniel    14       17   17  16.000000 

     Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17  16.000000
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10        NaN
6       Xavi    18        9   16        NaN
7  Cristiano    14       11   20        NaN
8    Ronaldo     7       12   19        NaN
9        Leo    10       20    5        NaN


In [90]:
print(grades_total.sample( frac = 0.5).reset_index(drop = True), "\n\n", grades_total)

    Students  Math  English  Art   promedio
0     Daniel    14       17   17  16.000000
1    Gissela    16       16   11  14.333333
2  Cristiano    14       11   20        NaN
3     Rebeca    15       18   10        NaN
4        Leo    10       20    5        NaN 

     Students  Math  English  Art   promedio
0    Gissela    16       16   11  14.333333
1     Daniel    14       17   17  16.000000
2     Andres    17       19   13  16.333333
3     Sandra    17       18   14  16.333333
4    Rosalyn    17       15   17  16.333333
5     Rebeca    15       18   10        NaN
6       Xavi    18        9   16        NaN
7  Cristiano    14       11   20        NaN
8    Ronaldo     7       12   19        NaN
9        Leo    10       20    5        NaN


### 3.3.4. <a id='3.3.4'> Importing Data</a>

|Method|Description|
|------|-----------|
|read_excel( )|Read a excel file and convert to a DataFrame.|
|to_csv( )| Write the index and entries to a CSV file|
|read_csv( )| Read a csv and convert into a DataFrame|
|to_json( )| Convert the object to a JSON string|
|to_pickle( )| Serialize the object and store it in an external file|
|to_sql( )| Write the object data to an open SQL database|
|read_html( )| Read a table in an html page and convert to a DataFrame|
|read_spss( )| Read a spss file and convert to a DataFrame.|

[ENAPRES DATA](http://proyecto.inei.gob.pe/enapres/)

The National Survey of Budgetary Programs - ENAPRES, has been running since 2010 in the urban and rural areas of the 24 Departments and the Constitutional Province of Callao, as part of the research carried out by the National Institute of Statistics and Informatics (INEI) in coordination with the Ministry of Economy and Finance (MEF) and the different ministries and agencies of the public sector.

In [91]:
!pip install pyreadstat > NUL 2>&1
!pip install savReaderWriter > NUL 2>&1

In [92]:
# enapres2020_1 = pd.read_spss( r"../_data/enapres_2020_ch_100/736-Modulo1618/CAP_100_URBANO_RURAL_3.sav" )

In [None]:
import pyreadstat
enapres2020_1, meta = pyreadstat.read_sav("../_data/enapres_2020_ch_100/736-Modulo1618/CAP_100_URBANO_RURAL_3.sav", apply_value_formats=True)

In [None]:
# Display the data
print("Data:")
print(enapres2020_1.head())

# Display the metadata
print("\nMetadata:")
print(meta)

In [None]:
enapres2020_1.attrs[ 'value_labels' ] = meta.variable_value_labels # etiquetas de valores de variables
enapres2020_1.attrs[ 'var_labels' ] = meta.column_names_to_labels # etiquetas de variables

In [None]:
enapres2020_1.head(5)

In [None]:
enapres2020_1.attrs[ 'var_labels' ]

In [None]:
dict(list(enapres2020_1.attrs[ 'var_labels' ].items())[:5])

In [None]:
dict(list(enapres2020_1.attrs[ 'value_labels' ].items())[:5])

### 3.3.5. <a id='3.3.5'>Filtering data (by row)</a> 

In [None]:
enapres2020_1.AREA == 'URBANO'

In [None]:
enapres2020_1.loc[ enapres2020_1.AREA == 'URBANO',  :  ]

In [None]:
# select observations
# when we create a sample from our data, copy the object.
df_urban_main = enapres2020_1.loc[ enapres2020_1.AREA == 'URBANO', : ]
df_urban_main

In [None]:
df_urban_completa = df_urban_main.loc[ df_urban_main.RESFIN == 'Completa', : ]
df_urban_completa

In [None]:
enapres2020_1.loc[ (enapres2020_1.AREA == 'URBANO') & (enapres2020_1.RESFIN == 'Completa') ]

In [None]:
df_urban = enapres2020_1.loc[ (enapres2020_1.AREA == 'URBANO') &  (enapres2020_1.RESFIN == 'Completa'), : ]
df_urban

### 3.3.5. <a id='3.3.5'>Filtering data (by column)</a> 

In [None]:
# We will work on this variable
df_urban.attrs[ 'var_labels' ]['P172D']

In [None]:
df_urban.columns

In [None]:
# Select columns with regex
# All the columns that start with P172
df_urban.filter(regex = "P172*").columns

In [None]:
df_urban.filter( regex = "P172*").head(5)

In [None]:
df_urban.filter( like = "P172").columns

In [None]:
# all columns that have an X
df_urban.filter( like = "P172").head(5)

### 3.3.6. <a id='3.3.6'>Dealing with nulls</a>  

We drop columns that at least 30% values are null to simplify our Exploratory Data Analysis (EDA).

In [None]:
df_urban

In [None]:
df_urban.isnull()

In [None]:
null_sum = df_urban.isnull().sum()
null_sum

In [None]:
len( df_urban ) * 0.3

In [None]:
null_sum < len( df_urban ) * 0.3

In [None]:
col_ok = df_urban.columns[ null_sum < len( df_urban ) * 0.3 ] 
col_ok

In [None]:
col_no_ok = df_urban.columns[ null_sum > len( df_urban ) * 0.3 ] 
col_no_ok

In [None]:
df_urban

In [None]:
df_urban.drop(columns = col_no_ok, inplace = True )
df_urban

In [None]:
df_urban

In [None]:
# cheack ID in pandas
( df_urban['PER'].astype(str)+ "_" + 
 df_urban['MES'].astype(str)+ "_" + 
 df_urban['CCDD'].astype(str) + "_" + 
 df_urban['CCPP'].astype(str) + "_" + 
 df_urban['CCDI'].astype(str) + "_" + 
 df_urban['CONGLOMERADO'].astype(str) + "_" + 
 df_urban['NSELV'].astype(str) + "_" +
 df_urban['VIVIENDA'].astype(str) + "_" + 
 df_urban['HOGAR'].astype(int).astype(str) 
)

In [None]:
( df_urban['PER'].astype(str)+ "_" + 
 df_urban['MES'].astype(str)+ "_" + 
 df_urban['CCDD'].astype(str) + "_" + 
 df_urban['CCPP'].astype(str) + "_" + 
 df_urban['CCDI'].astype(str) + "_" + 
 df_urban['CONGLOMERADO'].astype(str) + "_" + 
 df_urban['NSELV'].astype(str) + "_" +
 df_urban['VIVIENDA'].astype(str) + "_" + 
 df_urban['HOGAR'].astype(int).astype(str) 
).is_unique

### 3.3.7. <a id='3.3.7'>[Duplicates](https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/)</a>  

See duplicatedes in rows.

In [None]:
df_urban.shape

In [None]:
df_urban.loc[: , ['CCDD' ,'CCPP' , 'CCDI' ,'CONGLOMERADO' , 'NSELV', 'VIVIENDA', 'HOGAR'] ]

In [None]:
df_urban[df_urban.loc[: , ['CCDD' ,'CCPP' , 'CCDI' ,'CONGLOMERADO' , 'NSELV', 'VIVIENDA', 'HOGAR'] ].duplicated(keep=False) ]

In [None]:
df_urban.loc[: , ['CCDD' ,'CCPP' , 'CCDI' ,'CONGLOMERADO' , 'NSELV', 'VIVIENDA', 'HOGAR'] ].duplicated(keep=False).sum()

In [None]:
df_urban.loc[[4848,5693,26150,26157,36993,37886,37889,38215,39625,39631] , ['CCDD' ,'CCPP' , 'CCDI' ,'CONGLOMERADO' , 'NSELV', 'VIVIENDA', 'HOGAR'] ]

We will drop the last duplication.

In [None]:
df_urban_no_dpl = df_urban[ ~ df_urban.loc[:, ['CCDD' ,'CCPP' , 'CCDI' ,'CONGLOMERADO' , 'NSELV', 'VIVIENDA', 'HOGAR'] ].duplicated() ].copy()
df_urban_no_dpl

In [None]:
df_urban_no_dpl.shape

In [None]:
df_urban_no_dpl.NOMBREDD.value_counts()

In [None]:
df_urban_no_dpl.ESTRATO.value_counts()

### 3.3.8. <a id='3.3.8'>Groupby</a>  


In [None]:
# from yes to 1 and 0 to no
df_urban_no_dpl.P172D.replace(('Si', 'No'), (1, 0), inplace=True)
df_urban_no_dpl.P172D = pd.to_numeric(df_urban_no_dpl['P172D'], errors='coerce')
df_urban_no_dpl.P172D

In [None]:
df_urban_no_dpl.P172D.value_counts()

In [None]:
df_urban_no_dpl.groupby(['CCDD' ,'CCPP' , 'CCDI', 'P172D'])['P172D'].sum()

In [None]:
df_urban_no_dpl.groupby(['CCDD' ,'CCPP' , 'CCDI'])['P172D'].mean()

In [None]:
df_urban_no_dpl.groupby(['CCDD' ,'CCPP' , 'CCDI'], as_index = False )['P172D'].mean()

#### [Agg](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
Aggregate using one or more operations over the specified axis.


In [None]:
df_urban_no_dpl['P172D'] = df_urban_no_dpl['P172D'].astype( float )

In [None]:
df_urban_no_dpl.groupby(['CCDD' ,'CCPP' , 'CCDI' ], as_index = False ).agg( { "P172D": "mean" } )

In [None]:
import numpy as np

In [None]:
df3_rec = df_urban_no_dpl.groupby([ 'CCDD' ,'CCPP' , 'CCDI' ], as_index = False).agg( 
    recycle_median = ('P172D', np.median ), 
    recycle_mean   = ('P172D', np.mean))
df3_rec

### 3.3.9. <a id='3.3.9'>Reshape</a>  

#####  From Wide to Long

In [None]:
df3_rec

In [None]:
df3_rec_stack = df3_rec.set_index([ 'CCDD' ,'CCPP' , 'CCDI' ]).stack().reset_index().rename( {"level_3" : "STATS",
                                                                                              0 : "VALUES" }, 
                                                                                            axis = 1 )
df3_rec_stack

In [None]:
df3_rec_melt = df3_rec.melt(
    id_vars = [ 'CCDD' ,'CCPP' , 'CCDI' ] ,
    var_name = 'STATS', 
    value_name = 'VALUES')
df3_rec_melt

##### From Long to Wide

In [None]:
df3_rec_stack.set_index(['CCDD' ,'CCPP' , 'CCDI' , "STATS"]).unstack().head()

In [None]:
df4 = df3_rec_stack.set_index(  [ 'CCDD' ,'CCPP' , 'CCDI' , "STATS" ]   ).unstack().rename_axis( [None, None], axis = 1 )
df4

In [None]:
df4.columns = df3_rec_stack.STATS.unique()
df4.reset_index().head()

In [None]:
df_l_w = df3_rec_stack.pivot( index = [ 'CCDD' ,'CCPP' , 'CCDI' ], 
                         columns = 'STATS' ,
                         values = 'VALUES' 
                        ).rename_axis( [None], axis = 1 ).reset_index()
df_l_w.head()

### 3.3.10. <a id='3.3.10'>[Merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)</a>  

In [None]:
df_urban_merge = df_urban_no_dpl.merge(df_l_w, 
                                       on = ['CCDD' ,'CCPP' , 'CCDI'] , 
                                       how = "left" , 
                                       validate = "m:1")

In [None]:
df_urban_merge.head()

In [None]:
# See all your DataFrames
%whos DataFrame
# %whos DataFrame mostrará todas las variables que son instancias de la clase DataFrame de Pandas, proporcionando información útil 
# sobre cada una de ellas, como el nombre de la variable, el tipo, el tamaño, y una muestra de su contenido.

## 3.4. <a id='3.4'>References</a>  

1. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html
2. https://towardsdatascience.com/all-the-core-functions-of-python-pandas-you-need-to-know-d219cbd87636
3. https://pandas.pydata.org/docs/reference/api/pandas.melt.html#pandas.melt
4. https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe
5. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
6. https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas
7. https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/
