# Pandas

## Introduction

pandas is a library for data manipulation and analysis. Created in 2008 in response to the increasing use of Python in scientific applications traditionally dominated by **R**, MATLAB or SAS, and building on the maturity and stability of **NumPy** and **SciPy**. Its name derives from ***Pan**el **Da**ta*, a common term in statistics and econometrics for multidimensional datasets.

It allows:
- Easy importing from CSV, JSON, Excel, SQL, etc.
- Manipulation operations: selection, filtering, aggregation.
- Data cleaning (*data cleaning* or *data cleansing*).
- *Data wrangling* or *Data munging*: transforming data between formats

pandas structures:
- Series: 1D array
- DataFrame: 2D array
- Panel: 3D array


Official documentation: https://pandas.pydata.org/docs/

In [1]:
import pandas as pd

## Series

The **Series** type is a one-dimensional array that contains a sequence of values and a sequence of labels associated with the values, called an index. The existence of this explicit index (which can be of any immutable type) is the main difference from a NumPy vector, which has an implicit index (a sequence of integers indicating the position). Series indexes are like dictionary indexes, while **NumPy** indexes are like list indexes.

### Structure

In [2]:
serie_ejemplo = pd.Series([1,2,3,4,5,6]) # Series with implicit index since it starts from a list
print(serie_ejemplo)
print(type(serie_ejemplo))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
<class 'pandas.core.series.Series'>


In [3]:
# Similarly to how in NumPy we create vectors from lists, in pandas we can create series from dictionaries. In this case, the dictionary keys will be the series indexes and the dictionary values will be the series values.

estudiantes_con_notas = pd.Series({'Estudiante 1': 5, 'Estudiante 2': 10, 'Estudiante 3': 7, 'Estudiante 4': 8})

In [4]:
pd.Series([5,10,7,8], index=["Estudiante 1","Estudiante 2","Estudiante 3","Estudiante 4"]) # Can also be done this way

Estudiante 1     5
Estudiante 2    10
Estudiante 3     7
Estudiante 4     8
dtype: int64

In [5]:
asientos_ocupados_teatro = pd.Series({1: "Pepe Pérez", 7: "Juan Gómez", 6: "Ana López", 2: "María García", 5: "Luisa Martínez"})
asientos_ocupados_teatro

1        Pepe Pérez
7        Juan Gómez
6         Ana López
2      María García
5    Luisa Martínez
dtype: object

### Accessing elements of a series

Be careful when performing operations on positions instead of indexes, since the explicit index can be a number and not a text string. In this case, if you perform an operation on a position, you will be referring to the position of the implicit index, not the explicit index.

To operate on positions, use the **iloc** attribute (from *integer location*), while to operate on indexes use the **loc** attribute or directly the indexing operator **[]**, as in lists or dictionaries. The most common is to use the indexing operator, since it is shorter and more readable.

In [6]:
print(asientos_ocupados_teatro[7]) # Returns the value at explicit index 7
print(asientos_ocupados_teatro.loc[7]) # Equivalent to the above
print(asientos_ocupados_teatro.iloc[1]) # Returns the value at position 1
# print(asientos_ocupados_teatro[0]) # Would error since explicit indexes are numbers and index 0 doesn't exist (would be a source of errors if allowed)

Juan Gómez
Juan Gómez
Juan Gómez


In [7]:
print(estudiantes_con_notas["Estudiante 1"]) # Returns the value at explicit index "Estudiante 1"
print(estudiantes_con_notas.loc["Estudiante 1"]) # Equivalent to the above
print(estudiantes_con_notas.iloc[0]) # Returns the value at position 0
print(estudiantes_con_notas[0]) # Returns the value at implicit index 0 (position 0) 
# but throws a warning, should not be done this way but with iloc and will be removed in future pandas versions as it is a source of errors
# print(estudiantes_con_notas.loc[0]) # Would error since explicit indexes are strings and index 0 doesn't exist

5
5
5
5


  print(estudiantes_con_notas[0]) # Returns the value at implicit index 0 (position 0)


In [8]:
# Modifying values by indexes
estudiantes_con_notas['Estudiante 1'] = 10
estudiantes_con_notas['Estudiante 3':] = 5 # Modifies values from index 3 to the end (slicing)
estudiantes_con_notas

Estudiante 1    10
Estudiante 2    10
Estudiante 3     5
Estudiante 4     5
dtype: int64

In [9]:
print(estudiantes_con_notas.mean()) # Mean of the grades
print(estudiantes_con_notas.std()) # Standard deviation

7.5
2.886751345948129


In [10]:
print(estudiantes_con_notas.describe()) # Descriptive statistics of student grades

count     4.000000
mean      7.500000
std       2.886751
min       5.000000
25%       5.000000
50%       7.500000
75%      10.000000
max      10.000000
dtype: float64


## DataFrame

### Structure of a DataFrame

A **DataFrame** is a two-dimensional tabular data structure, with labeled rows and columns. It is similar to a relational database table (SQL). It can be considered as a collection of Series that share the same index. It is the most used data structure in pandas.

In [11]:
pd.DataFrame({'Notas': estudiantes_con_notas}) # Create a DataFrame from the grades series (giving a name to the column)

Unnamed: 0,Notas
Estudiante 1,10
Estudiante 2,10
Estudiante 3,5
Estudiante 4,5


In [12]:
# Directly create a dataframe with grades of several students in several subjects
pd.DataFrame({'PIA': estudiantes_con_notas, 'SAA': [5, 6, 7, 8], 'MIA': [9, 8, 7, 6], 'SBD': [10, 9, 8, 7], 'BDA': [6, 7, 8, 9]})

Unnamed: 0,PIA,SAA,MIA,SBD,BDA
Estudiante 1,10,5,9,10,6
Estudiante 2,10,6,8,9,7
Estudiante 3,5,7,7,8,8
Estudiante 4,5,8,6,7,9


In the previous case we used the indexes of ```estudiantes_con_notas``` to create the dataframe. We are adding a Series object for the first column and arrays for the following ones.

In [13]:
# Another option would be to specify the explicit indexes and receive all grades as lists
pd.DataFrame({'PIA': estudiantes_con_notas, 'SAA': [5, 6, 7, 8], 'MIA': [9, 8, 7, 6], 'SBD': [10, 9, 8, 7], 'BDA': [6, 7, 8, 9]}, index=['Wrong Name', 'Estudiante 2', 'Estudiante 3', 'Estudiante 4'])

Unnamed: 0,PIA,SAA,MIA,SBD,BDA
Wrong Name,,5,9,10,6
Estudiante 2,10.0,6,8,9,7
Estudiante 3,5.0,7,7,8,8
Estudiante 4,5.0,8,6,7,9


We made an error in a student's name, and since the first grade list (PIA) was a Series, it doesn't find the grade for the index 'Wrong Name' and returns **NaN (Not a Number)** (a NumPy constant). To avoid this, we can create a DataFrame from a dictionary of lists instead of a dictionary of Series. The other grades are simple lists without an index, so they are assumed to be correct.
However, in this type of process it is important to be alert. Having each student's grades only identified by their position in a list is not very robust, since if a student is added or the order of students is changed, the grades will be assigned to different students. It is better to use a dictionary of Series, since the explicit index allows correctly identifying each student.

The following solution is more robust:

In [14]:
notas_pia = pd.Series({'Marvin Minsky': 5.7, 'John McCarthy': 6.5, 'Claude Shannon': 6.5, 'Alan Turing': 7.0})
notas_saa = pd.Series({'Marvin Minsky': 8.0, 'John McCarthy': 8.5, 'Claude Shannon': 8.0, 'Alan Turing': 9.0})
notas_mia = pd.Series({'Marvin Minsky': 7.0, 'John McCarthy': 6.0, 'Claude Shannon': 6.0, 'Alan Turing': 7.0})
notas_sbd = pd.Series({'Marvin Minsky': 9.0, 'John McCarthy': 9.0, 'Claude Shannon': 9.0, 'Alan Turing': 10.0})
notas_bda = pd.Series({'John McCarthy': 7.8, 'Claude Shannon': 6.9, 'Alan Turing': 9.9, 'Marvin Minsky': 10}) # Order matters since we are creating from series with explicit indexes

notas_df = pd.DataFrame({'PIA': notas_pia, 'SAA': notas_saa, 'MIA': notas_mia, 'SBD': notas_sbd, 'BDA': notas_bda})
notas_df

Unnamed: 0,PIA,SAA,MIA,SBD,BDA
Alan Turing,7.0,9.0,7.0,10.0,9.9
Claude Shannon,6.5,8.0,6.0,9.0,6.9
John McCarthy,6.5,8.5,6.0,9.0,7.8
Marvin Minsky,5.7,8.0,7.0,9.0,10.0


A couple of problems to consider when using *strings* as indexes in data analysis:
- They may not be unique (two people can have the same name)
- There may be variations on how names are written (for example, with uppercase or lowercase) in different data sources.
These are two of the reasons why relational databases always use unique indexed primary keys, often auto-incremental integers that have no meaning in themselves (surrogate keys).

In [15]:
notas_pia = pd.Series({'Marvin Minsky': 5.7, 'John McCarthy': 6.2, 'Claude Shannon': 6.5, 'Alan Turing': 7.0})
notas_saa = pd.Series({'marvin minsky': 8.0, 'McCarthy': 8.5, 'shannon': 8.0, 'Alan-Turing': 9.0})
notas_df_liandola_parda = pd.DataFrame({'PIA': notas_pia, 'SAA': notas_saa})
notas_df_liandola_parda

Unnamed: 0,PIA,SAA
Alan Turing,7.0,
Alan-Turing,,9.0
Claude Shannon,6.5,
John McCarthy,6.2,
Marvin Minsky,5.7,
McCarthy,,8.5
marvin minsky,,8.0
shannon,,8.0


### Information about a DataFrame

In [16]:
notas_df.info() # Information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Alan Turing to Marvin Minsky
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PIA     4 non-null      float64
 1   SAA     4 non-null      float64
 2   MIA     4 non-null      float64
 3   SBD     4 non-null      float64
 4   BDA     4 non-null      float64
dtypes: float64(5)
memory usage: 192.0+ bytes


In [17]:
notas_df.head() # First 5 rows (in this case, there are only 4, normally we will work with much larger datasets and it will be useful to see only the first rows to get an idea of the data)

Unnamed: 0,PIA,SAA,MIA,SBD,BDA
Alan Turing,7.0,9.0,7.0,10.0,9.9
Claude Shannon,6.5,8.0,6.0,9.0,6.9
John McCarthy,6.5,8.5,6.0,9.0,7.8
Marvin Minsky,5.7,8.0,7.0,9.0,10.0


In [18]:
notas_df.shape # Number of rows and columns

(4, 5)

In [19]:
notas_df.keys() # "Index" object with column names and type

Index(['PIA', 'SAA', 'MIA', 'SBD', 'BDA'], dtype='object')

In [20]:
notas_df.columns # Equivalent to the above but only for DataFrame (the keys() method also works to retrieve Series keys)

Index(['PIA', 'SAA', 'MIA', 'SBD', 'BDA'], dtype='object')

In [21]:
notas_df.dtypes # Data types of columns

PIA    float64
SAA    float64
MIA    float64
SBD    float64
BDA    float64
dtype: object

In [22]:
notas_df.index # Row indexes

Index(['Alan Turing', 'Claude Shannon', 'John McCarthy', 'Marvin Minsky'], dtype='object')

### Writing and reading data files

Pandas offers a wide variety of functions to import and export data from and to files. Without going into depth, as an example we can store the DataFrame ```notas_df``` in a **CSV file** with the **to_csv** function and retrieve it with the **read_csv** function.

In [23]:
notas_df.to_csv('data/grades.csv') # the data directory must exist
df = pd.read_csv('data/grades.csv', index_col=0)
df

Unnamed: 0,PIA,SAA,MIA,SBD,BDA
Alan Turing,7.0,9.0,7.0,10.0,9.9
Claude Shannon,6.5,8.0,6.0,9.0,6.9
John McCarthy,6.5,8.5,6.0,9.0,7.8
Marvin Minsky,5.7,8.0,7.0,9.0,10.0


the parameter ```index_col=0``` indicates that the first column of the csv file is the explicit index of the DataFrame, if not specified an implicit index is created.

In [24]:
pd.read_csv('data/grades.csv')

Unnamed: 0.1,Unnamed: 0,PIA,SAA,MIA,SBD,BDA
0,Alan Turing,7.0,9.0,7.0,10.0,9.9
1,Claude Shannon,6.5,8.0,6.0,9.0,6.9
2,John McCarthy,6.5,8.5,6.0,9.0,7.8
3,Marvin Minsky,5.7,8.0,7.0,9.0,10.0
