# PANDAS

## Objective of the season

• Understand what a Series object is and what its relationship is within Dataframes .

• Know the functionalities and fundamental transformations that we can carry out in dataframes 

We import the pandas library to be able to work with DataFrames in Python, in case it is not installed by default we can do the following:
* <code>pip install pandas</code>
* <code>conda install pandas</code>
* <code>conda install -c anaconda pandas </code>

In [None]:
# Import pandas 


## What is pandas ?

* It is a library dedicated to data processing in the form of matrices or data tables (__PAN__el __DA__ta ).
* One of the quintessential libraries for data science.
* Data structures with which we work: Series and Dataframes .
* A succession of series composes a Dataframe
* pip installation may be required install pandas
* Commonly used with the alias pd
* To import the library, import pandas as pd is used.

![Imagen1.png](attachment:Imagen1.png)

## Series

It is an indexed collection that works in a similar way to a data dictionary, it is the minimum unit with which we can work in Pandas, in fact, a Dataframe is nothing more than a succession of series.

We must take into account that the series work in a similar way to a data dictionary where we have the files formed by {key_1: 'value one', key_2. 'value two', ....., key_n: 'value m'}. The series can be formed through a numpy array or a list and some values that will act as indexes (if they are not entered, they will be taken by default from 0 to the length of the set of elements).

To create a series, we will simply use the function <code>**Series**</code>

In [None]:
criptos = pd.Series(
    [15613.77, 496.29, 0.52, 0.83, 242.03], 
    index= ['Bitcoin', 'Ethereum', 'XRP', 'Tether', 'BTC Cash']
)

criptos

Series object has their own class

### Question.

An objetct of class pandas.Series looks similar to a dictionary, it has same functions like keys or values ?

## Dataframes

![Imagen2.jpg](attachment:Imagen2.jpg)

We can use as Dataframe almost all types of data structures like array, list, dictionary... the function to create a Pandas Dataframe is
* <code>__pandas.DataFrame()__</code>

* Through **Series**

In [None]:
# Create an object Series
df_criptos = pd.DataFrame(criptos)

# By default names of columns are created from 0
df_criptos

* Through **Numpy Array**

It is important to highlight in the previous example, that by not using any parameter as an index, it has automatically taken values from 0 to n

* Through **List**

* Through **diccionario de datos**

Note that when dealing with a data dictionary, we have to use the **orient** parameter to correctly recognize the key field as an index. Otherwise, we can use the parameter **index**

### Loading CVS file as Dataframes

One of the great advantages of working with DataFrames is that we can load files very easily and be able to process their rows and columns. There are several types of data that we can load to pandas:

* CSV
* JSON
* HTML
* EXCEL
* HDF5
* TXT
* ORC
* STATA
* ...

All the types of data that we can load are found in the following link: https://pandas.pydata.org/docs/user_guide/io.html#io

Throughout the content we will show the main functionalities of dataframes from csv as dataframe.

To load a csv as a Dataframe we have the function </code>**read_csv**</code>

### Series and columns of a Dataframe

As we have seen before, a dataframe is a succession or collection of series, that is, each column acts as a series since all the columns share the same index.

To access a column we have to write the name of the column in square brackets.

If we look at the type of class that has a column we will see that it is Series

Another way to get the column is similar than SQL:

* df_name.__nombre_columna__ 

### Selecting multiple columns

Now that we have seen how a DataFrame is made up of Series, we can see how to select multiple columns. To do this, we simply have to enclose the names of the columns that we want to select in square brackets, such as:

__df_name[['column_one', 'column_two', 'column_three']]__

Similarly, we can pass a list with names of the columns.

To do the same by means of indexing, that is, to obtain a subset of the columns of the dataframe by the position occupied by the columns, we can do:
* __dataframe_name[dataframe_name.columns[[col_n, col_m]]]__
* __dataframe_name.iloc[:, [col_n, col_m]]__

Using column names.

Through index 

To select multiple columns with iloc, we have to take into account that a dataframe is distributed in the following way.

* __dataframe[rows, columns]__: Where both rows and columns are indexable

### Selecting multiple rows

In general, we can use the same properties as for lists, except to select a single row that is different:
* To select from the first row to a limit we perform: <code>__dataframe[0:n]__</code>, similarly we can omit the zero and simply perform: <code>__dataframe[ :n]__</code>
* To select from a row to the end we do: <code>__dataframe[n:]__</code>
* To select a defined range of rows we do: <code>__dataframe[n:m]__</code>

To select a single row we cannot do the same operation as in the lists, index a single position <code>lista[n]</code>, since this does not return anything, for this, we have to select as a custom range the row we want to display plus one position <code>dataframe[n:n+1]</code>

### iloc Function Summary

With the __.iloc__ function we can select several rows in a very simple way that can be summarized as:
* <code>dataframe.iloc[0]</code> - First row of a dataframe.
* <code>dataframe.iloc[1]</code> - Second row of a dataframe.
* <code>dataframe.iloc[-1]</code> - (Negative indexing) Last row of a dataframe
* <code>dataframe.iloc[[n]]</code> - Row _n_ of the dataframe.
* <code>dataframe.iloc[n:m]</code> - Custom range of rows _n_ to _m_

In columns, the summary of the __.iloc__ function would become:
* <code>dataframe.iloc[:, 0]</code> - First column of a dataframe.
* <code>dataframe.iloc[:, 1]</code> - Second column of a dataframe.
* <code>dataframe.iloc[-1]</code> - (Negative indexing) Last column of a dataframe
* <code>dataframe.iloc[:,[n,m]]</code> - Exactly, the _n_ and _m_ columns of a dataframe.
* <code>dataframe.iloc[:, n:m]</code> - Custom range of columns _n_ to _m_ of a dataframe.

We can also perform multiple row column selection with __.iloc__ with the following examples:
* <code>dataframe.iloc[[0,5,7,9], [1,4]]</code> - Selection of rows 0,5,7,9 from columns 1 and 4
* <code>dataframe.iloc[0:10, 0:2]</code> - Range-based selection of rows 0 to 10 from columns 0 to 2

### Indexes of a DataFrame

These are the positions occupied by each row within a dataframe, unless a specific type of index is specified based on labels, the indexes will be from 0 to the total length of the dataframe

In [None]:
# Create a df
sales = {
   'region': ["EUROPA", "EUROPA", "EUROPA", 
              "USA", "USA", "USA", "LATAM", "LATAM"],
   'sales':[153752, 168742, 162587, 256198, 285743, 290371, 145638, 151678],
   'year':[2018, 2019, 2020, 2018, 2019, 2020, 2018, 2019]
}
df = pd.DataFrame(sales)
df

To observe the index of a dataframe we can use its attribute <code>**index**</code>

Depending on how our data is formed, we will be able to create multi-indices based on our dataframe's own columns, if we look at the region and year columns that have repeated elements. This will allow us to do multi-indices where we have one continent per year of sales.

With the function <code> set_index()</code> we will be able to establish a new index

In view of the new dataframe, we can see that the region and year columns now act as an index, so if we query the first row...

We obtain the data that belongs to the sales volume, being the region (Europe) and the year (2018) its indices

Similarly, we can pass a list as an index to a dataframe. Obviously, this list must be the same length as the dataframe.

With the <code>.loc</code> function we can search for elements in the index

### Some basic functions of a dataframe

Next, we are going to see a list of some of the main functions that we can apply in a dataframe to operate on its columns:

* Get the first rows of a dataframe with __<code>.head</code>__
* Get the last rows of a dataframe with __<code>.tail</code>__
* Get the unique elements of a dataframe column with __<code>.unique</code>__
* Get the frequency of the unique values ​​of a column of a dataframe __<code>.value_counts</code>__
* Obtain a statistical summary of the dataset with __<code>.describe</code>__
* Get the mean of a column with __<code>.mean</code>__
* Get a copy of a dataframe with __<code>.copy</code>__
* View the column names of a dataframe with __<code>.columns</code>__
* Get the correlation between all numeric variables with __<code>.corr</code>__
* Remove duplicates from a dataframe with __<code>.drop_duplicates</code>__
* Especially for Big Data sets we can see the RAM memory consumption of our dataset with __<code>.memory_usage</code>__
* See a graphical summary summary (based on density, scatter plots and correlation plots) of all variables in the dataframe with __<code>.scatter_matrix</code>__
* Obtain a histogram of each numerical variable of the dataset with __<code>.hist</code>__

### First rows of a dataframe

If in .head() we don't specify anything, by default, 5 are shown, we can specify the number of rows to show.

### Last Rows of a Dataframe

The same happens with the __tail__ command, if we do not pass the number of rows to display as a parameter, it will take the last 5

### Unique values by column

As we can see, this is a function that has a greater impact on categorical variables, since in general, numerical variables have too many different values.

### Summary with statistics

It is interesting to note that the statistical summary automatically discards any variable that is not numerical.

### Mean of a column

### Copy a dataframe

As in lists and arrays, dataframes share memory. It is easy to make a mistake and assign a complete dataframe to a new variable, later, in this new variable make modifications, thinking that we are only making them in the copy of the dataframe, but it is not like that, we are making modifications in both dataframes since being stored in memory, they share the same positions, let's see an example of how NOT to copy a dataframe.

changes made in bad_copy reflects to original data

Therefore, we have modified both dataframes, to only make modifications on the copy, we have to use the command <code>.copy</code>

### Correlation matrix

Only numeric values will be taken

### Names of columns

Columns attribute work as list

### Managing Duplicates

### RAM usage of a Dataframe

### Scatter Matrix

### Numeric columns histogramas

### Changing the type of variables.

With the `info` function we can obtain the information of the type of variables.

As we can see, all the variables are of type float, except ocean_proximity which is of type object, when the reality is that this is not the case, some of the most important transformations that we can perform in a dataframe on its variables is to change their type , that is, they become categorical or date type, we are going to see how to change the variables of a dataframe to a categorical type.

One of the possible ways to change the type of a variable is through `pd.Categorical`

Show again the info of the df

Now, we can include category type into the summary

### GROUPING DATAFRAMES - GROUP BY

As in SQL, from Python we can also group our data with the function __<code>groupby</code>__

### DATAFRAME MERGE

On some occasions, we will need to join two or more datasets to do this, first of all, as with lists, we can use the function __<code>.append</code>__

Due to the dimension of rows and columns of the dataframe, we are going to create two smaller dataframes to be able to correctly instantiate the append function.

It is important that we take into account the presence of new columns that appear only in one of the dataframes that we are going to concatenate. To exemplify it, we are going to obtain the datasets again, giving one more column to the dataset with the last positions.

Second, we can make use of the __<code>.concat</code>__ function. It is very important to know the types of union ( _join_ ) that we can perform when concatenating datasets
* __inner__: The union is performed by the common elements of both datasets.
* __outer__: The union is performed by all the elements between the datasets
    
It is recommended to take a look at the documentation to see the different types of join that we can perform https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

![FB_IMG_1650437650199.jpg](attachment:FB_IMG_1650437650199.jpg)

Compo we can see in the merge by type _inner_ the column _households_ of the new dataset does not appear since it is not a common element between both datasets. However, we must take into account that if a dataset does not have rows belonging to a column of another dataset, as is the case of the _households_ column, which only appears in the new dataset. Their values will become null.

### NEW COLUMNS

On many occasions, we will need to add new columns to a dataframe or make changes between the columns of a dataframe to obtain a new one. It is important to know that the operations are performed in a columnar way, that is, if two columns have the same length, we can perform an operation between them without needing to iterate over their rows.

To add a new column we can simply create a new variable with the name of the dataframe and the name of the column that we are going to create.

### DELETE ROWS AND COLUMNS

On some occasions, when we perform data cleaning or, because they are not the object of our analysis, we will need to delete rows or columns from a dataframe, for both cases the function is the same __<code>.drop</code>__, if we want to perform the deletion for rows we will use the value 0 in the parameter __axis__ and for the columns the value 1.

### NULL MANAGEMENT

Finally, as usual when working with dataframes, _missing values_ or simply null values can appear, in Python represented by the string __NaN__, through the __<code>.isnull</code>__ function we can know if an element of a dataframe is null or not, a very common practice is to obtain the number of null values per column in a dataframe and their percentage.

If what we want is to replace the null values ​​and not delete them, we will use the function __<code>.fillna</code>__

On the contrary, if what we want is to delete the null values ​​of a dataframe we can use the function __<code>.dropna</code>__

### Writing dataframe as csv

Just as we can load data from different data sources and process it as a data frame, we can also later write a data frame in one of the multiple sources that pandas accepts to export files, this time, we will dump the information of a data frame as a .csv , for this, we have the <code>**to_csv**</code> function. As parameters we will use the name of the file, the __sep__ argument to use one type of separator or another and, if we do not want the dataframe index to be displayed, we will use the __index__ argument with value _none_