# Managing data with Pandas.

## (Part 1)

__*Pandas*__ is a powerful python library that help us to query (extracting), summarize (aggregating), transform, and visualize data *tabular data*. This is one of the most popular data exploration open source library currently available. Tabular refers to data that is two-dimensional, consisting of rows and columns. Commonly, we refer to this organized structure of data as a table.

__*Pandas*__ was named after tabular data, refered as 'Panel data' in the financial world being smashed as pandas. This library can be easily imported and used in jupyter notebooks, available in several IDEs as VSCode, PyCharm, and many more. 

There are many aspects that make __*Pandas*__ an attractive choice for data analysis and it continues to have one of the fastest growing user bases.

* It's a Python library and integrates well with the other popular data science libraries such as numpy, scikit-learn, statsmodels, matplotlib, and seaborn.
* It is nearly self-contained in that lots of functionality is built into one package. This contrasts with R, where many packages are needed to obtain the same functionality.
* The community is excellent. Looking at Stack Overflow, for example, there are many ten's of thousands of pandas questions. If you need help, you are nearly guaranteed to find it quickly. 

Even though Python already contains data structures to contain sequence values, these are not built for scientific computing. __Lists__, it's primary data structure, can store any object of any type and are not optimized for tabular data analysis. Python lacks a built in data-structurethat contains homogeneous data types for fats numerical computation. This data structure is generally refered as an "array" in most languages, is provided by the numpy third-party library.

All of the data in pandas is stored in numpy arrays. That said, it isn't necessary to know much about numpy when learning pandas. You can think of pandas as a higher-level, easier to use interface for doing data analysis than numpy. It is a good idea to eventually learn numpy, but for most data analysis tasks, pandas will be the right tool.

### Content
In the first part of this introduction to pandas we are covering the following content:

- Installing Pandas.
- Recomended method to work with data.
- Reading the data.
- Data types and Missing Values.
- Working with indexes.

## 1.0. Installing Pandas

In order to use __*pandas*__, it needs to be installed in the computurer first. To do it, we run the following command in the command line:

> *pip install pandas* 

Once installed, pandas' version can be verified by accesing the special attribute *__ version __*. 

> *import pandas as pd*

> *pd.__ version __*

## 1.1. Recomended method to work with data

### A five-step logistic process

It is recomended approaching the data through snipets of codes in every cell, specially when you are a beginner, because major issues may arise when too many lines of code are writtenin a single cell of a notbook. Getting feedback from every line of code written is, in fact, a nice way to verify your code. Therefore, a 5 step process is highly suggested to help increase your ability to do data exploration: 

- Write and execute a single line of code every time.
- Verify that this line of code works by inspecting the output.
- Assign the result to a variable.
- Within the same cell, in a second line, output the head of the object obtained.
- Do not continue to add more lines of code, continue to the next cell instead. 

Let's see an example of the afore mentioned process by readind data from a csv file. Do not worry to much about the meaning of the code, it will be explained in the following section:

> *`Step 1:` Write and execute a single line of code at a time.*

In [46]:
pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\winequality\winequality-white.csv", sep=";")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7
4897,6.0,0.21,0.38,0.8,0.020,22.0,98.0,0.98941,3.26,0.32,11.8,6


> *`Step 2:` Verify that this line of code works by inspecting the output.*

Looking above, the output seems to be correct; hence, the DataFrame is read properly.

> *`Step 3:` Assign the result to a variable.*

In [47]:
# This is generally done in the same cell. Here, it is assigned to a variable "test":

test = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\winequality\winequality-white.csv", sep=";")

> *`Step 4:` Within the same cell, in a second line, output the head of the object obtained.*

In [48]:
# This canbe done in the same cell, just for exemplify is done separately:

test.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6


> *`Step 5:` Do not continue to add more lines of code, continue to the next cell instead.*

`NOTE:`It is advised against merging multiple lines of code into a single cell, so it is easier to track your work from one step to the next. Most lines of code in a notebook apply operations to the data, then it is vital that you can see exactly what they are doing.

There are some additional notes, such as:
- __Only assign output data to variables when needed.__ This will help you to check the results when it's convenient.
- __Be strategic to create variables.__ Creates new variables allows you to save previous work and to verify results when an error arise.  
- __Reusing variables is appropiate.__ When you are sure about the result obtained, reusing the variable will reduce processing memory consumption during data handling. 

## 1.2. Reading data

### DataFrames and Series

There are numerous formats for data *pandas* can handle, such as JSON, Parquet, CSV, and many others, and convert them into DataFrames. __DataFrames__ are two dimensional data structure that looks like any other rectangular table of data, is composed of three separate components - the columns, the index, and the data. 

Columns provide a label for each column and are always displayed in bold font above the data, and represent a single vertical sequence of data. In data analysis columns repersents different variables. 


To access the pandas resources, we need to import pandas into our namespace. By convention, pandas is imported and aliased to the name *pd*. The DataFrame called `wine2` has 12 columns, each one labeled in the upper part of the object, is read and displayed as follows:

In [49]:
import pandas as pd
wine2 = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\winequality\winequality-white.csv", sep=";")
wine2.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.0010,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6
...,...,...,...,...,...,...,...,...,...,...,...,...
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6


In [50]:
# Columns can be extracted from the DataFrame by using the following method:
col = wine2.columns

col 

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

The Index is simply a sequence of integers beggining at 0 that provides a label for each row and is always displayed to the left of the data. A row is a single horizontal sequence of data. In data analysis, every row represents a registry or observation of the phenomenon under study.

A __Series__ is a single dimension of data. Is analogous to a single column of data or a one-dimensional array. Series can be seleccted from dataframes by appending a set of brackets [ ] to the end of the DataFrame variable name. 

In [51]:
chlorides = wine2["chlorides"]
# While the attribute .head() shows the first 5 rows of the object, the attribute .tail() shows the last 5 rows of the object.
chlorides.tail()

4893    0.039
4894    0.047
        ...  
4896    0.022
4897    0.020
Name: chlorides, Length: 5, dtype: float64

A series only contains a single dimension of data but it has two components, the *index* and the *data*. However, series has no rows and columns. In appereance,it resembles a one-column DataFrame, but it tachnically has no columns. It just has a sequence of values that are labeled by an index. Series indexes serve as labels for the value, and a single label always reference a single value. In the above example, the index label *4895* corresponds to the value *0.041*.

Below the Series display, you will see a few other items printed to the screen - the __name__, __lenght__, and __dtype__. Those items are __NOT__ part of the Series itself and are just extra pieces of information to help you understand the Series. The Lenght is the number of valies in the Series, and dtype is the data type of the Series.

Series indexes and DataFrame indexes are virtually identical, so the same rules apply to them.

### Types of Objects

The object type can be verified by asking the function *type()*, which returns an output called the __fully-qualified name__. The fully-qualified name always returns the packages and module name where the type was defined. Let see the example:

In [52]:
type(wine2)

pandas.core.frame.DataFrame

The package name is the first part of the fully-qualified name and, in this case, is pandas. Only the word after the last dot is the name type and, in this case, we have verified that the *wine2* variable has type *DataFrame*. The module name is the word immediately preceding the name of the type. Here, it is frame.

In [53]:
# we can verify the type of the series as:
type(chlorides)

pandas.core.series.Series

### Changing display options.

Sometimes DataFrames have a large number of columns, and showing all of them can be traumatic. Pandas shows only 20 columns by default, the first 10 and las 10 columns will be shown on the screen. The `get_option` function is accessed directly from pandas, here are three examples:

In [54]:
pd.get_option('display.max_columns')

20

In [55]:
pd.get_option('display.max_rows')

4

In [56]:
pd.get_option('display.max_colwidth')

50

We can modify the display options using the `set_option` function to change an option value. Many options can be set at one time as desired. The example bellow shows how to set the maximum number of columns to 100 and the maximum number of rows to 4:

In [62]:
pd.set_option('display.max_columns', 20,'display.max_rows', 60)

In [58]:
pd.get_option('display.max_columns')

20

## 1.3. Data types and Missing Values

Knowing the data type of each column is very important. The main reason for this is that every value ineach column will be of the same type. 

There are many different data types that can be stored in DataFrames and Series, and knowing them is one of the most important pieces of information we can have about our data because pandas stores its data such that each column is exactly one data type. The following are the most common data types that appears frquently in DataFrames:

- __Boolean:__ Only two possible values: *True* or *False*.
- __Integer:__ Whole numbers without decimals.
- __Float:__ Numbers with decimals.
- __Object:__ Almost always a string, but can technically contain any Python object.
- __Datetime:__ Specific date and time with nanosecond precision.

To get the data type of each column in a DataFrame or a Series, we attach the attribute *dtypes* to the object of interes. Let's see an example:

In [64]:
bikes = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\bikes.csv")
bikes.dtypes

gender                object
starttime             object
stoptime              object
tripduration           int64
from_station_name     object
start_capacity       float64
to_station_name       object
end_capacity         float64
temperature          float64
wind_speed           float64
events                object
dtype: object

As appreciated, the attribute __*dtypes*__ returns a Series object with the data types as the values and column names as the index. `object` data types holds strings columns. The method __*info()*__ not only give you the data type of each columns but also provides additional information such as: 
- The type of object (always a DataFrame)
- The type of index and number of rows, the number of columns 
- The data types of each column and the number of non missing values (a.k.a. Non-null)
- The frequency count of all data types, and 
- The total memory usage. 

In [65]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             50089 non-null  object 
 1   starttime          50089 non-null  object 
 2   stoptime           50089 non-null  object 
 3   tripduration       50089 non-null  int64  
 4   from_station_name  50089 non-null  object 
 5   start_capacity     50083 non-null  float64
 6   to_station_name    50089 non-null  object 
 7   end_capacity       50077 non-null  float64
 8   temperature        50089 non-null  float64
 9   wind_speed         50089 non-null  float64
 10  events             50089 non-null  object 
dtypes: float64(4), int64(1), object(6)
memory usage: 4.2+ MB


In [67]:
# Let's check the DataFrame content:
bikes.head(2)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy


From the visual display of the bikes DataFrame above, it appears that boths *starttime* and *stoptime* columns are datetimes. Unfortunately, the *read_csv* function does not automatically read in these columns as datetimes. The `parse_dates` parameter can be provided to the read_csv function to get them as datetime, otherwise them will be read in as strings. Let's reread the data including the mentioned modification:

In [68]:
bikes = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\bikes.csv", parse_dates=['starttime','stoptime'])
bikes.dtypes.head()

gender                       object
starttime            datetime64[ns]
stoptime             datetime64[ns]
tripduration                  int64
from_station_name            object
dtype: object

### Getting more information (The metadata)

The __Metadata__ can be understtod as the data about the data. The data type of each column, and number of rows and columns are examples of metadata. 

The size of a DataFrame (the total number of values into a dataframe, number of columns multiplied by the number of rows) can be calculated withthe `size` attribute:

In [69]:
bikes.size

550979

The attribute `shape` returns a tuple of integers representig the number of rows and columns (in that order) of the DataFrame. Let's see it in action:

In [70]:
bikes.shape

(50089, 11)

The tupple obtained means, the DataFrame *bikes* contains 50.089 rows and 11 columns. In addition, if only one of the elements is of interest, we can use an integer as follows:

In [74]:
bikes.shape[0]

50089

In [75]:
bikes.shape[1]

11

### Missing Values 

Datasets often have missing values and need to have some representation to identify them. Pandas uses the object __*Not A Number - NaN -*__ to represent missing values in float and object types, and __*Not A Time - NaT -*__ to represent missing values in datetime types. Boolean and Integer data types doesn't have missing values representation. When missing values are placed in a boolean or integer column, pandas convert the entire column into floats, False becomes 0 adn True becomes 1.

## 1.4. Working with indexes

The index of a DataFrame provides a label for each of the rows. If not explicitly provided, pandas authomaticaly creates and assign a sequence of consecutive integers beginning at 0 as the index. A `RangeIndex` type is the simplest index and represents that sequence of integers starting at 0. 

A numpy array underlies the index. The attibute `values` is used to retrieve the index values:

In [90]:
df = pd.read_csv(r"C:\Users\jober\OneDrive\Desktop\Data Science\Data_used\sample_data.csv")
df

Unnamed: 0,name,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


In [91]:
# Let's get the index values as:
df.index.values

array([0, 1, 2, 3, 4, 5, 6], dtype=int64)

The index is a complex object on its own and has many attributes and methods. Working with and selecting them is the minimum we must know about as Data Sciencetist. We can choose a values just like we do in Python list, by placing and integer location of the item we want within the square brackets, by selecting a range of values using the slice notation, or by indicating a list of integers.

In [93]:
# Using the integer notion, we get:
df.index.values[4]

4

In [96]:
# Using the slice notion, we get the start point, stop point, and the step components of the desired indexes:
df.index.values[0:5:2]

array([0, 2, 4], dtype=int64)

In [97]:
# Using the list of integer notion, we get:
desired_indexes =[0,5,6]
df.index.values[desired_indexes]

array([0, 5, 6], dtype=int64)

### Setting an Index of a DataFrame

we can modifiy the index of a DataFrame by calling the `set_index` method to use one of the preloaded columns as the index. However, The column set as the index will no longer be part of the returned DataFrame. The following example shows how to change the index and the resulted impact over the data:

In [82]:
# The shape of the original DataFrame (df) is:
df.shape

(7, 7)

We modify the index by using the *set_index* method obtaining a new DataFrame (df2) one column smaller than the original DataFrame (df). While it's a good practice to save the resulting DataFrame into a new variable, it is not mandatory bacause a copy is returned by the set_index method. 

In [84]:
df2 = df.set_index("name")
df2

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [85]:
# The shape of the new (returned) dataframe is:
df2.shape

(7, 6)

### Accesing the index, columns and and data.

The index, columns, and data are each separate objects that can be accessed fromthe DataFrame as attributes and NOT methods. The attributes `index`, `columns`, and `values` are used to extract that information.

In [86]:
Index = df2.index
Index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [87]:
Labels = df2.columns
Labels

Index(['state', 'color', 'food', 'age', 'height', 'score'], dtype='object')

In [88]:
data = df2.values
data

array([['NY', 'blue', 'Steak', 30, 165, 4.6],
       ['TX', 'green', 'Lamb', 2, 70, 8.3],
       ['FL', 'red', 'Mango', 12, 120, 9.0],
       ['AL', 'white', 'Apple', 4, 80, 3.3],
       ['AK', 'gray', 'Cheese', 32, 180, 1.8],
       ['TX', 'black', 'Melon', 33, 172, 9.5],
       ['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)

Accessing these components do not affect the DataFrame. It can be recalled anytime from the variable *df2*. Both the index and columns are a special type of object name __index__, which is somewhat similar to Python lists. Nonetheless, we rarely need to operate with these components directly and instead will be working with the entire DataFrame.

It's important to notice that when we select a column from a DataFrame as a Series, the index remains the same:

In [89]:
color = df2['color']
color

name
Jane          blue
Niko         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object

Only `index` and `values` attributes are valid for Series and they opperate exactly as they do for DataFrames.

### Choosing a good index.

Modifying the index is sometimes useful, but the majority of the time it is not necesary. Data Sciencetits can complete all of their analysis tasks with just the default index. If a column is to be choosen as index, the values it cotains must be both __unique__ and __descriptive__. The *set_index* method has the ability to verify that all values used for the index are unique by setting the `verify_integrity` parameter to __True__. In the following example, the column "color" of DataFrame *df* contains duplicated colors, so it will return a `ValueError` warning indicating that the color "red" is duplicated.

In [98]:
df.set_index('color', verify_integrity=True)

ValueError: Index has duplicate keys: Index(['red'], dtype='object', name='color')

## Summary
Managing data with Pandas is part art and part science. In this section we have introduced a logical proces to follow to ensure proper data results of our analysis when using Pandas, how to extract basic information from our DataFrames, and the basic commands to handle rows and columns. This chapter set the foundations for further analysis of our datasets.