# Pandas Library

The pandas library is used for working with data in the form of tables, time series or datasets. Loading and saving data from different sources (csv, xlsx, sql database ...).

Provides highly optimized data structures for data analysis.

We will use it for data exploration and data modification (ETL process).

Pandas uses numpy library in the background to work with multidimensional data. We'll look at that later.

Installation using pip: **pip install pandas**


In [None]:
import pandas as pd

## Loading data from CSV

In [None]:
salary = pd.read_csv ("..\\dataset\\salary_dataset.csv")

Data display

In [None]:
salary

## Reading data from excel
For this functionality you need to install the openpyxl library: **pip install openpyxl**

In [None]:
customers = pd.read_excel ("..\\dataset\\mall_customers.xlsx")

Displaying part of the data

In [None]:
customers

## Reading data from Sqlite3 database using SQL query

In [None]:
import sqlite3
cur = sqlite3.connect ("..\\dataset\\database.db")
points = pd.read_sql_query ("SELECT * FROM points", cur)
points

## Change data format and export
Sometimes it can be useful to read structured data in one format and want to convert it to another format.

For example, poitns was read from sqlite3 and we need to work with it in JSON.

In [None]:
points.to_json()

Or save them to a csv file.

In [None]:
points.to_csv("..\\dataset\\database.csv")

If we want to put them into XML, we need to have the lxml library installed: **pip install lxml**

In [None]:
points.to_xml()

## Structure of read data
Pandas returns the DataFrame data type when it is read. 

DataFrame is the main object of pandas, it resembles an Excel table or SQL table.
* Each column can have a different data type.
* Each row has an index that can be customized.

A DataFrame consists of
- Series - column of data 
- columns - a list of columns of type Index. Column names can be edited
- index - table rows

In [None]:
type(points)

We can list the dataset structure using info.

In [None]:
points.info()

### Series
The values are stored using the numpy library. We will use it later.

We can access the column by its name and it can be seen that it is of type Series.

In [None]:
points["NAME"]

It is also possible to select subsets of columns. A subset is defined as a list []

In [None]:
points[["NAME", "CATEGORY"]]

In [None]:
type(points["NAME"])

### Columns
columns is a DataFrame attribute that contains a list of all column names.

In [None]:
points.columns

In [None]:
type(points.columns)

Columns can be renamed. Renaming is done by editing the dictionary entry
* inplace = False, the column is renamed within the output
* inplace = True, rename within the dataset installation

In [None]:
points.rename(columns={"CATEGORY":"Category"})

In [None]:
points.columns

So if we want to rename a column permanently, we need to use inplace=True

In [None]:
points.rename(columns={"CATEGORY":"Category"}, inplace=True)

In [None]:
points.columns

### Index
In pandas, Index is a basic object that represents an index of rows or columns in a DataFrame or Series.

Each DataFrame has a row index (df.index) and a column index (df.columns)

Both are instances of the pandas.core.indexes.base.Index class (or its subclasses, such as RangeIndex)

In [None]:
points.index

Data are indexed according to an internal index starting from 0.

In [None]:
points

If you don't want to use an internally generated index, but want to use a column as an index, this can be set.

In [None]:
points.set_index("ID", inplace=True)

In [None]:
# data are indexed by column ID from 1. 
points

In [None]:
points.index

Sometimes it is necessary to use a date column as an index. Then you can select dates according to a specified time span (year, quarter).

In [None]:
points.set_index("DATE", inplace=True)

In [None]:
points

DATE is set as Index.

In [None]:
points.index

For time functions it is necessary to set them as DatetimeIndex.

In [None]:
points.index = pd.to_datetime(points.index)

In [None]:
points.index

 With a time index we can select records from a certain range.

 sort_index() is used to select by sorted list.

In [None]:

points.sort_index().loc["2020-01-01" : "2020-12-31"]

## Data samples
If the dataset is very large, its display may be unclear. Pandas have functions to display parts of the data.
* head - first records
* tail - last records
* sample - random records

In [None]:
points.head(5)

In [None]:
points.tail(3)

In [None]:
points.sample(5)

## Data access

### Iteration over columns
The items() method returns a view (view object) on pairs (key, value) in the dictionary.

In [None]:
for key, values in points.items():
    print (key, values)

### Data selection by columns

In [None]:
points[["NAME", "POINTS"]]

### Row selection by index
The loc method is used for this purpose.

In [None]:
points.loc['2020-01-10']

In [None]:
points.loc['2021']

### Row selection by numeric index (order)
To select a row by order in the dataset, use **iloc**.
* Index order from 0
* Negative numbers index from the end
* Only a specific range can be selected
* Xth multiples can be selected

In [None]:
points.iloc[0]

In [None]:
points.iloc[-1]

In [None]:
points.iloc[2:4]

In [None]:
# every 3.
points.iloc[::3]

### Combination of column and row selection
* For loc, columns and index values are entered
* For iloc, row indexes and column indexes are entered

In [None]:
points.loc["2021", "POINTS"]

In [None]:
sum(points.loc["2021", "POINTS"])

In [None]:
points.iloc[0, 0:3]

### Selection by condition
The condition returns True and False values for the given indexes to indicate whether the record meets the condition.

In [None]:
points["Category"] == 2

The result of the condition is then used to select rows.

In [None]:
points[points["Category"] == 2]

The conditions may be more complex. For a logical OR expression, the | character is used.

In [None]:
points[(points["Category"] == 2) | (points["POINTS"] > 8)]

For the logical expression AND, & is used.

In [None]:
points[(points["Category"] == 2) & (points["POINTS"] > 8)]

## Applying the function to data
In some cases, the underlying data must be processed and a function must be applied to it to return a new value.

For example, the function scores test results with a grade.

In [None]:
import numpy as np
def grade(points):
    if np.isnan(points) : return np.nan
    elif 0 < points < 8 : return "D"
    elif 8 <= points <= 12 : return "C"
    elif 12 <= points <= 16 : return "B"
    else : return "A"

In [None]:
points

In [None]:
points["grade"]=points["POINTS"].apply(grade)

In [None]:
points

## Basic statistics
After reading data from a file, you often want to get an idea of the data by basic statistics (minimum, maximum, count, standard deviation, etc.)

This is what the **describe** function is for.

In [None]:
points.describe()

The pandas library got many statistical functions that you can run on selected data.

In [None]:
print (points["POINTS"].mean())

In [None]:
print (points["POINTS"].min())

## NaN values
Similarly to NULL values in databases, a dataset can contain NaN, i.e. unknown value.

Sometimes it is advisable to remove incomplete records so that they do not invalidate statistics or interfere with the learning process.

The function dopna, deletes incomplete records.

In [None]:
points2=points.dropna(inplace=False)
points2

Sometimes we may not want to delete incomplete records, but we want to fill in the missing values with some value, such as the average value in a column.

In [None]:
points["POINTS"].fillna(points["POINTS"].mean(), inplace=True)

In [None]:
points

Supplementation is always advisable. Is a student who hasn't taken the test really supposed to have a C average?

In [None]:
points["grade"]=points["POINTS"].apply(grade)

In [None]:
points