# Python Pandas Tutorial

The *pandas*-Paket is the most important tool for data scientists and analysts working with Python today. While powerful machine learning and impressive visualization tools may get the most attention, pandas is the backbone of many data projects.

> The name \[*pandas*\] is derived from the term "**pan**el **da**ta", which is an economic term for datasets containing observations over various time periods for the same subjects.

So, if you want or need to work with large datasets and many tables, it's extremely helpful to delve deeper into pandas.

# What is Pandas Used For?

This tool is essentially the home of your data. With Pandas, you get to know your data by cleaning, transforming, and analyzing it.

With Pandas, you can extract data from various files (e.g., CSV) and present it in a DataFrame—a table, essentially—and then work with that data. For example, you can:

- Calculate statistics and answer questions about the data, such as:

    - What is the average, median, maximum, or minimum of each column?
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?

- Clean the data by removing missing values and filtering rows or columns based on certain criteria

- Visualize the data using Matplotlib. You can create bar charts, line charts, histograms, scatter plots, and much more.

- Save the cleaned and transformed data back into a CSV file, another file, or a database

Before diving into modeling or complex visualizations, it's important to have a good understanding of the nature of your dataset. And Pandas is the best way to achieve this.


## How Does Pandas Fit into the Data Science Toolkit?

The Pandas library is not only a central element of the Data Science toolkit but is also used in conjunction with other libraries from this collection.

Pandas is based on the **NumPy** library, which means that many structures from NumPy are used or replicated in Pandas. Data in Pandas is often used for statistical analysis in **SciPy**, plotting functions from **Matplotlib**, and machine learning algorithms in **Scikit-learn**.

NumPy, SciPy, Matplotlib, and Scikit-learn are known as `libraries`, which are collections of code that provide specific functions and tools. For example, **NumPy** includes various mathematical functions such as sine (numpy.sin(x)) or logarithm (numpy.log(x)).

To access these libraries, you first need to import them. This is done using the `import` command followed by the required library.


In [45]:
import numpy 

pi = numpy.pi
n1 = numpy.sin(60)
print(pi, n1)

3.141592653589793 -0.3048106211022167


## Getting Started with Pandas

### Installation and Import
Pandas is a straightforward package that can be installed easily. Open your terminal (for Mac users) or command prompt (for PC users) and install it with one of the following commands: `pip install pandas`

When using an online compiler, this is usually not necessary, as the most popular libraries are pre-installed. In that case, you can directly import Pandas with `import pandas`. To avoid having to write the full name every time, you can also define libraries with abbreviations. For this, use "import pandas **as** pd," where `pd` is now the abbreviation under which you can access Pandas.


## Basics of Pandas: Series and DataFrames

The two main components of Pandas are `Series` and `DataFrame`.

A `Series` is essentially a single column, and a `DataFrame` is a multidimensional table consisting of a collection of Series.

DataFrames and Series are similar in many ways. Many operations that can be performed on one component can also be performed on the other, such as filling in missing values and calculating the average.

### Creating DataFrames
Creating DataFrames directly in Python is useful for testing new methods and functions from the Pandas documentation.

There are **many** ways to create a DataFrame from scratch, but a good option is to use a simple `dict` (dictionary).

Suppose we have a fruit stand selling apples and oranges. We want a column for each fruit and a row for each customer purchase. To organize this as a dictionary for Pandas, we could do the following:


In [2]:
import pandas as pd

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


**How Did This Work?**

Each **(key, value)** entry in `data` corresponds to a **column** in the resulting DataFrame.

The **index** of this DataFrame was provided to us as the numbers 0-3 during creation, but we can also create our own index when initializing the DataFrame.

Thus, we can now use customer names as our index and also call up the order of a specific customer (i.e., access individual rows). For rows, the `.loc[]` command must be used along with the new indexing in the brackets. Using the `.iloc[]` command, you can access rows using indexing similar to how you learned with lists.

In [23]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

June = purchases.loc['June'] # .iloc because we have changed the indexing
Lily = purchases.iloc[2]
apples = purchases['apples'] # No `.iloc` or `.loc` since the column names have been the same from the beginning


print(purchases)
print('--------------')
print(June)
print('--------------')
print(Lily)
print('--------------')
print(apples)

        apples  oranges
June         3        0
Robert       2        3
Lily         0        7
David        1        2
--------------
apples     3
oranges    0
Name: June, dtype: int64
--------------
apples     0
oranges    7
Name: Lily, dtype: int64
--------------
June      3
Robert    2
Lily      0
David     1
Name: apples, dtype: int64


## Most Important Operations with DataFrames

DataFrames have hundreds of methods and other operations that are crucial for any analysis. As a beginner, you should be familiar with the operations that perform simple transformations on your data and allow for basic statistical analysis.

The first thing to do when opening a new dataset is to print a few rows to get an overview. This can be achieved with `.head()`:


In [4]:
purchases.head()

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


`.head()` by default outputs the **first** five rows of your DataFrame, but you can also specify a number: `purchases.head(2)` would, for example, output the first two rows.

To display the **last** five rows, use `.tail()`. `tail()` also accepts a number to output the last rows as desired.


In [5]:
print(purchases.head(2))
print(purchases.tail(3))

        apples  oranges
June         3        0
Robert       2        3
        apples  oranges
Robert       2        3
Lily         0        7
David        1        2


### Getting Information About Your Data

`.info()` should be one of the first commands you run after loading your data. `.info()` provides important details about your dataset, such as the number of rows and columns, the number of non-null values, the data type of each column, and how much memory the DataFrame occupies.

Note in our dataset that we obviously have missing values in both columns. We will soon address how to handle these.

Another quick and useful attribute is `.shape`, which simply outputs a tuple **(number of rows, number of columns)**:


In [6]:
purchases.info()

purchases.shape

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, June to David
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   apples   4 non-null      int64
 1   oranges  4 non-null      int64
dtypes: int64(2)
memory usage: 268.0+ bytes


(4, 2)

## The append Function

To add more values to our DataFrame, we can use the `append()` function. For this, we define our new data that we want to add afterward. We now want to add two more customers (Paul and Larissa). To do this, we define a new DataFrame that specifies the amount of fruit both have purchased and then use the aforementioned function to append the new DataFrame to the old one.

In [28]:
new_data = {
    'apples': [3, 0],
    'oranges': [2, 2]
}

new_purchases = pd.DataFrame(new_data, index=['Paul', 'Larissa'])

purchases = purchases.append(new_purchases)

In [29]:
purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2
Paul,3,2
Larissa,0,2


## Accessing Individual Rows

To filter individual rows, you can use comparison operators `==` to specify which rows you want to display. 
> In Python, `==` is used as a comparison operator and represents an equality check between two values or expressions. The `==` operator checks if the values on both sides are equal, returning True if they are equal and False if they are not.

To display all rows where the column `Apples` has a value of 3, first specify the condition using `==` for the DataFrame. You can do this as follows:

In [32]:
a1 = purchases[purchases['apples'] == 3]

# gives the 'orange' column for all rows where apples == 3

a2 = a1['oranges']


print(a1)
print('--------------')
print(a2)

      apples  oranges
June       3        0
Paul       3        2
--------------
June    0
Paul    2
Name: oranges, dtype: int64
--------------


To set multiple conditions, you can use either `|` (or) or `&` (and). Use `|` to allow either condition to be met and `&` to require both conditions to be met. That is, `|` will return a result if any one of the conditions is satisfied, while `&` will return a result only if all conditions are met. These operators can be used as many times as needed. When setting multiple conditions, make sure to use appropriate parentheses. For example:

> (a==0) | (a==2) & (o==0)

is different from

> ((a==0) | (a==2)) & (o==0)

Try both cases to understand why parentheses are important!

In [39]:
# Apples should display either 3 or 0
a3 = purchases[(purchases['apples'] == 3) | (purchases['apples'] == 0)]

# Apples should display and Oranges 0 
a4 = purchases[(purchases['apples'] == 3) & (purchases['oranges'] == 0)]

# Apples should display either 3 or 0 and Oranges 2 
a5 = purchases[((purchases['apples'] == 3) | (purchases['apples'] == 0)) & (purchases['oranges'] == 2)] 

print(a3)
print('--------------')
print(a4)
print('--------------')
print(a5)

         apples  oranges
June          3        0
Lily          0        7
Paul          3        2
Larissa       0        2
--------------
      apples  oranges
June       3        0
--------------
         apples  oranges
Paul          3        2
Larissa       0        2


# Tasks:

1) Get familiar with DataFrames by creating your own table listing 4 friends/family members (or both). The names should be used as the row index, and the columns should include Age, Gender, and Hair Color. (And fill these in ;) )

2) Now, display all rows for the men/boys from your table.

3) Next, display all rows for people with brown and blonde hair.

4) Finally, display the age of all people with brown hair.