# Introduction to Pandas

We're starting to work with data!!! So far we've learned different types of data:
* ints
* floats
* booleans
* strings (text)
* qualitative vs. quantitative
* discrete vs. continuous
* numerical vs. categorical

And we've also learned different ways that this data can be organized (structures):
* variables
* lists
* dictionaries

And we've learned that there are a lot of steps in the data analysis process for working with this data! But how do we actually perform those steps? Fortunately there is a package written for Python that makes working with data much easier - **<span style='color:blue'>Pandas</span>**!

Pandas is basically the de facto data analysis package for python. It has its own structure for working with data (series and dataframes), provides many common commands for actions data analysts often do with their data, and it integrates really well with other packages (like the matplotlib package for plotting that we'll learn about in a future class!).

<center><img src="https://media1.giphy.com/media/VCzhhGsOVH2PNIa3hQ/giphy.gif" width="400" height="400" />

## Prerequisites

Pandas is a **package** in Python, which means we need to *import* it in order to tell our python code that we are using it. To do that we use the following format:

```
import package_name as name_you_use_to_refer_to_the_package
```

So for pandas the convention is to always import it as `pd` like below:

In [3]:
import pandas as pd

**<span style='color:red'>Warning</span>** : If you get an error with this code for some reason, that means that you haven't installed pandas yet and so Python can't import it.

Now anywhere where you want to use a command that comes from the pandas library you just have to use `pd.command_name` to tell python where to look for the code for that command.

## Series

The most basic data structure in pandas is a **<span style='color:blue'>series</span>**. A Series is very similar to a list, except that when you print it out it will explicitely say what the index values are and you can change those values based on the data. These indices are then more like row labels than the numerical position of the item in the list.

In [1]:
my_list = ['eggs', 'cheese', 'bacon']
print(my_list)

['eggs', 'cheese', 'bacon']


In [2]:
my_list[1]

'cheese'

You even define a series with a list:

```
series_name = pd.Series(data = [list of values], index = [list of row labels])
```

The index parameter is optional, and if you don't provide values they will just be like the normal list index values:

In [6]:
employee_ids = ['123', '456', '789', '000']
employees = pd.Series(data=['Hannah', 'Henry', 'Henrietta', 'Humphrey'], index=employee_ids)
print(employees)

123       Hannah
456        Henry
789    Henrietta
000     Humphrey
dtype: object


Then working with them is very similar to a list as well. If you want a specific value out of a Series, you use square brackets with the index location or the row label:

In [28]:
employees[0] # Using just the index location

'Hannah'

In [29]:
employees['123'] # Using the specific row label

'Hannah'

With pandas, series are like individual columns of a table.

## Dataframes

*In what formats have you seen data in your experiences?*

Likely one of those formats has been a table. This is the general anatomy of a table that holds data:

<center><img src="https://bam.files.bbci.co.uk/bam/live/content/zvxgd2p/large" width="600" height="600" />

**Field** : usually a column of a table; holds a certain piece of information

**Record** : usually a row of a table; holds all of the different pieces of information for a single instance

**Unit** : a single cell of the table (intersection of a row and column)

A **<span style='color:blue'>dataframe</span>** is just the main way that we make tables in Python. It is like glueing multiple series together that all have the same row labels.

## Creating a Dataframe

You can either create a dataframe from scratch, or read in data from another compatible format like an excel/csv file.

The most used format for creating a DataFrame from scratch is with a dictionary-like format:

```
dataframe_name = pd.DataFrame(
    data = {
        'column_name' : [list or Series of column values],
        'column_name2' : [list or Series of column 2 values],
        ...
    },
    index = [list of row lables]
)
```

So the `data` parameter is given a dictionary where the keys of the dictionary are the names of the columns in the table, and the values of those keys are the units of data within those rows (order matters!). The `index` parameter is for if you want specific labels for the rows, but if you don't care what the indices of the rows are, you can leave the index parameter out and the dataframe with automatically assign indices to the rows.

In [7]:
employees_df = pd.DataFrame(
    data = {
        'name' : employees, # using an already defined Series
        'age' : [24, 56, 61, 32], # using a list
        'department' : ['sales', 'marketing', 'development', 'sales']
    }
)

In [15]:
employees_df

Unnamed: 0,name,age,department
123,Hannah,24,sales
456,Henry,56,marketing
789,Henrietta,61,development
0,Humphrey,32,sales


Notice how it brought along the row labels from the Series we used!

These are the different parts of this kind of dataframe:

<center><img src="https://files.realpython.com/media/fig-1.d6e5d754edf5.png" width="600" height="600" />

In [9]:
# How would you create a dataframe that looks like the example below?
fruit_df = pd.DataFrame(
    data = {
        'Apples':[35, 41],
        'Bananas':[21, 34]
    },
    index = ['2017 Sales', '2018 Sales']
)

fruit_df

Unnamed: 0,Apples,Bananas
2017 Sales,35,21
2018 Sales,41,34


In [21]:
apples = [35, 41]
bananas = [12, 34]

fruits = [apples, bananas]
print(fruits)

[[35, 41], [12, 34]]


In [20]:
years = ['2017 Sales', '2018 Sales']
apples = pd.Series([35,41], name='Apples', index=['2017 Sales', '2018 Sales'])
bananas = pd.Series([21, 34], name='Bananas', index=['2019 Sales', '2020 Sales'])

fruits_2 = pd.DataFrame(
    data = {
        'Apples': apples,
        'Bananas': bananas
    }
)

fruits_2

Unnamed: 0,Apples,Bananas
2017 Sales,35.0,
2018 Sales,41.0,
2019 Sales,,21.0
2020 Sales,,34.0


![](https://i.imgur.com/CHPn7ZF.png)

## Viewing A Sample of a Dataframe

For our itty bitty dataframe it doesn't matter a lot because there is only 4 rows. But if you have a 10,000 row dataframe you aren't going to want to print the whole thing! Instead you can look at a sample of it by printing just the first several rows with `df_name.head(number_of_rows)`:

In [25]:
employees_df.head(2)

Unnamed: 0,name,age,department
123,Hannah,24,sales
456,Henry,56,marketing


## Selecting Specific Columns in a Dataframe

In [33]:
employees_df

Unnamed: 0,name,age,department
123,Hannah,24,sales
456,Henry,56,marketing
789,Henrietta,61,development
0,Humphrey,32,sales


There are two ways to get a specific column/Series from a dataframe:

```
dataframe_name.column_name
```
OR
```
dataframe_name['column_name']
```

They both work great EXCEPT if a column name has spaces in it. If that is the case, you have to use the square bracket `[]` version.

In [29]:
employees_df.age.to_list()

[24, 56, 61, 32]

In [27]:
employees_df.name

123       Hannah
456        Henry
789    Henrietta
000     Humphrey
Name: name, dtype: object

In [35]:
employees_df['name']

123       Hannah
456        Henry
789    Henrietta
000     Humphrey
Name: name, dtype: object

<center><img src="https://files.realpython.com/media/fig-2.2bee1e181467.png" width="500" height="500" />

As you can see, you get back the Series for that column with that column name. If you then want a specific value/unit of data from that column, you just have to use the index/row label like what we did with Series:

In [36]:
employees_df['name'][0]

'Hannah'

In [37]:
employees_df['name']['123']

'Hannah'

## Selecting Specific Rows in a Dataframe

In [33]:
employees_df

Unnamed: 0,name,age,department
123,Hannah,24,sales
456,Henry,56,marketing
789,Henrietta,61,development
0,Humphrey,32,sales


To select a specific row, you can use the `df.iloc[numerical position]` or `df.loc[row label]`:

In [35]:
employees_df.iloc[0]

name          Hannah
age               24
department     sales
Name: 123, dtype: object

In [39]:
employees_df.loc['123']

name          Hannah
age               24
department     sales
Name: 123, dtype: object

As you can see, this also returns a Series for that row.

<center><img src="https://files.realpython.com/media/fig-3.2e2d5c452c23.png" width="600" height="600" />

## Additional Resource

We are going to continue to talk about and learn how to use Pandas, but for some additional practice I highly recommend the mini lessons on Kaggle : https://www.kaggle.com/learn/pandas