### Introduction to Pandas

Pandas is one of the most popular libraries in Python. It makes working with structured data easy. It also provides numerous functionalities for data exploration, analysis and visualization. Before building a machine learning model for any data, an extensive **exploratory data analysis (EDA)** is usually performed to identify potential issues (such as missing data, collinearity, etc. with the data, select the best inputs (features), and possibly generate additional features (feature engineering). All these are often carried out with pandas in conjuction with other libraries like **NumPy, Matplotlib and Seaborn**. Pandas also allows us to interact with Excel, making it easier to perform excel-type operations within our IDE. 

To use, we first need to import pandas. A typical syntax for doing so is:

<code>import pandas as pd</code>

Pandas has a **DataFrame** function that allows us to generate and work with tabular data. The commonly used syntax for working with the DataFrame function in pandas is:

<code>df = pd.DataFrame(dictionary_input)</code>

As our first example, let's define some parameters to use in generating a dataframe:
<code>
names = ["Adekunle", "Saheed", "Fahad", "Fatimah", "Sheriff", "Abdul", "Darlington", "Simon", "Abiodun", "Ammar"]
ages = np.random.choice(np.arange(20, 51, 1), size = len(names))
job_title = ["teacher","engineer", "doctor", "student", "researcher", "lawyer", "CEO", "data scientist", "professor", "reservor engineer"]
</code>

In [None]:
# import libraries here
import pandas as pd
import numpy as np

np.random.seed(1)

In [None]:
# define the variables here

In [None]:
# define the dataframe here

In [None]:
# take a look at df here

**Note:**

Assuming the dataframe contains a large number of rows, only some of the data would be displayed. Pandas provides the head(), tail() and sample() functions to help us view the first few rows, the last few rows, and randonly selected rows respectively as illustarted below.

In [None]:
# let's try the head() function here

In [None]:
# let's try the tail() function here

In [None]:
# let's try the sample() function here

**Note**

Just like dictionaries, we can obtain the values for a given key (column name in this case!) as illustrated below

In [None]:
# use the df[] style here

In [None]:
# the the df. style here

**Note**

It is also possible to add a new column to the dataframe using the following syntax

<code>dataframe_name[new_column_name] = new_values</code>

In [None]:
# let's include a new column named "years in position" and populate with "nan"

In [None]:
# take a look at df again

In [None]:
# let's change the values of the "years in position" to some randomly generated integers

In [None]:
# take a look at df again

In [None]:
# let's have another column named annual_base_salary

In [None]:
# take a look at df again

In [None]:
# let's check what data types of our columns

In [None]:
# let's change annual_base_salary to integer

# take a look at df again


In [None]:
# let's use apply to change each of annual_base_salary to the nearest 1000

#### Applying functions to dataframe columns
Assuming these individuals work for the same organization that chooses to give a bonus to each person equivalent to how (s)he has been in the current position multiplied by 10% of their current base salary. If we need to add these information to the dataframe, then we need to do some calculations first. Luckily, pandas provides some functionalities to manipulate our dataframes or specific columns therein. 

In [None]:
# let's calculate bonus here

# let's round bonus to the nearest 1000 also and then convert to integer type


In [None]:
# take a look at df again


#### Slicing Dataframes

It is also possible to obtain specific elements or subsets from a dataframe. Pandas provides us with the **loc, iloc, at, iat** methods to achieve this. We would discuss how each one works in the next few code lines

**DataFrame.loc** vs **DataFrame.iloc**

**DataFrame.loc** uses labels locating data while **DataFrame.iloc** strictly uses the position integers for both rows or row/column pair.

To illustrate, let's make a copy of our dataframe and change the index of the copy to letters A-J

In [None]:
# first let's get the letters
import string
capital_letters = string.ascii_uppercase
first_10 = [capital_letters[ind] for ind in range(10)]

In [None]:
# make a copy of df and change the index of the copy


In [None]:
# let's get the data for df at the row labelled 0 
df.loc[0] # here 0 is interpretted as a label not an integer

In [None]:
# try df_copy.loc[0]

In [None]:
# what about df_copy.loc["A"]

In [None]:
# let's get the name for this position

In [None]:
# let's try multiple rows with df

In [None]:
# repeat with df_copy

In [None]:
# if we want only the columns from name to position

In [None]:
# if we want only specific columns

In [None]:
# let's get the first row for df
df.iloc[0]  # here 0 is interpreted as an integer

In [None]:
# try df_copy.iloc["A"]

This throws an error (as expected) because iloc expects integers and not strings

In [None]:
# try df_copy.iloc[0] instead

In [None]:
# let's get multiple data with iloc

**DataFrame.at** vs **DataFrame.iat**

DataFrame.at uses a row label or row/column pair labels to access a single element while DataFrame.iat uses position integers instead

In [None]:
# obtain the item at row label B and column label "name" for df_copy

In [None]:
# try df.iat here
df.iat[2, 0]

**Note**

You can combine different slicers as illustrated below

In [None]:
# try .loc and .iat together here