# Pandas DataFrames

A pandas dataframe is a 2D size-mutable, tabular data structure with labeled rows and columns. It is similar to a spreadsheet or SQL table.

DataFrames can be created from various sources such as lists, dictionaries, CSV files, Excel files, etc...

## Basic DataFrame Operations

### Creating a DataFrame

A dataframe can be simply created by converting a dictionary into a dataframe.



In [2]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
        'Age': [28, 34, 29, 32, 41],
        'City': ['New York', 'Paris', 'Berlin', 'London', 'Toronto']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin
3,Linda,32,London
4,James,41,Toronto


### Viewing Data

We can use `head()` or `tail()` to view the first or last records found in a dataframe.

In [4]:
df.head(3)

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin


In [5]:
df.tail(6)

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin
3,Linda,32,London
4,James,41,Toronto


`describe()` provides statistical summary of numerical columns.

In [6]:
df.describe()

Unnamed: 0,Age
count,5.0
mean,32.8
std,5.167204
min,28.0
25%,29.0
50%,32.0
75%,34.0
max,41.0


## Data Selection, Addition, and Deletion

### Selection

We can choose either columns or rows from dataframes. Columns can be directly selected using their name.

In [10]:
df["Age"]

0    28
1    34
2    29
3    32
4    41
Name: Age, dtype: int64

If you want to choose multiple columns, we can provide them as a list.

In [11]:
df[["Name", "Age"]]

Unnamed: 0,Name,Age
0,John,28
1,Anna,34
2,Peter,29
3,Linda,32
4,James,41


When selecting rows we can use `loc` to reference which rows we want. 

In [19]:
df.loc[0] # prints the first record

Name        John
Age           28
City    New York
Name: 0, dtype: object

In [31]:
df.loc[1:3] # prints records 2 to 4

Unnamed: 0,Name,Age,City
1,Anna,34,Paris
2,Peter,29,Berlin
3,Linda,32,London


In [27]:
df.loc[3:] # prints from the 4th record onwards

Unnamed: 0,Name,Age,City
3,Linda,32,London
4,James,41,Toronto


In [30]:
df.loc[:2] # prints until the 3rd record

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin


You can see numbers next to each row in the DataFrame. This is the index. In this case the index matches the record position. But, you can have dataframes where the index is something else. Let's change the index of this dataframe to be the name of the person.


In [53]:
df2 = df.set_index("Name")
df2.head(3)

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,28,New York
Anna,34,Paris
Peter,29,Berlin


Now, we cannot use numbers to directly access the row, but we have to use the value of the index itself.

In [44]:
# df2.loc[0] 
df2.loc["John":"Anna"]

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,28,New York
Anna,34,Paris


However, you can still use row indexes, using `iloc[]`.

In [46]:
df2.iloc[0:2]

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,28,New York
Anna,34,Paris


If you want to reset the index, you can use `reset_index()`.

In [48]:
df2.reset_index()

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,28,New York
Anna,34,Paris
Peter,29,Berlin
Linda,32,London
James,41,Toronto


When you perform these type of changes on a dataframe, a new dataframe is returned, and the original dataframe is not changed. For instance if we check the index of `df2`, it would still be the name.

In [50]:
df2.head(3)

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,28,New York
Anna,34,Paris
Peter,29,Berlin


To overwrite the dataframe with the wanted changes, pass the `inplace=True` parameter.

In [54]:
df2.reset_index(inplace=True)
df2.head(3)

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin


We can finally combine column and row selection as well.

In [57]:
df[["Name", "Age"]].loc[0:3]

Unnamed: 0,Name,Age
0,John,28
1,Anna,34
2,Peter,29
3,Linda,32


### Addition

A new column can be simply added by adding a new name to the dataframe and passing a value.

In [60]:
df["Over_30"] = df["Age"]>30
df.head(3)

Unnamed: 0,Name,Age,City,Over_30
0,John,28,New York,False
1,Anna,34,Paris,True
2,Peter,29,Berlin,False


### Deletion

Then, if we want to delete a column we use `drop()`.

In [61]:
df.drop("Over_30", axis=1, inplace=True)
df.head(3)

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Anna,34,Paris
2,Peter,29,Berlin


## Data Filtering

Filtering is often used to narrow down the data to relevant subsets.

If we do a simple `df["Age"]>30` it returns a list of `True` or `False` values.


In [66]:
df["Age"] > 30

0    False
1     True
2    False
3     True
4     True
Name: Age, dtype: bool

So, we can combine this into the original dataframe to obtain only the records where the label is `True`.

In [67]:
df[df["Age"] > 30]

Unnamed: 0,Name,Age,City
1,Anna,34,Paris
3,Linda,32,London
4,James,41,Toronto


We can also combine conditions using and `&`, and also or `|`.

In [69]:
df[(df['Age'] > 30) & (df['City'] == 'Paris')]

Unnamed: 0,Name,Age,City
1,Anna,34,Paris


## Combining DataFrames

These operations are fundamental in combining data from multiple sources.

First off, let's add an `Employee_ID` to the employees.

In [76]:
df["Employee_ID"]=range(1, len(df)+1)

Unnamed: 0,Name,Age,City,Employee_ID
0,John,28,New York,1
1,Anna,34,Paris,2
2,Peter,29,Berlin,3
3,Linda,32,London,4
4,James,41,Toronto,5


Now let's assume we have more details about the employee such as their role.

In [79]:
df_roles = pd.DataFrame({
    "Employee_ID" : [1, 2, 3, 4, 5],
    "Role": ["Developer", "Data Scientist", "Developer", "HR Manager", "Accountant"]
})

df_roles.head(2)

Unnamed: 0,Employee_ID,Role
0,1,Developer
1,2,Data Scientist


Now, we want to combine the `df` and `df_roles` datasets together, which is very similar to an SQL Join.

In [82]:
df_merged = pd.merge(df, df_roles, on="Employee_ID")
df_merged.head(3)

Unnamed: 0,Name,Age,City,Employee_ID,Role
0,John,28,New York,1,Developer
1,Anna,34,Paris,2,Data Scientist
2,Peter,29,Berlin,3,Developer


Now if we have a new employee, we can concatenate it to the existing dataframe as well.

In [84]:
df_new_emp = pd.DataFrame({
    "Employee_ID": [10],
    "Name": ["Jon"],
    "City": ["Alaska"],
    "Age": [42],
    "Role": ["Developer"]
})

df_new_emp

Unnamed: 0,Employee_ID,Name,City,Age,Role
0,10,Jon,Alaska,42,Developer


In [97]:
df_concat = pd.concat([df_merged, df_new_emp])
df_concat.tail(3)

Unnamed: 0,Name,Age,City,Employee_ID,Role
3,Linda,32,London,4,HR Manager
4,James,41,Toronto,5,Accountant
0,Jon,42,Alaska,10,Developer


This has combined the two dataframes together. Although now we have two records with index `0`.

In [91]:
df_concat.loc[0]

Unnamed: 0,Name,Age,City,Employee_ID,Role
0,John,28,New York,1,Developer
0,Jon,42,Alaska,10,Developer


To re-assign indexes, we can use `reset_index()` again.

In [98]:
df_concat.reset_index(inplace=True)
df_concat.drop(["index"], axis=1 , inplace=True)
df_concat

Unnamed: 0,Name,Age,City,Employee_ID,Role
0,John,28,New York,1,Developer
1,Anna,34,Paris,2,Data Scientist
2,Peter,29,Berlin,3,Developer
3,Linda,32,London,4,HR Manager
4,James,41,Toronto,5,Accountant
5,Jon,42,Alaska,10,Developer


## Additional Operations

### Renaming Columns

Every dataframe has column names. These can be seen using the `columns` property.

In [99]:
df.columns

Index(['Name', 'Age', 'City', 'Employee_ID'], dtype='object')

To rename a particular column say `Employee_ID` to `EmpID` we can use:

In [102]:
df.rename(columns={"Employee_ID": "EmpID"}).head(2)

Unnamed: 0,Name,Age,City,EmpID
0,John,28,New York,1
1,Anna,34,Paris,2
