# Quick Introduction to Pandas DataFrame for Machine Learning and Deep Learning

Welcome to this ultra-quick tutorial on [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), the central data structure in the pandas API. While not a comprehensive guide, this tutorial offers a rapid introduction to the essential aspects of DataFrames needed to kickstart your journey in Machine Learning and Deep Learning.

Think of a DataFrame as an in-memory spreadsheet. Similar to a spreadsheet:

  * Data is stored in cells within a DataFrame.
  * The DataFrame features named columns (usually) and numbered rows.

## Importing NumPy and Pandas Modules

Execute the code cell below to import the NumPy and pandas modules.

In [1]:
# Importing the NumPy library and aliasing it as 'np' for convenience
import numpy as np

# Importing the Pandas library and aliasing it as 'pd' for convenience
import pandas as pd

## Creating a Simple DataFrame

In the code cell below, a basic DataFrame is constructed, consisting of 10 cells organized as follows:

  * 5 rows
  * 2 columns: one named `temperature` and the other named `activity`

The code utilizes the `pd.DataFrame` class for instantiation, requiring two arguments:

  * The first argument supplies the data for the 10 cells, generated using `np.array` to form a 5x2 NumPy array.
  * The second argument specifies the column names for the DataFrame.

**Note**: Avoid redefining variables in the code cells below, as they are utilized further in this tutorial.

In [2]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame using the pd.DataFrame class.
# The 'data' parameter is assigned the NumPy array, and 'columns' parameter
# is assigned the column names.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame to display the constructed data structure.
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


## Adding a New Column to a DataFrame

Expanding a pandas DataFrame with a new column is a straightforward process. You can achieve this by assigning values to a new column name. In the example below, a third column named `adjusted` is introduced to the existing DataFrame (`my_dataframe`):

In [3]:
# Create a new column named 'adjusted' in the DataFrame.
# The values in the 'adjusted' column are derived by adding 2 to the
# corresponding values in the 'activity' column.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame to display the updated structure
# with the new 'adjusted' column.
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


## Selecting a Subset of a DataFrame

Pandas offers various methods to extract specific rows, columns, slices, or cells from a DataFrame.

In [4]:
# Print the first 3 rows (index 0, 1, and 2) of the DataFrame.
print(f"Rows #0, #1, and #2: \n{my_dataframe.head(3)} \n")

# Print the third row (index 2) of the DataFrame.
print(f"Row #2:\n {my_dataframe.iloc[[2]]} \n")

# Print rows #1, #2, and #3 of the DataFrame using slicing.
print(f"Rows #1, #2, and #3: \n{my_dataframe[1:4]} \n")

# Print the entire 'temperature' column from the DataFrame.
print(f"Column 'temperature': \n{my_dataframe['temperature']} \n")

Rows #0, #1, and #2: 
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
    temperature  activity  adjusted
2           20         9        11 

Rows #1, #2, and #3: 
   temperature  activity  adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'temperature': 
0     0
1    10
2    20
3    30
4    40
Name: temperature, dtype: int64 



## Example: Creating a DataFrame

The code below accomplishes the following steps:

  1. Creates a 3x4 (3 rows x 4 columns) pandas DataFrame with columns named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`. Each of the 12 cells is filled with a random integer between 0 and 100, inclusive.

  2. Outputs:

     * The entire DataFrame
     * The value in the cell of row #1 in the `Eleanor` column

  3. Adds a fifth column named `Janet`, populated with the row-wise sums of `Tahani` and `Jason`.


In [5]:
# Step 1: Create a 3x4 DataFrame with random integers between 0 and 100.
df_example = pd.DataFrame({
    'Eleanor': np.random.randint(0, 101, 3),
    'Chidi': np.random.randint(0, 101, 3),
    'Tahani': np.random.randint(0, 101, 3),
    'Jason': np.random.randint(0, 101, 3)
})

# Step 2: Output the DataFrame and a specific cell value.
print(f"Entire DataFrame:\n{df_example} \n")

# Print the value in the cell of row #1 in the 'Eleanor' column.
print(f"Value in cell (Row #1, 'Eleanor'): {df_example.at[1, 'Eleanor']} \n")

# Step 3: Create a new column 'Janet' with row-wise sums of 'Tahani' and 'Jason'.
df_example['Janet'] = df_example['Tahani'] + df_example['Jason']

# Display the updated DataFrame with the new 'Janet' column.
print(f"DataFrame with 'Janet' column:\n{df_example}")

Entire DataFrame:
   Eleanor  Chidi  Tahani  Jason
0       65     65      16     18
1       60     55      50     47
2       24     15      55     20 

Value in cell (Row #1, 'Eleanor'): 60 

DataFrame with 'Janet' column:
   Eleanor  Chidi  Tahani  Jason  Janet
0       65     65      16     18     34
1       60     55      50     47     97
2       24     15      55     20     75


## Duplicating a DataFrame

In Pandas, there are two distinct methods for duplicating a DataFrame:

* **Referencing:** Assigning a DataFrame to a new variable creates a reference. Modifications to either the original DataFrame or the new variable will affect both.

* **Copying:** Utilizing the `pd.DataFrame.copy` method generates an independent copy. Changes made to the original DataFrame or its copy won't impact the other.

This distinction is subtle yet crucial.

In [6]:
# Experiment with a reference: Creating a reference to the DataFrame.
reference_to_df = df_example

# Print the starting value of a specific cell in both df and the reference_to_df.
print(f"  Starting value of df: {df_example['Jason'][1]}")
print(f"  Starting value of reference_to_df: {reference_to_df['Jason'][1]}\n")

# Modify a cell in df, which should also reflect in the reference_to_df.
df_example.at[1, 'Jason'] = df_example['Jason'][1] + 5
print(f"  Updated df: {df_example['Jason'][1]}")
print(f"  Updated reference_to_df: {reference_to_df['Jason'][1]}\n\n")

# Experiment with a true copy: Creating a true independent copy of the DataFrame.
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a specific cell in both my_dataframe and copy_of_my_dataframe.
print(f"  Starting value of my_dataframe: {my_dataframe['activity'][1]}")
print(f"  Starting value of copy_of_my_dataframe: {copy_of_my_dataframe['activity'][1]}\n")

# Modify a cell in my_dataframe, ensuring that copy_of_my_dataframe remains unchanged.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print(f"  Updated my_dataframe: {my_dataframe['activity'][1]}")
print(f"  copy_of_my_dataframe does not get updated: {copy_of_my_dataframe['activity'][1]}")

  Starting value of df: 47
  Starting value of reference_to_df: 47

  Updated df: 52
  Updated reference_to_df: 52


  Starting value of my_dataframe: 7
  Starting value of copy_of_my_dataframe: 7

  Updated my_dataframe: 10
  copy_of_my_dataframe does not get updated: 7
