# Part III: Python Packages for Data Analysis

In this course, we will use various Python packages to perform data manipulation, data analysis, data visualization, and numerical optimization.
Python packages (or libraries) are collections of modules that provide specific functions or features. We can install and import these packages directly in our Google Colab notebooks to use them in our code. Below are some popular packages we will use and their purpose:

- **Data handling and analysis**: These packages enable working with diverse data types like arrays, data frames, or time series. They also facilitate tasks like cleaning, processing, aggregating, and manipulating data. Key packages include:
  - *NumPy*: Supports for multidimensional arrays and matrices, along with mathematical functions and operations on them. NumPy is ideal for scientific computing needs like creating/manipulating arrays, linear algebra, and mathematical applications.
  - *Pandas*: Offers high-performance data structures and analysis tools for tabular data. Its DataFrame can store/process data in rows and columns with multiple methods for indexing, filtering, grouping, and reshaping. Pandas allows reading, writing, and exploring data from sources like CSV, Excel, SQL, etc.
- **Data visualization**: These packages help us to create and display graphical representations of data, such as charts, plots, or maps. They also enable customizing the appearance, style, and interactivity of the visualizations. The following key packages provide a variety of plots that are suitable for exploring and understanding data.
  - *Matplotlib*: Leading Python plotting library to create and adapt figures, axes and other plot elements. It supports diverse plot types including bar, pie, scatter, histograms etc. We will employ Matplotlib for basic data visualization needs.
  - *Seaborn*: High-level statistical data visualization build on Matplotlib. Seaborn Integrates well with Pandas offering enhanced, informative plots like heatmaps, boxplots etc. to represent data.
- **Numerical optimization**: These packages help us to formulate and solve optimization problems, such as linear programming, integer programming, or nonlinear programming. They also provide tools for modeling, solving, and analyzing the results of the optimization problems. Some of the packages that we will use for this purpose are:
  - *pyomo*: A Python-based open-source software for modeling and solving optimization problems. It supports a variety of problem types, such as linear, quadratic, nonlinear, mixed-integer, stochastic, and bilevel programming. It also supports a variety of solvers, such as GLPK, Gurobi, IBM Cplex, and more. We will model optimization problems in Pyomo leveraging its algebraic modeling capabilities.
  - *glpk*: A free software for solving large-scale linear programming and mixed integer programming problems. It implements the simplex method, the interior-point method, and the branch-and-cut method. It also provides a standalone solver that can be called from the command line or from other programs. GLPK offers Python APIs to interface optimization problems.
  - *gurobipy*: A Python interface for Gurobi, one of the leading commercial solvers that can handle various optimization problems, such as linear programming, mixed-integer programming, and quadratic programming. The GurobiPy package Provides Python binding to formulate and solve problems. We will use GurobiPy to integrate Gurobi with Pyomo.

> *Note*: Commercial solvers like Gurobi usually have a free license installed by default for solving small-to-medium scale optimization problems. However, if you want to solve large-scale problems you need to obtain a Web License Service (WLS) license to work with Google Colab, which is free for academic purposes. For more info visit [this link](https://support.gurobi.com/hc/en-us/articles/4409582394769-Google-Colab-Installation-and-Licensing).

To install python packages, we use the `!pip install` command in a code cell. This command will download and install the packages from the Python Package Index (PyPI), which is a repository of software for the Python programming language.


## Numpy: A package for scientific computing

**NumPy** stands for Numerical Python and provides an efficient way to store and manipulate numerical data in Python. NumPy arrays are the main objects of the library, and they are similar to lists in Python, but can have any number of dimensions and store data of the same type. NumPy arrays are faster and more compact than lists, and support a wide range of mathematical operations and functions.

To install NumPy, create a new code cell and run the following command to install the NumPy package:

In [None]:
!pip install numpy

This will display the output of the installation process, such as the version of the package, the dependencies, and the location of the files.

To use NumPy, you need to import it in your Python program. The convention is to use the alias `np` for NumPy, like this:

In [None]:
import numpy as np

### Creating NumPy arrays and matrices

There are several ways to create NumPy arrays, depending on the source and shape of the data. Here are some common methods:

- `np.array()`: This function takes a sequence (such as a list, tuple, or another array) and converts it into a NumPy array. You can specify the data type of the array using the `dtype` argument, otherwise it will be inferred from the data. For example:

In [None]:
# Create a one-dimensional array from a list
a = np.array([1, 2, 3, 4, 5])
print(a)
# Output: [1 2 3 4 5]

# Create a two-dimensional array (a matrix) from a nested list
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(b)
# Output: [[1 2 3]
#          [4 5 6]
#          [7 8 9]]

# Create an array of floating-point numbers from a range
c = np.array(range(5), dtype=float)
print(c)
# Output: [0. 1. 2. 3. 4.]

- `np.arange()`: This function creates an array of evenly spaced values within a given interval. It takes three arguments: `start`, `stop`, and `step`, where `start` is the first value, `stop` is the last value (exclusive), and `step` is the increment. If only one argument is given, it is treated as `stop` and `start` is set to zero.
- `np.linspace()`: This function creates an array of evenly spaced values within a given interval. It takes three arguments: `start`, `stop`, and `num`, where `start` is the first value, `stop` is the last value (inclusive), and `num` is the number of values. For example:
  
For example:

In [None]:
# Create an array of integers from 0 to 9
d = np.arange(10)
print(d)
# Output: [0 1 2 3 4 5 6 7 8 9]

# Create an array of odd numbers from 1 to 19
e = np.arange(1, 20, 2)
print(e)
# Output: [ 1  3  5  7  9 11 13 15 17 19]

# Create an array of 10 values from 0 to 1
f = np.linspace(0, 1, 10)
print(f)
# Output: [0.         0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
#          0.66666667 0.77777778 0.88888889 1.        ]

# Create an array of 5 values from -1 to 1
g = np.linspace(-1, 1, 5)
print(g)
# Output: [-1.  -0.5  0.   0.5  1. ]

- `np.zeros()`: This function creates an array of zeros with a given `shape`. The `shape` can be an integer or a tuple of integers, representing the size of each dimension. The data type of the array can be specified using the `dtype` argument, otherwise it will be `float`.
- `np.ones()`: This function is similar to `np.zeros()`, except that it creates an array of ones with a given `shape`.
- `np.full()`: This function is similar to `np.zeros()`, except that it creates an array of a specified value with a given `shape`.
- `np.eye()`: This function creates an identity matrix with a given `n` dimension. The data type of the array can be specified using the `dtype` argument, otherwise it will be `float`.
- `np.diag()`: This function creates a diagonal matrix with a given `values` array. The data type of the array can be specified using the `dtype` argument, otherwise it will be `float`.

For example:

In [None]:
# Create a one-dimensional array of 5 zeros
h = np.zeros(5)
print(h)
# Output: [0. 0. 0. 0. 0.]

# Create a two-dimensional array of 3x4 zeros
i = np.zeros((3, 4))
print(i)
# Output: [[0. 0. 0. 0.]
#          [0. 0. 0. 0.]
#          [0. 0. 0. 0.]]

# Create a one-dimensional array of 5 ones
j = np.ones(5)
print(j)
# Output: [1. 1. 1. 1. 1.]

# Create a two-dimensional array of 3x4 ones
k = np.ones((3, 4))
print(k)
# Output: [[1. 1. 1. 1.]
#          [1. 1. 1. 1.]
#          [1. 1. 1. 1.]]

# Create a 3x3 identity matrix
l = np.eye(3)
print(l)
# Output: [[1. 0. 0.]
#          [0. 1. 0.]
#          [0. 0. 1.]]

m = np.diag([1, 2, 3])
print(m)
# Output: [[1 0 0]
#          [0 2 0]
#          [0 0 3]]

n = np.full((2, 3), -1, dtype=float)
print(n)
# Output: [[-1. -1. -1.]
#          [-1. -1. -1.]]

### Indexing and slicing

Indexing and slicing operations are used to access specific elements or sub-arrays of a NumPy array:

In [None]:
# The syntax for indexing and slicing a one-dimensional array is similar to
# that of Python lists.

# Create a one-dimensional array
a = np.array([10, 20, 30, 40, 50])

# Index the first element
print(a[0])  # Output: 10

# Index the last element
print(a[-1])  # Output: 50

# Slice the first three elements
print(a[:3])  # Output: [10 20 30]

# Slice the last two elements
print(a[-2:])  # Output: [40 50]

In [None]:
# Indexing and slicing for multi-dimensional numpy arrays are are more complex,
# but also more powerful. You can use commas to separate the indices or slices
# for each dimension of the array.

# Create a two-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Index the element at row 1 and column 2
print(b[1, 2])  # Output: 6

# Index the element at row 0 and column -1
print(b[0, -1])  # Output: 3

# Slice the first two rows and all columns
print(b[:2, :])  # Output: [[1 2 3]
#          [4 5 6]]

# Slice the last row and the first two columns
print(b[-1, :2])  # Output: [7 8]

# You can also use advanced indexing techniques, such as integer arrays or boolean arrays,
# to select arbitrary elements or subarrays from an array.

# Create a two-dimensional array
c = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])

# Use an integer array to index the rows
d = np.array([0, 2, 1])
print(c[d, :])
# Output: [[10 20 30]
#          [70 80 90]
#          [40 50 60]]

# Use a boolean array to index the columns
e = np.array([True, False, True])
print(c[:, e])
# Output: [[10 30]
#          [40 60]
#          [70 90]]

# You can modify an array by assigning a new value to an index or subarray.
print(b)
# Output: [[1 2 3]
#          [4 5 6]
#          [7 8 9]]
b[0, 1] *= 10
print(b)
# Output: [[ 1 20  3]
#          [ 4  5  6]
#          [ 7  8  9]]
b[:, 2] = [-1, 0, 1]
print(b)
# Output: [[ 1 20 -1]
#          [ 4  5  0]
#          [ 7  8  1]]

### Filtering

Filtering is a technique to select a subset of an array that satisfies some criteria. For example, you may want to filter an array to get only the positive values, or only the even values, or only the values that match a certain condition. One way to filter an array is to use a boolean mask. A boolean mask is an array of the same shape as the original array, but with True or False values indicating which elements to keep or discard. You can create a boolean mask by applying a logical expression to the array. For example:

In [None]:
# Create a two-dimensional array
arr = np.array([[10, -20, 30], [-40, 50, -60], [70, -80, 90]])
print(arr)
# Output: [[ 10 -20  30]
#          [-40  50 -60]
#          [ 70 -80  90]]

# # Create a boolean mask for positive values
mask = arr > 0
print(mask)
# Output: [[ True False  True]
#          [False  True False]
#          [ True False  True]]

# Apply the mask to the array to get the filtered array
filtered_arr = arr[mask]
print(filtered_arr)
# Output: [10 30 50 70 90]

Another way to filter an array is to use the `np.where()` function. This function takes a condition and returns the indices of the array where the condition is `True`. You can then use these indices to index the array and get the filtered array. For example:

In [None]:
arr = np.array([1, 2, 3, 4, 5])

# Use np.where() to get the indices of even values
indices = np.where(arr % 2 == 0)
print(indices)
# Output: (array([1, 3]),)

# Use the indices to index the array and get the filtered array
filtered = arr[indices]
print(filtered)
# Output: [2 4]

### Basic operations and methods

NumPy arrays support many basic operations that can be performed element-wise or with scalars. Element-wise operations are applied to each element of the array, while scalar operations are applied to the whole array as a single unit. Here are some examples of basic operations:

- Arithmetic operations: NumPy arrays support the standard arithmetic operators `+`, `-`, `*`, `/`, and `**` for addition, subtraction, multiplication, division, and exponentiation, respectively These operators can be used between two arrays of the same shape, or between an array and a scalar. For example:

In [None]:
# Create two arrays of the same shape
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Add the two arrays element-wise
print(a + b)
# Output: [5 7 9]

# Subtract the two arrays element-wise
print(a - b)
# Output: [-3 -3 -3]

# Multiply the two arrays element-wise
print(a * b)
# Output: [ 4 10 18]

# Divide the two arrays element-wise
print(a / b)
# Output: [0.25 0.4  0.5 ]

# Raise the first array to the power of the second array element-wise
print(a**b)
# Output: [   1   32  729]

# Add a scalar to an array
print(a + 10)
# Output: [11 12 13]

# Multiply an array by a scalar
print(a * 2)
# Output: [2 4 6]

- Comparison operations: NumPy arrays support the standard comparison operators `<`, `>`, `<=`, `>=`, `==`, and `!=` for less than, greater than, less than or equal to, greater than or equal to, equal to, and not equal to, respectively. These operators can be used between two arrays of the same shape, or between an array and a scalar. The result is a boolean array of the same shape, indicating the outcome of each comparison. For example:

In [None]:
# Create two arrays of the same shape
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])

# Compare the two arrays element-wise
print(a < b)
# Output: [ True False False]

are_equal = a == b
print(are_equal)
# Output: [False  True False]

# Compare an array with a scalar
print(a > 2)
# Output: [False False  True]

print(a != 2)
# Output: [ True False  True]

NumPy arrays have many built-in methods and functions that can perform various operations on the data. Some of these methods and functions are:

| Method | Description |
| --- | --- |
| `sum()` | Calculate the sum of all elements in the array, or along a specified axis. |
| `max()` | Return the maximum value in the array, or along a specified axis. |
| `min()` | Return the minimum value in the array, or along a specified axis. |
| `mean()` | Calculate the mean of all elements in the array, or along a specified axis. |
| `std()` | Calculate the standard deviation of all elements in the array, or along a specified axis. |
| `argmax()` | Return the index of the maximum value in the array, or along a specified axis. |
| `argmin()` | Return the index of the minimum value in the array, or along a specified axis. |
| `dot()` | Calculate the dot product of two arrays. |
| `reshape()` | Reshape an array. |
| `flatten()` | Flatten an array. |
| `sort()` | Sort an array. |
| `unique()` | Find the unique elements in an array. |
| `transpose()` or `T` | Transpose an array. |

> *Notes*: 
> - You apply the above methods on NumPy arrays using the `.` operator, or you can use the equivalent functions with the `np.` prefix. For example, `np.sum(a)` is equivalent to `a.sum()`.
> - We use the `axis=0` argument for row-wise operations and the `axis=1` argument for column-wise operations.

Example:

In [None]:
# Create a two-dimensional array
a = np.array([[1, 2, 3], [6, 5, 4]])
print(a)
# Output: [[1 2 3]
#          [6 5 4]]

# Sum all the elements
print(a.sum())  # Output: 21

# Sum the elements along the rows
print(a.sum(axis=1))  # Output: [ 6 15]

# Sum the elements along the columns
print(a.sum(axis=0))  # Output: [7 7 7]

print(a.min(), a.max(), sep=", ")  # Output: 1, 6
print("row minimums:", a.min(axis=1))  # Output: [1 4]
print("column maximums:", a.max(axis=0))  # Output: [6 5 4]

# Calculate the average of all elements
print(a.mean())  # Output: 3.5

# Calculate the standard deviation
print(a.std())  # Output: 1.707825127659933

# Calculate the mean along the rows
print(a.mean(axis=1))  # Output: [2. 5.]

# Calculate the standard deviation along the columns
print(a.std(axis=0))  # Output: [2.5 1.5 0.5]

# Find the index of the minimum and maximum elements
print(a.argmin(axis=1), a.argmax(), sep=", ")  # Output: [0 2], 3

In [None]:
# Transpose the array
print(a.T)
# Output: [[1 6]
#          [2 5]
#          [3 4]]
print(a.transpose() == a.T)
# Output: [[ True  True]
#          [ True  True]
#          [ True  True]]

# Sort the array inplace
a.sort()
print(a)
# Output: [[1 2 3]
#          [4 5 6]]

b = a.T
b.sort(axis=0)  # Sort the columns
print(b)
# Output: [[1 4]
#          [2 5]
#          [3 6]]

# Create two one-dimensional arrays
c = np.array([1, 2, 3])
d = np.array([4, 5, 6])

# Calculate the dot product
print(c.dot(d))  # Output: 32
# Using the np.dot() function also works
print(np.dot(c, d))  # Output: 32

e = np.array([[1, 2], [3, 4]])
f = np.array([[5, 6], [7, 8]])

# Calculate the dot product
print(e.dot(f))
# Output: [[19 22]
#          [43 50]]

# Reshape an array
print(a)
# Output: [[1 2 3]
#          [4 5 6]]
b = a.reshape(3, 2)
print(b)
# Output: [[1 2]
#          [3 4]
#          [5 6]]

b = a.flatten()
print(b)
# Output: [1 2 3 4 5 6]

### Creating random arrays

The `random` module in NumPy provides various functions for generating random arrays. For example:


In [None]:
# create an array with 5 elements, randomly distributed between 0 and 1
a = np.random.rand(5)
print(a)

# Create a two-dimensional array with integer elements randomly chosen from 1 to 100
b = np.random.randint(1, 100, size=(3, 3))
print(b)

# You can specify the seed for the random number generator to ensure reproducibility
np.random.seed(42)
c = np.random.randint(1, 100, size=(3, 3))
print(c)

## Pandas: A package for data analysis

**Pandas** is a Python library that is used for data manipulation and analysis. Built on top of NumPy, Pandas provides a convenient and efficient way to work with various types of data, such as tabular data, time series, matrices, etc.

To use Pandas, you need to install it in your Colab environment:

In [None]:
!pip install pandas

Now we can import it using the alias `pd`, like this:

In [None]:
import pandas as pd



Pandas introduces two new data structures to Python: `Series` and `DataFrame`, which are both based on NumPy arrays.

- `Series`: A Series is a one-dimensional array that can hold any type of data, such as numbers, strings, booleans, etc. A `Series` has an index, which labels each element in the array. Pandas `Series` can be created from a list, a dictionary, or a NumPy array. For example:

In [None]:
# Create a Series from a list
s = pd.Series([1, 2, 3, 4, 5])

# Print the Series
print(s)

# Access the element using indexing and slicing
print(s[0])  # Output: 1
values = s[2:-2]
print(values)
# Output: 2    3
#         dtype: int64
print(type(values))  # Prints <class 'pandas.core.series.Series'>

- `DataFrame`: A `DataFrame` is a two-dimensional array that can hold multiple columns of different types of data. A `DataFrame` has an index for the rows and columns, which can be customized or automatically generated. A `DataFrame` can be created from a dictionary, a list of lists, a NumPy array, or another `DataFrame`. For example:

In [None]:
my_data = {"name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"]}
client_data = pd.DataFrame(my_data)

# Show the DataFrame
client_data

In [None]:
# Each column in the DataFrame is a Series
print(client_data["name"])
print(type(client_data["name"]))  # Prints <class 'pandas.core.series.Series'>

# Therefore you can access the elements using the index
print(client_data["name"][0])  # Prints Alice

In [None]:
my_data = [[1, "B", True], [2, "A", False], [3, "C", True]]
my_columns = ["id", "grade", "passed"]
product_data = pd.DataFrame(my_data, columns=my_columns)

print(product_data)
# Output:
#    id grade  passed
# 0   1     B    True
# 1   2     A   False
# 2   3     C    True

# Use the `shape` attribute to get the number of rows and columns
print(product_data.shape)  # Output: (3, 3)
# use the `len()` function to get the number of rows
print(len(product_data))  # Output: 3
print(len(product_data) == product_data.shape[0])  # Output: True

One of the most common ways to use pandas is to read data from an external file and store it in a `DataFrame`. Pandas provides several functions to read different types of files, such as `read_csv()`, `read_excel()`, `read_sql()`, `read_json()`, etc. These functions return a `DataFrame` object that can be manipulated further. For example:

In [None]:
# Read a CSV file from a URL and store it in a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

# Show the first five rows of the DataFrame
df.head()

### Indexing and manipulation of data

The `index` of a pandas `DataFrame` is a special object that labels the rows of the `DataFrame`. The `index` can be a single column or a combination of columns, and it can have any type of values, such as numbers, strings, dates, etc. The `index` can be used to identify, select, and manipulate the rows of the `DataFrame` in various ways.

- To access the `index` of a `DataFrame`, you can use the index attribute.
- By default, the `index` is an integer sequence starting from 0. To change the `index` of a `DataFrame`, you can use the `set_index()` method. This method takes one or more columns as arguments and returns a new `DataFrame` with those columns as the `index`. You can also use the `inplace` parameter to modify the original DataFrame instead of creating a new one. For example:

In [None]:
# Create a DataFrame from a dictionary
client_data = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie"],
        "id": ["1001", "1002", "1003"],
        "age": [25, 30, 35],
        "gender": ["F", "M", "M"],
    }
)

client_data

In [None]:
print(client_data.index)
# Output: RangeIndex(start=0, stop=3, step=1)

# Set the 'id' column as the index of the DataFrame
new_client_data = client_data.set_index("id")  # This creates a new DataFrame with a new index
print(client_data)  # No changes to the original DataFrame!

client_data.set_index("id", inplace=True)  # This modifies the original DataFrame
print(client_data)
# Output:
#          name  age gender
# id
# 1001    Alice   25      F
# 1002      Bob   30      M
# 1003  Charlie   35      M

print(client_data.index)  # Prints Index(['1001', '1002', '1003'], dtype='int64', name='id')

#### Accessing data using `loc` and `iloc`

Pandas provides two methods for accessing data in a `DataFrame` using labels or positions: `loc` and `iloc`. These methods allow you to select rows, columns, or subsets of data using different types of arguments.

- `loc`: The `loc[]` method is used for label-based indexing. It takes one or more labels as arguments and returns the rows or columns that match those labels. You can also use slicing notation or boolean arrays to select data using `loc[]`.
- `iloc`: The `iloc[]` method is similar to `loc[]`, but it is used for position-based indexing. You can also use indexing/slicing notation or integer arrays to select data using `iloc[]`. For example:

In [None]:
row = client_data.loc["1001"]  # Select a single row based on the index
print(row)
# Output:
# name    Alice
# age        25
# gender      F

rows = client_data.loc[["1001", "1002"]]  # Select multiple rows based on the index
print(rows)
# Output:
#      name  age gender
# id
# 1001 Alice  25      F
# 1002   Bob  30      M

row = client_data.iloc[0]  # Select a single row based on the position
print(row)
# Output:
# name    Alice
# age       25
# gender     F

rows = client_data.iloc[[0, 1]]  # Select multiple rows based on the position
print(rows)
# Output:
#      name  age gender
# id
# 1001 Alice  25      F
# 1002   Bob  30      M

rows = client_data.iloc[0:2]  # Select multiple rows based on the position
print(rows)
# Output:
#      name  age gender
# id
# 1001 Alice  25      F
# 1002   Bob  30      M

# Select last two rows
rows = client_data.iloc[-2:]
print(rows)
# Output:
#          name  age gender
# id
# 1002      Bob   30      M
# 1003  Charlie   35      M

In [None]:
# Select columns (without `loc` or `iloc`)
column = client_data["name"]  # Select a single column based on the column name
print(column)
# Output:
# id
# 1001    Alice
# 1002      Bob
# 1003  Charlie

columns = client_data[["name", "age"]]  # Select multiple columns based on the column name
print(columns)
# Output:
#          name  age
# id
# 1001    Alice   25
# 1002      Bob   30
# 1003  Charlie   35

In [None]:
# Select columns with `loc`
column = client_data.loc[:, "name"]  # Select a single column based on the column name
print(column)
# Output:
# id
# 1001    Alice
# 1002      Bob
# 1003  Charlie

columns = client_data.loc[:, ["name", "age"]]  # Select multiple columns based on the column name
print(columns)
# Output:
#          name  age
# id
# 1001    Alice   25
# 1002      Bob   30
# 1003  Charlie   35

table = client_data.loc[["1001", "1002"], ["name", "age"]]  # Selecting rows and columns by labels
print(table)
# Output:
#          name  age
# id
# 1001    Alice   25
# 1002      Bob   30

In [None]:
# Select columns with `iloc`
columns = client_data.iloc[:, [0, 1]]  # Select multiple columns based on the position
print(columns)
# Output:
#          name  age
# id
# 1001    Alice   25
# 1002      Bob   30
# 1003  Charlie   35

table = client_data.iloc[-2:, 1:]  # Using positions
print(table)
# Output:
#       age gender
# id
# 1002   30      M
# 1003   35      M

### Adding and removing rows and columns

Pandas allows you to add and remove rows and columns from a `DataFrame` using a label, `loc`, the `concat()` function, and the `drop()` method. For example:

In [None]:
client_data = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie"],
        "id": ["1001", "1002", "1003"],
        "age": [25, 30, 35],
        "gender": ["F", "M", "M"],
    }
)

client_data

In [None]:
# Add a new row
new_row = ["Linda", "1005", 30, "F"]
# Add the new row using a label as the index
n = len(client_data)  # Evaluates to 3
client_data.loc[n] = new_row
# If the label already exists, the row will be replaced
print(client_data)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 1      Bob  1002   30      M
# 2  Charlie  1003   35      M
# 3    Linda  1005   30      F

# Remove a row using its label as the index
client_data.drop(n, inplace=True)
print(client_data)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 1      Bob  1002   30      M
# 2  Charlie  1003   35      M

# Remove the last row using its position
client_data.drop(client_data.index[1], inplace=True)
print(client_data)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 2  Charlie  1003   35      M

In [None]:
# Appending rows from another DataFrame
new_clients = pd.DataFrame(
    {
        "name": ["Adam", "Eve"],
        "age": [18, 18],
        "gender": ["M", "F"],
        "id": ["2001", "2002"],
    }
)

all_clients = pd.concat([client_data, new_clients])
print(all_clients)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 2  Charlie  1003   35      M
# 0     Adam  2001   18      M
# 1      Eve  2002   18      F

# Note that the index labels are not updated after appending.
# You can use the ignore_index parameter to reset the index labels:
all_clients = pd.concat([client_data, new_clients], ignore_index=True)
print(all_clients)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 1  Charlie  1003   35      M
# 2     Adam  2001   18      M
# 3      Eve  2002   18      F

# You can use the `reset_index()` method to assign default indices:
all_clients.reset_index(drop=True, inplace=True)
# If `drop=True`, the old index will be removed and a new one will be created
# If `drop=False`, the old index will be kept as a new column and a new index will be created
print(all_clients)

# Adding a new column
all_clients["city"] = ["Montreal", "Vancouver", "Toronto", "Toronto"]
print(all_clients)
# Output:
#       name  age gender       city
# 0    Alice   25      F   Montreal
# 1  Charlie   35      M  Vancouver
# 2     Adam   18      M    Toronto
# 3      Eve   18      F    Toronto

# Removing columns
all_clients.drop("city", axis=1, inplace=True)
print(all_clients)
# Output:
#       name    id  age gender
# 0    Alice  1001   25      F
# 1  Charlie  1003   35      M
# 2     Adam  2001   18      M
# 3      Eve  2002   18      F

#### More on data manipulation

Sometimes we want to add a new column whose values are derived from other columns. Some of the common methods are:

- Using arithmetic operations: You can use arithmetic operators such as `+`, `-`, `*`, `/`, etc. to perform element-wise operations on existing columns and assign the result to a new column.
- Using conditional logic: You can use conditional logic to create new columns based on the values of existing columns.
- Using the `apply()` method: You can use the apply method to apply a custom function to existing columns and assign the result to a new column. The function can take one or more columns as arguments and return a single value for each row. 

Example:

In [None]:
student_data = pd.DataFrame(
    {
        "name": ["Nina", "Jim", "Katie"],
        "age": [16, 18, 19],
        "year_registered": [2019, 2020, 2013],
    }
)

current_year = 2024

# Arithmetic operations:
# Create a new column to store their age when they registered
years_since_registration = current_year - student_data["year_registered"]
student_data["age_when_registered"] = student_data["age"] - years_since_registration
print(student_data)
# Output:
#     name  age  year_registered  age_when_registered
# 0   Nina   16             2019                   11
# 1    Jim   18             2020                   14
# 2  Katie   19             2013                    8

# Using logical operators:
# Create a new column to indicate if the client is an adult or not
student_data["is_adult"] = student_data["age"] >= 18
print(student_data)
# Output:
#     name  age  year_registered  age_when_registered  is_adult
# 0   Nina   16             2019                   11     False
# 1    Jim   18             2020                   14      True
# 2  Katie   19             2013                    8      True

In [None]:
# Using the `apply()` function:
# Assume that we want to assign categories to our clients based on their registration year.
# The clients are classified as "old_client" if they have been registered in the last 10 years.
# Otherwise, they are classified as "new_client".


# Create a function to assign categories
def assign_category(year):
    return "old_client" if current_year - year >= 10 else "new_client"


# Create a new column to store the category by applying the function
student_data["category"] = student_data["year_registered"].apply(assign_category)
print(student_data)
# Output:
#     name  age  year_registered  age_when_registered  is_adult    category
# 0   Nina   16             2019                   11     False  new_client
# 1    Jim   18             2020                   14      True  new_client
# 2  Katie   19             2013                    8      True  old_client


### Filtering and sorting

Sometimes, you may want to filter a `DataFrame` based on certain conditions and select only the rows that satisfy those conditions. One common way to do this is using the `loc` attribute. This allows you to select rows based on a single condition or multiple conditions using boolean indexing. You can use operators such as `==`, `!=`, `<`, `>`, `<=`, `>=`, and `isin`. to compare column values with scalars or iterables. You can also use logical operators such as `&`, `|`, and `~` to combine multiple conditions with parentheses. For example:

In [None]:
# Import pandas
import pandas as pd

# Create a DataFrame from a dictionary
df = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
        "age": [25, 30, 35, 40, 28, 32],
        "gender": ["F", "M", "M", "M", "F", "M"],
        "salary": [5000, 6000, 7000, 8000, 5500, 6500],
    }
)

# Filter rows where gender is F using loc
df_f = df.loc[df["gender"] == "F"]

print(df_f)
# Output:
#     name  age gender  salary
# 0  Alice   25      F    5000
# 4    Eve   28      F    5500

# Filter rows where age is between 30 and 39
df_age = df.loc[(df["age"] >= 30) & (df["age"] < 40)]

print(df_age)
# Output:
#       name  age gender  salary
# 1      Bob   30      M    6000
# 2  Charlie   35      M    7000
# 5    Frank   32      M    6500

# Filter rows where name is in a list using loc
mask = df["name"].isin(["Alice", "Bob"])
df_name = df.loc[mask]

print(df_name)
# Output:
#     name  age gender  salary
# 0  Alice   25      F    5000
# 1    Bob   30      M    6000

# Filter rows where name is not in a list
# Use the `~` operator to invert the mask
df_name = df.loc[~mask]
print(df_name)
# Output:
#       name  age gender  salary
# 2  Charlie   35      M    7000
# 3    David   40      M    8000
# 4      Eve   28      F    5500
# 5    Frank   32      M    6500

Sometimes, you may want to sort a Pandas `DataFrame` based on a column or a set of columns in ascending or descending order. In this case, you can use the `sort_values()` method to sort the `DataFrame` by the values in the specified column. You can pass the name or list of names of the columns to sort. You can also specify the `ascending` argument to control whether the `DataFrame` is sorted in ascending or descending order. For example:

In [None]:
df = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Eve", "David", "Charlie", "Frank"],
        "age": [25, 30, 35, 40, 30, 32],
        "gender": ["F", "M", "F", "M", "M", "M"],
        "salary": [5500, 6000, 7000, 8000, 5000, 6500],
    }
)

df.sort_values(by="age", inplace=True)
df

In [None]:
# sort by salary in descending order
df.sort_values(by="salary", ascending=False, inplace=True)
print(df)
# Output:
#       name  age gender  salary
# 3    David   40      M    8000
# 2      Eve   35      F    7000
# 5    Frank   32      M    6500
# 1      Bob   30      M    6000
# 0    Alice   25      F    5500
# 4  Charlie   30      M    5000

# sort by gender then by salary
df.sort_values(by=["gender", "salary"], inplace=True)
print(df)
# Output:
#       name  age gender  salary
# 0    Alice   25      F    5500
# 2      Eve   35      F    7000
# 4  Charlie   30      M    5000
# 1      Bob   30      M    6000
# 5    Frank   32      M    6500
# 3    David   40      M    8000

# sort by gender in ascending order then by salary in descending order
df.sort_values(by=["gender", "salary"], ascending=[True, False], inplace=True)
print(df)
# Output:
#       name  age gender  salary
# 2      Eve   35      F    7000
# 0    Alice   25      F    5500
# 3    David   40      M    8000
# 5    Frank   32      M    6500
# 1      Bob   30      M    6000
# 4  Charlie   30      M    5000

### Data aggregation

Data aggregation is a process of transforming and summarizing data into a more compact and meaningful form. Data aggregation can help you to perform various types of data analysis, such as finding patterns, trends, outliers, correlations, etc. Data aggregation can also help you to reduce the complexity and size of your data, making it easier to understand and visualize.

One of the most common ways to perform data aggregation on a Pandas `DataFrame` is to use the `groupby()` method. This method allows you to split a `DataFrame` into groups based on one or more columns, and then apply a function to each group. The function can be an aggregation function, such as `sum()`, `mean()`, `count()`, etc., or a custom function defined by the user. The result is a new `DataFrame` with the aggregated values for each group.

Another way to perform data aggregation using Pandas dataframes is to use the `pivot_table()` method. This method allows you to create a multidimensional table from a `DataFrame`, where the rows and columns are defined by one or more columns, and the values are defined by an aggregation function. The `pivot_table()` method can help you to reshape and reorganize your data, and to create cross-tabulations and contingency tables.

In [None]:
df = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Eve", "David", "Charlie", "Frank"],
        "age": [25, 30, 35, 40, 30, 32],
        "gender": ["F", "M", "F", "M", "M", "M"],
        "salary": [5500, 6000, 7000, 8000, 5000, 6500],
    }
)

# Assume that we are interested in the mean salary for each gender
grouped = df.groupby("gender")
salaries = grouped["salary"]
# print the size of each group
print(grouped.size())
# Output:
# gender
# F    2
# M    4

# print the mean salary
print(salaries.mean())
# Output:
# gender
# F    6250.0
# M    6375.0

# Find out, for each gender, how many people are at least 30 years old
grouped = df.loc[df["age"] >= 30].groupby("gender")
print(grouped.size())
# Output:
# gender
# F    1
# M    4

In [None]:
# Using pivot table
# find the minimum salary for each gender
pivoted = df.pivot_table(index="gender", values="salary", aggfunc="min")
print(pivoted)
# Output:
#         salary
# gender
# F         5500
# M         5000


# add a new column to the dataframe
def get_category(age):
    return "old" if age > 30 else "young"


df["age_category"] = df["age"].apply(get_category)

# Find the mean salary for each age category indexed by gender
# Using multi-index
pivoted = df.pivot_table(index=["gender", "age_category"], values="salary", aggfunc="mean")
print(pivoted)
# Output:
#                      salary
# gender age_category
# F      old           7000.0
#        young         5500.0
# M      old           7250.0
#        young         5500.0

# Using the `columns` argument
pivoted = df.pivot_table(index="gender", columns="age_category", values="salary", aggfunc="mean")
print(pivoted)
# Output:
# age_category     old   young
# gender
# F             7000.0  5500.0
# M             7250.0  5500.0

## Matplotlib and Seaborn: Visualizing data

Data visualization is an important skill for data analysis, as it can help you to explore, understand, and communicate your data. By leveraging the power of `pandas` and utilizing data visualization libraries such as Matplotlib and Seaborn, you can create various types of plots that can reveal patterns, trends, outliers, relationships, and more in your data.

First, let's install Matplotlib and Seaborn:

In [None]:
!pip install matplotlib seaborn

> *Note*: We can also install multiple packages at once by separating them with spaces. The above command installs both `matplotlib` and `seaborn` at once.


### Matplotlib

**Matplotlib** is a low-level library that provides a comprehensive set of tools for creating and customizing plots. Pandas' built-in integration with matplotlib offers some useful methods for plotting dataframes and series.

To plot some basic plots using matplotlib and pandas, you need to follow these steps:

1. Import the libraries: `import matplotlib.pyplot as plt`
2. Load or create your data as a pandas dataframe or series
3. Choose the type of plot that suits your data and your analysis goal
4. Use the `plot()` method of the dataframe or series, and pass the `kind` argument to specify the type of plot, such as `kind='line'`, `kind='box'`, `kind='hist'`, or `kind='scatter'`
5. Customize your plot by adding labels, titles, legends, etc. using `matplotlib` functions such as `plt.xlabel()`, `plt.title()`, `plt.legend()`, etc.
6. Show or save your plot using `plt.show()` or `plt.savefig()`

Here are some examples of basic plots using `matplotlib` and `pandas`:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Alternatively, you can write:
# from mathplotlib import pyplot as plt

dataset_url = "https://raw.githubusercontent.com/m-mehdi/pandas_tutorials/main/weekly_stocks.csv"
# read the data from the URL, parse the dates, and set the index to the date
df = pd.read_csv(dataset_url, parse_dates=["Date"], index_col="Date")

# only include the second half of 2021 data
df = df.loc["2021-07-01":"2021-12-31"]

df.head()

In [None]:
# Let's plot a line plot and see how Microsoft performed over time:
df.plot(kind="line", y="MSFT", figsize=(8, 5))
# The `figsize` argument takes two arguments, and allowing us to change the size of the output figure.

# Set the title
plt.title("Microsoft Stock Prices")
# Set the y-axis label
plt.ylabel("Price ($)")
# Set the x-axis label
plt.xlabel("Date")
# Display the plot
plt.show()

In [None]:
# You can plot columns in the same plot
df.plot(kind="hist", y=["MSFT", "FB"], bins=20, alpha=0.5)
# Using the bins argument to set the number of bins in the histogram.
# Using the alpha argument to set the transparency of bar colors.
# The default `figsize` is used if not specified.

plt.title("Stock Prices")
# plt.ylabel("Price ($)")
# plt.xlabel("Date")
plt.show()

In [None]:
# Plot a box plot of the closing prices
df.plot(kind="box", y=df.columns)  # Plot for all columns
plt.show()

### Seaborn

**Seaborn** is another library for data visualization in Python, which is built on top of `matplotlib` and `pandas`. Seaborn offers a higher-level interface and more attractive default styles for creating various types of plots, such as histograms, scatter plots, box plots, line plots, etc. Seaborn also integrates well with Pandas dataframes and supports statistical analysis and inference.

You can import `seaborn` as follows:

In [None]:
import seaborn as sns  # sns for 'seaborn name space'

Below, we will use `sns` to plot some basic plots using `pandas` and `matplotlib`:

In [None]:
# Load the "tips" dataset from the Seaborn library
df = sns.load_dataset("tips")
print(type(df))  # Prints <class 'pandas.core.frame.DataFrame'>

# Display the first few rows of the DataFrame
df.head()

# Description of the dataset:
# - total_bill: Total bill (including tip).
# - tip: Tip amount.
# - sex: Gender of the payer (Male or Female).
# - smoker: Whether the party was a smoker (Yes or No).
# - day: Day of the week (Thur, Fri, Sat, Sun).
# - time: Time of day (Lunch or Dinner).
# - size: Size of the party.

In [None]:
# Create a line plot with no error bars
sns.lineplot(df, x="day", y="total_bill", hue="smoker")
# The `hue` argument is used to group the data by the `smoker` column.
plt.show()
# You can pass the `errorbar=None` argument to the `lineplot()` function
# to remove the error areas.

In [None]:
# Creating a histogram with multiple categories, showing
sns.histplot(df, x="tip", hue="sex", alpha=0.3)
plt.xlabel("Tip ($)")  # Setting the x-axis label
plt.show()

In [None]:
sns.barplot(df, x="size", y="total_bill", hue="sex")
plt.ylabel("Total bill ($)")  # Setting the y-axis label
# Remove the legend title and set
# the legend position to "upper center"
plt.legend(title=None, loc="upper center")

plt.show()

In [None]:
# plot the maximum tip amount for each day of the week using a horizontal bar chart
ax1, ax2 = "day", "tip"
grouped = df.groupby(ax1, observed=True)[ax2].max().reset_index()
print(grouped)
# Output:
#     day    tip
# 0  Thur   6.70
# 1   Fri   4.73
# 2   Sat  10.00
# 3   Sun   6.50

sns.barplot(grouped, x=ax2, y=ax1, orient="h")  # `orient="h"` for horizontal bar chart
plt.title("Maximum Tip Amount by Day of the Week")
plt.show()

# Alternatively, you can use matplotlib on the grouped data:
grouped.plot(kind="barh", x=ax1, y=ax2)
plt.title("Maximum Tip Amount by Day of the Week")
plt.show()

# Or, use seaborn's `estimator` argument for the original data:
# No need to use `groupby`
sns.barplot(df, x=ax2, y=ax1, estimator=np.max, errorbar=None, orient="h")
plt.title("Maximum Tip Amount by Day of the Week")
plt.show()

In [None]:
sns.regplot(df, x="total_bill", y="tip")
plt.title("Regression Plot of Total Bill vs Tip")
plt.show()