# Week 8 Overview

This week will be a mix of data joining/merging problems and linear algebra. 
The first 5 problems are data cleaning and the final 4 problems are linear algebra. 

There are multiple ways to combine data. These methods are common cross multiple languages like pandas, SQL, and R. At times the naming is different but the general concepts apply. 

### **Joining or Merging**
This is a process of combining two datasets by adding the columns of one dataset to the other by some logical relationship between the columns. 

In SQL we call this joining but pandas has two functions:

**merge** - The default behavor for merge is to combine on columns matching.

**join** - The default behavor for join is to combine on the column index matching. 

Often times I will colloquially use the word "join" for either merging or joining in pandas. 

Left Dataset
| key    | value |
| -------- | ------- |
| A1  | $250    |
| A2 | $80     |
| A3    | $420    |

Right Dataset
| key    | different_value |
| -------- | ------- |
| A1  | cat    |
| A2 | dog     |
| A3    | apple    |

Data Joined on key

| key    | value | different_value |
| -------- | ------- | ------- |
| A1  | $250    | cat |
| A2 | $80     | dog |
| A3    | $420    | apple |

Typically we refer to the starting dataset as the left dataset and the one being added as the right. 

The logic is typically that there is the same value in a specific column in both datasets. SQL allows for slightly more advanced logic which we will learn next quarter. Today we will focus on just columns matching. 

There are different types of joins that you will explore in this notebook (inner, outer, left, right, cross). The typical visual that is used to illustration these concepts in Ven Diagrams. If you are getting stuck trying to pick the right join type search for "types of joins" and look at the pictures that come up.


## **Concat or Union**

This is a process of combining two dataset by adding the rows of one dataset to the end of another. There is no logic required for this. This is called conact in pandas and union in SQL. 

In most version of SQL you are required to have the same columns in both datasets. In pandas you don't have to. If I concatenate the two dataset above in pandas I would get:

| key    | value | different_value |
| -------- | ------- | ------- |
| A1  | $250    | null |
| A2 | $80     | null |
| A3    | $420    | null |
| A1  | null    | cat |
| A2 | null     | dog |
| A3    | null    | apple |

However if my right dataset looked like this:

| key    | value |
| -------- | ------- |
| A1  | cat    |
| A2 | dog     |
| A3    | apple    |

then I could union them in SQL or concat in pandas to get:

| key    | value |
| -------- | ------- |
| A1  | $250    |
| A2 | $80     |
| A3    | $420    |
| A1  | cat    |
| A2 | dog     |
| A3    | apple    |



In [13]:
import pandas as pd
import numpy as np

In [14]:
df_1 = pd.DataFrame({"ints": range(100)})
df_2 = pd.DataFrame({"ints": range(-10, 10)}, index=range(-10, 10))


df_1['threes'] = np.floor(df_1['ints']/3) * 3

df_2['evens'] = df_2['ints']*2
df_2['threes'] = np.floor(df_2['ints']/3) * 3

### Problem 1:

Your first task will be to create a dataset by `merging` `df_1` and `df_2` on the `ints` column where the match on both sides. The results will be a dataframe with 10 rows and 5 columns.

You will then create the same dataframe by using the `join` function and joining the two datasets where the indexes are equal. There will be a little more work of handle column duplication so look up the error and figure out arguments to set. How many columns do you get in this case?

In [15]:
df_merge = df_1.merge(df_2, on="ints", suffixes=("_df1", "_df2"))
print(df_merge.shape)

(10, 4)


In [16]:
df_join = df_1.set_index("ints").join(df_2.set_index("ints"), how="inner", lsuffix="_df1", rsuffix="_df2")
print(df_join.shape)

(10, 3)


## Problem 2:

Next you will perform the same merge as above three times with the following modifications:

* You want to keep all rows in `df_1` even if there is no match found in `df_2`
* You want to keep all rows in `df_2` even if there is no match found in `df_1`
* You want to keep all rows in `df_1` and `df_2` even if there is no match found in the other dataframe


How many rows do you end up with in each case? 

Think through a scenario where you might want to do this and add it as a comment above each merge. 

In [17]:
df_left = df_1.merge(df_2, on="ints", how="left", suffixes=("_df1", "_df2"))
print(df_left.shape)

(100, 4)


In [18]:
df_right = df_1.merge(df_2, on="ints", how="right", suffixes=("_df1", "_df2"))
print(df_right.shape)

(20, 4)


In [19]:
df_outer = df_1.merge(df_2, on="ints", how="outer", suffixes=("_df1", "_df2"))
print(df_outer.shape)

(110, 4)


### Problem 3

Now we are going to merge on columns that are not the same. Merge on the following:

* Merge `df_1` and `df_2` where `df_1.ints = df_2.evens`, only keep rows where there is a value for either dataframe
* Merge `df_1` and `df_2` where `df_1.ints = df_2.threes`, only keep rows where there is a value for either dataframe
* Merge `df_1` and `df_2` where `df_1.ints = df_2.threes`, keep all rows from `df_1` even if there is no match found in `df_2`
* Merge `df_1` and `df_2` where `df_1.threes = df_2.threes`, only keep rows where there is a value for either dataframe


How many rows do you end up with in each case? Are there any duplications? (try: value_count)

Think through a scenario where you might want to do this and add it as a comment above each merge. 

In [20]:
df_merge_1 = df_1.merge(df_2, left_on="ints", right_on="evens", how="outer")
print(df_merge_1.shape)

(110, 5)


In [21]:
df_merge_2 = df_1.merge(df_2, left_on="ints", right_on="threes", how="outer")
print(df_merge_2.shape)

(116, 5)


In [22]:
df_merge_3 = df_1.merge(df_2, left_on="ints", right_on="threes", how="left")
print(df_merge_3.shape)

(106, 5)


In [23]:
df_merge_4 = df_1.merge(df_2, on="threes", how="outer")
print(df_merge_4.shape)

(128, 4)


### Problem 4

Add a new the column to `df_2` called `threes_string` that is the `threes` column converted to a string. Attempt to merge `df_1` and `df_2` where `df_1.threes = df_2.threes_string` with an inner join. What happens? Why? There is a type mismatch error, the inner join will fail because now the data types are different. The columns have to be compatible with one another. 

In [None]:
df_2['threes_string'] = df_2['threes'].astype(str)
df_merge_problem = df_1.merge(df_2, left_on='threes', right_on='threes_string', how='inner')
print(df_merge_problem)

### Problem 5

Now you will play around with `pd.concat` by doing the following:

* Concatenate `df_1` and `df_2` keeping all rows, columns and indexes
* Concatenate `df_1` and `df_2` keeping all rows and columns but ignore the indexes from the orginal dataframes and instead have the index on this dataframe be zero to the number of rows.
* Concatenate `df_1` and `df_2` keeping all rows and indexes the same but only keeping columns that exist in both dataframes


In [None]:
df_concat_1 = pd.concat([df_1, df_2], axis=0, join='outer')
print(df_concat_1.shape)

df_concat_2 = pd.concat([df_1, df_2], axis=0, join='outer', ignore_index=True)
print(df_concat_2.shape)

(120, 4)
(120, 4)
(120, 2)


In [None]:
df_concat_3 = pd.concat([df_1, df_2], axis=0, join='inner')
print(df_concat_3.shape)

## Linear Algebra: Rank and Column Space

### Problem 6
You will now learn how to create random matrices with arbitrary rank (subject to the constraints about matrix sizes, etc.). To create an $m \times n$ matrix with rank $r$, multiply a random $m \times r$ matrix with a random $r \times n$ matrix. Implement this in Python and confirm that the rank is indeed $r$. 

What happens if you set $r > min{M,N}$, and why does that happen? The rank will be by the smaller of n or m, and can't be any greater than this minimum value. 

In [28]:
def generate_matrix_with_rank(num_rows, num_columns, rank):
    matrix_A = np.random.randn(num_rows, rank)
    matrix_B = np.random.randn(rank, num_columns)
    result_matrix = np.dot(matrix_A, matrix_B)
    matrix_rank = np.linalg.matrix_rank(result_matrix)
    
    print(f"Rank of the matrix: {matrix_rank}")
    print(f"Expected rank: {rank}")
    return result_matrix, matrix_rank

num_rows, num_columns, rank = 5, 4, 3
result_matrix, matrix_rank = generate_matrix_with_rank(num_rows, num_columns, rank)

Rank of the matrix: 3
Expected rank: 3


### Problem 7
Interestingly, the matrices $A$, $A^T$, $A^T A$, and $AA^T$ all have the same rank. Write code to demonstrate this, using random matrices of various sizes, shapes (square, tall, wide), and ranks. Create a total of 6 random, two of each size that have different sizes and ranks. 

In [None]:
import numpy as np

def check_rank_of_matrices(matrix):
    transpose_matrix = matrix.T
    product_A_T_A = np.dot(transpose_matrix, matrix)
    product_A_A_T = np.dot(matrix, transpose_matrix)
    
    rank_A = np.linalg.matrix_rank(matrix)
    rank_A_T = np.linalg.matrix_rank(transpose_matrix)
    rank_A_T_A = np.linalg.matrix_rank(product_A_T_A)
    rank_A_A_T = np.linalg.matrix_rank(product_A_A_T)
    
    print(f"Rank of A: {rank_A}")
    print(f"Rank of A^T: {rank_A_T}")
    print(f"Rank of A^T A: {rank_A_T_A}")
    print(f"Rank of A A^T: {rank_A_A_T}")
    print("="*30)

# Test 1: Square matrix, full rank
A1 = np.random.randn(5, 5)
check_rank_of_matrices(A1)

# Test 2: Tall matrix, full rank
A2 = np.random.randn(5, 3)
check_rank_of_matrices(A2)

# Test 3: Wide matrix, full rank
A3 = np.random.randn(3, 5)
check_rank_of_matrices(A3)

# Test 4: Square matrix, rank < min(rows, columns)
A4 = np.random.randn(5, 5)
A4[:, 2:] = 0 
check_rank_of_matrices(A4)

# Test 5: Tall matrix, rank < min(rows, columns)
A5 = np.random.randn(5, 3)
A5[:, 1:] = 0  
check_rank_of_matrices(A5)

# Test 6: Wide matrix, rank < min(rows, columns)
A6 = np.random.randn(3, 5)
A6[1:, :] = 0 
check_rank_of_matrices(A6)


Rank of A: 5
Rank of A^T: 5
Rank of A^T A: 5
Rank of A A^T: 5
Rank of A: 3
Rank of A^T: 3
Rank of A^T A: 3
Rank of A A^T: 3
Rank of A: 3
Rank of A^T: 3
Rank of A^T A: 3
Rank of A A^T: 3
Rank of A: 2
Rank of A^T: 2
Rank of A^T A: 2
Rank of A A^T: 2
Rank of A: 1
Rank of A^T: 1
Rank of A^T A: 1
Rank of A A^T: 1
Rank of A: 1
Rank of A^T: 1
Rank of A^T A: 1
Rank of A A^T: 1


### Problem 8

Demonstrate the addition rule of matrix rank $(r(A + B) ≤ r(A) + r(B))$ by creating three pairs of rank-1 matrices that have a sum with 
1. rank-0
2. rank-1
3. rank-2

Then repeat this exercise using matrix multiplication instead of addition.

In [31]:
import numpy as np

def compute_rank_of_sum(matrix_A, matrix_B):
    result_matrix = matrix_A + matrix_B
    return np.linalg.matrix_rank(result_matrix)

def compute_rank_of_multiplication(matrix_A, matrix_B):
    result_matrix = np.dot(matrix_A, matrix_B.T)  # Transpose matrix_B for valid multiplication
    return np.linalg.matrix_rank(result_matrix)

# Rank-0 sum
matrix_A1 = np.array([[1], [0]])
matrix_B1 = np.array([[-1], [0]])
print(f"Sum rank (rank-0 case): {compute_rank_of_sum(matrix_A1, matrix_B1)}")

# Rank-1 sum
matrix_A2 = np.array([[1], [0]])
matrix_B2 = np.array([[0], [1]])
print(f"Sum rank (rank-1 case): {compute_rank_of_sum(matrix_A2, matrix_B2)}")

# Rank-2 sum
matrix_A3 = np.array([[1, 0], [0, 0]])
matrix_B3 = np.array([[0, 1], [0, 0]])
print(f"Sum rank (rank-2 case): {compute_rank_of_sum(matrix_A3, matrix_B3)}")

# Rank-0 product
matrix_A4 = np.array([[1], [0]])
matrix_B4 = np.array([[-1], [0]])
print(f"Multiplication rank (rank-0 case): {compute_rank_of_multiplication(matrix_A4, matrix_B4)}")

# Rank-1 product
matrix_A5 = np.array([[1], [0]])
matrix_B5 = np.array([[0], [1]])
print(f"Multiplication rank (rank-1 case): {compute_rank_of_multiplication(matrix_A5, matrix_B5)}")

# Rank-2 product
matrix_A6 = np.array([[1, 0], [0, 0]])
matrix_B6 = np.array([[0, 1], [0, 0]])
print(f"Multiplication rank (rank-2 case): {compute_rank_of_multiplication(matrix_A6, matrix_B6)}")


Sum rank (rank-0 case): 0
Sum rank (rank-1 case): 1
Sum rank (rank-2 case): 1
Multiplication rank (rank-0 case): 1
Multiplication rank (rank-1 case): 1
Multiplication rank (rank-2 case): 0


### Problem 9

The goal of this exercise is to answer the question is $v \in C(A)$?

Create a rank-3 matrix $A \in \mathbb{R}^{4 \times 3}$ and vector $v \in \mathbb{R}^{4}$ using numbers randomly drawn from a normal distribution. 

Follow the algorithm described in the [In the Column Space?](https://learning.oreilly.com/library/view/practical-linear-algebra/9781098120603/ch06.html#id335) section of Practical Linear Algebra to determine whether the vector is in the column space of the matrix. 

Rerun the code multiple times to see whether you find a consistent pattern. 

Next, use a $A \in \mathbb{R}^{4 \times 4}$ rank-4 matrix and a vector $v \in \mathbb{R}^{4}$ using numbers randomly drawn from a normal distribution. What happens in this case? Why?


In [None]:
import numpy as np

def is_v_in_column_space(matrix_A, vector_v):
    try:
        solution = np.linalg.lstsq(matrix_A, vector_v, rcond=None)[0]
        return True
    except np.linalg.LinAlgError:
        return False

rank_3_matrix = np.random.randn(4, 3)
vector_3 = np.random.randn(4)
is_in_column_space_3 = is_v_in_column_space(rank_3_matrix, vector_3)
print(f"Vector is in the column space of rank-3 matrix: {is_in_column_space_3}")

rank_4_matrix = np.random.randn(4, 4)
vector_4 = np.random.randn(4)
is_in_column_space_4 = is_v_in_column_space(rank_4_matrix, vector_4)
print(f"Vector is in the column space of rank-4 matrix: {is_in_column_space_4}")


Vector is in the column space of rank-3 matrix: True
Vector is in the column space of rank-4 matrix: True
