# CSCI4022 Homework 8; Recommendations


## Due Wednesday, November 17 at 11:59 pm to Canvas and Gradescope

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.

---
**Shortcuts:**  [Problem 1](#p1) | [Problem 2](#p2) | [Problem 3](#p3) | [Extra Credit](#ec) |
---

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
# import statsmodels.api as sm

***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (7 pts; Theory: PCA)
Prove that if $M$ is any matrix, then $M^TM$ and $MM^T$ are symmetric.

**Solution Markdown**


---

**Rule 1:** $(AB)^T = B^TA^T$

$(M^TM)^T=M^T(M^T)^T=M^TM$

$(MM^T)^T=(M^T)^TM^T=MM^T$

In both cases, $M^TM$ and $MM^T$ are equal to their respective transposes. That is, $M^TM=(M^TM)^T$ and $MM^T=(MM^T)^T$ due to the transpose rule listed above.

***
<a/ id='p2'></a>
[Back to top](#top)
# Problem 2 (13 pts; Theory: Recommendations and Scaling)

To date, we've used **centered cosine distance** as a mechanism to center data.  This is not the only such option.  Consider the following:

Three computers, A, B, and C, have the numerical features listed below:

| Feature | A | B | C |
| --- | --- |
|Processor Speed | 3.06 | 2.68 | 2.92 |
|Disk Size | 500 | 320 | 640 |
|Main-Memory Size | 6 | 4 | 6 |

We may imagine these values as defining a vector for each computer; for instance, A's vector is [3.06, 500, 6]. We can compute the cosine distance between any two of the vectors, but if we do not scale the components (via a scalar multiplication), then the disk
size will dominate the dot product and make differences in the other components essentially invisible. Let us use 1 as the scale factor for processor speed, $\alpha$ for the disk size, and $\beta$ for the main memory size.

**(A)** In terms of $\alpha$ and $\beta$, compute the cosines of the angles between the vectors for each pair of the three computers.

**(B)** What are the angles between the vectors if $\alpha=\beta=1$?

**(C)** What are the angles between the vectors if $\alpha=0.01$ and $\beta=0.5$?

**(D)** One fair way of selecting scale factors is to make each inversely proportional to the average value in its component. What would be the values of $\alpha$ and $\beta$, and what would be the angles between the vectors?


In [2]:
def cosine_similarity(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

raw_data = [[3.06, 500, 6],
            [2.68, 320, 4],
            [2.92, 640, 6]]

'''
A
'''

data = [[data[0], data[1], data[2]] for data in raw_data]

AB = cosine_similarity(data[0], data[1])
AC = cosine_similarity(data[0], data[2])
BC = cosine_similarity(data[1], data[2])

print(f"The angle between AB is: {np.round(AB, 5)} radians and {np.round(np.degrees(AB), 5)} degrees")
print(f"The angle between AC is: {np.round(AC, 5)} radians and {np.round(np.degrees(AC), 5)} degrees")
print(f"The angle between BC is: {np.round(BC, 5)} radians and {np.round(np.degrees(BC), 5)} degrees")

'''
B
'''
alpha = 1
beta  = 1

data = [[data[0], alpha * data[1], beta * data[2]] for data in raw_data]
print(data)

AB = cosine_similarity(data[0], data[1])
AC = cosine_similarity(data[0], data[2])
BC = cosine_similarity(data[1], data[2])

print(f"The angle between AB is: {np.round(AB, 5)} radians and {np.round(np.degrees(AB), 5)} degrees")
print(f"The angle between AC is: {np.round(AC, 5)} radians and {np.round(np.degrees(AC), 5)} degrees")
print(f"The angle between BC is: {np.round(BC, 5)} radians and {np.round(np.degrees(BC), 5)} degrees")

'''
C
'''

alpha = 0.01
beta  = 0.5

data = [[data[0], alpha * data[1], beta * data[2]] for data in raw_data]
print(data)

AB = cosine_similarity(data[0], data[1])
AC = cosine_similarity(data[0], data[2])
BC = cosine_similarity(data[1], data[2])

print(f"The angle between AB is: {np.round(AB, 5)} radians and {np.round(np.degrees(AB), 5)} degrees")
print(f"The angle between AC is: {np.round(AC, 5)} radians and {np.round(np.degrees(AC), 5)} degrees")
print(f"The angle between BC is: {np.round(BC, 5)} radians and {np.round(np.degrees(BC), 5)} degrees")

'''
D
'''

The angle between AB is: 1.0 radians and 57.29563 degrees
The angle between AC is: 1.0 radians and 57.29551 degrees
The angle between BC is: 0.99999 radians and 57.29508 degrees
[[3.06, 500, 6], [2.68, 320, 4], [2.92, 640, 6]]
The angle between AB is: 1.0 radians and 57.29563 degrees
The angle between AC is: 1.0 radians and 57.29551 degrees
The angle between BC is: 0.99999 radians and 57.29508 degrees
[[3.06, 5.0, 3.0], [2.68, 3.2, 2.0], [2.92, 6.4, 3.0]]
The angle between AB is: 0.99088 radians and 56.77333 degrees
The angle between AC is: 0.99155 radians and 56.8119 degrees
The angle between BC is: 0.96918 radians and 55.5298 degrees


'\nD\n'

---

Not sure if I interpretted this part correctly. In fact I know I didn't. Don't really remember covering this, and I have no clue why my answers are this way, as computers A and C seem to be the closest, yet I'm getting B and C.

Really no idea where to get started on this for **A.**, **B.**, **C.**, **D.**. Just need thanksgiving break to be here already.

---

***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 3 (25 pts; Practice: Recommendations)

This problem is about recommender systems. There is a joke recommender dataset from this site:

http://eigentaste.berkeley.edu/dataset/

`jokeratings.csv` contains data from 24,983 users who have rated 36 or more jokes from this set.  The text of the jokes themselves is contained in `jokes.csv`.  Each are saved as UTF-8 files.

**WARNING NOTE**: A number of the jokes are distasteful at best and may be offensive.  You may skip any prompt that required you actually read/print actual jokes if you wish, and fabricate ranking values when applicable in part **D**.  In practice, finding objects that are polarizing (such as bad/offensive jokes) is part of the point of a recommendation system, so instead I recommend you rank such jokes very low and see if a similar user patten is be manifested in the data!

In [3]:
ratings=pd.read_csv('jokeratings.csv', encoding='UTF-8', header=None)
ratings.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0


In [4]:
jokes=pd.read_csv('jokes.csv', encoding='UTF-8', header=None)
len(jokes)

100

**Part A:**

Read the dataset and ensure it is clean. That is, that it contains no odd or fill entries such as NAN or nonsensical numerical values.  Perform any adjustments to set up the data frame for a k-nearest neighbors recommendation system.

(Note that the first column of the CSV is the count of jokes rated by that user, the other 100 columns are their actual ratings.)

In [5]:
#YOUR CODE HERE
lb = -10
ub = 10
fill = 99

initial_size = ratings.shape[0]
print(f"Number of users with valid ratings: {initial_size}\n")

# because we simply want to get rid of NAN rows...
print("Dropping rows with NANs...")
print(f"We dropped {initial_size - ratings.shape[0]} rows\n")
ratings.dropna()

# we also drop rows with values outside the range -10...10 and that are not equal to our null value of 99 
print(f"Dropping rows ratings less than {lb}, greater than {ub}, and unequal to the null value ({fill} in our case)...")
print(f"We dropped {initial_size - ratings.shape[0]} rows\n")
ratings = ratings[((lb <= ratings.iloc[:,1:]) & (ratings.iloc[:,1:] <= ub) | (ratings.iloc[:,1:] == fill)).all(1)]

final_size = ratings.shape[0]
print(f"Number of users with valid ratings: {final_size}")
print(f"We dropped {initial_size - final_size} rows in total!")

Number of users with valid ratings: 24983

Dropping rows with NANs...
We dropped 0 rows

Dropping rows ratings less than -10, greater than 10, and unequal to the null value (99 in our case)...
We dropped 0 rows

Number of users with valid ratings: 24983
We dropped 0 rows in total!


**Part B:**

For a naive recommender system, we have enough data to implement $k-$ nearest neighbors for a collaborative filtering.

Assume there is an user who has rated only one joke and that happens to be joke 8. The user has given the joke a rating of 5.

- Print joke 8.

Now we have to recommend a set of jokes to the user.

- You will need to find a list of `k` people who have given a similar rating on joke 8. The value of `k` is your choice.

Print these users (by their index/identifiers).

- You will also need to find the top `m` jokes rated by these `k` users. Again `m` is your choice.


Since we only have one user and one rating, we may have had numerous "perfect match" other users.  Even in this case, you should include terms to *weight* the averages of the nearest-neighbors by their similarities to the new user.

Print these jokes (both by their identifiers and the joke itself)

In [6]:
# nansum returns 0 if all values in array are nan, and I wanted it to return nan
def true_nansum(arr):
    a, mask = np.lib.nanfunctions._replace_nan(arr,0)
    if np.all(mask):
        return np.nan
    return np.nansum(arr)

def create_new_user(jokes, scores, fill, n_col):
    l = [len(jokes)]
    for col in range(n_col):
        flag = True
        for i, joke in enumerate(jokes):
            if joke == col:
                l.append(scores[i])
                flag = False
        if flag:
            l.append(fill)
    return l

def neighbors(df, user, k, fill_value=np.nan, cols=[]):
    nan_df       = df.replace(fill_value, np.nan)
    nan_user     = pd.Series(user).replace(fill_value, np.nan)
    nan_diff     = nan_df - nan_user
    cols_to_drop = list(set(nan_diff.columns) - set(cols))
    nan_diff     = nan_diff.drop(cols_to_drop, axis=1)
    abs_sums     = []
    for index, row in nan_diff.iterrows():
        abs_sums.append(np.abs(true_nansum(row)))
    return list(pd.Series(abs_sums).sort_values()[:k].index)

def get_jokes(df, user, neighbors, m, fill_value=np.nan, cols=[]):
    n_neighbors  = len(neighbors)
    cols_to_drop = list(set(df.columns) - set(cols))
    z_df         = df.replace(fill_value, 0)
    z_user       = pd.Series(user).replace(fill_value, 0)
    sim_XY       = []  
    cols         = np.zeros(len(df.columns)) 
    for neighbor in neighbors:
        sim = stats.pearsonr(z_df.iloc[neighbor,:], z_user)[0]
        sim_XY.append(sim)
        cols += (sim * z_df.iloc[neighbor,:])
    cols *= (1 / np.sum(sim_XY))
    jokes = pd.Series(cols[1:]).sort_values(ascending=False)[:m].index
    return jokes    

In [7]:
'''
Joke 8
'''
print(f"Joke 8: {jokes.iloc[8].values[0]}\n")

'''
Choosing m
'''
# percentage of users to use
percent = 0.1 
n_jokes = len(jokes)
# number of neighbors for our new user
m       = int(n_jokes * percent)
print(f"m = {m}\n")

'''
Choosing k
'''
# percentage of users to use
percent = 0.01 
# number of neighbors for our new user
k       = int(ratings.shape[0] * percent)
print(f"k = {k}\n")

'''
Creating a new user
'''
n_columns    = 100
fill_value   = 99.0
joke_indices = [8]
new_scores   = [5]
new_user     = create_new_user(joke_indices, new_scores, fill_value, n_columns)
print(f"Our new user looks like:\n{new_user}\n")

'''
Getting the k-nearest neighbors
'''
# The actual columns the jokes are stored in
columns     = [i for i in range(1, n_columns + 1)]
k_neighbors_complex = neighbors(ratings, new_user, k, fill_value, columns)
print(f"{k} nearest neighbors are:\n{k_neighbors_complex}")

'''
Getting m jokes
'''
recommended_jokes = get_jokes(ratings, new_user, k_neighbors_complex, m, fill_value, columns)
for joke in recommended_jokes:
    print(f"Joke {joke}: {jokes.iloc[joke].values[0]}\n")

Joke 8: A country guy goes into a city bar that has a dress code, and the maitre d'  demands he wear a tie. Discouraged, the guy goes to his car to sulk when  inspiration strikes: He's got jumper cables in the trunk! So he wraps them around his neck, sort of like a string tie (a bulky string tie to be sure) and returns to the bar. The maitre d' is reluctant, but says to the guy, "Okay, you're a pretty resourceful fellow, you can come in... but just don't start anything"!

m = 10

k = 249

Our new user looks like:
[1, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 5, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 

In [8]:
'''
Joke 8
'''
print(f"Joke 8: {jokes.iloc[8].values[0]}\n")

'''
Choosing m
'''
# percentage of users to use
percent = 0.1 
n_jokes = len(jokes)
# number of neighbors for our new user
m       = int(n_jokes * percent)
print(f"m = {m}\n")

'''
Choosing k
'''
# percentage of users to use
percent = 0.01 
# number of neighbors for our new user
k       = int(ratings.shape[0] * percent)
print(f"k = {k}\n")

'''
Creating a new user
'''
n_columns    = 100
fill_value   = 99.0
joke_indices = [8]
new_scores   = [5]
new_user     = create_new_user(joke_indices, new_scores, fill_value, n_columns)
print(f"Our new user looks like:\n{new_user}\n")

'''
Getting the k-nearest neighbors
'''
k_neighbors_simple = list(abs(ratings.iloc[:,joke_indices[0]+1] - new_scores[0]).sort_values()[:k].index)
print(f"{k} nearest neighbors are:\n{k_neighbors_simple}")

'''
Getting m jokes
'''
recommended_jokes = get_jokes(ratings, new_user, k_neighbors_simple, m, fill_value, columns)
for joke in recommended_jokes:
    print(f"Joke {joke}: {jokes.iloc[joke].values[0]}\n")

Joke 8: A country guy goes into a city bar that has a dress code, and the maitre d'  demands he wear a tie. Discouraged, the guy goes to his car to sulk when  inspiration strikes: He's got jumper cables in the trunk! So he wraps them around his neck, sort of like a string tie (a bulky string tie to be sure) and returns to the bar. The maitre d' is reluctant, but says to the guy, "Okay, you're a pretty resourceful fellow, you can come in... but just don't start anything"!

m = 10

k = 249

Our new user looks like:
[1, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 5, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 

In [9]:
'''
Making sure both solutions match, and while they may order the things differently, they are both correct solutions 
I would say.
'''
s   = set(k_neighbors_simple)
print("S:")
for i in k_neighbors_simple:
    print(ratings.iloc[i,joke_indices[0]+1], end=" ")
c   = set(k_neighbors_complex)
print("\nC:")
for i in k_neighbors_complex:
    print(ratings.iloc[i,joke_indices[0]+1], end=" ")
s_n = len(s)
c_n = len(c)
d   = s - c
d_n = len(d)

'''As it turns out, both methods work well, however, my functional solution will be easier to implement with the weights'''

S:
5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.05 4.95 5.05 5.05 5.05 4.95 4.95 5.05 4.95 4.95 5.05 5.05 4.95 5.05 5.05 5.05 5.05 4.95 4.95 4.95 5.05 4.95 4.95 5.05 4.95 4.95 4.95 5.05 4.95 5.05 5.05 4.95 5.05 5.05 4.95 5.05 5.05 5.05 4.95 5.05 4.95 4.95 4.95 5.05 4.95 4.95 5.05 4.95 4.95 4.95 5.05 4.95 4.95 5.05 5.05 4.95 4.95 4.95 5.05 4.95 4.95 4.95 4.95 4.95 5.05 5.1 4.9 5.1 4.9 4.9 5.1 4.9 5.1 5.1 5.1 4.9 5.1 4.9 4.9 5.1 5.1 4.9 5.1 4.9 4.9 4.9 5.1 5.1 4.9 5.1 4.9 5.1 5.1 4.9 4.9 4.9 4.9 4.9 4.9 5.1 4.9 4.9 4.9 5.1 5.1 4.9 5.1 4.9 4.9 4.9 4.9 5.1 4.9 4.9 5.1 4.9 4.9 4.9 4.9 5.1 5.1 4.9 5.1 5.1 5.1 5.1 4.9 5.1 5.1 5.1 5.1 4.9 4.9 4.9 4.9 4.9 5.1 4.9 4.85 5.15 5.15 5.15 5.15 4.85 4.85 4.85 5.15 5.15 5.15 5.15 4.85 5.15 4.85 5.15 5.15 5.15 4.85 5.15 4.85 4.85 5.15 5.15 4.85 4.85 4.85 5.15 4.85 5.15 4.85 4.85 5.15 5.15 4.85 4.85 5.15 5.15 5.15 5.15 5.15 4.85 5.15 4.85 4.85 4.85 5.15 5.15 5.15 4.85 4.85 5.15 4.85 5.15 4.85 5.15 4.85 4.8

'As it turns out, both methods work well, however, my functional solution will be easier to implement with the weights'

**Part C:**

Justify why you picked these values for `m` and `k`.

In general, how would your recommendation change with changing `k` and keeping `m` constant?  What effects do increases in `k` cause?  What about decreases?

---

Well, I've chosen `k` differently, using a percentage of users. I've tried $0.1$ which took a while to compute, $0.01$ which I think to be the sweespot, and $0.005$ which seemed too small a sample size. Larger `k` takes more time, but theoretically, I'd think it would be a bigger 'battle' between jokes when we raise `k` and keep `m` constant, and inversely, we'd have less competition when we decrease `k`. 

As for `m`, I also chose to take a percentage of all the jokes, and I chose $0.1$ or $10%$ of jokes simply because I wanted to read them. I think a smaller choice of `m` might've made more sense because not all of the jokes were necessarily all that similar in style.

---

**Part D:**
Now try it yourself.  Read jokes 1-4 and rate them all.  Create a collaborative filter for yourself.  What are the top 3 jokes recommended to you?  What do you think of them?

In [10]:
'''
Jokes 1, 2, 3, 4
'''
joke_nums = [1,2,3,4]
for joke_i in joke_nums:
    print(f"Joke {joke_i}: {jokes.iloc[joke_i].values[0]}\n")

'''
Choosing m
'''
m = 3 # we're given this
print(f"m = {m}\n")    
    
'''
Choosing k
'''
# percentage of users to use
percent = 0.01 
# number of neighbors for our new user
k       = int(ratings.shape[0] * percent)
print(f"k = {k}\n")

'''
Creating a new user
'''
n_columns    = 100
fill_value   = 99.0
joke_indices = [i for i in joke_nums]
print(joke_indices)

'''NOTE'''
# Although some of these jokes can definitely be viewed as offensive,
# I rated them by which ones ellicited the largest response. 
# Plus I happen to enjoy some good 'ole fashoined dark humor.
new_scores   = [7, 3.5, 0, 6]


new_user     = create_new_user(joke_indices, new_scores, fill_value, n_columns)
print(f"Our new user looks like:\n{new_user}\n")

'''
Getting the k-nearest neighbors
'''
# The actual columns the jokes are stored in
columns     = [i for i in range(1, n_columns + 1)]
k_neighbors = neighbors(ratings, new_user, k, fill_value, columns)
print(f"{k} nearest neighbors are:\n{k_neighbors_complex}")

'''
Getting m jokes
'''
recommended_jokes = get_jokes(ratings, new_user, k_neighbors, m, fill_value, columns)
for joke in recommended_jokes:
    print(f"Joke {joke}: {jokes.iloc[joke].values[0]}\n")

Joke 1: This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him.   "What could they possibly have said to make you move out?"   "They told me that you were a pedophile."   He replied, "That's an awfully big word for a ten year old."

Joke 2: Q. What's 200 feet long and has 4 teeth?   A. The front row at a Willie Nelson Concert.

Joke 3: Q. What's the difference between a man and a toilet?   A. A toilet doesn't follow you around after you use it.

Joke 4: Q.	What's O. J. Simpson's Internet address?  A.	Slash, slash, backslash, slash, slash, escape.

m = 3

k = 249

[1, 2, 3, 4]
Our new user looks like:
[4, 99.0, 7, 3.5, 0, 6, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 99.0, 

**Part E:**

Read some more of the jokes.  Repeat your favorite(s) here:

In [11]:
favs = [1, 4, 27, 32, 53]
for joke in favs:
    print(f"Joke {joke}: {jokes.iloc[joke].values[0]}\n")

Joke 1: This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him.   "What could they possibly have said to make you move out?"   "They told me that you were a pedophile."   He replied, "That's an awfully big word for a ten year old."

Joke 4: Q.	What's O. J. Simpson's Internet address?  A.	Slash, slash, backslash, slash, slash, escape.

Joke 27: A mechanical, electrical and a software engineer from Microsoft were driving through the desert when the car broke down. The mechanical engineer said "It seems to be a problem with the fuel injection system, why don't we pop the hood and I'll take a look at it." To which the electrical engineer replied, "No I think it's just a loose ground wire, I'll get out and take a look." Then, the Microsoft engineer jumps in. "No, no, no. If we just close up all the windows, get out, wait a few minutes,

***
<a/ id='ec'></a>
[Back to top](#top)
# Extra Credit (Running BigCLAM on Big Data: up to +25 pts).  This problem may be turned in either with HW7 or HW8.

Consider the data set `Marvel_Network`.  This set consists of two columns, hero1 and hero2. Every row is filled with co-occurrence of two such marvel characters in a comic.

### Part A) (10 pts) Cleaning and setup
#### (i) Use item baskets to count how many times each character appears.  You may also want to count how many times each *edge* between pairs of characters appears (use your code from A Priori!)

Print the most popular 5 characters' names.


In [12]:
#Names and counts


#### (ii) Remove any characters having degree = 1. What do these nodes represent?

#### (iii) For speed of computation, you should also remove any *edges* with count 1. What do these *edges* represent?

#### (iv) Create an adjacency matrix (or probably a compact/sparse representation of such a matrix!) for the data after accounting for parts (ii), (iii).  How many characters are there?  How many edges?

### Part B)  (10 pts)  Detect communities in the graph generated above using BigCLAM
For this problem, take the number of communities to find as 4 (plus a background community!).  Use your choice of initialization. If you have **prior knowledge** of any comic book characters, consider setting 4 characters from "different" Marvel universes as your seeds.  If not, either use PageRank or properties of your adjacency matrix to decide where to initialize affiliations.

Are the final communities ones you've seen in popular culture?  Verify by checking the community affiliation scores of two characters that you believe should share a dominant community, like Wolverine and Professor X.

### Part C) (5 pts) Detect communities in the graph generated above using *weighted* BigCLAM

It turns out that some edges occurred multiple times in the data set.  We can adjust our model to count the edges proportionately to the number of times they occured: we just have to weight the partial derivatives in our gradient calculation by multiplying each term $v$ in $\nabla F_u$ by the number of times $u$ and $v$ were seen together.

Run the model accordingly, and report the community affiliations of the same pair of characters you looked at in Part B).  They should be even closer now, right?