In [5]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display, HTML

# Fix the dying kernel problem (only a problem in some installations - you can remove it, if it works without it)
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Numpy tasks

For a detailed reference check out: https://numpy.org/doc/stable/reference/arrays.indexing.html.

**Task 1.** Calculate the sigmoid (logistic) function on every element of the following numpy array [0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25] and print the last 5 elements. Use only vector operations.

In [3]:
# Write your code here
array = np.array([0.3, 1.2, -1.4, 0.2, -0.1, 0.1, 0.8, -0.25])
sigm = lambda x: 1/(1+np.exp(-x))
array2 = sigm(array[-5:])
print(array2)


[0.549834   0.47502081 0.52497919 0.68997448 0.4378235 ]


**Task 2.** Calculate the dot product of the following two vectors:<br/>
$x = [3, 1, 4, 2, 6, 1, 4, 8]$<br/>
$y = [5, 2, 3, 12, 2, 4, 17, 11]$<br/>
a) by using element-wise mutliplication and np.sum,<br/>
b) by using np.dot,<br/>
b) by using np.matmul and transposition (x.T).

In [18]:
# Write your code here
x= np.array([3,1,4,2,6,1,4,8])
y= np.array([5,2,3,12,2,4,17,11])

print("a " +str(np.sum(x*y)))
print("b " + str(np.dot(x, y)))

print("c " + str(np.matmul(x, y)))



a 225
b 225
c 225


**Task 3.** Calculate value of the logistic model<br/>
$$y = \frac{1}{1 + e^{-x_0 \theta_0 - \ldots - x_9 \theta_9 - \theta_{10}}}$$
for<br/>
$x = [1.2, 2.3, 3.4, -0.7, 4.2, 2.7, -0.5, 1.4, -3.3, 0.2]$<br/>
$\theta = [2.7, 0.33, -2.12, -1.73, 2.9, -5.8, -0.9, 12.11, 3.43, -0.5, -1.65]$<br/>
and print the result. Use only vector operations.

In [37]:
# Write your code here
x = np.array([1.2,2.3,3.4,-0.7,4.2,2.7,-0.5,1.4,-3.3,0.2])
theta = np.array([2.7,0.33,-2.12,-1.73,2.9,-5.8,-0.9,12.11,3.43,-0.5,-1.65])

fun = lambda x,y: -x*y
power = np.sum(fun(x[:],theta[:len(theta)-1]))
y = 1/(1+np.exp(power - theta[len(theta)-1]))
print(y)

0.2417699832615572


**Task 4.** Calculate value of the multivariate linear regression model<br/>
$$y = A x + B$$
for<br/>
$A = \begin{bmatrix} 1 & 2 & 1 \\ 3 & 0 & 1 \end{bmatrix}$<br/>
$B = \begin{bmatrix} 0.2 \\ 0.3 \end{bmatrix}$<br/>
$x = [1, 2, 3]^T$<br/>
and print the result. Use only vector and matrix operations.

In [7]:
# Write your code here
A = np.array([1,2,1,3,0,1]).reshape(2,3)
print(A)
B = np.array([0.2,0.3]).reshape(2,1)
print(B)
print()
x = np.array([1,2,3])
result = (np.matmul(A,x)+B)
print(result)


[[1 2 1]
 [3 0 1]]
[[0.2]
 [0.3]]

[[8.2 6.2]
 [8.3 6.3]]


# Pandas

## Load datasets

- Steam (https://www.kaggle.com/tamber/steam-video-games)

- MovieLens (https://grouplens.org/datasets/movielens/)

In [6]:
steam_df = pd.read_csv(os.path.join("data", "steam", "steam-200k.csv"), 
                       names=['user-id', 'game-title', 'behavior-name', 'value', 'zero'])

ml_ratings_df = pd.read_csv(os.path.join("data", "movielens_small", "ratings.csv"))
ml_movies_df = pd.read_csv(os.path.join("data", "movielens_small", "movies.csv"))

## Merge both MovieLens DataFrames into one

In [8]:
ml_df = pd.merge(ml_ratings_df, ml_movies_df, on='movieId')
ml_df.head(10)
steam_df.tail(10)

Unnamed: 0,user-id,game-title,behavior-name,value,zero
199990,128470551,Fallen Earth,purchase,1.0,0
199991,128470551,Fallen Earth,play,2.4,0
199992,128470551,Magic Duels,purchase,1.0,0
199993,128470551,Magic Duels,play,2.2,0
199994,128470551,Titan Souls,purchase,1.0,0
199995,128470551,Titan Souls,play,1.5,0
199996,128470551,Grand Theft Auto Vice City,purchase,1.0,0
199997,128470551,Grand Theft Auto Vice City,play,1.5,0
199998,128470551,RUSH,purchase,1.0,0
199999,128470551,RUSH,play,1.4,0


## Pandas tasks - Steam dataset

**Task 5.** How many people made a purchase in the Steam dataset? Remember that a person could buy many games, but you need to count every person once.

In [9]:
# Write your code here
steam_g = steam_df.loc[steam_df['behavior-name'] == 'purchase']
print(len(pd.unique(steam_g['user-id'])))

12393


**Task 6.** How many people made a purchase of "The Elder Scrolls V Skyrim"?

In [10]:
# Write your code here
steam_g = steam_df.loc[steam_df['game-title'] == 'The Elder Scrolls V Skyrim']
x = pd.unique(steam_g['user-id'])
print(len(x))

717


**Task 7.** How many purchases people made on average?

In [26]:
# Write your code here
users = len(pd.unique(steam_df['user-id']))
cond2 = steam_df['behavior-name'] == 'purchase'
g = steam_df[cond2]
group = g.groupby('user-id').sum()

print(group['value'].mean())



10.45033486645687


**Task 8.** Who bought the most games?

In [28]:
# Write your code here
cond2 = steam_df['behavior-name'] == 'purchase'
g = steam_df[cond2].groupby('user-id').sum()
g = g.sort_values(by='value', ascending=False).reset_index()
g = g.loc[:, ['user-id','value']]
display(g.head(1))

Unnamed: 0,user-id,value
0,62990992,1075.0


**Task 9.** How many hours on average people played in "The Elder Scrolls V Skyrim"?

In [None]:
# Write your code here

**Task 10.** Which games were played the most (in terms of the number of hours played)? Print the first 10 titles and respective numbers of hours.

In [None]:
# Write your code here

**Task 11.** Which games are the most consistently played (in terms of the average number of hours played)? Print the first 10 titles and respective numbers of hours.

In [None]:
# Write your code here

**Task 12\*\*.** Fix the above for the fact that 0 hours played is not listed, but only a purchase is recorded in such a case.

In [None]:
# Write your code here

**Task 13.** Apply the sigmoid function
$$f(x) = \frac{1}{1 + e^{-\frac{1}{100}x}}$$
to hours played and print the first 10 rows from the entire Steam dataset after this change.

In [None]:
# Write your code here

## Pandas tasks - MovieLens dataset

**Task 14\*.** Calculate popularity (by the number of users who watched a movie) of all genres. Print a DataFrame with two columns: genre, n_users, where n_users contains the number of users who watched a given genre. Sort all genres in descending order.

In [None]:
# Write your code here

**Task 15\*.** Calculate average rating for all genres. Print a DataFrame with two columns: genre, rating, where rating contains the average rating for a given genre. Sort all genres in descending order.

In [None]:
# Write your code here

**Task 17.** Calculate each movie rating bias (deviation from the mean of all movies average rating). Print first 10 in the form: title, average rating, bias.

In [None]:
# Write your code here

**Task 17.** Calculate each user rating bias (deviation from the mean of all users average rating). Print first 10 in the form: user_id, average rating, bias.

In [None]:
# Write your code here

**Task 18.** Randomly choose 10 movies and 10 users and print their interaction matrix in the form of a DataFrame with user_id as index and movie titles as columns. You can iterate over the DataFrame in this task.

In [None]:
# Write your code here

## Pandas + numpy tasks

**Task 19.** Create the entire interaction matrix for the MovieLens dataset. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here

**Task 20.** Calculate the matrix of size (n_users, n_users) where at position (i, j) there is the number of movies watched both by user i and user j. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here

**Task 21.** Calculate the matrix of size (n_items, n_items) where at position (i, j) there is the number of users who watched both movie i and movie j. To prevent hanging your computer because of RAM shortage use only the first 1000 items. Print the submatrix of first 10 rows and 10 columns.

In [None]:
# Write your code here