<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Mutation" data-toc-modified-id="Mutation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Mutation</a></span></li><li><span><a href="#SQLlite" data-toc-modified-id="SQLlite-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>SQLlite</a></span></li><li><span><a href="#Bit-Operation" data-toc-modified-id="Bit-Operation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Bit Operation</a></span></li><li><span><a href="#Constructing-Co-occurence-Matrix" data-toc-modified-id="Constructing-Co-occurence-Matrix-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Constructing Co-occurence Matrix</a></span></li><li><span><a href="#Normalized-Co-occurences" data-toc-modified-id="Normalized-Co-occurences-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Normalized Co-occurences</a></span></li></ul></div>

## Mutation

- http://book.pythontips.com/en/latest/mutation.html

> Whenever you assign a variable to another variable of mutable datatype, any changes to the data are reflected by both variables. The new variable is just an alias for the old variable. This is only true for mutable datatypes.

In [21]:
# we assign the dictionary in the list to another
# variable and update that variable, the dictionary in
# the list also gets updated
items = [{
    'name': 'chair',
    'price': 15.00
}]

item = items[0]
data = {'name': 'piano', 'price': 12.00}
item.update(data)
items

[{'name': 'piano', 'price': 12.0}]

## SQLlite

- https://docs.python.org/3.6/library/sqlite3.html
- https://www.sqlite.org/autoinc.html

> sqlite is a database that stores its information on disk thus the uri here is a file name, it is more lightweight than other options such as mysql or postgresql, etc. Thus can be used for quick prototyping stuff, but as a result can be slower.

In [7]:
import sqlite3


connection = sqlite3.connect('data.db')
cursor = connection.cursor()

create_table_statement = """
    CREATE TABLE users(id int, username text, password text)
"""
cursor.execute(create_table_statement)

insert_user_statement = "INSERT INTO users VALUES(?, ?, ?)"

# inserting one record
user = (1, 'ethen', 'asdf')
cursor.execute(insert_user_statement, user)

# inserting multiple records
users = [
    (2, 'rolf', 'asdf'),
    (3, 'anne', 'xyz')
]
cursor.executemany(insert_user_statement, users)

# looping through the result of a SELECT statement as an iterator
get_all_users_statement = "SELECT * FROM users"
for row in cursor.execute(get_all_users_statement):
    print(row)
    
# fetch the first result
result = cursor.execute(get_all_users_statement)
row = result.fetchone()
print(row)

connection.commit()
connection.close()

(1, 'ethen', 'asdf')
(2, 'rolf', 'asdf')
(3, 'anne', 'xyz')


## Bit Operation

- https://code.tutsplus.com/articles/understanding-bitwise-operators--active-11301

> To check if a number is an even number, instead of using the mod operation, we can use the bit operation using the mod operation will be faster if the number if large.

In [18]:
num = 311452345245123412341

In [19]:
%%timeit
if (num % 2):
    result = 'number is odd'
else:
    result = 'number is even'

84.3 ns ± 0.564 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [20]:
%%timeit
if (num & 1):
    result = 'number is odd'
else:
    result = 'number is even'

54.3 ns ± 0.802 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


## Constructing Co-occurence Matrix

- https://stackoverflow.com/questions/20574257/constructing-a-co-occurrence-matrix-in-python-pandas

> Given a table with users and their purchases of each item, we can leverage a matrix multiplication to construct the co-purchase matrix.

In [12]:
import pandas as pd

# note for larger tables, sparse matrix might be preferred
# to store these informations
df = pd.DataFrame({
    'userId': [1, 2, 3, 4, 5, 6],
    'Snack': [1, 0, 1, 1, 0, 0],
    'Trans': [1, 1, 1, 0, 0, 1],
    'Dop': [1, 0, 1, 0, 1, 1]}).set_index('userId')
df.head()

Unnamed: 0_level_0,Snack,Trans,Dop
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,1,1
2,0,1,0
3,1,1,1
4,1,0,0
5,0,0,1


In [13]:
X = df.values
co_occurence = X.T.dot(X)
co_occurence

array([[3, 2, 2],
       [2, 4, 3],
       [2, 3, 4]])

In [16]:
import pandas as pd

base_path = '/Users/mingyuliu/personal/project/learning/learn/ml-100k/'
rating_path = base_path + 'u.data'

df = pd.read_csv(rating_path, sep='\t', header=None)
df.columns = ['userId', 'itemId', 'rating', 'timestamp']
print('dimension: ', df.shape)
df.head()

dimension:  (100000, 4)


Unnamed: 0,userId,itemId,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [17]:
subset = df[df['itemId'].isin(['40', '167'])]
subset.head()

Unnamed: 0,userId,itemId,rating,timestamp
13,210,40,3,891035994
375,87,40,3,879876917
582,276,40,3,874791871
1243,43,40,3,883956468
2240,244,167,3,880607853


In [30]:
# https://stackoverflow.com/questions/32918506/pandas-how-to-filter-for-items-that-occur-more-than-once-in-a-dataframe
temp = subset.groupby('userId').filter(lambda x: len(x) > 1)
temp['userId'].unique()

array([210,  87, 244,  13,   5, 222, 279, 280, 174, 286,  92, 435, 389,
         1, 504, 417, 660, 711, 805, 318, 648, 378, 804])

In [34]:
temp.head()

Unnamed: 0,userId,itemId,rating,timestamp
13,210,40,3,891035994
375,87,40,3,879876917
2240,244,167,3,880607853
2699,13,167,4,882141659
4094,5,167,2,875636281


In [35]:
s = subset.groupby('userId')
s.get_group(210)

Unnamed: 0,userId,itemId,rating,timestamp
13,210,40,3,891035994
37325,210,167,4,891036054


In [31]:
temp.head()

Unnamed: 0,userId,itemId,rating,timestamp
13,210,40,3,891035994
375,87,40,3,879876917
2240,244,167,3,880607853
2699,13,167,4,882141659
4094,5,167,2,875636281


In [21]:
subset.groupby(['userId', 'itemId']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,timestamp
userId,itemId,Unnamed: 2_level_1,Unnamed: 3_level_1
1,40,1,1
1,167,1,1
5,40,1,1
5,167,1,1
10,40,1,1
11,40,1,1
13,40,1,1
13,167,1,1
22,167,1,1
43,40,1,1


## Normalized Co-occurences

Jaccard similarity normalized by popularity. e.g. # purchases made on item i and j / # purchases made on item i or j