### **Data Mining Using Python**

<font color="red">File access required:</font> In Colab this notebook requires first uploading files **Shop.csv** and **Movies.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure these files are in the same workspace as the notebook.

In [9]:
# Set-up
import csv

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Look at CSV files:** TID,item pairs

In [10]:
# Read shopping dataset from CSV file
# Create dictionary "Sitems" with key = item and value = set of transactions
# Also set variable Snumtrans = number of transactions
Sitems = {}
trans = []  # list of transactions used to set Snumtrans

with open('/content/drive/MyDrive/Google Colab/Data Mining Using Python and SQL/Copy of Shop.csv') as f:
    rows = csv.DictReader(f)

    for r in rows:
        if r['item'] not in Sitems:
            Sitems[r['item']] = {r['TID']}
        else:
            Sitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])

Snumtrans = len(trans)
print('Number of transactions:', Snumtrans)
print('Number of distinct items:', len(Sitems))
print('Item dictionary:')
Sitems

Number of transactions: 5
Number of distinct items: 5
Item dictionary:


{'milk': {'1', '2', '4', '5'},
 'eggs': {'1', '3', '4'},
 'juice': {'1', '2', '5'},
 'cookies': {'2', '5'},
 'chips': {'3', '5'}}

In [11]:
# Read movies dataset from CSV file
# Create dictionary "Mitems" with key = item and value = set of transactions
# Also set variable Mnumtrans = number of transactions
Mitems = {}
trans = []  # list of transactions used to set Mnumtrans

with open('/content/drive/MyDrive/Google Colab/Data Mining Using Python and SQL/Copy of Movies.csv') as f:
    rows = csv.DictReader(f)

    for r in rows:
        if r['item'] not in Mitems:
            Mitems[r['item']] = {r['TID']}
        else:
            Mitems[r['item']].add(r['TID'])
        if r['TID'] not in trans:
            trans.append(r['TID'])

Mnumtrans = len(trans)
print('Number of transactions (users):', Mnumtrans)
print('Number of distinct items (movies):', len(Mitems))
print('Item dictionary:')
Mitems.items()

Number of transactions (users): 1382
Number of distinct items (movies): 123
Item dictionary:


dict_items([('The Fault in Our Stars', {'128203', '14642', '69391', '15590', '31198', '29139', '174147', '114602', '243482', '184307', '72265', '186450', '152735', '18805', '61803', '105787', '127474', '153533', '98509', '101420', '102371', '140380', '171673', '215987', '145755', '221712', '191042', '38153', '232320', '34280', '54212', '4231', '12924', '151368', '210968', '121896', '62076', '78715', '150676', '25851', '71352', '68551', '128828', '50601', '41453', '173452', '100316', '47173', '22793', '229131', '239816', '39827', '87907', '240694', '163775', '166205', '92094', '6802', '167831', '15530', '176880', '158208', '115516', '3508', '234322', '124830', '174077', '206129', '6573', '171575', '121408', '168240', '87127', '152646', '126898', '218424', '241263', '241916', '214778', '184040', '176', '43142', '102901', '218669', '206146', '200931', '46770', '232001', '89200', '237123', '71268', '55096', '165260', '96530', '54858', '36040', '158048', '58296', '204050', '7353', '120662',

### Some new Python features

In [None]:
# Iterating through dictionaries
for i in Sitems:
    print(i)
    print(Sitems[i])

milk
{'5', '1', '4', '2'}
eggs
{'1', '4', '3'}
juice
{'5', '1', '2'}
cookies
{'5', '2'}
chips
{'5', '3'}


In [None]:
# Intersecting sets
# How many transactions contain both eggs and milk?
set1 = Sitems['eggs']
print('Transactions containing eggs:', set1)
set2 = Sitems['milk']
print('Transactions containing milk:', set2)
set3 = set1 & set2
print('Transactions containing both:', set3)
print('Number of transactions containing both:', len(set3))

Transactions containing eggs: {'1', '4', '3'}
Transactions containing milk: {'5', '1', '4', '2'}
Transactions containing both: {'1', '4'}
Number of transactions containing both: 2


## Shopping dataset - frequent item-sets

### Frequent item-sets of two

#### Print all pairs of items and the number of transactions they occur together in (see what's wrong and fix it)

In [29]:
for i1 in Sitems:
    for i2 in Sitems:
      # if i1 != i2:
      if i1 < i2:
        common = len(Sitems[i1] & Sitems[i2])
        print([i1, i2, common])

['eggs', 'milk', 2]
['eggs', 'juice', 1]
['juice', 'milk', 3]
['cookies', 'milk', 2]
['cookies', 'eggs', 0]
['cookies', 'juice', 2]
['chips', 'milk', 1]
['chips', 'eggs', 1]
['chips', 'juice', 1]
['chips', 'cookies', 1]


#### Now only print pairs that meet support threshold

In [30]:
support = .1
for i1 in Sitems:
    for i2 in Sitems:
      if i1 < i2:
        common = len(Sitems[i1] & Sitems[i2])
        if common/Snumtrans > support:
          print(i1, '|', i2)

eggs | milk
eggs | juice
juice | milk
cookies | milk
cookies | juice
chips | milk
chips | eggs
chips | juice
chips | cookies


### Frequent item-sets of three

In [None]:
support = .3
for i1 in Sitems:
    for i2 in Sitems:
        for i3 in Sitems:
            if i1 < i2 and i2 < i3:
                common = len(Sitems[i1] & Sitems[i2] & Sitems[i3])
                if common/Snumtrans > support:
                    print(i1, '|', i2, '|', i3)

cookies | juice | milk


### <font color = 'green'>**Your Turn - Movies dataset frequent item-sets**</font>

In [None]:
print(Mnumtrans, 'transactions (users)')
print(len(Mitems), 'distinct items (movies)')

1382 transactions (users)
123 distinct items (movies)


#### Mine for frequent item-sets of three and four items in the Movies dataset. Find a single support threshold where the number of frequent item-sets of three items is more than 10 but less than 20, and the number of frequent item-sets of four items is more than 0.

In [None]:
# Frequent item-sets of three
support = .1
for i1 in Mitems:
    for i2 in Mitems:
        for i3 in Mitems:
            if i1 < i2 and i2 < i3:
                common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
                if common/Snumtrans > support:
                    print(i1, '|', i2, '|', i3)

Boyhood | Inside Out | Teenage Mutant Ninja Turtles
Boyhood | Inside Out | Sisters
Boyhood | Inside Out | Leviathan
Boyhood | Inside Out | Louis C.K.: Live at The Comedy Store
Boyhood | Inside Out | Laggies
Boyhood | Inside Out | The Best of Me
Boyhood | Inside Out | Magic in the Moonlight
Boyhood | Inside Out | Justice League: Gods and Monsters
Boyhood | Inside Out | Man Up
Boyhood | Inside Out | Self/less
Boyhood | Inside Out | Strangerland
Boyhood | Inside Out | Steve Jobs: The Man in the Machine
Boyhood | Inside Out | Rage
Boyhood | Inside Out | Listen Up Philip
Boyhood | Inside Out | Outcast
Boyhood | Gone Girl | Inside Out
Boyhood | Gone Girl | Teenage Mutant Ninja Turtles
Boyhood | Gone Girl | Sisters
Boyhood | Gone Girl | Leviathan
Boyhood | Gone Girl | In the Heart of the Sea
Boyhood | Gone Girl | Louis C.K.: Live at The Comedy Store
Boyhood | Gone Girl | Laggies
Boyhood | Gone Girl | The Best of Me
Boyhood | Gone Girl | Magic in the Moonlight
Boyhood | Gone Girl | Justice Lea

In [None]:
# Frequent item-sets of four
support = .01
for i1 in Mitems:
    for i2 in Mitems:
        for i3 in Mitems:
          for i4 in Mitems:
            if i1 < i2 and i2 < i3 and i3 < i4:
                common = len(Mitems[i1] & Mitems[i2] & Mitems[i3] & Mitems[i4])
                if common/Snumtrans > support:
                    print(i1, '|', i2, '|', i3, '|', i4)

## Shopping dataset - association rules

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [18]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

[['milk', 4], ['eggs', 3], ['juice', 3]]


#### Now find right-hand side items with sufficient confidence (see what's wrong and fix it)

In [19]:
confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        common = len(Sitems[lhs[0]] & Sitems[i])
        if common/lhs[1] > confidence:
            print(lhs[0], '->', i)

milk -> milk
milk -> juice
eggs -> milk
eggs -> eggs
juice -> milk
juice -> juice
juice -> cookies


### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [20]:
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

[['juice', 'milk', 3]]


#### Now find right-hand side items with sufficient confidence

In [21]:
confidence = .5
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            if common/lhs[2] > confidence:
                print(lhs[0], '|', lhs[1], '->', i)

juice | milk -> cookies


## Shopping dataset - association rules with lift instead of confidence

### Association rules with one item on the left-hand side

#### First compute frequent item-sets of one item, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [25]:
support = .5
frequentLHS = []
for i in Sitems:
    if len(Sitems[i])/Snumtrans > support:
        frequentLHS.append([i,len(Sitems[i])])
print(frequentLHS)

[['milk', 4], ['eggs', 3], ['juice', 3]]


#### Now find right-hand side items with sufficient lift

In [26]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[i])
            lift = (common/lhs[1]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '->', i, ' lift:', lift)

milk -> juice  lift: 1.25
milk -> cookies  lift: 1.25
juice -> milk  lift: 1.25
juice -> cookies  lift: 1.6666666666666665


### Association rules with two items on the left-hand side

#### First compute frequent item-sets of two items, as candidate left-hand sides of assocation rules. Include the number of transactions the items occur in.

In [13]:
support = .5
frequentLHS = []
for i1 in Sitems:
    for i2 in Sitems:
        if i1 < i2:
            common = len(Sitems[i1] & Sitems[i2])
            if common/Snumtrans > support:
                frequentLHS.append([i1,i2,common])
print(frequentLHS)

[['juice', 'milk', 3]]


#### Now find right-hand side items with sufficient lift

In [12]:
liftthresh = 1
for lhs in frequentLHS:
    for i in Sitems:
        if i not in lhs:
            common = len(Sitems[lhs[0]] & Sitems[lhs[1]] & Sitems[i])
            lift = (common/lhs[2]) / (len(Sitems[i])/Snumtrans)
            if lift > liftthresh:
                print(lhs[0], '|', lhs[1], '->', i, ' lift:', lift)

### <font color = 'green'>**Your Turn - Movies dataset association rules**</font>

#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and confidence thresholds (need not be the same) so the number of association rules is more than 10 but less than 20.

In [None]:
# Association rules with three items on the left-hand side
# Hint: Make sure to include the code from the seprate cells above that
#   together implement the two steps of association rule mining
support = .01
confidence = .92

# Frequent item-sets of three as candidate left-hand sides
frequentLHS = []
for i1 in Mitems:
    for i2 in Mitems:
        for i3 in Mitems:
            if i1 < i2 and i2 < i3:
                common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
                if common/Mnumtrans > support:
                    frequentLHS.append([i1,i2,i3,common])
# print(len(frequentLHS))
for lhs in frequentLHS:
    for i in Mitems:
        if i not in lhs:
            common = len(Mitems[lhs[0]] & Mitems[lhs[1]] & Mitems[lhs[2]] & Mitems[i])
            if common/lhs[3] > confidence:
                print(lhs[0], '|', lhs[1], '|', lhs[2], '->', i)

Boyhood | Fury | Inside Out -> The Imitation Game
Boyhood | Fury | Inside Out -> Gone Girl
Boyhood | Calvary | Fury -> The Imitation Game
Boyhood | Calvary | Fury -> Gone Girl
Big Hero 6 | Boyhood | Fury -> The Imitation Game
Big Hero 6 | Boyhood | Fury -> Gone Girl
Big Hero 6 | Boyhood | Wild Tales -> The Imitation Game
Big Hero 6 | Boyhood | Wild Tales -> Gone Girl
Big Hero 6 | Fury | Transformers: Age of Extinction -> The Imitation Game
Big Hero 6 | The Hunger Games: Mockingjay - Part 2 | The Imitation Game -> Inside Out
Big Hero 6 | Calvary | The Imitation Game -> Gone Girl
Big Hero 6 | Calvary | Gone Girl -> The Imitation Game
Fury | The Imitation Game | Wild Tales -> Gone Girl
Fury | Inside Out | The Imitation Game -> Gone Girl
Fury | Gone Girl | Wild Tales -> The Imitation Game
Calvary | Fury | The Imitation Game -> Gone Girl
Calvary | Fury | Gone Girl -> The Imitation Game


#### Mine for association rules in the Movies dataset with three items on the left-hand side. Find support and lift thresholds so the number of association rules is more than 10 but less than 20. Only consider lift thresholds > 1.


In [None]:
# Association rules with three items on the left-hand side
support = .08
liftthresh = 5

# Frequent item-sets of three as candidate left-hand sides
frequentLHS = []
for i1 in Mitems:
    for i2 in Mitems:
        for i3 in Mitems:
            if i1 < i2 and i2 < i3:
                common = len(Mitems[i1] & Mitems[i2] & Mitems[i3])
                if common/Mnumtrans > support:
                    frequentLHS.append([i1,i2,i3,common])
# print(len(frequentLHS))
for lhs in frequentLHS:
    for i in Mitems:
        if i != lhs[0] and i != lhs[1] and i != lhs[2]:
            common = len(Mitems[lhs[0]] & Mitems[lhs[1]] & Mitems[lhs[2]] & Mitems[i])
            lift = (common/lhs[3]) / (float(len(Mitems[i]))/Mnumtrans)
            if lift > liftthresh:
                print(lhs[0], '|', lhs[1], '|', lhs[2], '->', i, ' lift:', lift)

Big Hero 6 | Gone Girl | The Imitation Game -> Samba  lift: 5.80672268907563
Big Hero 6 | Gone Girl | The Imitation Game -> Faith of Our Fathers  lift: 5.80672268907563
Big Hero 6 | Gone Girl | The Imitation Game -> Turks & Caicos  lift: 5.80672268907563
Big Hero 6 | Gone Girl | The Imitation Game -> Flowers in the Attic  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitation Game -> Rage  lift: 5.80672268907563
Big Hero 6 | Gone Girl | The Imitation Game -> Listen Up Philip  lift: 5.80672268907563
Big Hero 6 | Gone Girl | The Imitation Game -> Whitey: United States of America v. James J. Bulger  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitation Game -> The Wonders  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitation Game -> Action Jackson  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitation Game -> Breathe  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitation Game -> Court  lift: 11.61344537815126
Big Hero 6 | Gone Girl | The Imitat