# COMP4433 Individual Project

## Topic 1: Weblog Mining

JAHJA Darwin // 16094501d

---

## Intro

The application of data mining has provided many potentials for extrating hidden information and structure within datasets. In this project, we would utilize Association Rule Mining via Apriori to efficiently extract and analyze information from the Microsoft Anonymous Web Data.

The dataset records which areas of www.microsoft.com each user visited in a one-week timeframe in Feburary 1998. Users are identified only by a sequential number, for example, User #14988, User #14989 etc.. And there are 294 areas are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream").

The project source code and report are based on this single Jupyter notebook, which will be easier to read and understand what we are going to implement in this data mining project.

In [1]:
# Importing package
import csv
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Data Preprocessing

First, we need to reorganize the data according to their types. There are in total 3 types of data:

```
-- Attribute lines:
For example, 'A,1277,1,"NetShow for PowerPoint","/stream"'
Where:
  'A' marks this as an attribute line, 
  '1277' is the attribute ID number for an area of the website (called a Vroot),
  '1' may be ignored, 
  '"NetShow for PowerPoint"' is the title of the Vroot, 
  '"/stream"' is the URL relative to "http://www.microsoft.com"

Case and Vote Lines:
For each user, there is a case line followed by zero or more vote lines.
For example:
  C,"10164",10164
  V,1123,1
  V,1009,1
  V,1052,1
Where:
  'C' marks this as a case line, 
  '10164' is the case ID number of a user, 
  'V' marks the vote lines for this case, 
  '1123', 1009', 1052' are the attributes ID's of Vroots that a user visited. 
  '1' may be ignored.
```

Note that we will use both train and test data (in total around 38k) for association analysis. First, we need to manually separate the attribute lines and the Case-Vote Lines into 3 separated `.csv` file - `msweb_attr.csv`, `msweb_traincv.csv` and `msweb_testcv.csv` respectively.

Then, we load the attributes as dataframe:

In [2]:
attr = pd.read_csv('resources/msweb_attr.csv', header=None,
                   usecols=[1,3,4], index_col=0,
                   names=['id', 'title', 'url'])
attr.head()

Unnamed: 0_level_0,title,url
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1287,International AutoRoute,/autoroute
1288,library,/library
1289,Master Chef Product Information,/masterchef
1297,Central America,/centroam
1215,For Developers Only Info,/developer


Also, let's handle the Case and Vote Lines, we are going to create a list of titles for each user, and we can transform it using `TransactionEncoder` to our desired format.

In [3]:
# Function to read case-vote data and return it as a 2d-list value
def load_cv(filedir):
    cvlist = []
    with open(filedir,'r') as csv_file:
        reader = csv.reader(csv_file)
        case = next(reader)
        items = []
        for row in reader:
            if row[0] == 'V':
                title = attr.at[int(row[1]), 'title']
                items.append(title)
            else:
                cvlist.append(items)
                case, items = row[0], []
        # For the last element
        cvlist.append(items)
    return cvlist

In [4]:
# Load the data and merge as one list
train_cv = load_cv('resources/msweb_traincv.csv')
test_cv = load_cv('resources/msweb_testcv.csv')

#  Transform it to a format
te = TransactionEncoder()
all_cv = train_cv + test_cv

te_arr = te.fit(all_cv).transform(all_cv)
df = pd.DataFrame(te_arr, columns=te.columns_)

print('**The final dataset**')
df.info()
df.head()

**The final dataset**
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37711 entries, 0 to 37710
Columns: 285 entries, About Microsoft  to vbscripts
dtypes: bool(285)
memory usage: 10.2 MB


Unnamed: 0,About Microsoft,Access Development,ActiveX Data Objects,ActiveX Technology Development,Advanced Data Connector,Advanced Technology,Anti Piracy Information,Argentina,Australia,Authorized Technical Education Center Program,...,mdsn,misc,msdownload.,partner,promo,regwiz,security.,softlib,sports,vbscripts
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False


As we can see, in total there are 37711 users with 285 different attributes showing which content has the user viewed. Now we can approach to use Apriori to find some interesting patterns.

## Association Analysis

Before that we need to know why we do Association analysis.

So Association analysis is relatively light on the math concepts and easy to explain to non-technical people. In addition, it is an unsupervised learning tool that looks for hidden patterns so there is limited need for data prep and feature engineering. It is a good start for certain cases of data exploration and can point the way for a deeper dive into the data using other approaches.

Here, we will use Apriori -- a Association Rule Mining algorithm -- to discover those hidden patterns. 

In Apriori, *Support* is the relative frequency that the rules show up. In many instances, we may want to look for high support to ensure that a relationship is useful. However, there may be instances where a low support is useful if we are trying to find 'hidden' relationships.

Let's say we want rules for only those links that are visited at least by 1000 unique user throughout the week, and the support will be around 1000 / 37711 = 2.65%. so let's generate frequent item sets that have a support of at least 2.65%.

In [5]:
# Adjust the min support
min_support = 0.0265

frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)
frequent_itemsets.info()
frequent_itemsets.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 2 columns):
support     57 non-null float64
itemsets    57 non-null object
dtypes: float64(1), object(1)
memory usage: 1.0+ KB


Unnamed: 0,support,itemsets
0,0.03373,(Developer Network)
1,0.046034,(Developer Workshop)
2,0.33176,(Free Downloads)
3,0.044417,(Games)
4,0.287025,(Internet Explorer)
5,0.099096,(Internet Site Construction for Developers)
6,0.091194,(Knowledge Base)
7,0.046379,(MS Office Info)
8,0.259526,(Microsoft.com Search)
9,0.156267,(Products )


And next, we generate the rules with their corresponding support, confidence and lift. We set the Lift value > 1 as the rules are generally more interesting and could be indicative of a useful rule pattern.

In [6]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(by=['lift'], ascending=False).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
83,(Windows95 Support),"(isapi, Windows Family of OSs)",0.054281,0.044258,0.0284,0.523205,11.821793,0.025998,2.004513
78,"(isapi, Windows Family of OSs)",(Windows95 Support),0.044258,0.054281,0.0284,0.641702,11.821793,0.025998,2.639473
18,(SiteBuilder Network Membership),(Internet Site Construction for Developers),0.034446,0.099096,0.027896,0.809854,8.172436,0.024483,4.737954
19,(Internet Site Construction for Developers),(SiteBuilder Network Membership),0.099096,0.034446,0.027896,0.281509,8.172436,0.024483,1.343864
47,(Windows Family of OSs),(Windows 95),0.14041,0.035719,0.032749,0.233239,6.529824,0.027734,1.257603
46,(Windows 95),(Windows Family of OSs),0.035719,0.14041,0.032749,0.916852,6.529824,0.027734,10.338105
1,(Internet Site Construction for Developers),(Developer Workshop),0.099096,0.046034,0.028506,0.287664,6.248902,0.023944,1.339207
0,(Developer Workshop),(Internet Site Construction for Developers),0.046034,0.099096,0.028506,0.61924,6.248902,0.023944,2.366066
75,(Knowledge Base),"(isapi, Support Desktop)",0.091194,0.058975,0.033465,0.366967,6.222436,0.028087,1.486534
74,"(isapi, Support Desktop)",(Knowledge Base),0.058975,0.091194,0.033465,0.567446,6.222436,0.028087,2.101024


Let's see and figure it out what this tells us.

For instance, we can see that there are quite a few rules with a high lift value, which means that it occurs more frequently than would be expected given the number of users and area combinations.

We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will become useful.

We can filter the dataframe and look for a larger lift (5) and a confidence of 80%:

In [7]:
rules[ (rules['lift'] >= 5) & (rules['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
18,(SiteBuilder Network Membership),(Internet Site Construction for Developers),0.034446,0.099096,0.027896,0.809854,8.172436,0.024483,4.737954
46,(Windows 95),(Windows Family of OSs),0.035719,0.14041,0.032749,0.916852,6.529824,0.027734,10.338105
53,(Windows95 Support),(isapi),0.054281,0.161279,0.045584,0.839766,5.206905,0.036829,5.234334
80,"(Windows Family of OSs, Windows95 Support)",(isapi),0.032696,0.161279,0.0284,0.868613,5.385773,0.023127,6.383597


Looking at the rules, it appears that there are some interesting patterns. Users who have visited SiteBuilder Network Membership are also going to the Internet Site Construction for Developers site. Users who go to Windows 95 site also visit Windows Family of OSs. Also, users visiting Windows95 Support, and some also visiting the Windows Family of OSs, will also visit the isapi site (Internet Server API). From this patterns, we learn that the dataset contains a group of users who might be developers, and from their web log histories, they meght be finding their way to fix or make things after reading some specifications or guidances.

## Conclusion

In conclusion, this notebook followed the ARM (Apriori) data mining method to explore the historical weblog data, preprocess and format the data, and find interesting rules and hidden patterns from this dataset.
