# Metisa - Data Science Challenge (Calvin)

Data: Anonymous Microsoft Web Data Data Set - http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data

The data is in an ASCII-based sparse-data format called "DST". Each line of the data file starts with a letter which tells the line's type. The three line types of interest are Attribute, Case and Vote. Each Attribute is a website, each Case is a user and each Vote is an Attribute that the user visited. For more details, please read the data description file for the structure of the data set.

##Task:
1. Assuming we are at a time such that we only have the training data, we want to recommend websites that the users should visit based on their user ID (case ID number). Please construct a recommender system, train it with the training data set and then conduct recommendation for the users given the user ID.
2. Please also write the procedure to test your recommender with the test data set. Explain the metrics that you use for the testing.

##Please also answer the following questions:
1. What are the pros and cons of the recommendation algorithm you have used?
2. How did you evaluate your recommender's performance? Why?
3. Are you happy with you recommender's results? What could be a suitable baseline to compare your classifier's performance to?

##Remarks:
1. The challenge does not have 'the one' solution or answer. There are many ways to approach the task. Same holds true for the accompanying questions. Please motivate all the choices you have made.
2. We have stated the task with many implicit and explicit requirements. If you cannot comply with any of these requirements, please state this and work around.
3. We also value your input on how this challenge can be improved.
4. Very important: we want to see how you think. Please write down all your thoughts, however preliminary. We much prefer that you discuss an issue without offering a solution, rather than not mentioning it.


## Relevant Information:

    We created the data by sampling and processing the www.microsoft.com logs.
    The data records the use of www.microsoft.com by 38000 anonymous,
    randomly-selected users. For each user, the data lists all the areas of
    the web site (Vroots) that user visited in a one week timeframe.

    Users are identified only by a sequential number, for example, User #14988,
    User #14989, etc. The file contains no personally identifiable information.
    The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint")
    and URL (e.g. "/stream"). The data comes from one week in February, 1998.

    Dataset format:
	-- The data is in an ASCII-based sparse-data format called "DST".
           Each line of the data file starts with a letter which tells the line's type.
           The three line types of interest are:
               -- Attribute lines:
		     For example, 'A,1277,1,"NetShow for PowerPoint","/stream"'
                     Where:
                        'A' marks this as an attribute line,
                        '1277' is the attribute ID number for an area of the website
                                 (called a Vroot),
	                '1' may be ignored,
			'"NetShow for PowerPoint"' is the title of the Vroot,
                        '"/stream"' is the URL relative to "http://www.microsoft.com"
                -- Case and Vote Lines:
                    For each user, there is a case line followed by zero or more vote lines.
                     For example:
                           C,"10164",10164
                           V,1123,1
                           V,1009,1
                           V,1052,1
                      Where:
                         'C' marks this as a case line,
                          '10164' is the case ID number of a user,
                         'V' marks the vote lines for this case,
                         '1123', 1009', 1052' are the attributes ID's of Vroots that a
                                user visited.
                          '1' may be ignored.

## Number of Instances:
      -- Training: 32711
      -- Testing:   5000
    Each instance represents an anonymous, randomly selected user of the web site.

## Number of Attributes: 294

## Attribute Information:
   Each attribute is an area ("vroot") of the www.microsoft.com web site.

   The datasets record which Vroots each user visited in a one-week timeframe
   in Feburary 1998.

## Missing Attribute Values: The data is very sparse, so vroot visits are explicit,
    nonvisits are implicit (missing).

## Class Distribution: 
    Mean number of vroot visits per case: 3.0

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_table('anonymous-msweb.data', skiprows = 7, sep = ',', header = None, names = ['Attribute', 'ID', 'Ig', 'Vroot', 'URL'])

  data = self._reader.read(nrows)


## First, divide the data into two tables: 
### 1) Attribute table - store all the attribute information
### 2) Cases and Votes tables - for extract information from our cases

In [21]:
# Attribute table
dfA = data.ix[0:293,:]
dfA.head()

Unnamed: 0,Attribute,ID,Ig,Vroot,URL
0,A,1287,1,International AutoRoute,/autoroute
1,A,1288,1,library,/library
2,A,1289,1,Master Chef Product Information,/masterchef
3,A,1297,1,Central America,/centroam
4,A,1215,1,For Developers Only Info,/developer


In [12]:
# The Case and Vote data is then store in dfC
dfC = data.ix[294:,:3]
dfC = dfC.reset_index()
dfC.head()

Unnamed: 0,index,Attribute,ID,Ig
0,294,C,10001,10001
1,295,V,1000,1
2,296,V,1001,1
3,297,V,1002,1
4,298,C,10002,10002


In [20]:
dfA[dfA['ID'].isin([1000,1001,1002])]

Unnamed: 0,Attribute,ID,Ig,Vroot,URL
78,A,1001,1,Support Desktop,/support
217,A,1002,1,End User Produced View,/athome
268,A,1000,1,regwiz,/regwiz


# Strategy:
1. Find relationships between Attribute, i.e. in the first case, ID:10001 visit site 1000, then visit 1001 and 1002, so it indicate that there is a relation between Support Desktop, End User Produced View and regwiz.
2. Formulate these relations from the case and vote data, then we can use these information to recommend website based on the history of website visited.


In [33]:
test = np.empty(20)*np.nan

for i in range(0,10):
    if dfC.Attribute[i] == 'C':
        test[i] = dfC.ID[i+1]
test

array([ 1000.,    nan,    nan,    nan,  1001.,    nan,    nan,  1001.,
          nan,    nan,    nan,    nan,    nan,    nan,    nan,    nan,
          nan,    nan,    nan,    nan])

In [34]:
test = np.empty(10,2)*np.nan
test

TypeError: data type not understood

In [5]:
dfC = data.loc[data['Attribute'] == 'C']
dfC = dfC.reset_index()
#dfC['Nvote'] = float('NaN')
#dfC['Attribute'].size
#dfC
#for i in range(0,dfC['Attribute'].size):
 #   dfC['Nvote'][i] = dfC['index'][i+1] - dfC['index'][i]

Unnamed: 0,index,Attribute,ID,Ig,Vroot,URL,Nvote
0,294,C,10001,10001,,,
1,298,C,10002,10002,,,
2,301,C,10003,10003,,,
3,305,C,10004,10004,,,
4,307,C,10005,10005,,,
5,309,C,10006,10006,,,
6,312,C,10007,10007,,,
7,314,C,10008,10008,,,
8,316,C,10009,10009,,,
9,319,C,10010,10010,,,


In [7]:
i = 0
dfC['index'][i+1] - dfC['index'][i]

4