### CS 421 PROJECT

In this project, you will be working with data extracted from famous recommender systems type datasets: you are provided with a large set of interactions between users (persons)  and items (movies). Whenever a user "interacts" with an item, it watches the movie and gives a mark or "rating" between 1 and 5 stars (5 stars indicating that the user liked that movie very much, and 1 star indicating that the user didn't like the movie at all. 




In this exercise, we will **not** be performing the recommendation task per se. Instead, we will identify *anomalous users*. In the dataset that you are provided with, some of the data was corrupted. Whilst most of the data comes from real life user-item interactions from a famous movie rating website, some "users" are anomalous: they were generated by me according to some undisclosed procedure. 

You are provided with two data frames: the first one ("ratings") contains the interactions provided to you, and the second one ("labels") contains the labels for the users.

As you can see, the three columns in "ratings" correspond to the user ID, the item ID and the rating. Thus, each row of "ratings" contains a single interaction. For instance, if the row "142, 152, 5" is present, this means that the user with ID 142 has given the movie 152 the rating 5 stars.

The dataframe "labels" has two columns. In the first column we have the user ids, whilst the second column contains the labels. A label of 1 indicates that the user is fake (generated by me), whilst a label of 0 denotes a natural user (coming from real life interactions). 

For instance, if the labels matrix contains the line "142, 1", it means that all of the ratings given by the user with id 142 are fake. This means all lines in the dataframe "ratings" which start with the userID 142 correspond to fake interactions. 

#### Evaluation

Your task is to be able to classify unseen instances as either anomalies or non anomalies (guess whether they are real users or if they were generated by me). 

There are **far more** normal users than anomalies in the dataset, which makes this a very heavily **unbalanced dataset**. Thus, accuracy will not be a good measure of performance, since simply predicting that every user is normal will give good accuracy. Thus, we need to use some other evaluation metrics (see lecture notes from week 3). 

THE **EVALUATION METRICS** are:  THE **AUC** (AREA UNDER CURVE), the **PRECISION**, THE **RECALL**, and the **F1 score**. The **main metric** will be the **AREA UNDER CURVE**, and it will by default be used to rank teams. This means your programs should return an **anomaly score** for each user (the higher the score, the more likely the model think the sample is anomalous).  

Every few weeks, we will evaluate the performance of each team (on an *unseen test set* I will provide) in terms of AUC, PRECISION, RECALL and F1 score, and rank the teams by **AUC** and by F1 score to distinguish between ties, where a tie is defined by a difference of less than 0.005 in AUC.  

The difficulty implied by **the generation procedure of the anomalies WILL CHANGE as the project evolves: depending on how well the teams are doing, I will generate easier or harder anomalies**.

The **first round** will take place after recess (week 9): this means that I will **release the next test set on the tuesday of week 9**, and you must hand in your scores before the **WEDNESDAY at NOON (5th of October)**. We will then look at the results together on the thursday.  

We will check everyone's performance in this way every week (once on  week 10, once on week 11 and once on week 12). 

Whilst performance (expressed in terms of AUC and your ranking compared to other teams) at **each of the check points** (weeks 9 to 12 inclusive) is an **important component** of your **final grade**, the **final report** and the detail of the various methods you will have tried will **also** be very **important**. Ideally, to get perfect marks (A+), you should try at least **two supervised methods** and **two unsupervised methods**, as well as be ranked the **best team** in terms of performance.

The performance part of the grading will be based half on performance at weeks 9,10,11 and half on performance at week 12. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data=np.load("first_batch.npz")

In [2]:
X=data["X"]
y=data["y"]

XX=pd.DataFrame(X)
yy=pd.DataFrame(y)
XX.rename(columns={0:"user",1:"item",2:"rating"},inplace=True)

In [3]:
XX.head()

Unnamed: 0,user,item,rating
0,2549,0,4
1,2549,30,3
2,2549,55,2
3,2549,67,2
4,2549,72,4


In [4]:
XX.nunique()

user      5200
item       999
rating       6
dtype: int64

In [5]:
yy.rename(columns={0:"user",1:"label"},inplace=True)

In [6]:
yy.head()

Unnamed: 0,user,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


In [7]:
yy.iloc[1143]

user     1143
label       1
Name: 1143, dtype: int64

In [8]:
XX = pd.merge(XX, yy, on='user')

In [9]:
XX

Unnamed: 0,user,item,rating,label
0,2549,0,4,0
1,2549,30,3,0
2,2549,55,2,0
3,2549,67,2,0
4,2549,72,4,0
...,...,...,...,...
807648,1143,347,1,1
807649,1143,368,1,1
807650,1143,452,1,1
807651,1143,637,1,1


In [10]:
y = XX.label
XX = XX.drop(['label'], axis=1)
XX

Unnamed: 0,user,item,rating
0,2549,0,4
1,2549,30,3
2,2549,55,2
3,2549,67,2
4,2549,72,4
...,...,...,...
807648,1143,347,1
807649,1143,368,1
807650,1143,452,1
807651,1143,637,1


In [11]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score

# Split the dataset into training (80%) and testing (20%) sets
x_train,x_test,y_train,y_test=train_test_split(XX,y,test_size=0.2, random_state=42)

# instantiate the logistic regression model
lr=LogisticRegression(solver='lbfgs', max_iter=1000)

# fit the model using the training data
lr.fit(x_train,y_train)

# Predictions
y_pred=lr.predict(x_test)
print(accuracy_score(y_pred,y_test))

0.9752988590425368


In [13]:
# from sklearn.metrics import roc_curve
# from sklearn.metrics import roc_auc_score
# lr_probs = lr.predict_proba(x_test.values.reshape(-1,1))
# A = np.array(y_test)
# B = [1-A, A]
# y_test_dummy=np.array(B).transpose()
# lr_auc = roc_auc(y_test_dummy, lr_probs, multi_class="ovr")
# print(lr_auc)

In [14]:
# https://www.statology.org/auc-in-python/

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn import metrics

# use model to predict probability that given y value is 1
y_pred_proba = lr.predict_proba(x_test)[::,1]

# Calculate AUC of model
auc = metrics.roc_auc_score(y_test, y_pred_proba)

# Print AUC score
print(auc)

0.7187084751252006


In [15]:
# Testing with second_batch
data2=np.load("second_batch.npz")

In [16]:
X = data2["X"]
XX2 = pd.DataFrame(X)
XX2.rename(columns={0:"user",1:"item",2:"rating"},inplace=True)
# XX2 = XX2.drop(['user'], axis=1)
XX2

Unnamed: 0,user,item,rating
0,5327,1,3
1,5327,8,2
2,5327,38,5
3,5327,80,3
4,5327,102,4
...,...,...,...
190198,5452,897,1
190199,5452,927,1
190200,5452,931,1
190201,5452,935,1


In [17]:
y_pred2=lr.predict(XX2)
print(len(y_pred2))
y_pred2

190203


array([0, 0, 0, ..., 0, 0, 0])

In [18]:
import sys
np.set_printoptions(threshold=sys.maxsize)
y_pred2

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [19]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,