### CS 421 PROJECT

In this project, you will be working with data extracted from famous recommender systems type datasets: you are provided with a large set of interactions between users (persons)  and items (movies). Whenever a user "interacts" with an item, it watches the movie and gives a mark or "rating" between 1 and 5 stars (5 stars indicating that the user liked that movie very much, and 1 star indicating that the user didn't like the movie at all. 




In this exercise, we will **not** be performing the recommendation task per se. Instead, we will identify *anomalous users*. In the dataset that you are provided with, some of the data was corrupted. Whilst most of the data comes from real life user-item interactions from a famous movie rating website, some "users" are anomalous: they were generated by me according to some undisclosed procedure. 

You are provided with two data frames: the first one ("ratings") contains the interactions provided to you, and the second one ("labels") contains the labels for the users.

As you can see, the three columns in "ratings" correspond to the user ID, the item ID and the rating. Thus, each row of "ratings" contains a single interaction. For instance, if the row "142, 152, 5" is present, this means that the user with ID 142 has given the movie 152 the rating 5 stars.

The dataframe "labels" has two columns. In the first column we have the user ids, whilst the second column contains the labels. A label of 1 indicates that the user is fake (generated by me), whilst a label of 0 denotes a natural user (coming from real life interactions). 

For instance, if the labels matrix contains the line "142, 1", it means that all of the ratings given by the user with id 142 are fake. This means all lines in the dataframe "ratings" which start with the userID 142 correspond to fake interactions. 

#### Evaluation

Your task is to be able to classify unseen instances as either anomalies or non anomalies (guess whether they are real users or if they were generated by me). 

There are **far more** normal users than anomalies in the dataset, which makes this a very heavily **unbalanced dataset**. Thus, accuracy will not be a good measure of performance, since simply predicting that every user is normal will give good accuracy. Thus, we need to use some other evaluation metrics (see lecture notes from week 3). 

THE **EVALUATION METRICS** are:  THE **AUC** (AREA UNDER CURVE), the **PRECISION**, THE **RECALL**, and the **F1 score**. The **main metric** will be the **AREA UNDER CURVE**, and it will by default be used to rank teams. This means your programs should return an **anomaly score** for each user (the higher the score, the more likely the model think the sample is anomalous).  

Every few weeks, we will evaluate the performance of each team (on an *unseen test set* I will provide) in terms of AUC, PRECISION, RECALL and F1 score, and rank the teams by **AUC** and by F1 score to distinguish between ties, where a tie is defined by a difference of less than 0.005 in AUC.  

The difficulty implied by **the generation procedure of the anomalies WILL CHANGE as the project evolves: depending on how well the teams are doing, I will generate easier or harder anomalies**.

The **first round** will take place after recess (week 9): this means that I will **release the next test set on the tuesday of week 9**, and you must hand in your scores before the **WEDNESDAY at NOON (5th of October)**. We will then look at the results together on the thursday.  

We will check everyone's performance in this way every week (once on  week 10, once on week 11 and once on week 12). 

Whilst performance (expressed in terms of AUC and your ranking compared to other teams) at **each of the check points** (weeks 9 to 12 inclusive) is an **important component** of your **final grade**, the **final report** and the detail of the various methods you will have tried will **also** be very **important**. Ideally, to get perfect marks (A+), you should try at least **two supervised methods** and **two unsupervised methods**, as well as be ranked the **best team** in terms of performance.

The performance part of the grading will be based half on performance at weeks 9,10,11 and half on performance at week 12. 

In [1]:
import numpy as np
import pandas as pd
data=np.load("first_batch.npz")

In [2]:
X=data["X"]
y=data["y"]


XX=pd.DataFrame(X)
yy=pd.DataFrame(y)
XX.rename(columns={0:"user",1:"item",2:"rating"},inplace=True)

In [3]:
XX.head()

Unnamed: 0,user,item,rating
0,2549,0,4
1,2549,30,3
2,2549,55,2
3,2549,67,2
4,2549,72,4


In [4]:
# Get the number of unique values in each column
XX.nunique()

user      5200
item       999
rating       6
dtype: int64

In [5]:
yy.rename(columns={0:"user",1:"label"},inplace=True)

In [6]:
yy.head(10)

Unnamed: 0,user,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
5,5,0
6,6,0
7,7,0
8,8,0
9,9,0


In [7]:
XX = XX.groupby('user')['rating'].apply(lambda x: x.value_counts().index[0]).reset_index(name='rating')
XX

Unnamed: 0,user,rating
0,0,4
1,1,4
2,2,2
3,3,4
4,4,4
...,...,...
5195,5195,3
5196,5196,4
5197,5197,3
5198,5198,4


In [8]:
# Add list of movie reviews to yy
# This will be training data for the classification model
#yy["reviews"] = user_reviews
yy

# Model should output prediction array of 5200 - 1 for each user
# We can either predict 0-1 or probabilities
# Metrics are AUC, precision, recall, F1

Unnamed: 0,user,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0
...,...,...
5195,5195,0
5196,5196,0
5197,5197,0
5198,5198,0


In [9]:
# Separate inputs and labels
y = yy['label']

print(y)

0       0
1       0
2       0
3       0
4       0
       ..
5195    0
5196    0
5197    0
5198    0
5199    0
Name: label, Length: 5200, dtype: int64


In [10]:
from sklearn.model_selection import train_test_split
# Applying logreg and SVM models for binary classification
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import svm
# Evaluation metrics: AUC, prec, recall, F1
from sklearn.metrics import auc, roc_auc_score, precision_score, recall_score, f1_score



# Instantiate logreg model and fit logreg
logreg=LogisticRegression(solver='lbfgs', max_iter=1000)

# Fit on first batch data
logreg.fit(XX, y)

In [16]:
# Load and predict on third batch
batch3_data = np.load('third_batch.npz')
batch3_df = pd.DataFrame(batch3_data['X'])
batch3_df.rename(columns={0:"user",1:"item",2:"rating"},inplace=True)

batch3_df

Unnamed: 0,user,item,rating
0,7105,0,5
1,7105,3,4
2,7105,16,3
3,7105,17,3
4,7105,18,3
...,...,...,...
195408,7182,807,0
195409,7182,808,1
195410,7182,809,4
195411,7182,815,0


In [18]:
batch3_df = batch3_df.groupby('user')['rating'].apply(lambda x: x.value_counts().index[0]).reset_index(name='rating')

y_pred_logreg = logreg.predict(batch3_df)
y_pred_proba = logreg.predict_proba(batch3_df)[::,1]


# Evaluation
# print("------Logistic regression evaluation metrics------")
# print(f"AUC: {roc_auc_score(y_test, y_pred_proba)}")
# print(f"Precision: {precision_score(y_test, y_pred_logreg)}")
# print(f"Recall: {recall_score(y_test, y_pred_logreg)}")
# print(f"F1: {f1_score(y_test, y_pred_logreg)}")


In [23]:
predictions = pd.DataFrame(y_pred_proba, columns=['y_pred'])
predictions

Unnamed: 0,y_pred
0,0.025104
1,0.025108
2,0.069954
3,0.069962
4,0.069971
...,...
1295,0.029680
1296,0.029684
1297,0.082015
1298,0.010370


In [24]:
np.savez('W10_predictions.npz', predictions)