# Anomaly Detection 'Many Ratings'
KLF v1.0

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
from os.path import exists

### Build reduced ratings dataset (if not exists)
All ratings are about 1.2GB and cause significant delays while loading.
Since we only need a tiny fraction of this data, we'll reduce the data based on user ids.

This is a one-time operation, but it needs to be done again if there are new users to include or in new environments.

In [2]:
if not exists('data/reduced_ratings.csv'):
    users_to_include = [172357,134596,123100]

    all_ratings = pd.read_csv('data/all_ratings.csv')
    all_ratings = all_ratings.query('userId in @users_to_include')
    all_ratings.to_csv('data/reduced_ratings.csv', header=True, index=False)

## Import (local)

In [3]:
ratings_single_account = pd.read_csv('data/ratings_single_account.csv')
movies_single_account = pd.read_csv('data/movies_single_account.csv')
reduced_ratings = pd.read_csv('data/reduced_ratings.csv')

## Set Global Design

In [4]:
def ratings_for_uid(user_id):
    all_user_ratings = reduced_ratings.query(f'userId == {user_id}').copy()
    all_user_ratings.sort_values(by=['rating_date'], inplace=True)
    all_user_ratings['rating_date'] = pd.to_datetime(all_user_ratings['rating_date'])
    return all_user_ratings

def apply_grid_bg_design(go):
    go.update_layout(
        plot_bgcolor= 'rgb(244,247,251)',
    )
    go.update_xaxes(
        showgrid=True, gridwidth=1, gridcolor='lightgrey'
    )
    go.update_yaxes(
        showgrid=True, gridwidth=1, gridcolor='lightgrey'
    )
    

# Users with many ratings

It is not uncommon to have very enthusiastic users. There will always be a small fraction of people with significantly more interactions than the average person. However, we do have some users with interesting rating activity. 

The most active user has rated over 23'000 movies in less than 3 years. We don't want to be judgmental, but this is a bit too much, even for a _very_ enthusiastic movie connoisseur. The average movie length is around 90 minutes.<sup>1</sup> Under the assumption that this user was legit and watched all movies in his 3-year rating period, he would have spent about 32 hours a day watching movies.

Unlike detecting suspect activity in empty profiles, we have a lot more data to work with here. This allows us to get an insight into the rating patterns of individual users.

For most of the Top 20 most active users, there are obvious signs indicating bot activity. One phenomenon a lot of them share is 'Rating Bursts', short timeframes with hundreds of ratings in minutes. Another common pattern is a very even, unnatural distribution of ratings.

We are more interested in the ones who deviate from obvious bot patterns - heavy users with seemingly legitimate rating activity.

<sup>1</sup> <cite>Average Movie Length - https://towardsdatascience.com/are-new-movies-longer-than-they-were-10hh20-50-year-ago-a35356b2ca5b</cite>



In [5]:
# Showing unnatural patterns (rating Bursts, Distribution)
def plot_indicators():
    all_user_ratings = ratings_for_uid(172357)

    t_range = ['2016-06-25 08:00:00','2016-06-25 12:00:00']
    all_user_ratings = all_user_ratings.query('rating_date >= @t_range[0] and rating_date <= @t_range[1]')

    fgo2 = make_subplots(rows=1, cols=2, column_widths=[0.5,0.5], subplot_titles=(f'(1) Rating Bursts','(2) Unnatural Distribution'))

    fig2_1 = px.scatter(all_user_ratings,x='rating_date',y='rating', range_x=t_range)
    fig2_2 = px.histogram(all_user_ratings,x='rating')

    fgo2.add_trace(fig2_1['data'][0], row=1, col=1)
    fgo2.add_trace(fig2_2['data'][0], row=1, col=2)



    fgo2.update_layout(
        title_text=f'Potential Indicators for Bot Activity (Visualized for User 172357)',
    )

    fgo2.update_layout(
        yaxis = dict(
            tickmode = 'array',
            tickvals = np.arange(0.5,5.5,0.5)
        ),
        bargap = 0.05,
    )

    fgo2.update_xaxes(
        range=t_range,
        row=1,
        col=1,
    )

    fgo2.update_xaxes(
        tickmode = 'array',
        tickvals = np.arange(0.5,5.5,0.5),
        row=1,
        col=2,
    )

    fgo2.update_xaxes(title_text="Time", titlefont_size=12, row = 1, col = 1)
    fgo2.update_yaxes(title_text="Rating", titlefont_size=12, row = 1, col = 1)
    fgo2.update_xaxes(title_text="Rating", titlefont_size=12, row = 1, col = 2)
    fgo2.update_yaxes(title_text="Count", titlefont_size=12, row = 1, col = 2)


    fgo2.update_traces(
        marker_color='darkgrey',
    )

    apply_grid_bg_design(fgo2)

    return fgo2

plot_indicators().show()

Our hypothesis is, that (1) Rating Bursts and (2) Unnatural Distributions are the main indicators for bot activity. We searched for a user to showcase this pattern. This example shows a data burst of user 172357 with a duration of about 2 hours with the corresponding histogram. The burst consists of 1005 ratings. 

In the case of (1) Rating Bursts, they are easy to identify. We define Rating Bursts as unusually high amounts of activity in contained timeframes, ranging from seconds to hours with distinct intervals with no activity. The distribution of rating scores is irrelevant to this criteria. 

Despite our initial assumption that (2) Unnatural Distribution would be fairly easy to identify, we found that this may not be the case. There is no consensus about what "unnatural" means. We presumed that natural distributions would come in form of a normal curve. However, this is only an assumption, and we are biased by our own rating behavior. In the case of user 172357, the distribution could be legitimate and indicate that the user tends to rate movies in a polarized way. 

## User 134596

We would love to know more about User 134596. Despite his high number of Ratings, his activity seems legitimate at first glance. There are no rating bursts and continuous activity over almost 10 years. The rating distribution does not show any signs of polarization. 

In [6]:
def plot_strip_scatter(uid,time_range):
    all_user_ratings = ratings_for_uid(uid)

    fgo = make_subplots(rows=1, cols=2, column_widths=[0.75,0.25], subplot_titles=('Rating Distribution','Histogram'), shared_yaxes=True)

    fig1_1 = px.scatter(all_user_ratings,x='rating_date',y='rating', hover_data=['movieId'], color="rating")
    fig1_1.update_traces(
        marker=dict(size=16, symbol="line-ns", line=dict(width=0, color="DarkSlateGrey")),
        selector=dict(mode="markers"),
    )

    fig1_2 = px.histogram(all_user_ratings,y='rating')
    fig1_2.update_traces(
        marker_color='darkgrey'
    )
    
    fgo.add_trace(go.Heatmap(
    z=[np.arange(0.0,5.5,0.5)],
    colorscale=[
        [0, "#0f0787"],
        [0.1, "#0f0787"],

        [0.1, "#5011a4"],
        [0.2, "#5011a4"],

        [0.2, "#790eac"],
        [0.3, "#790eac"],

        [0.3, "#a833aa"],
        [0.4, "#a833aa"],

        [0.4, "#be3e8a"],
        [0.5, "#be3e8a"],

        [0.5, "#d8576b"],
        [0.6, "#d8576b"],

        [0.6, "#ed7953"],
        [0.7, "#ed7953"],

        [0.7, "#fba342"],
        [0.8, "#fba342"],

        [0.8, "#fdcb2d"],
        [0.9, "#fdcb2d"],

        [0.9, "#f1f421"],
        [1.0, "#f1f421"]

    ],
    
    colorbar=dict(
        ticks="outside",
        ticktext= [str(x) for x in np.arange(0.5,5.5,0.5)],
        tickvals=np.arange(0.25,5.25,0.5),
    )  
    ))
    

    fgo.update_layout(
        plot_bgcolor= 'rgb(244,247,251)',
        yaxis = dict(
            tickmode = 'array',
            tickvals = np.arange(0.5,5.5,0.5)
        ),
        title_text=f'Movie Ratings of User <b>{uid}</b> ({len(all_user_ratings)} Ratings)',
        coloraxis_showscale=False,
        xaxis_range=time_range,
    )

    fgo.update_xaxes(title_text="Time", titlefont_size=12, row = 1, col = 1)
    fgo.update_yaxes(title_text="Rating", titlefont_size=12, row = 1, col = 1)
    fgo.update_xaxes(title_text="Count per Rating Level", titlefont_size=12, row = 1, col = 2)

    fgo.update_xaxes(
        showgrid=True, gridwidth=1, gridcolor='lightgrey'
    )
    fgo.update_yaxes(
        showgrid=True, gridwidth=1, gridcolor='lightgrey', range=[0,5.5]
    )

    fgo.add_trace(fig1_1['data'][0], row=1, col=1)
    fgo.add_trace(fig1_2['data'][0], row=1, col=2)

    return fgo

plot_strip_scatter(134596,['2009-01-01','2019-01-01']).show()
plot_strip_scatter(123100,['2015-07-01','2019-01-01']).show()


## Searching for Time Patterns



In [7]:
def plot_freq_polygon(uid,startyear,endyear):
    all_user_ratings = ratings_for_uid(uid)
    all_user_ratings['hour'] = all_user_ratings['rating_date'].dt.hour.astype(int)

    fgo = go.Figure()

    color_seq = px.colors.qualitative.D3
    color_seq_count = 0

    empty_df = pd.DataFrame(index=np.arange(0,24,1))
    empty_df['count'] = 0

    for i in np.arange(startyear,endyear + 1,1):
        all_user_ratings_for_year_i = all_user_ratings[all_user_ratings['rating_date'].dt.year == i]

        #This only includes hours present in data
        hour_freq = all_user_ratings_for_year_i.hour.value_counts().to_frame()
        hour_freq.rename(columns={'hour':'count'},inplace=True)
        hour_freq.sort_index(inplace=True)

        #This fills in missing hours with 0
        hour_freq = hour_freq.combine_first(empty_df)
        
        fgo.add_trace(go.Scatter(x=hour_freq.index, y=hour_freq['count'], name=f'Year {i}', line=dict(color=color_seq[color_seq_count]), mode='lines+markers'))
        color_seq_count += 1


    fgo.update_xaxes(
        title_text="Timeline through the day",
        titlefont_size=12,
    )
    fgo.update_yaxes(
        title_text="Rating Count for Hour",
        titlefont_size=12,
        range=[0,120]
    )
    fgo.add_vrect(
        x0=17,
        x1=22,
        line_width=1,
        fillcolor='black',
        opacity=0.15,
        annotation_text='Hours with<br>no activity',
        annotation_position='top left',
        annotation=dict(font_size=14, font_color='black'),
    )

    fgo.update_layout(
        title_text=f'Favorite Rating Hours for User <b>{uid}</b>',
        xaxis = dict(
            tickmode = 'array',
            tickvals = np.arange(0,24,1)
        ),
    )
    apply_grid_bg_design(fgo)
    return fgo

plot_freq_polygon(134596,2011,2013).show()

The previous visualization of User 134596 left us with more questions than answers. There are no indications to justify doubt in the legitimacy of this user. It feels like finding the needle in the haystack. Maybe there is no needle and User 134596 is in fact human. But we are not done yet.

Every Rating has a timestamp. With thousands of ratings, this metadata can reveal a lot of insight into the user's life. In the original dataset, the timestamps are in Unix time. Even if the server moved to a different time zone, the data would still be consistent. 

We created a histogram of the timestamps. As expected, there is no anomaly in minutes and seconds. Ratings are evenly distributed over minutes and seconds. However, the hours tell a different story. There is a distinct pattern of activity over the day. There are even some resting hours with no activity at all. We decided to compare this histogram over years. There is still a very visible correlation between favorite rating hours over years. 

The consistency of hours without any ratings is remarkable. If User 134596 is a human, we can only applaud this disciplined sleep schedule. The decrease in activity from 3UTC to 11UTC could indicate a workday, however, this is pure speculation.

In Conclusion, we found no evidence to justify doubt in the legitimacy of User 134596. If this User actually turns out to be a bot, we can only admire the creators dedication and effort to run this account for over ten years. Of course, there are still more sophisticated ways to detect irregularities we haven't covered yet. In summary, our approach worked well to identify anomalies for most busy user accounts.