# Exploratory Data Analysis of User Ratings

In [56]:
import pandas as pd

In [58]:
user_ratings = pd.read_csv("../data/raw/unclean_user_ratings.csv")

Movie_id and user_id have data type object and are not explicit about what's in them. I'll first check if there's any mismatched data types.

In [59]:
print(f"movie_id types: {user_ratings['movie_id'].apply(type).unique()}")
print(f"user_id types: {user_ratings['user_id'].apply(type).unique()}")

movie_id types: [<class 'str'>]
user_id types: [<class 'str'>]


They just contain strings which is expected so we can move on. Checking if any rows have any missing values:

In [60]:
user_ratings.isnull().any(axis=1).sum()

np.int64(0)

There are no records with missing values. Let's take a look at the data to get an idea of what we want to do:

In [61]:
user_ratings.head()

Unnamed: 0,movie_id,rating_val,user_id
0,mank,5,deathproof
1,the-social-network,10,deathproof
2,insidious,10,deathproof
3,hush-2016,8,deathproof
4,deadpool,5,deathproof


We have the movie_id of the movie they are rating which matches with movie_id in movies dataset. We have the user_id of the user that rated that movie and the rating_val out of 10 that they gave to the movie.

Let's first make sure that all the ratings are in the correct range:

In [62]:
invalid_ratings = user_ratings[(user_ratings['rating_val'] < 0) | (user_ratings['rating_val'] > 10)]
print(invalid_ratings)


Empty DataFrame
Columns: [movie_id, rating_val, user_id]
Index: []


All the ratings are within the correct range.

When exploring the movies table we discovered two movies with the same movie_title but different movie_id (and same or almost same release year so we know they are the same movie referenced twice). Let's check if both movie_ids received ratings by our users.

In [63]:
rating_count = user_ratings.groupby("movie_id")["rating_val"].count()
rating_count = rating_count.reset_index(name="ratings_count")

print(f"Rating count for black-panther: {rating_count[rating_count['movie_id'] == 'black-panther']}")
print(f"Rating count for black-panther-2018: {rating_count[rating_count['movie_id'] == 'black-panther-2018']}")

Rating count for black-panther:          movie_id  ratings_count
97  black-panther           4894
Rating count for black-panther-2018:               movie_id  ratings_count
98  black-panther-2018           3031


We'll consolidate to black-panther-2018 to keep a more unique id.

In [64]:
rating_count = user_ratings.groupby("movie_id")["rating_val"].count()
rating_count = rating_count.reset_index(name="ratings_count")

print(f"Rating count for ex-machina-2014: {rating_count[rating_count['movie_id'] == 'ex-machina-2014']}")
print(f"Rating count for ex-machina-2015: {rating_count[rating_count['movie_id'] == 'ex-machina-2015']}")

Rating count for ex-machina-2014:             movie_id  ratings_count
217  ex-machina-2014           4022
Rating count for ex-machina-2015:             movie_id  ratings_count
218  ex-machina-2015           3324


There is a significant number of records for both. A google search tells me that 2015 is the more accurate release date so we will consolidate the records to that year.

There's nothing else to check so let's move onto data cleaning.

# Data Cleaning

## Consolidate Duplicated Movies with Different Movie ID

First we want to consolidate the duplicated movies to one id. All instances of "black-panther" will be changed to "black-panther-2018".

In [65]:
user_ratings.loc[user_ratings["movie_id"] == "black-panther", "movie_id"] = "black-panther-2018"

black_panther_ratings = user_ratings[user_ratings["movie_id"] == "black-panther"]
black_panther_2018_ratings = user_ratings[user_ratings["movie_id"] == "black-panther-2018"]

print(f"Ratings for Black Panther: {black_panther_ratings.shape}")
print(f"Ratings for Black Panther 2018: {black_panther_2018_ratings.shape}")

Ratings for Black Panther: (0, 3)
Ratings for Black Panther 2018: (7925, 3)


Now let's change all instances of "ex-machina-2014" to "ex-machina-2015".

In [66]:
user_ratings.loc[user_ratings["movie_id"] == "ex-machina-2014", "movie_id"] = "ex-machina-2015"

ex_machina_2014_ratings = user_ratings[user_ratings["movie_id"] == "ex-machina-2014"]
ex_machina_2015_ratings = user_ratings[user_ratings["movie_id"] == "ex-machina-2015"]

print(f"Ratings for Ex Machina 2014: {ex_machina_2014_ratings.shape}")
print(f"Ratings for Ex Machina 2015: {ex_machina_2015_ratings.shape}")

Ratings for Ex Machina 2014: (0, 3)
Ratings for Ex Machina 2015: (7346, 3)


Now I'll check if any duplicates have been introduced (same user_id rating both movie_ids).

In [67]:
duplicates = user_ratings[user_ratings.duplicated()]

num_duplicates = duplicates.shape[0]
num_duplicates

5551

Let's drop these duplicates.

In [68]:
user_ratings = user_ratings.drop_duplicates()

Now let's check if there any users have given the same movie different ratings.

In [69]:
duplicates = user_ratings[user_ratings.duplicated(subset=["user_id", "movie_id"], keep=False)]
duplicates = duplicates.sort_values(by=["user_id", "movie_id"])

duplicates

Unnamed: 0,movie_id,rating_val,user_id
1528369,black-panther-2018,7,007filmreviwer
1599902,black-panther-2018,2,007filmreviwer
971380,black-panther-2018,5,_movieman_
1576623,black-panther-2018,6,_movieman_
1547536,black-panther-2018,5,admiringcinema
...,...,...,...
2440561,ex-machina-2015,8,yondu4
1671144,ex-machina-2015,8,zaydentomson
2464085,ex-machina-2015,9,zaydentomson
644969,black-panther-2018,8,zegan


We can see that there are some instances of this happening. Let's consolidate their rating for the movie by finding the average of their ratings.

In [70]:
user_ratings = (
    user_ratings.groupby(["movie_id", "user_id"], as_index=False)["rating_val"]
      .mean()
)

# Check a record that was previously affected
record = user_ratings[
    (user_ratings["movie_id"] == "black-panther-2018") & 
    (user_ratings["user_id"] == "007filmreviwer")
]

print(record)


                  movie_id         user_id  rating_val
289116  black-panther-2018  007filmreviwer         4.5


Let's check that all the duplicates are gone.

In [71]:
duplicates = user_ratings[user_ratings.duplicated(subset=["user_id", "movie_id"], keep=False)]
duplicates = duplicates.sort_values(by=["user_id", "movie_id"])

duplicates

Unnamed: 0,movie_id,user_id,rating_val


## Anonymise User ID

There is not much else to clean in this dataset. However, it's good practice to anonymise the user_ids. Let's change them to serialised numbers so that we can still track different users while keeping their identities anonymous.

In [72]:
# Check unique users
print(f"Unique users: {user_ratings['user_id'].nunique()}")

user_mapping = {uid: i for i, uid in enumerate(user_ratings['user_id'].unique(), start=1)}

user_ratings['user_id'] = user_ratings['user_id'].map(user_mapping)
print(f"Unique users after mapping: {user_ratings['user_id'].nunique()}")

print(user_ratings['user_id'].unique())

Unique users: 7449
Unique users after mapping: 7449
[   1    2    3 ... 7447 7448 7449]


We've successfully mapped the user_ids to an anonymised series and retained the user count.

In [73]:
user_ratings.reset_index(drop=True, inplace=True)

# Save the cleaned data

For testing purposes in the pipeline, it makes sense for us to export the cleaned DataFrame to a CSV file.  This will allow us to use the cleaned data in the pipeline without having to run the cleaning steps again.

In [74]:
user_ratings.to_csv(
    "../tests/test_data/expected_user_ratings_clean_results.csv", index=False
)