# Exploratory Data Analysis of User Ratings

In [52]:
import pandas as pd

In [60]:
user_ratings = pd.read_csv('../data/raw/unclean_user_ratings.csv')

# Display some information about the transactions DataFrame
user_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2603343 entries, 0 to 2603342
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   movie_id    object
 1   rating_val  int64 
 2   user_id     object
dtypes: int64(1), object(2)
memory usage: 59.6+ MB


Movie_id and user_id have data type object and are not explicit about what's in them. I'll first check if there's any mismatched data types.

In [55]:
print(f"movie_id types: {user_ratings['movie_id'].apply(type).unique()}")
print(f"user_id types: {user_ratings['user_id'].apply(type).unique()}")

movie_id types: [<class 'str'>]
user_id types: [<class 'str'>]


They just contain strings which is expected so we can move on. Checking if any rows have any missing values:

In [56]:
user_ratings.isnull().any(axis=1).sum()

np.int64(0)

There are no records with missing values. Let's take a look at the data to get an idea of what we want to do:

In [57]:
user_ratings.head()

Unnamed: 0,movie_id,rating_val,user_id
0,mank,5,deathproof
1,the-social-network,10,deathproof
2,insidious,10,deathproof
3,hush-2016,8,deathproof
4,deadpool,5,deathproof


We have the movie_id of the movie they are rating which matches with movie_id in movies dataset. We have the user_id of the user that rated that movie and the rating_val out of 10 that they gave to the movie.

Let's first make sure that all the ratings are in the correct range:

In [58]:
invalid_ratings = user_ratings[(user_ratings['rating_val'] < 0) | (user_ratings['rating_val'] > 10)]
print(invalid_ratings)


Empty DataFrame
Columns: [movie_id, rating_val, user_id]
Index: []


All the ratings are within the correct range.

There's nothing else to check so let's move onto data cleaning.

# Data Cleaning

There is not much to clean in this dataset. However, it's good practice to anonymise the user_ids. Let's change them to serialised numbers so that we can still track different users while keeping their identities anonymous.

In [64]:
# Check unique users
print(f"Unique users: {user_ratings['user_id'].nunique()}")

user_mapping = {uid: i for i, uid in enumerate(user_ratings['user_id'].unique(), start=1)}

user_ratings['user_id'] = user_ratings['user_id'].map(user_mapping)
print(f"Unique users after mapping: {user_ratings['user_id'].nunique()}")

print(user_ratings['user_id'].unique())

Unique users: 7449
Unique users after mapping: 7449
[   1    2    3 ... 7447 7448 7449]


We've successfully mapped the user_ids to an anonymised series and retained the user count.

In [None]:
user_ratings.reset_index(drop=True, inplace=True)

# Save the cleaned data

For testing purposes in the pipeline, it makes sense for us to export the cleaned DataFrame to a CSV file.  This will allow us to use the cleaned data in the pipeline without having to run the cleaning steps again.

In [67]:
user_ratings.to_csv(
    "../tests/test_data/expected_user_ratings_clean_results.csv", index=False
)