<a href="https://colab.research.google.com/github/evanh1393/dsi_capstone/blob/main/01_main_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports



In [1]:
import pandas as pd
import numpy as np

from datetime import datetime

# Load Dataframes

The data we are dealing with come from  the [MovieLens 1M](https://grouplens.org/datasets/movielens/) dataset published by GroupLens. It contains a *movies*, *users*, and a *ratings* dataset. Their file format is `.dat` and they are seperated by **::** instead of standard commas. They will need a slightly more robust `read_csv()` to load without throwing warnings.  


## Mounting Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Creating the path of data
DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/one-m-capstone/data/main/'

## Movies Dataframe

In [4]:
movies = pd.read_csv(DATA_PATH + 'movies.dat', sep='::', engine='python', names=['movie_id', 'title', 'genres'])
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  3883 non-null   int64 
 1   title     3883 non-null   object
 2   genres    3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


### Genres
The genres data will be crucial to our modeling later on. In order to make it more wieldy we will change the seperation technique to  a simple space-seperated string.  

In [6]:
movies['genres'] = movies['genres'].str.split('|')
movies['genres'] = movies['genres'].str.join(' ').astype(str)

In [7]:
movies.head(2)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation Children's Comedy
1,2,Jumanji (1995),Adventure Children's Fantasy


# Users Dataframe

In [8]:
users = pd.read_csv(DATA_PATH + 'users.dat', 
                    sep='::', 
                    engine='python', 
                    names=['user_id', 'gender', 'age', 'occupation', 'zip'])
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [9]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     6040 non-null   int64 
 1   gender      6040 non-null   object
 2   age         6040 non-null   int64 
 3   occupation  6040 non-null   int64 
 4   zip         6040 non-null   object
dtypes: int64(3), object(2)
memory usage: 236.1+ KB


## Elaborating on the numeric values

This will make it easier to interpret our EDA. The mappings below are taken from the dataset's `README`.

In [10]:
age_map = {
    1  : 'Under 18',
    18 : '18-24',
    25 : '25-34',
    35 : '35-44',
    45 : '45-49',
    50 : '50-55',
    56 : '56+'
} 
occ_map = {
    0:  "other or not specified",
    1:  "academic/educator",
    2:  "artist",
    3:  "clerical/admin",
    4:  "college/grad student",
    5:  "customer service",
    6:  "doctor/health care",
    7:  "executive/managerial",
    8:  "farmer",
    9:  "homemaker",
    10:  "K-12 student",
    11:  "lawyer",
    12:  "programmer",
    13:  "retired",
    14:  "sales/marketing",
    15:  "scientist",
    16:  "self-employed",
    17:  "technician/engineer",
    18:  "tradesman/craftsman",
    19:  "unemployed",
    20:  "writer"
}
users['age_elab'] = users['age'].map(age_map)
users['occ_elab'] = users['occupation'].map(occ_map)

In [11]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip,age_elab,occ_elab
0,1,F,1,10,48067,Under 18,K-12 student
1,2,M,56,16,70072,56+,self-employed
2,3,M,25,15,55117,25-34,scientist
3,4,M,45,7,2460,45-49,executive/managerial
4,5,M,25,20,55455,25-34,writer


# Ratings Dataframe

In [12]:
ratings = pd.read_csv(DATA_PATH + 'ratings.dat', sep='::', 
                     engine='python', 
                     names=['user_id', 'movie_id', 'rating', 'timestamp'])
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


## Unrated movies

Not the kind of unrated as in *X-rated* but movies that appear in movies but not in ratings. To resolve this we will generate ratings from a fake user. In our final modeling and implementation it will be critical that we preserve every movie in our data. The fake user will generate ratings as to totall encompass every movie in our dataframe. 

In [13]:
all_movie_ids = movies['movie_id'].values
rated_movie_ids = ratings['movie_id'].values
unrated_ids = [x for x in all_movie_ids if x not in rated_movie_ids]

print(f'There are  {len(unrated_ids)}  movies out of that do not have ratings')

There are  177  movies out of that do not have ratings


### Fake User

In [14]:
# creating a fake user that we can use to make our test predictions in ratings from
fake_user = pd.Series([6041, 'M', 30, 0, '07974', '25-34', 'other or not specified'], index=users.columns)
users = users.append(fake_user, ignore_index=True)
users.tail(1)

Unnamed: 0,user_id,gender,age,occupation,zip,age_elab,occ_elab
6040,6041,M,30,0,7974,25-34,other or not specified


### Creating Fake (Imputed) Ratings

In [15]:
round(np.mean(ratings['rating']))

4

In [16]:
# First create a mapping of genre to average ratings
mean_score = round(np.mean(ratings['rating']))

# create a fake ratings dataframe
fake_ratings = []

for movie_id in unrated_ids:
  # check the movies genre and get the genre's average rating
  movie_series = movies.query(f'movie_id == {movie_id}')
  genre = str(movie_series['genres'].values).strip('[]').replace("'",'')
  
  # if genre not listed just use 3... which is the global average
  try:
    g_avg = ugenre_map[genre]
  except:
    g_avg = 3

  # our fake user from above will be the user submitting the rating
  user_id = 6041

  # we need a timestamp for the rating data
  time_stamp = int(datetime.timestamp(datetime.now()))

  # create the rating to attach
  fake_rating = {
      'user_id'   : 6041,
      'movie_id'  : movie_id,
      'rating'    : mean_score,
      'timestamp' : np.int(datetime.timestamp(datetime.now()))
  }
  fake_ratings.append(fake_rating)

In [17]:
fakedf = pd.DataFrame(data=fake_ratings, columns=['user_id','movie_id','rating','timestamp'])

In [18]:
ratings = ratings.append(fakedf,ignore_index=True)

### Confirming Changes

In [19]:
all_movie_ids = movies['movie_id'].values
rated_movie_ids = ratings['movie_id'].values
unrated_ids = [x for x in all_movie_ids if x not in rated_movie_ids]

print(f'There are  {len(unrated_ids)}  movies out of that do not have ratings')

There are  0  movies out of that do not have ratings


# Combined Dataframes

This dataframe will make modeling easier later on. By combining all of our data, we create a monolithic dataframe that is easily accessible and reduces the amount of importing we need to do later on. We will use a simple inner join, the `pd.merge` default, to merge our data.

In [20]:
combined = pd.merge(movies, ratings, on='movie_id')
combined = pd.merge(combined, users, on='user_id')

# Saving our Data

In [22]:
movies.to_csv(DATA_PATH + 'processed_movies.csv', sep=',', header=True, columns=['movie_id', 'title', 'genres'])
ratings.to_csv(DATA_PATH + 'processed_ratings.csv', sep=',', header=True, columns=['user_id','movie_id','rating','timestamp'])
users.to_csv(DATA_PATH + 'processed_users.csv', sep=',', header=True, columns=['user_id', 'gender', 'age', 'occupation', 'zip', 'age_elab', 'occ_elab'])
combined.to_csv(DATA_PATH + 'combined.csv', header=True)