# Let's Watch a Movie!
I love watching movies, but sometimes it can be hard to decide what to watch. I might be in the mood for an action movie like The Matrix, but I can’t seem to think of a title to watch. Thankfully, we can use machine learning to help us, using the power of recommender systems! 
A recommender system is an information filtering system that seeks to predict the most relevant item to a user. There are a handful of different kinds of recommendation systems, but the one that we will be focusing on will be Content-Based Filtering. This is filtering that uses the features of an item in order to recommend similar ones to the user. We will be utilizing two popular approaches to content-based filtering: a k-nearest neighbors algorithm and cosine similarity.<br><br> 
In this project, we will attempt to design our own recommender system that provides us with movie recommendations. These recommendations will be based on a movie we give to the system.

## Goal
The goal of this project is to design and implement a successful movie recommender system. We will attempt two different approaches and compare how each system performs. Additionally, I hope to gain more insight how these systems are developed and learn about ways to improve them. I think recommender systems are fascinating and I think it's important to know the inner workings of these systems that we interact with every day. <br><br>
Let's get started!
<br><br>

# Data

## Source
The dataset we will be utilizing will be:
<br>
<br>
*TMDB 5000 Movie Dataset.(2017)*https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata, &nbsp; Author unknown
<br>
<br>
This is a public version of The Movie Database’s (https://www.themoviedb.org/) dataset that includes metadata on roughly 5,000 movies. This metadata was generated by users of the website and includes things like movie titles, ratings, genre, etc. 
<br>

## Explanation

There are two files in this dataset,tmdb_5000_movies.csv and tmdb_5000_credits.csv. They are 5.4 MB and 38.1 MB, respectively. 
<br>In the movies csv, there are 4,803 samples with 20 features. The datatypes of these features is 7 numerical and 13 categorical. Some of the key features include **title**, **keywords**, and **genres**.
<br>In the credits csv, there are also 4,803 samples but there are only 4 features. This includes: **title**, **cast**, and **crew** which are categorical features, and **movie_id** which is numerical.
This dataset is not from multiple sources

## Data Cleaning
Now let's clean our data! This dataset has a lot of interesting characteristics, but it's not quite ready for our purposes yet. We will be cleaning the data by going through the following steps:
<br>
<ul>
<li>Consolidating the two csv's into one single data frame. This will make it easier to move forward in our exploratory data analysis. </li>
<li>Remove the features that will not be relevant to our future model.</li>
<li>Observe features with null values and decide whether to drop or impute.</li> 
<li>Determine if any of the datatypes need to be changed (such as float to an int, or list to a string),</li>
<br>
<br>
First, let's 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math

df = pd.read_csv('data/tmdb_5000_movies.csv')
df.dtypes
df.head()
df2 = pd.read_csv('data/tmdb_5000_credits.csv')
df2.dtypes
df2.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Local library import
We import all the required local libraries libraries

In [None]:
# Include local library paths
import sys
# sys.path.append('path/to/local/lib') # uncomment and fill to import local libraries

# Import local libraries

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.


# Data import
We retrieve all the required data for the analysis.

# Data processing
Put here the core of the notebook. Feel free di further split this section into subsections.

# References
We report here relevant references:
1. author1, article1, journal1, year1, url1
2. author2, article2, journal2, year2, url2