# Metadata Cleanup Purpose

Within our original project, we had three files that we wanted to work with, all within the Resource folder.  The file titled movie_metadata.csv had some formatting issues that prevented the data in this file from merging correctly with the data in the other two files when we created Dataframes in Jupyter Notebook.  The code in this notebook is to clean up the movie_metadata.csv so that its data can be merged and use for future analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

In [2]:
#Load movie_metadata CSV file
start_file = pd.read_csv("Resource/movie_metadata.csv")

#Remove extraneous columns
start_file = start_file.drop(columns = ["color", "num_critic_for_reviews", "director_facebook_likes", 
                                       "actor_3_facebook_likes", "actor_1_facebook_likes", "cast_total_facebook_likes",
                                       "plot_keywords", "actor_2_facebook_likes", "aspect_ratio", 
                                       "num_voted_users", "num_user_for_reviews", "movie_imdb_link"])

In [3]:
#Remove extra spaces at end of movie titles
#Note: These spaces were a primary reason why this data was not successfully merged in the original project.
start_file["movie_title"] = start_file["movie_title"].str.rstrip()

In [4]:
#Find films where country is missing; identify the ones made in USA
start_file.loc[start_file["country"].isnull()]
start_file.at[4, "country"] = "USA"
start_file.at[279, "country"] = "USA"
start_file.at[3397, "country"] = "USA"
start_file.at[4021, "country"] = "USA"

In [6]:
start_file.columns

Index(['director_name', 'duration', 'actor_2_name', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'actor_3_name', 'facenumber_in_poster',
       'language', 'country', 'content_rating', 'budget', 'title_year',
       'imdb_score', 'movie_facebook_likes'],
      dtype='object')

In [10]:
#Create a dataframe to reorder the columns of the dataframe.
reordered_df = start_file[["movie_title", "director_name", "gross", "title_year", "budget", 
                          "imdb_score", "duration", "content_rating", "country", "actor_1_name", 
                          "actor_2_name", "actor_3_name", "genres", "facenumber_in_poster", 
                          "language", "movie_facebook_likes"]]
reordered_df.head()

Unnamed: 0,movie_title,director_name,gross,title_year,budget,imdb_score,duration,content_rating,country,actor_1_name,actor_2_name,actor_3_name,genres,facenumber_in_poster,language,movie_facebook_likes
0,Avatar,James Cameron,760505847.0,2009.0,237000000.0,7.9,178.0,PG-13,USA,CCH Pounder,Joel David Moore,Wes Studi,Action|Adventure|Fantasy|Sci-Fi,0.0,English,33000
1,Pirates of the Caribbean: At World's End,Gore Verbinski,309404152.0,2007.0,300000000.0,7.1,169.0,PG-13,USA,Johnny Depp,Orlando Bloom,Jack Davenport,Action|Adventure|Fantasy,0.0,English,0
2,Spectre,Sam Mendes,200074175.0,2015.0,245000000.0,6.8,148.0,PG-13,UK,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Action|Adventure|Thriller,1.0,English,85000
3,The Dark Knight Rises,Christopher Nolan,448130642.0,2012.0,250000000.0,8.5,164.0,PG-13,USA,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Action|Thriller,0.0,English,164000
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,,,,7.1,,,USA,Doug Walker,Rob Walker,,Documentary,0.0,,0


In [11]:
#Find films made in the USA
#The original file has budgets that seem to be of various currencies.  This code is to narrow
#down the frame to films known to be using US dollars as the currency.
us_films = reordered_df.loc[reordered_df["country"] == "USA"]
us_films

Unnamed: 0,movie_title,director_name,gross,title_year,budget,imdb_score,duration,content_rating,country,actor_1_name,actor_2_name,actor_3_name,genres,facenumber_in_poster,language,movie_facebook_likes
0,Avatar,James Cameron,760505847.0,2009.0,237000000.0,7.9,178.0,PG-13,USA,CCH Pounder,Joel David Moore,Wes Studi,Action|Adventure|Fantasy|Sci-Fi,0.0,English,33000
1,Pirates of the Caribbean: At World's End,Gore Verbinski,309404152.0,2007.0,300000000.0,7.1,169.0,PG-13,USA,Johnny Depp,Orlando Bloom,Jack Davenport,Action|Adventure|Fantasy,0.0,English,0
3,The Dark Knight Rises,Christopher Nolan,448130642.0,2012.0,250000000.0,8.5,164.0,PG-13,USA,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Action|Thriller,0.0,English,164000
4,Star Wars: Episode VII - The Force Awakens,Doug Walker,,,,7.1,,,USA,Doug Walker,Rob Walker,,Documentary,0.0,,0
5,John Carter,Andrew Stanton,73058679.0,2012.0,263700000.0,6.6,132.0,PG-13,USA,Daryl Sabara,Samantha Morton,Polly Walker,Action|Adventure|Sci-Fi,1.0,English,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5037,Newlyweds,Edward Burns,4584.0,2011.0,9000.0,6.4,95.0,Not Rated,USA,Kerry Bishé,Caitlin FitzGerald,Daniella Pineda,Comedy|Drama,1.0,English,413
5039,The Following,,,,,7.5,43.0,TV-14,USA,Natalie Zea,Valorie Curry,Sam Underwood,Crime|Drama|Mystery|Thriller,1.0,English,32000
5040,A Plague So Pleasant,Benjamin Roberds,,2013.0,1400.0,6.3,76.0,,USA,Eva Boehnke,Maxwell Moody,David Chandler,Drama|Horror|Thriller,0.0,English,16
5041,Shanghai Calling,Daniel Hsia,10443.0,2012.0,,6.3,100.0,PG-13,USA,Alan Ruck,Daniel Henney,Eliza Coupe,Comedy|Drama|Romance,5.0,English,660
