## COMP 3400: Data Preparation Techniques Project
## Name of our Project

**Group Members:**
  - Liudmila Strelnikova 201819885
  - David Chicas 201919354

**Description of our data:**
This data set was uploaded by user *Daniel Grijalva* and was found in *Kaggle*, the dataset can be found [here](https://www.kaggle.com/datasets/danielgrijalvas/movies). This dataset compiles different aspect of the film industry from 1980 to 2020. In this data we can observe different patterns like the most commons release dates for films, the highest grossing genres of movies, as well as consumers best rated movies. We will use this information for **nlahnlahnlah**. 

**Description of our variables:**
- **Name:** Title of the film. This helps us differentiate between each film. 
- **Rating:** Given to each movie by the Classification and Ratings Administration (CARA), the rating provides information needed to determine of a film is suitable for children. We can use this information to see the difference between movies rated for everyone and for a specific group of consumers. 
- **Genre:** Is a thematic category given to a film depending on theme, plot, topics and other conventions. This information can help us determine what types of genres are the more popular, highest grossing, and more common. 
- **Year:** Release year of the film. Ranges from 1980 to 2020. We can use this to sed the difference between consuming movies in the past and present. 
- **Released:** Exact date when the film was released and place of release. This data can be presented in a more useful way by separating the information into two different columns. We can see what is the most commons and uncommon release date for a movie and see how it affects different aspects of the film. 
- **Score:** Rating given to a film by users of [iMDb.com](https://www.imdb.com) on a scale of 1 to 10 depending on enjoyability of the film. This can help us see any correlation between what consumers like and the different variables like genre, release date, and director.
- **Director:** Person who manages the film's creative aspects, like directing the film crew and actors. With the data we have we can see who the highest grossing directors are or the most well rated director. 
- **Writer:** Person who writed the script for the movie. This information is not relevant to our project.
- **Star:** Most known actor in that is in the movie. This information is not relevant to our project.
- **Country:** Place where the movie was released for it's premiere, decided by the producers and film company. We can see how the director or production company can affect the country of release for a film. 
- **Budget:** Amount of money alloted to the film's creation by the producers and film company, presented in US dollars. With this information we can see if a film was profitable or lost money, and we can also see if different directors/film company have any tendencies to having a bigger budget with certain variables. 
- **Gross:** Refers to gross box office earnings of a movie in U.S. dollars, this does not include any other forms of revenue. This can be used to see any impact that user score has, what genres are the most profitable, and many more correlations.
- **Company:** Business in charge of producing the film. We can see what type of film genre is most common with what film company, as well as the success that the companies films have. 
- **Runtime:** Total amount of time from start to finish that the film lasts. Presented in minutes. With this information we can see if the duration of the movie has any effect on budget, rating, or score. 

In [15]:
import numpy as np
import matplotlib as plt
import datetime
import pandas as pd

In [16]:
movies = pd.read_csv('./movies.csv')
movies.head(10)
movies.set_index('year')

Unnamed: 0_level_0,name,rating,genre,released,score,votes,director,writer,star,country,budget,gross,company,runtime
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1980,The Shining,R,Drama,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1980,The Blue Lagoon,R,Adventure,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
1980,Star Wars: Episode V - The Empire Strikes Back,PG,Action,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
1980,Airplane!,PG,Comedy,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
1980,Caddyshack,R,Comedy,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020,More to Life,,Drama,"October 23, 2020 (United States)",3.1,18.0,Joseph Ebanks,Joseph Ebanks,Shannon Bond,United States,7000.0,,,90.0
2020,Dream Round,,Comedy,"February 7, 2020 (United States)",4.7,36.0,Dusty Dukatz,Lisa Huston,Michael Saquella,United States,,,Cactus Blue Entertainment,90.0
2020,Saving Mbango,,Drama,"April 27, 2020 (Cameroon)",5.7,29.0,Nkanya Nkwai,Lynno Lovert,Onyama Laura,United States,58750.0,,Embi Productions,
2020,It's Just Us,,Drama,"October 1, 2020 (United States)",,,James Randall,James Randall,Christina Roz,United States,15000.0,,,120.0


## Getting rid of unnecessary columns, splitting "released" column

The columns "writer", "votes", and "star" do not present any significance in the analysis, so they are removed from the dataframe.

In [17]:
to_drop = ['writer', 'votes', 'star']
movies.drop(to_drop, inplace=True, axis=1)

The column "released" is split into "date" and "release_country" for more meaningful analysis. 

In [18]:
movies[['date','release_country']] = movies.released.str.split("(",expand=True)


By using the left parenthesis as a split symbol in the previous operation, the right parenthesis in the column release_country should be removed.

In [19]:
movies['release_country'] = movies['release_country'].str.replace(')', '', regex = True)

To finish this cleaning step off, the column "released" is deleted

In [20]:
movies.drop('released', inplace=True, axis=1)

Next, we need to check if all the columns have data with appropriate datatypes.

In [21]:
movies.dtypes

name                object
rating              object
genre               object
year                 int64
score              float64
director            object
country             object
budget             float64
gross              float64
company             object
runtime            float64
date                object
release_country     object
dtype: object

Now that we deleted the data we do not need we can format some of our variables like: budget, gross and runtime. 

In [22]:
movies['budget'] = movies['budget'].apply(lambda x: f"${x*1:,.2f}")
movies['gross'] = movies['gross'].apply(lambda x: f"${x*1:,.2f}")
movies['runtime'] = movies['runtime'].apply(lambda x: f"{x*1:,.0f} min")

In [23]:
movies.head(10)

Unnamed: 0,name,rating,genre,year,score,director,country,budget,gross,company,runtime,date,release_country
0,The Shining,R,Drama,1980,8.4,Stanley Kubrick,United Kingdom,"$19,000,000.00","$46,998,772.00",Warner Bros.,146 min,"June 13, 1980",United States
1,The Blue Lagoon,R,Adventure,1980,5.8,Randal Kleiser,United States,"$4,500,000.00","$58,853,106.00",Columbia Pictures,104 min,"July 2, 1980",United States
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,8.7,Irvin Kershner,United States,"$18,000,000.00","$538,375,067.00",Lucasfilm,124 min,"June 20, 1980",United States
3,Airplane!,PG,Comedy,1980,7.7,Jim Abrahams,United States,"$3,500,000.00","$83,453,539.00",Paramount Pictures,88 min,"July 2, 1980",United States
4,Caddyshack,R,Comedy,1980,7.3,Harold Ramis,United States,"$6,000,000.00","$39,846,344.00",Orion Pictures,98 min,"July 25, 1980",United States
5,Friday the 13th,R,Horror,1980,6.4,Sean S. Cunningham,United States,"$550,000.00","$39,754,601.00",Paramount Pictures,95 min,"May 9, 1980",United States
6,The Blues Brothers,R,Action,1980,7.9,John Landis,United States,"$27,000,000.00","$115,229,890.00",Universal Pictures,133 min,"June 20, 1980",United States
7,Raging Bull,R,Biography,1980,8.2,Martin Scorsese,United States,"$18,000,000.00","$23,402,427.00",Chartoff-Winkler Productions,129 min,"December 19, 1980",United States
8,Superman II,PG,Action,1980,6.8,Richard Lester,United States,"$54,000,000.00","$108,185,706.00",Dovemead Films,127 min,"June 19, 1981",United States
9,The Long Riders,R,Biography,1980,7.0,Walter Hill,United States,"$10,000,000.00","$15,795,189.00",United Artists,100 min,"May 16, 1980",United States


In [24]:
movies


Unnamed: 0,name,rating,genre,year,score,director,country,budget,gross,company,runtime,date,release_country
0,The Shining,R,Drama,1980,8.4,Stanley Kubrick,United Kingdom,"$19,000,000.00","$46,998,772.00",Warner Bros.,146 min,"June 13, 1980",United States
1,The Blue Lagoon,R,Adventure,1980,5.8,Randal Kleiser,United States,"$4,500,000.00","$58,853,106.00",Columbia Pictures,104 min,"July 2, 1980",United States
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,8.7,Irvin Kershner,United States,"$18,000,000.00","$538,375,067.00",Lucasfilm,124 min,"June 20, 1980",United States
3,Airplane!,PG,Comedy,1980,7.7,Jim Abrahams,United States,"$3,500,000.00","$83,453,539.00",Paramount Pictures,88 min,"July 2, 1980",United States
4,Caddyshack,R,Comedy,1980,7.3,Harold Ramis,United States,"$6,000,000.00","$39,846,344.00",Orion Pictures,98 min,"July 25, 1980",United States
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7663,More to Life,,Drama,2020,3.1,Joseph Ebanks,United States,"$7,000.00",$nan,,90 min,"October 23, 2020",United States
7664,Dream Round,,Comedy,2020,4.7,Dusty Dukatz,United States,$nan,$nan,Cactus Blue Entertainment,90 min,"February 7, 2020",United States
7665,Saving Mbango,,Drama,2020,5.7,Nkanya Nkwai,United States,"$58,750.00",$nan,Embi Productions,nan min,"April 27, 2020",Cameroon
7666,It's Just Us,,Drama,2020,,James Randall,United States,"$15,000.00",$nan,,120 min,"October 1, 2020",United States
