# Lab 06 - EDA - David Rogan

**What is EDA (Exploratory Data Analysis)?**
1. EDA is a process in data analytics used to analyze and investigate datasets and find out the main characteristics.
2. EDA helps data analysts to discover patterns in the dataset and examine their assumptions.
3. EDA also used to find out about the relationships between the dataset variables.
4. It was developed my the American mathematician, John Tukey, in the 1970s.
5. EDA helps analysts to answer the management team questions.
6. When EDA is done, the dataset can be used for more sophisticated applications of AI such as ML.

**Types of EDA:**
1. Univariate: Examining an individual variable.
2. Bivariate: Examining two variables.
3. Multivariate: Examining more than two variables.

**EDA Tools:**
1. Graphical EDA (data visualization): plots, graphs, charts.
2. Non-graphical EDA: Examining the relationships between the dataset variables.

**EDA Most Popular Languages:**
1. Python
2. R

### Steps in Completing a Data Analysis Project:
1. Introduction:
   - Context (Background)
   - Objectives
   - Data Description: Name of each variable and a short description about the variable.
2. Importing the needed libraries
3. Loading the dataset.
4. EDA (Univariate, Bivariate, Multivariate)
   - Basic data exploration
   - Advanced data exploration
   - Data Cleaning
   - Data Engineering
   - Summary of EDA
5. Data Visualization
6. Data Mining
   - (if the data analysis is a descriptive one: data mining and answering the data mining questions.)
   - if the data analysis is a predictive one: data will be used in more advanced step of machine learning or any model-building process.
7. Data Reporting (presentation)

### 1. Introduction: To be completed by me.
    - Context
    - Objective
    - Data Description

### 2. Importing libraries

In [6]:
import pandas as pd
import numpy as np

### 3. Loading (Reading) the Dataset

In [8]:
# Let's load the dataset
df = pd.read_csv('movies.csv')

In [9]:
# creating a copy of the dataset
movies = df.copy()

### 4. EDA

##### Basic Data Exploration
**Data Attributes:**
1. shape: Total number of rows and columns
2. size: Total number of values.
3. ndim: Dimensionality
4. columns: Names
5. dtypes: Different types of Data within the dataset.

In [12]:
movies.shape

(782, 5)

In [13]:
movies.size

3910

In [14]:
movies.ndim

2

In [15]:
movies.columns

Index(['Rank', 'Title', 'Studio', 'Gross', 'Year'], dtype='object')

In [16]:
movies.dtypes

Rank       int64
Title     object
Studio    object
Gross     object
Year       int64
dtype: object

In [17]:
# Let's ensure that dataset has been loaded correctly.
# shows the first 5 rows of the dataset
movies.head()

Unnamed: 0,Rank,Title,Studio,Gross,Year
0,1,Avengers: Endgame,Buena Vista,"$2,796.30",2019
1,2,Avatar,Fox,"$2,789.70",2009
2,3,Titanic,Paramount,"$2,187.50",1997
3,4,Star Wars: The Force Awakens,Buena Vista,"$2,068.20",2015
4,5,Avengers: Infinity War,Buena Vista,"$2,048.40",2018


In [18]:
# Let's look at the tail of the dataset
movies.tail()

Unnamed: 0,Rank,Title,Studio,Gross,Year
777,778,Yogi Bear,Warner Brothers,$201.60,2010
778,779,Garfield: The Movie,Fox,$200.80,2004
779,780,Cats & Dogs,Warner Brothers,$200.70,2001
780,781,The Hunt for Red October,Paramount,$200.50,1990
781,782,Valkyrie,MGM,$200.30,2008


##### Adanced Data Exploration

In [20]:
# Let's get more information about the dataset.
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 782 entries, 0 to 781
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rank    782 non-null    int64 
 1   Title   782 non-null    object
 2   Studio  782 non-null    object
 3   Gross   782 non-null    object
 4   Year    782 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 30.7+ KB


**Observations:**

  - movies.tail() will pull the last 5 of the data set
  - movies.head() will pull the first 5 rows of the data set
  - movies.info gives more information about the dataset. This includes data types, their names, positions and the total entries.
  - movies.columns gives the name of the columns
  - movies.size gives the size

In [22]:
# let's examine the statistical summary of the dataset
# establish the mean, mode, range etc.
movies.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rank,782.0,391.5,225.888247,1.0,196.25,391.5,586.75,782.0
Year,782.0,2006.620205,10.026227,1939.0,2001.0,2009.0,2014.0,2019.0


**Observations:**

  - the movies.describe function gives us the count, mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile and max.
  - count represents the number of movies
  - min is the lowest rank possible
  - max is the most
  - 50% is the middle of the data
  - mean is the average

In [24]:
# Let's get the summary of the object variables
movies.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Title,782,773,Beauty and the Beast,2
Studio,782,37,Warner Brothers,132
Gross,782,701,$225.90,3


**Observations:**

  - movies is the dataframe containing the movie data
  - .describe generatives descriptive statistics of the Data Frame
  - include='object' tells pandas to only include columns with object (string) data types
  - '.T' transposes the result which flips the rows and columns
  - If we used int64 we would only be calling for numbers

In [26]:
# Assigning Title variable as the index
movies.set_index('Title', inplace=True)

In [27]:
movies.head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers: Endgame,1,Buena Vista,"$2,796.30",2019
Avatar,2,Fox,"$2,789.70",2009
Titanic,3,Paramount,"$2,187.50",1997
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015
Avengers: Infinity War,5,Buena Vista,"$2,048.40",2018


In [28]:
movies.shape

(782, 4)

In [29]:
movies.sort_index().head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"10,000 B.C.",536,Warner Brothers,$269.80,2008
101 Dalmatians,708,Buena Vista,$215.90,1961
101 Dalmatians,425,Buena Vista,$320.70,1996
2 Fast 2 Furious,632,Universal,$236.40,2003
2012,93,Sony,$769.70,2009


In [30]:
movies.sort_values('Year').head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gone with the Wind,288,MGM,$402.40,1939
Bambi,540,RKO,$267.40,1942
101 Dalmatians,708,Buena Vista,$215.90,1961
The Jungle Book,755,Buena Vista,$205.80,1967
The Godfather,604,Paramount,$245.10,1972


In [31]:
movies.sort_values('Year').tail()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Men in Black International,686,Sony,$220.80,2019
John Wick: Chapter 3 - Parabellum,458,Lionsgate,$304.70,2019
Pokemon Detective Pikachu,263,Warner Brothers,$427.50,2019
Dark Phoenix,603,Fox,$245.10,2019
Avengers: Endgame,1,Buena Vista,"$2,796.30",2019


**Counting the Values**
- using the value_counts method.

In [33]:
# sort_values , sorts the values
# value_counts , display the sorted unique values
# let's look at the unique studios in this dataset
studios = movies['Studio'].value_counts()
studios

Warner Brothers           132
Buena Vista               125
Fox                       117
Universal                 109
Sony                       86
Paramount                  76
Dreamworks                 27
Lionsgate                  21
New Line                   16
MGM                        11
TriStar                    11
Miramax                    10
Weinstein                   6
Columbia                    5
WGUSA                       4
Polygram                    2
Orion                       2
SonR                        2
Dimension                   2
Vestron                     1
USA                         1
Lions                       1
Focus                       1
Rela.                       1
CL                          1
Pathe                       1
Artisan                     1
IFC                         1
GrtIndia                    1
RKO                         1
UTV                         1
FUN                         1
FR                          1
Newmarket 

In [34]:
len(studios)

37

In [35]:
studios = movies['Studio'].value_counts().head()
studios

Warner Brothers    132
Buena Vista        125
Fox                117
Universal          109
Sony                86
Name: Studio, dtype: int64

In [36]:
type(studios)

pandas.core.series.Series

**Observations:**
- Add 3 points here.

### Data Mining Questions

In [39]:
# 1. What are the top 5 studios in this dataset?
studios.head()

Warner Brothers    132
Buena Vista        125
Fox                117
Universal          109
Sony                86
Name: Studio, dtype: int64

### Data Indexers
- There are 2 types of indexers:
1. iloc indexer: This is an implicit indexing (not clearly stated) It is integer-based & refers to the position of the element. (i stands for integer, loc stands for location.) It starts with 0 because it is referring to an element. USE FOR NUMBERS
2. loc indexer: This is explicit indexing (clearly stated). It is the user-defined name and acts as a locator of the element. USE FOR CHARACTERS

In [41]:
#2. What is the movie sitting in position 300?
movies.iloc[300]
# name of dataset, the function, the address

Rank           301
Studio         Fox
Gross     $389.70 
Year          2016
Name: Independence Day: Resurgence, dtype: object

In [42]:
# 3. What is the information about the movie "Forrest Gump" ? To find out we will use loc indexer
movies.loc['Forrest Gump']

Rank            119
Studio    Paramount
Gross      $677.90 
Year           1994
Name: Forrest Gump, dtype: object

In [43]:
# 4. Is there only one version of the movie "101 Dalmatians"?
movies.loc['101 Dalmatians']

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
101 Dalmatians,425,Buena Vista,$320.70,1996
101 Dalmatians,708,Buena Vista,$215.90,1961


In [44]:
# 5. What are the movies created by universal studios?
# let's do double indexing
# we're looking inside of a variable for additional information?
# 1 = is assignment , == is to specify within the existing dataset.
universal = movies[movies['Studio'] == 'Universal']
universal

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jurassic World,6,Universal,"$1,671.70",2015
Furious 7,8,Universal,"$1,516.00",2015
Jurassic World: Fallen Kingdom,13,Universal,"$1,309.50",2018
The Fate of the Furious,17,Universal,"$1,236.00",2017
Minions,19,Universal,"$1,159.40",2015
...,...,...,...,...
The Break-Up,763,Universal,$205.00,2006
Everest,766,Universal,$203.40,2015
Patch Adams,772,Universal,$202.30,1998
Kindergarten Cop,775,Universal,$202.00,1990


In [45]:
# 6. What are the top 5 movies by Universal Studios?
universal.head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jurassic World,6,Universal,"$1,671.70",2015
Furious 7,8,Universal,"$1,516.00",2015
Jurassic World: Fallen Kingdom,13,Universal,"$1,309.50",2018
The Fate of the Furious,17,Universal,"$1,236.00",2017
Minions,19,Universal,"$1,159.40",2015


In [46]:
# 7. What are the movies released in 2015 from Universal?
released_2015 = movies[movies['Year'] == 2015]
released_2015

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015
Jurassic World,6,Universal,"$1,671.70",2015
Furious 7,8,Universal,"$1,516.00",2015
Avengers: Age of Ultron,9,Buena Vista,"$1,405.40",2015
Minions,19,Universal,"$1,159.40",2015
Spectre,58,Sony,$880.70,2015
Inside Out,69,Buena Vista,$857.60,2015
Mission: Impossible - Rogue Nation,118,Paramount,$682.70,2015
The Hunger Games: Mockingjay - Part 2,130,Lionsgate,$653.40,2015
The Martian,137,Fox,$630.20,2015


In [47]:
# 8. What are the top 5 movies in 2015?
released_2015.head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015
Jurassic World,6,Universal,"$1,671.70",2015
Furious 7,8,Universal,"$1,516.00",2015
Avengers: Age of Ultron,9,Buena Vista,"$1,405.40",2015
Minions,19,Universal,"$1,159.40",2015


In [48]:
released_2015[:5]

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015
Jurassic World,6,Universal,"$1,671.70",2015
Furious 7,8,Universal,"$1,516.00",2015
Avengers: Age of Ultron,9,Buena Vista,"$1,405.40",2015
Minions,19,Universal,"$1,159.40",2015


In [49]:
# 9. What are the movies released by Fox Studios in 2006?
fox_2006 = movies[(movies['Studio'] == 'Fox') & (movies['Year'] == 2006)]
fox_2006

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ice Age: The Meltdown,125,Fox,$660.90,2006
Night at the Museum,164,Fox,$574.50,2006
X-Men: The Last Stand,238,Fox,$459.40,2006
The Devil Wears Prada,415,Fox,$326.60,2006
Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan,554,Fox,$261.60,2006
Eragon,586,Fox,$249.50,2006


In [50]:
# 10. What are the top 3 movies released by fox in 2006?
fox_2006[:3]
#or
fox_2006.head(3)

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ice Age: The Meltdown,125,Fox,$660.90,2006
Night at the Museum,164,Fox,$574.50,2006
X-Men: The Last Stand,238,Fox,$459.40,2006


### Conclusion:
1. Ice Age the meltdown was rank 125
2. Night at the Museum is 164
3. fox_2006 calls to movies[(movies['Studio'] =='Fox') & (movies['Year'] == 2006)]
4. we set released_2015 call to "= movies[movies['Year'] == 2015]"
5. iloc indexer is an implicit indexing (not clearly stated) It is integer-based & refers to the position of the element. (i stands for integer, loc stands for location.) It starts with 0 because it is referring to an element. USE FOR NUMBERS
6. loc indexer is explicit indexing (clearly stated). It is the user-defined name and acts as a locator of the element. USE FOR CHARACTERS
7. movies.loc['Forrest Gump'] movies is the data, loc is the indexer and calls directly to titles ''
8. len(studios) gives the total amount of studios within the data. the function is essential in character searching
9. sort_values sorts the values
10. df = pd.read_csv('movies.csv') is how we load the dataset 
11. loc can be used for characters in searching, even pieces.
12. .head can allow us to look up multiple studios data such as paramount.head()
13. value_counts display the sorted unique values
14. a lot of the function for finding specific data simply lies in plugging and playing with established principals i.e. the function!
15. shape is the Total number of rows and columnsset. size is the Total number of values..
1. ndim is the Dimensionality (area)6
1. columns for this dataset are Names7
1. dtypes are Different types of Data within the dataset.8
1. To build more complicated functions, we must combine simpler functions. i.e. the #9 example9
2. movies.tail() will pull the last 5 of the data set
22. movies.head() will pull the first 5 rows of the data setdat.aber .describe generatives descriptive statistics of the Data Fram.es1
 include='object' tells pandas to only include columns with object (string) data type.s22
 '.T' transposes the result which flips the rows and column.s23
 If we used int64 we would only be calling for number. movies.info gives more information about the dataset. This includes data types, their names, positions and the total entries.s24
25
26
27

### Obse.rvations
- 1 fox_2006[:3] or fox_2006.head(3) both work .the same
- 2 We can use this exercise to pull data from other datasets to draw information as part. of EDA.
- 3 1 = is assignment , == is to specify within the existing dataset. This allows us to make recursive fu.nctions.
- 4 Data indexing is very useful when searching by a specific value within your dataset.


### End of Lab 06