# Pandas

## DataFrame and a Series
The pandas library has two primary containers of data, the DataFrame and the Series. You will spend nearly all your time working with both of the objects when you use pandas.

![](images/01_dataframe_anatomy.png)

At first glance, the DataFrame looks like any other two-dimensional table of data that you have seen. It has rows and it has columns. Technically, there are three main components of the DataFrame.

### The three components of a DataFrame
A DataFrame is composed of three different components, the **index**, **columns**, and the **data**;  that you must be aware of in order to maximize the DataFrame's full potential.. The data is also known as the **values**.


### Each row has a label and each column has a label
The main takeaway from the DataFrame anatomy is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame.

The Index component of the **Series** and **DataFrame** is what separates pandas from most other data analysis libraries and is the key to understanding how many operations work.

In [0]:
from google.colab import drive
import os
drive.mount('/content/drive')
#os.chdir('drive/My Drive/courses/FML/2. Data Science/1. Data Analysis/2. Pandas')
!ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
'1. pandas_intro.ipynb'
'2. pandas series.ipynb'
'3. Pandas DataFrame.ipynb'
'4. Selecting Subsets with [ ], .loc and .iloc.ipynb'
'5. Boolean Indexing.ipynb'
'6. Assigning subsets of data.ipynb'
'7. Other Important concepts in Pandas.ipynb'
'8. Groupby.ipynb'
'capstone projects'
 data
 images
'pandas from numpy.pptx'
'Pandas Solutions(Part 4-6).ipynb'
'samples codes'


In [0]:
import pandas as pd
import numpy as np

In [0]:
movie = pd.read_csv('data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [0]:
index = movie.index
columns = movie.columns
values = movie.values

In [0]:
index

RangeIndex(start=0, stop=4916, step=1)

In [0]:
columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [0]:
values

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

In [0]:
print(type(index))
print(type(columns))
print(type(values))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>


In [0]:
print(type(index.values))
print(type(columns.values))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


#### Note:
The **head** method accepts a single parameter, n, which controls the number of rows displayed. Similarly, the **tail** method returns the last n rows.

### Dtypes
Display the data type of each column in a DataFrame. It is crucial to know the type of data held in each column as it fundamentally changes the kind of operations that are possible with it.


In [0]:
movie.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [0]:
movie.get_dtype_counts() ## method, not attribute.

float64    13
int64       3
object     12
dtype: int64

#### Note:

### Selecting a single column (Series)
A Series is a single column of data from a DataFrame. It is a single dimension of data, composed of just an index and the data.

All three(list, tuple, dictionary), built-in objects use the **indexing operator** to select their data. DataFrames are more powerful and complex containers of data, but they too use the indexing operator as the primary means to select data. Passing a single string to the DataFrame indexing operator returns a Series.

The visual output of the Series is less stylized than the DataFrame. It represents a single column of data. Along with the index and values, the output displays the name, length, and data type of the Series

In [0]:
director = movie['director_name']  ## Explain components
director

0            James Cameron
1           Gore Verbinski
2               Sam Mendes
3        Christopher Nolan
4              Doug Walker
5           Andrew Stanton
6                Sam Raimi
7             Nathan Greno
8              Joss Whedon
9              David Yates
10             Zack Snyder
11            Bryan Singer
12            Marc Forster
13          Gore Verbinski
14          Gore Verbinski
15             Zack Snyder
16          Andrew Adamson
17             Joss Whedon
18            Rob Marshall
19        Barry Sonnenfeld
20           Peter Jackson
21               Marc Webb
22            Ridley Scott
23           Peter Jackson
24             Chris Weitz
25           Peter Jackson
26           James Cameron
27           Anthony Russo
28              Peter Berg
29         Colin Trevorrow
               ...        
4886            Eric Eason
4887              Uwe Boll
4888     Richard Linklater
4889       Joseph Mazzella
4890          Travis Legge
4891         Alex Kendrick
4

In [0]:
type(director)

pandas.core.series.Series

#### Note:
The old column name is now the name of the Series and has actually become an attribute.

In [0]:
director.name

'director_name'

#### Note:
It is possible to turn this Series into a one-column DataFrame with the to_frame method. This method will use the Series name as the new column name.

In [0]:
director.to_frame()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker
5,Andrew Stanton
6,Sam Raimi
7,Nathan Greno
8,Joss Whedon
9,David Yates


### Series Methods
Series methods is the primary way to use the abilities that the Series offer. We will learn quiet a few series methods now.

<span style="color:brown">Note:</span> You are not suppose to understand everything in the first go. Just focus on understanding what all you can do with pandas Series and DataFrames. You can always come back and look at the syntax (I still do, even after using pandas for more than 3 years). 

In [0]:
director.head()

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [0]:
director.value_counts()

Steven Spielberg        26
Woody Allen             22
Martin Scorsese         20
Clint Eastwood          20
Ridley Scott            16
Spike Lee               16
Steven Soderbergh       15
Renny Harlin            15
Tim Burton              14
Oliver Stone            14
Ron Howard              13
Barry Levinson          13
Joel Schumacher         13
Robert Zemeckis         13
Robert Rodriguez        13
Brian De Palma          12
Tony Scott              12
Kevin Smith             12
Michael Bay             12
Rob Reiner              11
Chris Columbus          11
Shawn Levy              11
Richard Linklater       11
Richard Donner          11
Francis Ford Coppola    11
Sam Raimi               11
John McTiernan          10
Bobby Farrelly          10
Paul W.S. Anderson      10
David Fincher           10
                        ..
Joe Pytka                1
Pat Holden               1
Jim Jarmusch             1
Nick Gomez               1
Jonathan Newman          1
Ash Baron-Cohen          1
R

Counting the number of elements in the Series may be done with the **size** or **shape** parameter or the **len** function

In [0]:
director.size

4916

In [0]:
director.shape

(4916,)

In [0]:
len(director)

4916

**count** method returns number of not-null values (not same as size, len).

In [0]:
director.count()

4814

Since the **count** method returned a value less than the total number of Series elements, we know that there are missing values. The **isnull** method may be used to determine whether each individual value is missing or not. The result will be a Series of booleans the same length as the original Series.

In [0]:
director.isnull()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
4886    False
4887    False
4888    False
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895    False
4896    False
4897    False
4898    False
4899    False
4900    False
4901    False
4902    False
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909    False
4910    False
4911    False
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

In [0]:
director.isnull().sum()

102

In [0]:
director.size - director.isnull().sum() ## same as count method

4814

#### Note:
There exists a complement of isnull: the **notnull** method, which returns True for all the non-missing values.

#### Selecting 'actor_1_facebook_likes'

In [0]:
actor_1_fb_likes = movie['actor_1_facebook_likes']
actor_1_fb_likes.head()

0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

Basic summary statistics may be yielded with the *min*, *max*, *mean*, *median*, *std* and *sum* methods

In [0]:
actor_1_fb_likes.min(), \
actor_1_fb_likes.max(), \
actor_1_fb_likes.mean(), \
actor_1_fb_likes.median(), \
actor_1_fb_likes.std(), \
actor_1_fb_likes.sum()

(0.0, 640000.0, 6494.488490527602, 982.0, 15106.986883848309, 31881444.0)

 you may use the *describe* method to return both the summary statistics and a few of the quantiles at once

In [0]:
actor_1_fb_likes.describe()

count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

In [0]:
actor_1_fb_likes.quantile(0.5)

982.0

 To know more about percentile and quantile watch this video.


In [0]:
from IPython.display import YouTubeVideo

YouTubeVideo("IFKQLDmRK0Y")

In [0]:
actor_1_fb_likes.quantile([0.2,0.5,0.7])

0.2     510.0
0.5     982.0
0.7    8000.0
Name: actor_1_facebook_likes, dtype: float64

In [0]:
actor_1_fb_likes.count()

4909

In [0]:
actor_1_fb_likes.size

4916

It is possible to **replace** all missing values within a Series with the **fillna** method.

In [0]:
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()

4916

To **remove** the missing values (instead of replacing), use **dropna** method.

In [0]:
actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.count()

4909

we determined that there were missing values in the Series by observing that the result from the count method did not match the size attribute. A more direct approach is to use the **hasnans** attribute:

In [0]:
director.hasnans

True

In [0]:
actor_1_fb_likes.hasnans

True

In [0]:
actor_1_fb_likes_filled.hasnans

False

#### Selecting 'imdb_score' column

In [0]:
imdb_score = movie['imdb_score']
imdb_score.head()

0    7.9
1    7.1
2    6.8
3    8.5
4    7.1
Name: imdb_score, dtype: float64

### Operators

All the operators used, apply the same operation to each element in the Series. In native Python, this would require a for-loop to iterate through each of the items in the sequence before applying the operation. Pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a Series with the same index, but with values that have been modified by the operator.

In [0]:
imdb_score+1 # +,-,*,/,//,%,** all these operators are allowed. 

0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
5       7.6
6       7.2
7       8.8
8       8.5
9       8.5
10      7.9
11      7.1
12      7.7
13      8.3
14      7.5
15      8.2
16      7.6
17      9.1
18      7.7
19      7.8
20      8.5
21      8.0
22      7.7
23      8.9
24      7.1
25      8.2
26      8.7
27      9.2
28      6.9
29      8.0
       ... 
4886    8.0
4887    7.3
4888    8.1
4889    5.8
4890    4.3
4891    7.9
4892    5.6
4893    4.0
4894    7.6
4895    8.4
4896    7.2
4897    5.0
4898    7.1
4899    7.9
4900    8.5
4901    7.7
4902    8.4
4903    7.1
4904    6.4
4905    7.4
4906    8.0
4907    7.3
4908    7.9
4909    8.8
4910    7.4
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

In [0]:
imdb_score > 7 # comparision operators are also allowed. They return boolean series.

0        True
1        True
2       False
3        True
4        True
5       False
6       False
7        True
8        True
9        True
10      False
11      False
12      False
13       True
14      False
15       True
16      False
17       True
18      False
19      False
20       True
21      False
22      False
23       True
24      False
25       True
26       True
27       True
28      False
29      False
        ...  
4886    False
4887    False
4888     True
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895     True
4896    False
4897    False
4898    False
4899    False
4900     True
4901    False
4902     True
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909     True
4910    False
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

In [0]:
director == 'James Cameron'

0        True
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26       True
27      False
28      False
29      False
        ...  
4886    False
4887    False
4888    False
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895    False
4896    False
4897    False
4898    False
4899    False
4900    False
4901    False
4902    False
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909    False
4910    False
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

### Chaining Series methods
In Python, every variable is an object, and all objects have attributes and methods that refer to or return more objects. The sequential invocation of methods using the dot notation is referred to as method chaining. Pandas is a library that lends itself well to **method chaining**, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called.

 It is not necessary for the method to return the same type of object. 

In [0]:
director.value_counts().head().sum()

104

In [0]:
actor_1_fb_likes.isnull().sum() 

7

Instead of summing up the booleans to find the total number of missing values, we can take the **mean** of the Series to get the percentage of values that are missing.

In [0]:
actor_1_fb_likes.isnull().mean()

0.0014239218877135883

## <span style="color:red">Note</span>
All the non-missing values of *actor_1_fb_likes* should be integers as it is impossible to have a partial Facebook like. Any numeric columns with **missing values** must have their data type as **float**. If we fill missing values from *actor_1_fb_likes* with zeros, we can then convert it to an integer with the **astype** method.

In [0]:
actor_1_fb_likes.dtype

dtype('float64')

In [0]:
actor_1_fb_likes.astype(int)

ValueError: ignored

In [0]:
actor_1_fb_likes.fillna(0).astype(int).head()