# Chapter 1: Pandas Foundations

## Introduction  

The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is vital for pandas users to know each component of the Series and the DataFrame, and to understand that each column of data in pandas holds precisely one data type.

In this chapter, you will learn how to select a single column of data from a DataFrame, which is returned as a Series. Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession, which is known as method chaining.

The Index component of the Series and DataFrame is what separates pandas from most other data analysis libraries and is the key to understanding how many operations work. We will get a glimpse of this powerful object when we use it as a meaningful label for Series values. The final two recipes contain simple tasks that frequently occur during a data analysis.

この章の目的は、SeriesとDataFrameのデータ構造を徹底的に調べることで、pandasの基礎を紹介することです。pandasユーザーにとって、SeriesとDataFrameの各構成要素を知り、pandasのデータの各列が正確に1つのデータ型を保持していることを理解することは非常に重要です。

この章では、DataFrameから1つの列のデータを選択し、Seriesとして返される方法を学びます。この一次元オブジェクトを使用することで、さまざまなメソッドや演算子がどのように動作するかを簡単に示すことができます。多くの Series メソッドは、出力として別の Series を返します。これは、メソッド・チェイニングとして知られています。

SeriesとDataFrameのIndexコンポーネントは、pandasを他のほとんどのデータ分析ライブラリから分離するものであり、多くの演算子がどのように動作するかを理解するための鍵となります。この強力なオブジェクトを、系列値の意味のあるラベルとして使用するときに垣間見ることができるでしょう。最後の2つのレシピには、データ分析中に頻繁に発生するシンプルなタスクが含まれています。

## Recipes
* [Dissecting the anatomy of a DataFrame](#Dissecting-the-anatomy-of-a-DataFrame)
* [Accessing the main DataFrame components](#Accessing-the-main-DataFrame-components)
* [Understanding data types](#Understanding-data-types)
* [Selecting a single column of data as a Series](#Selecting-a-single-column-of-data-as-a-Series)
* [Calling Series methods](#Calling-Series-methods)
* [Working with operators on a Series](#Working-with-operators-on-a-Series)
* [Chaining Series methods together](#Chaining-Series-methods-together)
* [Making the index meaningful](#Making-the-index-meaningful)
* [Renaming row and column names](#Renaming-row-and-column-names)
* [Creating and deleting columns](#Creating-and-deleting-columns)

In [27]:
%matplotlib inline
import pandas as pd
import numpy as np

print(pd.get_option('display.max_rows'))  # 60
pd.set_option('display.max_rows', 500)

500


# Dissecting the anatomy of a DataFrame

Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components--the index, columns, and data (also known as values) that you must be aware of in order to maximize the DataFrame's full potential.

パンダに深く潜る前に、DataFrame の構成要素を知っておく価値があります。視覚的には、パンダの DataFrame (Jupyter ノートブック内の) の出力された表示は、行と列からなるデータの普通の表にしか見えません。表面の下に隠れているのは、インデックス、列、データ（値としても知られています）の3つのコンポーネントです。

#### Change options to get specific output for book

In [28]:
pd.set_option('max_columns', 8, 'max_rows', 10)

In [29]:
movie = pd.read_csv('data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,...,23000.0,8.5,2.35,164000
4,,Doug Walker,,,...,12.0,7.1,,0


![dataframe anatomy](./images/ch01_dataframe_anatomy.png)

# Accessing the main DataFrame components

Each of the three DataFrame components--the index, columns, and data--may be accessed directly from a DataFrame. Each of these components is itself a Python object with its own unique attributes and methods. It will often be the case that you would like to perform operations on the individual components and not on the DataFrame as a whole.

3つのDataFrameコンポーネント（インデックス、カラム、データ）のそれぞれは、DataFrameから直接アクセスすることができます。これらのコンポーネントはそれぞれ独自の属性とメソッドを持つ Python オブジェクトです。DataFrame全体ではなく、個々のコンポーネントに対して操作を行いたい場合が多いでしょう。

In [30]:
columns = movie.columns
index = movie.index
data = movie.values

In [31]:
columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [32]:
index

RangeIndex(start=0, stop=4916, step=1)

In [33]:
data

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

In [34]:
type(index)

pandas.core.indexes.range.RangeIndex

In [35]:
type(columns)

pandas.core.indexes.base.Index

In [36]:
type(data)

numpy.ndarray

In [37]:
issubclass(pd.RangeIndex, pd.Index)

True

## How it works...

You may access the three main components of a DataFrame with the index, columns, and values attributes. The output of the columns attribute appears to be just a sequence of the column names. This sequence of column names is technically an Index object. The output of the function type is the fully qualified class name of the object.

The fully qualified class name of the object for the variable columns is pandas.core.indexes.base.Index. It begins with the package name, which is followed by a path of modules and ends with the name of the type. A common way of referring to objects is to include the package name followed by the name of the object type. In this instance, we would refer to the columns as a pandas Index object.  

The built-in subclass function checks whether the first argument inherits from the second. The Index and RangeIndex objects are very similar, and in fact, pandas has a number of similar objects reserved specifically for either the index or the columns. The index and the columns must both be some kind of Index object. Essentially, the index and the columns represent the same thing, but along different axes. They’re occasionally referred to as the row index and column index.  

In this context, the Index objects refer to all the possible objects that can be used for the index or columns. They are all subclasses of pd.Index. Here is the complete list of the Index objects: CategoricalIndex, MultiIndex, IntervalIndex, Int64Index, UInt64Index, Float64Index, RangeIndex, TimedeltaIndex, DatetimeIndex, PeriodIndex.  

A RangeIndex is a special type of Index object that is analogous to Python's range object. Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory. It is completely defined by its start, stop, and step values.

DataFrame の 3 つの主要コンポーネントに index、columns、および values 属性を使用してアクセスすることができます。columns 属性の出力は、単に列名のシーケンスのように見えます。この列名のシーケンスは、技術的には Index オブジェクトです。関数型の出力は、オブジェクトの完全修飾クラス名です。

変数カラム用のオブジェクトの完全修飾クラス名は pandas.core.indexes.base.Index です。これはパッケージ名で始まり、モジュールのパスが続き、型の名前で終わります。オブジェクトを参照する一般的な方法は、パッケージ名の後にオブジェクトタイプの名前を含めることです。この例では、列をpandas Indexオブジェクトと呼びます。 

組み込みのサブクラス関数は、第一引数が第二引数を継承しているかどうかをチェックします。IndexオブジェクトとRangeIndexオブジェクトは非常によく似ており、実際、pandasはインデックスとカラムのどちらかのために予約された類似のオブジェクトを多数持っています。インデックスとカラムはどちらも何らかのインデックスオブジェクトでなければなりません。基本的に、インデックスとカラムは同じものを表しますが、異なる軸に沿っています。これらは、行インデックスや列インデックスと呼ばれることもあります。 

この文脈では、インデックスオブジェクトとは、インデックスや列に使用できるすべてのオブジェクトを指します。これらはすべて pd.Index のサブクラスです。以下にインデックスオブジェクトの完全なリストを示します。CategoricalIndex、MultiIndex、IntervalIndex、Int64Index、UInt64Index、Float64Index、RangeIndex、TimedeltaIndex、DatetimeIndex、PeriodIndexです。 

RangeIndex は、Python の range オブジェクトに似た特別なタイプの Index オブジェクトです。一連の値は必要になるまでメモリにロードされず、メモリを節約することができます。これは、開始値、停止値、ステップ値によって完全に定義されます。


## There's more

In [38]:
index.values

array([   0,    1,    2, ..., 4913, 4914, 4915])

In [39]:
columns.values

array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)

# Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurement, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.
Pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following table contains all pandas data types, with their string equivalents, and some notes on each type:  

非常に広い意味で、データは連続的なものと分類されることがあります。連続データは常に数値であり、身長、賃金、給与などの何らかの測定値を表します。連続データは、無限の可能性を持っています。一方、カテゴリカルデータは、車の色、ポーカーハンドの種類、シリアルのブランドなど、離散的で有限な量の値を表します。
パンダでは、データを連続的なものと定型的なもののどちらかに広く分類していません。その代わりに、多くの異なるデータタイプに対して正確な技術的定義を持っています。次の表は、すべてのパンダのデータ型と、それに対応する文字列、および各データ型についての注意事項を示しています。

In [40]:
movie = pd.read_csv('data/movie.csv')

In [41]:
movie.dtypes

color                       object
director_name               object
num_critic_for_reviews     float64
duration                   float64
director_facebook_likes    float64
                            ...   
title_year                 float64
actor_2_facebook_likes     float64
imdb_score                 float64
aspect_ratio               float64
movie_facebook_likes         int64
Length: 28, dtype: object

In [42]:
movie.get_dtype_counts()

  """Entry point for launching an IPython kernel.


float64    13
int64       3
object     12
dtype: int64

## How it works...

Each DataFrame column must be exactly one type. For instance, every value in the column aspect_ratio is a 64-bit float, and every value in movie_facebook_likes is a 64-bit integer. Pandas defaults its core numeric types, integers, and floats to 64 bits regardless of the size necessary for all data to fit in memory. Even if a column consists entirely of the integer value 0, the data type will still be int64. get_dtype_counts is a convenience method for directly returning the count of all the data types in the DataFrame.

Homogeneous data is another term for referring to columns that all have the same type. DataFrames as a whole may contain heterogeneous data of different data types for different columns.

The object data type is the one data type that is unlike the others. A column that is of object data type may contain values that are of any valid Python object. Typically, when a column is of the object data type, it signals that the entire column is strings. This isn't necessarily the case as it is possible for these columns to contain a mixture of integers, booleans, strings, or other, even more complex Python objects such as lists or dictionaries. The object data type is a catch-all for columns that pandas doesn’t recognize as any other specific type.

各 DataFrame カラムは、正確に 1 つの型でなければなりません。たとえば、列 aspect_ratio のすべての値は 64 ビットのフロートで、movie_facebook_likes のすべての値は 64 ビットの整数です。Pandas は、すべてのデータがメモリに収まるのに必要なサイズにかかわらず、コアとなる数値型、整数、フロートを 64 ビットにデフォルト設定しています。列全体が整数値 0 で構成されている場合でも、データ型は int64 になります。 get_dtype_count は、DataFrame 内のすべてのデータ型のカウントを直接返す便利なメソッドです。

同種データとは、すべて同じ型を持つ列を指す別の用語です。DataFrames全体には、異なるカラムに対して異なるデータ型の異種データが含まれている場合があります。

オブジェクトデータ型とは、他のデータ型とは異なるデータ型のことです。オブジェクトデータ型のカラムは、任意の有効なPythonオブジェクトの値を含むことができます。一般的に、カラムがオブジェクトデータ型の場合、カラム全体が文字列であることを示しています。これは必ずしもそうとは限りません。これらのカラムには整数、ブーリアン、文字列、リストや辞書のような複雑なPythonオブジェクトが混在している可能性があります。オブジェクトデータ型は、pandasが他の特定の型として認識しないカラムのためのキャッチオールです。

## There's more...

Almost all of pandas data types are built directly from NumPy. This tight integration makes it easier for users to integrate pandas and NumPy operations. As pandas grew larger and more popular, the object data type proved to be too generic for all columns with string values. Pandas created its own categorical data type to handle columns of strings (or numbers) with a fixed number of possible values.

pandasのデータ型のほとんどすべてがNumPyから直接作成されています。このように緊密に統合されているため、ユーザーはpandasとNumPyの操作を簡単に統合することができます。pandasが大きくなり、人気が出てくると、オブジェクトデータ型は文字列値を持つすべての列に対して汎用性が高すぎることが判明しました。Pandasは、可能な値の数が決まっている文字列（または数値）の列を扱うために、独自のカテゴリデータ型を作成しました。

# Selecting a single column of data as a Series

A Series is a single column of data from a DataFrame. It is a single dimension of data, composed of just an index and the data.

シリーズとは、DataFrameからのデータの単一列のことです。これは、インデックスとデータだけで構成されるデータの単一次元です。



In [52]:
movie = pd.read_csv('data/movie.csv')

In [53]:
movie['director_name']

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

In [54]:
movie.director_name
#print(movie.director_name)

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

In [48]:
type(movie['director_name'])

pandas.core.series.Series

## How it works... 

Python has several built-in objects for containing data, such as lists, tuples, and dictionaries. All three of these objects use the indexing operator to select their data. DataFrames are more powerful and complex containers of data, but they too use the indexing operator as the primary means to select data. Passing a single string to the DataFrame indexing operator returns a Series.  

The visual output of the Series is less stylized than the DataFrame. It represents a single column of data. Along with the index and values, the output displays the name, length, and data type of the Series.

Alternatively, while not recommended and subject to error, a column of data may be accessed using the dot notation with the column name as an attribute. Although it works with this particular example, it is not best practice and is prone to error and misuse. Column names with spaces or special characters cannot be accessed in this manner. This operation would have failed if the column name was director name. Column names that collide with DataFrame methods, such as count, also fail to be selected correctly using the dot notation. Assigning new values or deleting columns with the dot notation might give unexpected results. Because of this, using the dot notation to access columns should be avoided with production code.  

Pythonには、リスト、タプル、辞書などのデータを格納するためのいくつかの組み込みオブジェクトがあります。これら3つのオブジェクトはすべて、データを選択するためにインデックス演算子を使用します。DataFramesはより強力で複雑なデータのコンテナですが、データを選択するための主要な手段としてインデックス演算子を使用します。DataFrame のインデックス演算子に単一の文字列を渡すと、Series が返されます。 

直列の視覚的な出力は、DataFrameよりも様式化されていません。これはデータの単一列を表します。インデックスと値とともに、出力には系列の名前、長さ、データ型が表示されます。

あるいは、推奨されずエラーが発生する可能性がありますが、データの列には、列名を属性としてドット記法を使用してアクセスすることもできます。この特定の例では機能しますが、ベストプラクティスではなく、エラーや誤用の可能性があります。スペースや特殊文字を含む列名は、この方法ではアクセスできません。列名がディレクター名の場合、この操作は失敗します。count などの DataFrame メソッドと衝突する列名も、ドット表記を使用して正しく選択することができませんでした。ドット表記を使用して新しい値を代入したり、列を削除したりすると、予期しない結果が得られることがあります。このため、プロダクション コードでは、ドット表記を使用して列にアクセスすることは避けてください。


## There's more

In [55]:
director = movie['director_name'] # save Series to variable
director.name

'director_name'

In [51]:
#director.to_frame().head()
director.to_frame()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker
...,...
4911,Scott Smith
4912,
4913,Benjamin Roberds
4914,Daniel Hsia


In [59]:
director.to_frame().shape

(4916, 1)

# Calling Series methods

## Getting ready...

In [23]:
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)

442

In [24]:
df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)

445

In [25]:
len(s_attr_methods & df_attr_methods)

376

## How to do it...

In [26]:
movie = pd.read_csv('data/movie.csv')
director = movie['director_name']
actor_1_fb_likes = movie['actor_1_facebook_likes']

In [27]:
director.head()

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [28]:
actor_1_fb_likes.head()

0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

In [29]:
pd.set_option('max_rows', 8)
director.value_counts()

Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
                    ..
James Nunn           1
Gerard Johnstone     1
Ethan Maniquis       1
Antony Hoffman       1
Name: director_name, Length: 2397, dtype: int64

In [30]:
actor_1_fb_likes.value_counts()

1000.0     436
11000.0    206
2000.0     189
3000.0     150
          ... 
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

In [31]:
director.size

4916

In [32]:
director.shape

(4916,)

In [33]:
len(director)

4916

In [34]:
director.count()

4814

In [35]:
actor_1_fb_likes.count()

4909

In [36]:
actor_1_fb_likes.quantile()

982.0

In [37]:
actor_1_fb_likes.min(), actor_1_fb_likes.max(), \
actor_1_fb_likes.mean(), actor_1_fb_likes.median(), \
actor_1_fb_likes.std(), actor_1_fb_likes.sum()

(0.0, 640000.0, 6494.488490527602, 982.0, 15106.986883848309, 31881444.0)

In [38]:
actor_1_fb_likes.describe()

count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

In [39]:
director.describe()

count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

In [40]:
actor_1_fb_likes.quantile(.2)

510.0

In [41]:
actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])

0.1      240.0
0.2      510.0
0.3      694.0
0.4      854.0
        ...   
0.6     1000.0
0.7     8000.0
0.8    13000.0
0.9    18000.0
Name: actor_1_facebook_likes, Length: 9, dtype: float64

In [42]:
director.isnull()

0       False
1       False
2       False
3       False
        ...  
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

In [43]:
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()

4916

In [44]:
actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.size

4909

## There's more...

In [45]:
director.value_counts(normalize=True)

Steven Spielberg    0.005401
Woody Allen         0.004570
Clint Eastwood      0.004155
Martin Scorsese     0.004155
                      ...   
James Nunn          0.000208
Gerard Johnstone    0.000208
Ethan Maniquis      0.000208
Antony Hoffman      0.000208
Name: director_name, Length: 2397, dtype: float64

In [46]:
director.hasnans

True

In [47]:
director.notnull()

0        True
1        True
2        True
3        True
        ...  
4912    False
4913     True
4914     True
4915     True
Name: director_name, Length: 4916, dtype: bool

# Working with operators on a Series

In [48]:
pd.options.display.max_rows = 6

In [49]:
5 + 9    # plus operator example. Adds 5 and 9

14

In [50]:
4 ** 2   # exponentiation operator. Raises 4 to the second power

16

In [51]:
a = 10   # assignment operator.

In [52]:
5 <= 9   # less than or equal to operator

True

In [53]:
'abcde' + 'fg'    # plus operator for strings. C

'abcdefg'

In [54]:
not (5 <= 9)      # not is an operator that is a reserved keyword and reverse a boolean

False

In [55]:
7 in [1, 2, 6]    # in operator checks for membership of a list

False

In [56]:
set([1,2,3]) & set([2,3,4])

{2, 3}

In [57]:
[1, 2, 3] - 3

TypeError: unsupported operand type(s) for -: 'list' and 'int'

In [58]:
a = set([1,2,3])     
a[0]                 # the indexing operator does not work with sets

TypeError: 'set' object does not support indexing

## Getting ready...

In [59]:
movie = pd.read_csv('data/movie.csv')
imdb_score = movie['imdb_score']
imdb_score

0       7.9
1       7.1
2       6.8
       ... 
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [60]:
imdb_score + 1

0       8.9
1       8.1
2       7.8
       ... 
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

In [61]:
imdb_score * 2.5

0       19.75
1       17.75
2       17.00
        ...  
4913    15.75
4914    15.75
4915    16.50
Name: imdb_score, Length: 4916, dtype: float64

In [62]:
imdb_score // 7

0       1.0
1       1.0
2       0.0
       ... 
4913    0.0
4914    0.0
4915    0.0
Name: imdb_score, Length: 4916, dtype: float64

In [63]:
imdb_score > 7

0        True
1        True
2       False
        ...  
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

In [64]:
director = movie['director_name']

In [65]:
director == 'James Cameron'

0        True
1       False
2       False
        ...  
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

## There's more...

In [66]:
imdb_score.add(1)              # imdb_score + 1

0       8.9
1       8.1
2       7.8
       ... 
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

In [67]:
imdb_score.mul(2.5)            # imdb_score * 2.5

0       19.75
1       17.75
2       17.00
        ...  
4913    15.75
4914    15.75
4915    16.50
Name: imdb_score, Length: 4916, dtype: float64

In [68]:
imdb_score.floordiv(7)         # imdb_score // 7

0       1.0
1       1.0
2       0.0
       ... 
4913    0.0
4914    0.0
4915    0.0
Name: imdb_score, Length: 4916, dtype: float64

In [69]:
imdb_score.gt(7)               # imdb_score > 7

0        True
1        True
2       False
        ...  
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

In [70]:
director.eq('James Cameron')   # director == 'James Cameron'

0        True
1       False
2       False
        ...  
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

In [71]:
imdb_score.astype(int).mod(5)

0       2
1       2
2       1
       ..
4913    1
4914    1
4915    1
Name: imdb_score, Length: 4916, dtype: int64

In [72]:
a = type(1)

In [73]:
type(a)

type

In [74]:
a = type(imdb_score)

In [75]:
a([1,2,3])

0    1
1    2
2    3
dtype: int64

# Chaining Series methods together

In [76]:
movie = pd.read_csv('data/movie.csv')
actor_1_fb_likes = movie['actor_1_facebook_likes']
director = movie['director_name']

In [77]:
director.value_counts().head(3)

Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Name: director_name, dtype: int64

In [78]:
actor_1_fb_likes.isnull().sum()

7

In [79]:
actor_1_fb_likes.dtype

dtype('float64')

In [80]:
actor_1_fb_likes.fillna(0)\
                .astype(int)\
                .head()

0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

## There's more...

In [81]:
actor_1_fb_likes.isnull().mean()

0.0014239218877135883

In [82]:
(actor_1_fb_likes.fillna(0)
                 .astype(int)
                 .head())

0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

# Making the index meaningful

In [83]:
movie = pd.read_csv('data/movie.csv')

In [84]:
movie.shape

(4916, 28)

In [85]:
movie2 = movie.set_index('movie_title')
movie2

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
...,...,...,...,...,...,...,...,...,...
A Plague So Pleasant,Color,Benjamin Roberds,13.0,76.0,...,0.0,6.3,,16
Shanghai Calling,Color,Daniel Hsia,14.0,100.0,...,719.0,6.3,2.35,660
My Date with Drew,Color,Jon Gunn,43.0,90.0,...,23.0,6.6,1.85,456


In [86]:
pd.read_csv('data/movie.csv', index_col='movie_title')

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avatar,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
...,...,...,...,...,...,...,...,...,...
A Plague So Pleasant,Color,Benjamin Roberds,13.0,76.0,...,0.0,6.3,,16
Shanghai Calling,Color,Daniel Hsia,14.0,100.0,...,719.0,6.3,2.35,660
My Date with Drew,Color,Jon Gunn,43.0,90.0,...,23.0,6.6,1.85,456


# There's more...

In [87]:
movie2.reset_index()

Unnamed: 0,movie_title,color,director_name,num_critic_for_reviews,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Avatar,Color,James Cameron,723.0,...,936.0,7.9,1.78,33000
1,Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,...,5000.0,7.1,2.35,0
2,Spectre,Color,Sam Mendes,602.0,...,393.0,6.8,2.35,85000
...,...,...,...,...,...,...,...,...,...
4913,A Plague So Pleasant,Color,Benjamin Roberds,13.0,...,0.0,6.3,,16
4914,Shanghai Calling,Color,Daniel Hsia,14.0,...,719.0,6.3,2.35,660
4915,My Date with Drew,Color,Jon Gunn,43.0,...,23.0,6.6,1.85,456


# Renaming row and column names

In [88]:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')

In [89]:
idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'} 
col_rename = {'director_name':'Director Name', 
              'num_critic_for_reviews': 'Critical Reviews'} 

In [90]:
movie.rename(index=idx_rename, 
             columns=col_rename).head()

Unnamed: 0_level_0,color,Director Name,Critical Reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ratava,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
Ertceps,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,...,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,...,12.0,7.1,,0


# There's more

In [91]:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')
index = movie.index
columns = movie.columns

index_list = index.tolist()
column_list = columns.tolist()

index_list[0] = 'Ratava'
index_list[2] = 'Ertceps'
column_list[1] = 'Director Name'
column_list[2] = 'Critical Reviews'

In [92]:
print(index_list[:5])

['Ratava', "Pirates of the Caribbean: At World's End", 'Ertceps', 'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens']


In [93]:
print(column_list)

['color', 'Director Name', 'Critical Reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']


In [94]:
movie.index = index_list
movie.columns = column_list

In [95]:
movie.head()

Unnamed: 0,color,Director Name,Critical Reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
Ertceps,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,...,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,...,12.0,7.1,,0


# Creating and deleting columns

In [96]:
movie = pd.read_csv('data/movie.csv')

In [97]:
movie['has_seen'] = 0

In [98]:
movie.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes', 'has_seen'],
      dtype='object')

In [99]:
movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + 
                                              movie['actor_2_facebook_likes'] + 
                                              movie['actor_3_facebook_likes'] + 
                                              movie['director_facebook_likes'])

In [100]:
movie['actor_director_facebook_likes'].isnull().sum()

122

In [101]:
movie['actor_director_facebook_likes'] = movie['actor_director_facebook_likes'].fillna(0)

In [102]:
movie['is_cast_likes_more'] = (movie['cast_total_facebook_likes'] >= 
                                  movie['actor_director_facebook_likes'])

In [103]:
movie['is_cast_likes_more'].all()

False

In [104]:
movie = movie.drop('actor_director_facebook_likes', axis='columns')

In [105]:
movie['actor_total_facebook_likes'] = (movie['actor_1_facebook_likes'] + 
                                       movie['actor_2_facebook_likes'] + 
                                       movie['actor_3_facebook_likes'])

movie['actor_total_facebook_likes'] = movie['actor_total_facebook_likes'].fillna(0)

In [106]:
movie['is_cast_likes_more'] = movie['cast_total_facebook_likes'] >= \
                                  movie['actor_total_facebook_likes']
    
movie['is_cast_likes_more'].all()

True

In [107]:
movie['pct_actor_cast_like'] = (movie['actor_total_facebook_likes'] / 
                                movie['cast_total_facebook_likes'])

In [108]:
movie['pct_actor_cast_like'].min(), movie['pct_actor_cast_like'].max() 

(0.0, 1.0)

In [109]:
movie.set_index('movie_title')['pct_actor_cast_like'].head()

movie_title
Avatar                                        0.577369
Pirates of the Caribbean: At World's End      0.951396
Spectre                                       0.987521
The Dark Knight Rises                         0.683783
Star Wars: Episode VII - The Force Awakens    0.000000
Name: pct_actor_cast_like, dtype: float64

## There's more...

In [110]:
profit_index = movie.columns.get_loc('gross') + 1
profit_index

9

In [111]:
movie.insert(loc=profit_index,
                 column='profit',
                 value=movie['gross'] - movie['budget'])

In [112]:
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,...,has_seen,is_cast_likes_more,actor_total_facebook_likes,pct_actor_cast_like
0,Color,James Cameron,723.0,178.0,...,0,True,2791.0,0.577369
1,Color,Gore Verbinski,302.0,169.0,...,0,True,46000.0,0.951396
2,Color,Sam Mendes,602.0,148.0,...,0,True,11554.0,0.987521
3,Color,Christopher Nolan,813.0,164.0,...,0,True,73000.0,0.683783
4,,Doug Walker,,,...,0,True,0.0,0.0
