### Pandas Tutorial: Data analysis with Python: Part 1

### Introduction

Pandas package - used for importing and analyzing data much easier; builds on NumPy and matplotlib

In this tutorial, pandas wil be used to analyze data on video game reviews from IGN

Dataset is from: https://www.kaggle.com/egrinstein/20-years-of-games

Data is composed of 18,625 lines - webcrawling script (https://github.com/egrinstein/crawl_ign/blob/master/crawl_ign/crawl_ign/ign.ipynb)
- Good to review after finishing this tutorial to get a refresher on scraping

The dataset contains the data about the major games and releases on various platforms, with the month and year of release too. Score for each game is also provided. The data for games from 1996 to 2017 is present in the dataset.

Explore the dataset and find:
- Top Platforms
- Top Genres
- Best Ratings

In [9]:
# Set directory for the data
import os
path = 'C:\\Users\\' + os.getlogin() + '\\Documents\\Programming\\Python\\MachineLearning\\Data'
os.chdir(path)
os.getcwd()
os.listdir()

['01-ign.csv']

In [10]:
"""
Importing Data with Pandas

Read the csv file and write it to a DataFrame which is a representation of tabular data (rows and columns)
"""
import pandas as pd

# Use pd.read_csv() method
reviews = pd.read_csv('01-ign.csv')

"""
Once we read the data in a DataFrame, Pandas gives two methods that make it fast to print out.

    pd.DataFrame.head - prints the first N rows (default = 5)
    pd.DataFrame.tail - prints the last N rows (default = 5)
"""

'\nOnce we read the data in a DataFrame, Pandas gives two methods that make it fast to print out.\n\n    pd.DataFrame.head - prints the first N rows (default = 5)\n    pd.DataFrame.tail - prints the last N rows (default = 5)\n'

In [15]:
# [::2] - skips once to get second
reviews[::2].head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
6,6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
8,8,Awful,Double Dragon: Neon,/games/double-dragon-neon/ps3-131321,PlayStation 3,3.0,Fighting,N,2012,9,11


In [17]:
# [::3] - skips twice to get third
reviews[::3].head()

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
6,6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
9,9,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/pc-142564,PC,7.0,Strategy,N,2012,9,11
12,12,Good,Wild Blood,/games/wild-blood/iphone-139363,iPhone,7.0,,N,2012,9,10


In [27]:
# First 11 rows
reviews[:11]

Unnamed: 0.1,Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11
6,6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
7,7,Amazing,Guild Wars 2,/games/guild-wars-2/pc-896298,PC,9.0,RPG,Y,2012,9,11
8,8,Awful,Double Dragon: Neon,/games/double-dragon-neon/ps3-131321,PlayStation 3,3.0,Fighting,N,2012,9,11
9,9,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/pc-142564,PC,7.0,Strategy,N,2012,9,11


In [33]:
# Get first column
reviews[reviews.columns[0]].head()

0    0
1    1
2    2
3    3
4    4
Name: Unnamed: 0, dtype: int64

In [34]:
# Get second element
reviews[reviews.columns[1]].head()

0    Amazing
1    Amazing
2      Great
3      Great
4      Great
Name: score_phrase, dtype: object

In [40]:
# Get all elements besides the first column and create a new data variable
data = reviews[reviews.columns[1:]] # last column = x[x.columns[-1]] 

In [45]:
"""
Pandas selections and indexing
data.loc[row, column] - select data by label or by a conditional statement
"""
print(data.head())
data.shape # see rows and columns

  score_phrase                                              title  \
0      Amazing                            LittleBigPlanet PS Vita   
1      Amazing  LittleBigPlanet PS Vita -- Marvel Super Hero E...   
2        Great                               Splice: Tree of Life   
3        Great                                             NHL 13   
4        Great                                             NHL 13   

                                                 url          platform  score  \
0             /games/littlebigplanet-vita/vita-98907  PlayStation Vita    9.0   
1  /games/littlebigplanet-ps-vita-marvel-super-he...  PlayStation Vita    9.0   
2                          /games/splice/ipad-141070              iPad    8.5   
3                      /games/nhl-13/xbox-360-128182          Xbox 360    8.5   
4                           /games/nhl-13/ps3-128181     PlayStation 3    8.5   

        genre editors_choice  release_year  release_month  release_day  
0  Platformer            

(18625, 10)

As you can see, everything has been read in properly — we have 18625 rows and 10 columns. I removed the first column that is considered as an index that we don't need.

One of the big advantages of Pandas vs just using NumPy is that Pandas allows you to have columns with different data types. reviews has columns that store float values, like score, string values, like score_phrase, and integers, like release_year. Now that we’ve read the data in properly, let’s work on indexing reviews to get the rows and columns that we want. 

### Indexing DataFrames with Pandas

    pandas.DataFrame.iloc method

Earlier, we used the head method to print the first 5 rows of reviews. We could accomplish the same thing using the pandas.DataFrame.iloc method. The iloc method allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. 

In [49]:
"""
The iloc indexer for Pandas DataFrame is used for integer location based indexing / selection by position.

pd.DataFrame.iloc(row, column)
"""
data.iloc[0:5].head() # prints all
data.iloc[0:5,]       # prints all or something has changed that we cant see right now
data.iloc[0:5,:]      # looks the same

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11


As you can see above, we specified that we wanted rows 0:5. This means that we wanted the rows from position 0 up to, but not including, position 5. 

The first row is considered to be in position 0. This gives us the rows at positions 0, 1, 2, 3, and 4. If we leave off the first position value, like :5, it’s assumed we mean 0. If we leave off the last position value, like 0:, it’s assumed we mean the last row or column in the DataFrame. We wanted all of the columns, so we specified just a colon (:), without any positions. This gave us the columns from 0 to the last column. 

Here are some indexing examples, along with the results:

- data.iloc[:5,:] — the first 5 rows, and all of the columns for those rows.
- data.iloc[:,:] — the entire DataFrame.
- data.iloc[5:,5:] — rows from position 5 onwards, and columns from position 5 onwards.
- data.iloc[:,0] — the first column, and all of the rows for the column.
- data.iloc[9,:] — the 10th row, and all of the columns for that row.

Indexing by position is very similar to NumPy indexing. If you want to learn more, you can read our NumPy tutorial here. (https://www.dataquest.io/blog/numpy-tutorial-python/)

In [50]:
data.head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11


### Indexing Using Labels in Pandas

Now that we know how to retrieve rows and columns by position, it’s worth looking into the other major way to work with DataFrames, which is to retrieve rows and columns by label. 

A major advantage of Pandas over NumPy is that each of the columns and rows has a label. Working with column positions is possible, but it can be hard to keep track of which number corresponds to which column. We can work with labels using the pandas.DataFrame.loc method, which allows us to index using labels instead of positions.

In [54]:
# Get rows at 0-5 index - show all columns
data.loc[0:5,:] # Not much different from data.iloc[0:5,:]

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11


The above doesn’t actually look much different from reviews.iloc[0:5,:]. This is because while row labels can take on any values, our row labels match the positions exactly. You can see the row labels on the very left of the table above (they’re in bold). You can also see them by accessing the index property of a DataFrame.

In [56]:
# Display the row indexes for data
data.index

RangeIndex(start=0, stop=18625, step=1)

Indexes don’t always have to match up with positions, though. In the below code cell, we’ll:
- Get row 10 to row 20 of reviews, and assign the result to some_reviews.
- Display the first 5 rows of some_reviews.


In [63]:
# Get row 10- 20 and assign it to a new variable
some_data = data.iloc[10:20,] # loc includes 20 vs iloc up to 19 - not sure what : does when we don't need it to slice columns
some_data

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
10,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/ps3-124584,PlayStation 3,7.5,Fighting,N,2012,9,11
11,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/xbox-360-124581,Xbox 360,7.5,Fighting,N,2012,9,11
12,Good,Wild Blood,/games/wild-blood/iphone-139363,iPhone,7.0,,N,2012,9,10
13,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/xbox-360-129276,Xbox 360,9.0,"Action, Adventure",Y,2012,9,7
14,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/pc-143761,PC,9.0,"Action, Adventure",Y,2012,9,7
15,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/mac-2001...,Macintosh,6.5,Adventure,N,2012,9,6
16,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/pc-137135,PC,6.5,Adventure,N,2012,9,6
17,Great,Avengers Initiative,/games/avengers-initiative/iphone-141579,iPhone,8.0,Action,N,2012,9,5
18,Mediocre,Way of the Samurai 4,/games/way-of-the-samurai-4/ps3-23516,PlayStation 3,5.5,"Action, Adventure",N,2012,9,3
19,Good,JoJo's Bizarre Adventure HD,/games/jojos-bizarre-adventure/xbox-360-137717,Xbox 360,7.0,Fighting,N,2012,9,3


Ccolumn labels can make life much easier when you’re working with data. We can specify column labels in the loc method to retrieve columns by label instead of by position. 

In [69]:
# Specify up to index 5 and only extract the "score" column
data.loc[:5, "score"]

0    9.0
1    9.0
2    8.5
3    8.5
4    8.5
5    7.0
Name: score, dtype: float64

In [70]:
# Get score and release_year column, which needs an inner bracket
data.loc[:5, ["score", "release_year"]]

Unnamed: 0,score,release_year
0,9.0,2012
1,9.0,2012
2,8.5,2012
3,8.5,2012
4,8.5,2012
5,7.0,2012


### Pandas Series Objects

We can retrieve an individual column in Pandas a few different ways. So far, we’ve seen two types of syntax for this:

- data.iloc[:,1] — will retrieve the second column.
- data.loc[:,"score_phrase"] — will also retrieve the second column.

There’s a third, even easier, way to retrieve a whole column. We can just specify the column name in square brackets, like with a dictionary: 

In [71]:
# Get "score"
data["score"]

0         9.0
1         9.0
2         8.5
3         8.5
4         8.5
5         7.0
6         3.0
7         9.0
8         3.0
9         7.0
10        7.5
11        7.5
12        7.0
13        9.0
14        9.0
15        6.5
16        6.5
17        8.0
18        5.5
19        7.0
20        7.0
21        7.5
22        7.5
23        7.5
24        9.0
25        7.0
26        9.0
27        7.5
28        8.0
29        6.5
         ... 
18595     4.4
18596     6.5
18597     4.9
18598     6.8
18599     7.0
18600     7.4
18601     7.4
18602     7.4
18603     7.8
18604     8.6
18605     6.0
18606     6.4
18607     7.0
18608     5.4
18609     8.0
18610     6.0
18611     5.8
18612     7.8
18613     8.0
18614     9.2
18615     9.2
18616     7.5
18617     8.4
18618     9.1
18619     7.9
18620     7.6
18621     9.0
18622     5.8
18623    10.0
18624    10.0
Name: score, Length: 18625, dtype: float64

In [72]:
# Also use lists of columns with this method
data[["score", "release_year"]]

Unnamed: 0,score,release_year
0,9.0,2012
1,9.0,2012
2,8.5,2012
3,8.5,2012
4,8.5,2012
5,7.0,2012
6,3.0,2012
7,9.0,2012
8,3.0,2012
9,7.0,2012


When we retrieve a single column, we’re actually retrieving a Pandas Series object. A DataFrame stores tabular data, but a Series stores a single column or row of data. We can verify that a single column is a Series: 

In [73]:
# Verify if series
type(data["score"])

pandas.core.series.Series

We can create a Series manually to better understand how it works. To create a Series, we pass a list or NumPy array into the Series object when we instantiate it: 

In [78]:
# Passing a list (do something with the list)
s1 = pd.Series([1,2]) # Passed a list [1,2] into the Series() object function
s1

0    1
1    2
dtype: int64

A Series can contain any type of data, including mixed types. Here, we create a Series that contains string objects: 

In [79]:
s2 = pd.Series(["Boris Yeltsin", "Mikhail Gorbachev"])
s2

0        Boris Yeltsin
1    Mikhail Gorbachev
dtype: object

### Creating A DataFrame in Pandas

We can create a DataFrame by passing multiple Series into the DataFrame class. Here, we pass in the two Series objects we just created,

- s1 as the first row, and s2 as the second row: 

In [84]:
pd.DataFrame([s1, s2])

Unnamed: 0,0,1
0,1,2
1,Boris Yeltsin,Mikhail Gorbachev


We can also accomplish the same thing with a list of lists. Each inner list is treated as a row in the resulting DataFrame: 

In [81]:
# Without the variables
pd.DataFrame(
[
[1,2],    
["Boris Yeltsin", "Mikhail Gorbachev"]
])

Unnamed: 0,0,1
0,1,2
1,Boris Yeltsin,Mikhail Gorbachev


In [91]:
# Without the variables and column names
pd.DataFrame(
[
[1,2],    
["Boris Yeltsin", "Mikhail Gorbachev"]
],
columns=['Column1', 'Column2']
)

Unnamed: 0,Column1,Column2
0,1,2
1,Boris Yeltsin,Mikhail Gorbachev


In [92]:
# Without the variables and add column names and row names
pd.DataFrame(
[
[1,2], ["Boris Yeltsin", "Mikhail Gorbachev"]
], index=['row1','row2'], columns=['Columns1','Columns2']
)

Unnamed: 0,Columns1,Columns2
row1,1,2
row2,Boris Yeltsin,Mikhail Gorbachev


In [101]:
# Assign the new df to a variable
new_df = pd.DataFrame(
[
[1,2], ["Boris Yeltsin", "Mikhail Gorbachev"]     
], index=['row1','row2'], columns=['column1','column']
)

# Grab new indexes - use .loc()
print(new_df.loc['row1':'row2', 'column1'])

row1                1
row2    Boris Yeltsin
Name: column1, dtype: object


We can skip specifying the columns keyword argument if we pass a dictionary into the DataFrame constructor. This will automatically setup column names: 

In [102]:
# Specify new dataframe without the columns keyword argument
# Pass a dictionary into the DataFrame constructor
new_df = pd.DataFrame(
    {'column1': [1, 'Boris'],
     'column2': [2, 'Mikhail']}
)
new_df

Unnamed: 0,column1,column2
0,1,2
1,Boris,Mikhail


### Pandas DataFrame Methods

As we mentioned earlier, each column in a DataFrame is a Series object: 

In [103]:
type(data["title"])

pandas.core.series.Series

We can call most of the same methods on a Series object that we can on a DataFrame, including head: 

In [104]:
# Call head() method on the Series object
data['title'].head()

0                              LittleBigPlanet PS Vita
1    LittleBigPlanet PS Vita -- Marvel Super Hero E...
2                                 Splice: Tree of Life
3                                               NHL 13
4                                               NHL 13
Name: title, dtype: object

In [108]:
some_data

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
10,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/ps3-124584,PlayStation 3,7.5,Fighting,N,2012,9,11
11,Good,Tekken Tag Tournament 2,/games/tekken-tag-tournament-2/xbox-360-124581,Xbox 360,7.5,Fighting,N,2012,9,11
12,Good,Wild Blood,/games/wild-blood/iphone-139363,iPhone,7.0,,N,2012,9,10
13,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/xbox-360-129276,Xbox 360,9.0,"Action, Adventure",Y,2012,9,7
14,Amazing,Mark of the Ninja,/games/mark-of-the-ninja-135615/pc-143761,PC,9.0,"Action, Adventure",Y,2012,9,7
15,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/mac-2001...,Macintosh,6.5,Adventure,N,2012,9,6
16,Okay,Home: A Unique Horror Adventure,/games/home-a-unique-horror-adventure/pc-137135,PC,6.5,Adventure,N,2012,9,6
17,Great,Avengers Initiative,/games/avengers-initiative/iphone-141579,iPhone,8.0,Action,N,2012,9,5
18,Mediocre,Way of the Samurai 4,/games/way-of-the-samurai-4/ps3-23516,PlayStation 3,5.5,"Action, Adventure",N,2012,9,3
19,Good,JoJo's Bizarre Adventure HD,/games/jojos-bizarre-adventure/xbox-360-137717,Xbox 360,7.0,Fighting,N,2012,9,3


In [114]:
(data.iloc[:5,:])  # the first 5 rows, and all of the columns for those rows.
# print(data.iloc[:,:])   # the entire DataFrame.
# print(data.iloc[5:,5:]) # rows from position 5 onwards, and columns from position 5 onwards.
# print(data.iloc[:,0])   # the first column, and all of the rows for the column.
# print(data.iloc[9,:])   # the 10th row, and all of the columns for that row.

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11


In [122]:
# First 3 rows starting at 1st column element
data.iloc[:3,1:]

Unnamed: 0,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12


In [123]:
# First column
(data.iloc[:,1]).head()

0                              LittleBigPlanet PS Vita
1    LittleBigPlanet PS Vita -- Marvel Super Hero E...
2                                 Splice: Tree of Life
3                                               NHL 13
4                                               NHL 13
Name: title, dtype: object

In [124]:
# First row
(data.iloc[:1,]).head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12


Pandas Series and DataFrames also have other methods that make calculations simpler. For example, we can use the pandas.Series.mean method to find the mean of a Series: 

In [126]:
# Get mean()
data['score'].mean()

6.950459060402685

We can also call the similar pandas.DataFrame.mean method, which will find the mean of each numerical column in a DataFrame by default: 

In [127]:
# Get mean() for every numerical column
data.mean()

score               6.950459
release_year     2006.515329
release_month       7.138470
release_day        15.603866
dtype: float64

We can modify the axis keyword argument to mean in order to compute the mean of each row or of each column. By default, axis is equal to 0, and will compute the mean of each column. We can also set it to 1 to compute the mean of each row. Note that this will only compute the mean of the numerical values in each row: 

In [128]:
data.mean(axis=1)

0        510.500
1        510.500
2        510.375
3        510.125
4        510.125
5        509.750
6        508.750
7        510.250
8        508.750
9        509.750
10       509.875
11       509.875
12       509.500
13       509.250
14       509.250
15       508.375
16       508.375
17       508.500
18       507.375
19       507.750
20       507.750
21       514.625
22       514.625
23       514.625
24       515.000
25       514.250
26       514.750
27       514.125
28       514.250
29       513.625
          ...   
18595    510.850
18596    510.875
18597    510.225
18598    510.700
18599    510.750
18600    512.600
18601    512.600
18602    512.600
18603    512.450
18604    512.400
18605    511.500
18606    508.600
18607    510.750
18608    510.350
18609    510.750
18610    510.250
18611    508.700
18612    509.200
18613    508.000
18614    515.050
18615    515.050
18616    508.375
18617    508.600
18618    515.025
18619    514.725
18620    514.650
18621    515.000
18622    513.9

 There are quite a few methods on Series and DataFrames that behave like mean. Here are some handy ones:

- pandas.DataFrame.corr — finds the correlation between columns in a DataFrame.
- pandas.DataFrame.count — counts the number of non-null values in each DataFrame column.
- pandas.DataFrame.max — finds the highest value in each column.
- pandas.DataFrame.min — finds the lowest value in each column.
- pandas.DataFrame.median — finds the median of each column.
- pandas.DataFrame.std — finds the standard deviation of each column.

We can use the corr method to see if any columns correlation with score. For instance, this would tell us if games released more recently have been getting higher reviews (release_year), or if games released towards the end of the year score better (release_month): 

In [129]:
# Use DataFrame.corr() to see if there are any correlations
data.corr()

Unnamed: 0,score,release_year,release_month,release_day
score,1.0,0.062716,0.007632,0.020079
release_year,0.062716,1.0,-0.115515,0.016867
release_month,0.007632,-0.115515,1.0,-0.067964
release_day,0.020079,0.016867,-0.067964,1.0


As you can see above, none of our numeric columns correlates with score, meaning that release timing doesn’t linearly relate to review score. 

### DataFrame Math with Pandas

We can also perform math operations on Series or DataFrame objects. For example, we can divide every value in the score column by 2 to switch the scale from 0–10 to 0–5: 

In [130]:
# Divide score by 2
data["score"] / 2

0        4.50
1        4.50
2        4.25
3        4.25
4        4.25
5        3.50
6        1.50
7        4.50
8        1.50
9        3.50
10       3.75
11       3.75
12       3.50
13       4.50
14       4.50
15       3.25
16       3.25
17       4.00
18       2.75
19       3.50
20       3.50
21       3.75
22       3.75
23       3.75
24       4.50
25       3.50
26       4.50
27       3.75
28       4.00
29       3.25
         ... 
18595    2.20
18596    3.25
18597    2.45
18598    3.40
18599    3.50
18600    3.70
18601    3.70
18602    3.70
18603    3.90
18604    4.30
18605    3.00
18606    3.20
18607    3.50
18608    2.70
18609    4.00
18610    3.00
18611    2.90
18612    3.90
18613    4.00
18614    4.60
18615    4.60
18616    3.75
18617    4.20
18618    4.55
18619    3.95
18620    3.80
18621    4.50
18622    2.90
18623    5.00
18624    5.00
Name: score, Length: 18625, dtype: float64

### Boolean Indexing in Pandas

As we saw above, the mean of all the values in the score column of reviews is around 7. What if we wanted to find all the games that got an above average score? We could start by doing a comparison. The comparison compares each value in a Series to a specified value, then generate a Series full of Boolean values indicating the status of the comparison. For example, we can see which of the rows have a score value higher than 7: 

In [131]:
score_filter = data["score"] > 7
score_filter

0         True
1         True
2         True
3         True
4         True
5        False
6        False
7         True
8        False
9        False
10        True
11        True
12       False
13        True
14        True
15       False
16       False
17        True
18       False
19       False
20       False
21        True
22        True
23        True
24        True
25       False
26        True
27        True
28        True
29       False
         ...  
18595    False
18596    False
18597    False
18598    False
18599    False
18600     True
18601     True
18602     True
18603     True
18604     True
18605    False
18606    False
18607    False
18608    False
18609     True
18610    False
18611    False
18612     True
18613     True
18614     True
18615     True
18616     True
18617     True
18618     True
18619     True
18620     True
18621     True
18622    False
18623     True
18624     True
Name: score, Length: 18625, dtype: bool

In [132]:
score_filter.count()

18625

In [137]:
score_filter.head()

0    True
1    True
2    True
3    True
4    True
Name: score, dtype: bool

Once we have a Boolean Series, we can use it to select only rows in a DataFrame where the Series contains the value True. So, we could only select rows in reviews where score is greater than 7: 

In [138]:
filtered_df = data[score_filter]
filtered_df.head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11


In [139]:
filtered_df["score"].mean()

8.237275510204082

In [140]:
data["score"].mean()

6.950459060402685

In [141]:
filtered_df.count()

score_phrase      9800
title             9800
url               9800
platform          9800
score             9800
genre             9785
editors_choice    9800
release_year      9800
release_month     9800
release_day       9800
dtype: int64

In [143]:
filtered_df[:1:].head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12


In [146]:
filtered_df.iloc[:1,1:].head()

Unnamed: 0,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12


In [147]:
filtered_df.iloc[:1,:].head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12


In [148]:
filtered_df.iloc[:2,:].head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...,/games/littlebigplanet-ps-vita-marvel-super-he...,PlayStation Vita,9.0,Platformer,Y,2012,9,12


In [149]:
# First 2 rows starting at the 4th column index
filtered_df.iloc[:2,4:].head()

Unnamed: 0,score,genre,editors_choice,release_year,release_month,release_day
0,9.0,Platformer,Y,2012,9,12
1,9.0,Platformer,Y,2012,9,12


In [150]:
# First 2 rows with first column
filtered_df.iloc[:2,0].head()

0    Amazing
1    Amazing
Name: score_phrase, dtype: object

In [153]:
# First 2 rows with first column to 2nd (slices only first two columns)
filtered_df.iloc[:2,0:2].head()

Unnamed: 0,score_phrase,title
0,Amazing,LittleBigPlanet PS Vita
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero E...


 It’s possible to use multiple conditions for filtering. Let’s say we want to find games released for the Xbox One that have a score of more than 7. In the below code, we:

- Setup a filter with two conditions:
    - Check if score is greater than 7.
    - Check if platform equals Xbox One
- Apply the filter to reviews to get only the rows we want.
- Use the head method to print the first 5 rows of filtered_reviews.


In [155]:
# Get Xbox One with scores over 7
xbox_one_filter = (data["score"] > 7) & (data["platform"] == "Xbox One")
filtered_reviews = data[xbox_one_filter]
filtered_reviews.head()

Unnamed: 0,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
17137,Amazing,Gone Home,/games/gone-home/xbox-one-20014361,Xbox One,9.5,Simulation,Y,2013,8,15
17197,Amazing,Rayman Legends,/games/rayman-legends/xbox-one-20008449,Xbox One,9.5,Platformer,Y,2013,8,26
17295,Amazing,LEGO Marvel Super Heroes,/games/lego-marvel-super-heroes/xbox-one-20000826,Xbox One,9.0,Action,Y,2013,10,22
17313,Great,Dead Rising 3,/games/dead-rising-3/xbox-one-124306,Xbox One,8.3,Action,N,2013,11,18
17317,Great,Killer Instinct,/games/killer-instinct-2013/xbox-one-20000538,Xbox One,8.4,Fighting,N,2013,11,18


In [156]:
# There are only 140 scores for the Xbox above a score of 7
filtered_reviews.count()

score_phrase      140
title             140
url               140
platform          140
score             140
genre             140
editors_choice    140
release_year      140
release_month     140
release_day       140
dtype: int64

#### When filtering with multiple conditions, it’s important to put each condition in parentheses, and separate them with a single ampersand (&). 

### Pandas Plotting

Now that we know how to filter, we can create plots to observe the review distribution for the Xbox One vs the review distribution for the PlayStation 4. This will help us figure out which console has better games. We can do this via a histogram, which will plot the frequencies for different score ranges. This will tell us which console has more highly reviewed games. We can make a histogram for each console using the pandas.DataFrame.plot method. This method utilizes matplotlib, the popular Python plotting library, under the hood to generate good-looking plots. The plot method defaults to drawing a line graph. We’ll need to pass in the keyword argument kind="hist" to draw a histogram instead. In the below code, we:

- Call %matplotlib inline to set up plotting inside a Jupyter notebook.
- Filter reviews to only have data about the Xbox One.
- Plot the score column.


In [157]:
%matplotlib

Using matplotlib backend: TkAgg


In [160]:
data[data["platform"] == "Xbox One"]["score"].plot(kind="hist")

AttributeError: module 'pandas' has no attribute 'pause'

In [None]:
data[data["platform"] == "PlayStation 4"]["score"].plot(kind="hist")

In [161]:
filtered_reviews["score"].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1945a4ef978>