PANDAS

To represent tabular data, pandas uses a custom data structure called a dataframe. A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The dataframe is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and pandas dataframes handle mixed data types effortlessly while NumPy doesn't. Pandas dataframes can also handle missing values gracefully using a custom object, NaN, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, pandas dataframes contain axis labels for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that dataframes preserve the metadata from the file around the data.

In [2]:
import numpy as np
np.none

AttributeError: module 'numpy' has no attribute 'none'

In this tutorial, we'll use Pandas to analyze data on video game reviews.
In order to be able to work with the data in Python, we'll need to read the csv file into a Pandas DataFrame. A DataFrame is a way to represent and work with tabular data. Tabular data has rows and columns, just like our csv file.
In order to read in the data, we'll need to use the pandas.read_csv function. This function will take in a csv file and return a DataFrame. The below code will:

In [1]:
import pandas as pd
reviews = pd.read_csv("ign.csv")
# dataframe.columns will provide the column list of dataset as shown in code below:
print(reviews.columns)

Index(['Unnamed: 0', 'score_phrase', 'title', 'url', 'platform', 'score',
       'genre', 'editors_choice', 'release_year', 'release_month',
       'release_day'],
      dtype='object')


Once we read in a DataFrame, Pandas gives us two methods that make it fast to print out the data. These functions are:
pandas.DataFrame.head -- prints the first N rows of a DataFrame. By default 5.
pandas.DataFrame.tail -- prints the last N rows of a DataFrame. By default 5.
We'll use the head method to see what's in reviews:

In [None]:
reviews.head()

The columns contain information about that game:

score_phrase — how IGN described the game in one word. This is linked to the score it received.
title — the name of the game.
url — the URL where you can see the full review.
platform — the platform the game was reviewed on (PC, PS4, etc).
score — the score for the game, from 1.0 to 10.0.
genre — the genre of the game.
editors_choice — N if the game wasn't an editor's choice, Y if it was. This is tied to score.
release_year — the year the game was released.
release_month — the month the game was released.
release_day — the day the game was released.

In [6]:
reviews.shape

(18625, 11)

In [9]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18625 entries, 0 to 18624
Data columns (total 11 columns):
Unnamed: 0        18625 non-null int64
score_phrase      18625 non-null object
title             18625 non-null object
url               18625 non-null object
platform          18625 non-null object
score             18625 non-null float64
genre             18589 non-null object
editors_choice    18625 non-null object
release_year      18625 non-null int64
release_month     18625 non-null int64
release_day       18625 non-null int64
dtypes: float64(1), int64(4), object(6)
memory usage: 1.6+ MB


In [10]:
reviews.describe()

Unnamed: 0.1,Unnamed: 0,score,release_year,release_month,release_day
count,18625.0,18625.0,18625.0,18625.0,18625.0
mean,9312.0,6.950459,2006.515329,7.13847,15.603866
std,5376.718717,1.711736,4.587529,3.47671,8.690128
min,0.0,0.5,1970.0,1.0,1.0
25%,4656.0,6.0,2003.0,4.0,8.0
50%,9312.0,7.3,2007.0,8.0,16.0
75%,13968.0,8.2,2010.0,10.0,23.0
max,18624.0,10.0,2016.0,12.0,31.0


As you can see, everything has been read in properly -- we have 18625 rows and 11 columns.
One of the big advantages of Pandas vs just using NumPy is that Pandas allows you to have columns with different data types. reviews has columns that store float values, like score, string values, like score_phrase, and integers, like release_year.

In [None]:
 # Indexing DataFrames with Pandas using the pandas.DataFrame.iloc method
 # he below code will replicate reviews.head():
 reviews.iloc[:5, :]

In [None]:
# Here are some indexing examples, along with the results:
# the entire DataFrame.
reviews.iloc[:,:]

In [None]:
# rows from position 5 onwards, and columns from position 5 onwards
reviews.iloc[5:, 5:]

In [None]:
# the first column, and all of the rows for the column.
reviews.iloc[:,0]

In [None]:
# the 10th row, and all of the columns for that row.
reviews.iloc[9,:]

In [None]:
# Now that we know how to index by position, let's remove the first column, which doesn't have any useful information:
reviews = reviews.iloc[:, 1:]
reviews.head()

In [None]:
reviews.iloc[:3, 1:]

In [None]:
reviews.loc[:3]

Pandas Series
The Series object is a core data structure that pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector. The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.

In [None]:
reviews["score"]

In [None]:
# We can also use lists of columns with this method:
reviews[["score", "release_year"]]

In [None]:
# We can verify that a single column is a Series:
type(reviews['score'])

In [None]:
# We can create a Series manually to better understand how it works.
# To create a Series, we pass a list or NumPy array into the Series object when we instantiate it:
s1 = pd.Series([1,2])
s1

In [None]:
s2 = pd.Series(['Pandas', 'are', 'better', 'than', 'Numpy', 99, 100])
s2

In [None]:
print s2[5]
type(s2[5])

In [None]:
print s2[2]
type(s2[2])

Creating A DataFrame in Pandas
We can create a DataFrame by passing multiple Series into the DataFrame class. Here, we pass in the two Series objects we just created, s1 as the first row, and s2 as the second row.

In [None]:
s2 = pd.Series(['Python', 'Programming'])
df = pd.DataFrame([s1,s2])
df.head()

In [None]:
pd.DataFrame(
    [
        [1,2],
        ['Python', 'Programming']
    ]
)

In [None]:
pd.DataFrame(
    [
        [1,2],
        ['Python', 'Programming']
    ],
    columns=["column1", "column2"]
)

In [None]:
# pandas series from dictionary
a_d ={'name': 'Diljeet', 'city': 'Delhi', 'email': 'diljeet@gmail.com'}
print (a_d)
print('-'*20)
s = pd.Series(a_d)
print (s)

print(s.loc['name'])
print(s.iloc[0])

In [None]:
# pandas series from dictionary of lists
data ={'name': ['Diljeet', 'Kishan', 'Prakhar'], 
      'city': ['Delhi', 'Gurgaon', 'Delhi'], 
      'email': ['diljeet@gmail.com', 'Kishan@gmail.com', 'Prakhar@gmai;l.com']}

print (data)
print('-'*20)
s = pd.Series(data)
print (s)

# print(s.loc['name'])
# print(s.iloc[0])

In [None]:
# Pandas DataFrame Methods
reviews["score"].mean()  # gets mean of column

In [None]:
# to get mean of each row
reviews.mean(axis=1)

In [None]:
reviews['score'].count()

In [None]:
print reviews.max()

In [None]:
print reviews.min()

In [None]:
print reviews.median()

In [None]:
print reviews.std()

In [None]:
# DataFrame Math with Pandas
# All the common mathematical operators that work in Python, like +, -, *, /, and ^ will work, 
# and will apply to each element in a DataFrame or a Series.
reviews['score'] / 2

In [None]:
score_filter = reviews["score"] > 7
score_filter

In [None]:
filtered_reviews = reviews[score_filter]
filtered_reviews.head()

In [None]:
# When filtering with multiple conditions, it's important to put each condition in parentheses, 
# and separate them with a single ampersand (&).
xbox_one_filter = (reviews["score"] > 7) & (reviews["platform"] == "Xbox One")
filtered_reviews = reviews[xbox_one_filter]
filtered_reviews.head()

GroupBY

In [None]:
reviews.head()

In [None]:
reviews.groupby(by=['score_phrase'])[['score']].max()

In [None]:
reviews.groupby(by=['title', 'platform'])[['score']].sum()

In [4]:
import numpy as np
reviews.groupby(by=['title', 'platform']).agg({'score': np.sum, 'score': [np.max, np.mean]})

Unnamed: 0_level_0,Unnamed: 1_level_0,score,score
Unnamed: 0_level_1,Unnamed: 1_level_1,amax,mean
title,platform,Unnamed: 2_level_2,Unnamed: 3_level_2
#IDARB,Xbox One,7.5,7.5
'Splosion Man,Xbox 360,9.0,9.0
.deTuned,PlayStation 3,4.0,4.0
.hack//G.U. Vol. 1: Rebirth,PlayStation 2,5.0,5.0
.hack//G.U. Vol. 2: Reminisce,PlayStation 2,5.5,5.5
.hack//G.U. Vol.3: Redemption,PlayStation 2,5.5,5.5
.hack//INFECTION (Part 1),PlayStation 2,8.5,8.5
.hack//MUTATION (Part 2),PlayStation 2,8.4,8.4
.hack//OUTBREAK (Part 3),PlayStation 2,8.4,8.4
.hack//QUARANTINE (Part 4),PlayStation 2,8.3,8.3


In [None]:
grp = reviews['score'].apply(lambda score: score+100 if score%2==0  else score)

In [None]:
grp[grp%2==0]

In [None]:
def even_score(series):
    if series%2 == 0:
        series = series + 100
        return series
    else:
        pass

In [None]:
grp = reviews['score'].apply(even_score)

In [None]:
grp.dropna()[0:10]

In [None]:
import numpy as np
grp.replace(np.nan, '0')[0:10]

In [None]:
grp[0:10]

In [None]:
grp.replace(np.nan, '0', inplace = True)

In [None]:
grp[0:10]

In [None]:
grp.replace('0', np.nan, inplace = True)

In [None]:
grp[0:3]

In [None]:
grp.fillna(0, inplace = True)

In [None]:
grp[0:5]