PANDAS

To represent tabular data, pandas uses a custom data structure called a dataframe. A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The dataframe is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and pandas dataframes handle mixed data types effortlessly while NumPy doesn't. Pandas dataframes can also handle missing values gracefully using a custom object, NaN, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, pandas dataframes contain axis labels for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that dataframes preserve the metadata from the file around the data.

In this tutorial, we'll use Pandas to analyze data on video game reviews.
In order to be able to work with the data in Python, we'll need to read the csv file into a Pandas DataFrame. A DataFrame is a way to represent and work with tabular data. Tabular data has rows and columns, just like our csv file.

Let's learn some basics of Pandas first.

In [None]:
import pandas as pd

In [None]:
# We can create a Series manually to better understand how it works.
# To create a Series, we pass a list or NumPy array into the Series object when we instantiate it:
s1 = pd.Series([1,2])
s1

In [None]:
s2 = pd.Series(['Pandas', 'are', 'better', 'than', 'Numpy', 99, 100])
s2

In [None]:
print(s2[5])
type(s2[5])

Creating A DataFrame in Pandas
We can create a DataFrame by passing multiple Series into the DataFrame class. Here, we pass in the two Series objects we just created, s1 as the first row, and s2 as the second row.

In [None]:
s2 = pd.Series(['Python', 'Programming'])
df = pd.DataFrame([s1,s2])
df.head()

In [None]:
pd.DataFrame(
    [
        [1,2],
        ['Python', 'Programming']
    ]
)

In [None]:
pd.DataFrame(
    [
        [1,2],
        ['Python', 'Programming']
    ],
    columns=["column1", "column2"]
)

In [None]:
# pandas series from dictionary
a_d ={'name': 'Diljeet', 'city': 'Delhi', 'email': 'diljeet@gmail.com'}
print (a_d)
print('-'*100)
s = pd.Series(a_d)
print(s)

# print(s.loc['name'])
# print(s.iloc[0])

In [None]:
# pandas series from dictionary of lists
data ={'name': ['Diljeet', 'Kishan', 'Prakhar'], 
      'city': ['Delhi', 'Gurgaon', 'Delhi'], 
      'email': ['diljeet@gmail.com', 'Kishan@gmail.com', 'Prakhar@gmai;l.com']}

print (data)
print('-'*100)

s = pd.Series(data)
print (s)
# print(s.loc['name'])
# print(s.iloc[0])

In order to read in Video Gamres Reviews data, we'll need to use the pandas.read_csv function. This function will take in a csv file and return a DataFrame. The below code will:

In [None]:
reviews = pd.read_csv("ign.csv")
reviews

Once we read in a DataFrame, Pandas gives us two methods that make it fast to print out the data. These functions are:
* pandas.DataFrame.head -- prints the first N rows of a DataFrame. By default 5.
* pandas.DataFrame.tail -- prints the last N rows of a DataFrame. By default 5.

We'll use the head method to see what's in reviews:

In [None]:
reviews.head()

In [None]:
# dataframe.columns will provide the column list of dataset as shown in code below:
print(reviews.columns)

The columns contain information about that game:

* score_phrase — how IGN described the game in one word. This is linked to the score it received.
* title — the name of the game.
* url — the URL where you can see the full review.
* platform — the platform the game was reviewed on (PC, PS4, etc).
* score — the score for the game, from 1.0 to 10.0.
* genre — the genre of the game.
* editors_choice — N if the game wasn't an editor's choice, Y if it was. This is tied to score.
* release_year — the year the game was released.
* release_month — the month the game was released.
* release_day — the day the game was released.

In [None]:
reviews.shape # returns rows and columns count of data set

In [None]:
reviews.info()

In [None]:
reviews.describe()

percentile calculation is based on below formula:
  min+(max-min)*percentile

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

In [None]:
data_1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One'])
# res_25 = 4 + (6-4)*(3/4) =  5.5
# res_75 = 8 + (10-8)*(1/4) = 8.5
data_1.describe()

As you can see, everything has been read in properly 
* we have 18625 rows and 11 columns.

One of the big advantages of Pandas vs just using NumPy is that Pandas allows you to have columns with different data types. DataFrame reviews has columns that store float values, like score, string values, like score_phrase, and integers, like release_year.

In [None]:
 # Indexing DataFrames with Pandas using the pandas.DataFrame.iloc method
 # he below code will replicate reviews.head():
 reviews.iloc[:5, :]

In [None]:
# Here are some indexing examples, along with the results:
# the entire DataFrame.
reviews.iloc[:,:]

In [None]:
# rows from position 5 onwards, and columns from position 5 onwards
reviews.iloc[5:, 5:]

In [None]:
# the first column, and all of the rows for the column.
reviews.iloc[:,0]

In [None]:
# the 10th row, and all of the columns for that row.
reviews.iloc[9,:]

In [None]:
# Now that we know how to index by position, let's remove the first column, which doesn't have any useful information:
reviews = reviews.iloc[:, 1:]
reviews.head()

In [None]:
reviews.iloc[:3, 1:]

In [None]:
reviews.loc[:3]

Pandas Series
The Series object is a core data structure that pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector. The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.

In [None]:
reviews["score"]

In [None]:
# We can also use lists of columns with this method:
reviews[["score", "release_year"]]

In [None]:
# We can verify that a single column is a Series:
type(reviews['score'])

###  Math Functions with Pandas

In [None]:
# Pandas DataFrame Methods
reviews["score"].mean()  # gets mean of column

In [None]:
# to get mean of each row
reviews.mean(axis=1)

In [None]:
reviews['score'].count()

In [None]:
reviews.max()

In [None]:
reviews.min()

In [None]:
reviews.median()

In [None]:
reviews.std()

All the common mathematical operators that work in Python, like +, -, *, /, and ^ will work, and will apply to each element in a DataFrame or a Series.

In [None]:
reviews['score'] / 2

In [None]:
score_filter = reviews["score"] > 7
score_filter

In [None]:
filtered_reviews = reviews[score_filter]
filtered_reviews.head()

In [None]:
# When filtering with multiple conditions, it's important to put each condition in parentheses, 
# and separate them with a single ampersand (&) or pip (|) depending upon filtering criteria.
xbox_one_filter = (reviews["score"] > 7) & (reviews["platform"] == "Xbox One")
filtered_reviews = reviews[xbox_one_filter]
filtered_reviews.head()

GroupBY

In [None]:
reviews.head()

In [None]:
reviews.groupby(by=['score_phrase'])[['score']].max()

In [None]:
reviews.groupby(by=['title', 'platform'])[['score']].sum()

In [None]:
import numpy as np
reviews.groupby(by=['score_phrase', 'platform']).agg({'score': [np.sum, np.min]})

In [None]:
# verify above result
reviews[(reviews['score_phrase']=='Amazing') & (reviews['platform']=='Android')]

Use of Lambda or Apply function in Pandas

Let's add 100 to each score if its an even score

In [None]:
even_grp = reviews['score'].apply(lambda score: score+100 if score%2==0  else score)

In [None]:
even_grp

Let's add 50 to each score if its an even score otherwise add 10 to each score

In [None]:
new_score= reviews['score'].apply(lambda score: score+50 if score%2==0  else score+10)
new_score

In [None]:
reviews[['score', 'score']]

In [None]:
reviews[['score', 'score']].apply(np.sum)

In [None]:
def even_score(series):
    if series%2 == 0:
        series = series + 100
        return series
    else:
        return series

In [None]:
grp = reviews['score'].apply(even_score)
grp

In [None]:
def odd_score(series):
    if series['score']%2 != 0 and series['platform']=='PlayStation 3':
        series['score'] = series['score'] + 100
        return series
    else:
        series['score'] = series['score'] + 10
        return series

In [None]:
grp = reviews[['score', 'platform']].apply(odd_score, axis=1)
grp

In [None]:
# use labda with apply majorly when you want to transform one column only 
grp = reviews[['score', 'platform']].apply(lambda x: x['score']+200 if x['platform']=='PlayStation 3' else x['score']+10, axis=1)
grp

In [None]:
# broadcasted
grp = reviews[['score', 'platform']].apply(lambda x: x['score']+200 if x['platform']=='PlayStation 3' else x['score']+10, axis=1, broadcast=True)
grp

In [None]:
# let's add some Null values in our grp dataset
grp.replace(19, np.nan, inplace=True)
grp

In [None]:
grp.dropna()[0:50]

In [None]:
grp.replace(np.nan, '0')[0:10]

In [None]:
grp

In [None]:
grp.replace(np.nan, '0', inplace = True)

In [None]:
grp[0:10]

In [None]:
grp.replace('0', np.nan, inplace = True)

In [None]:
grp[0:3]

In [None]:
grp.fillna(0, inplace = True)

In [None]:
grp[0:5]

In [None]:
# joined_df = pd.merge(reviews, reviews_low_score, left_on = 'Unnamed: 0', right_on='Unnamed: 0', how='right')
# joined_df.shape