# Pandas

## What is pandas?

Pandas is an open-source python library that can be used to create and query small datasets. You can use it to load  .csv files to your jupyter notebook. To learn more about data structures in pandas see their <a href="https://pandas.pydata.org/pandas-docs/stable/dsintro.html">webpage</a>. In this lecture we will go through some of the useful functions of this library while looking at the <a href="https://www.kaggle.com/claudiodavi/superhero-set">heroes dataset</a> from Kaggle.

<span style="color:red">NOTE:</span> Before continuing, please download the dataset and add it to the folder where this notebook resides.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Loading files

The command <code>read_csv()</code> loads a ".csv" file to jupyter notebook as an object of type <code>pandas.dataframe</code>.

In [2]:
df_heroes_info = pd.read_csv("./heroes/heroes_information.csv")
df_heroes_powers = pd.read_csv("./heroes/super_hero_powers.csv")

FileNotFoundError: File b'./heroes/heroes_information.csv' does not exist

To take a peek at the dataframes we can use the method <code>head</code> which displays the top 5 entries.

In [3]:
df_heroes_info.head()

NameError: name 'df_heroes_info' is not defined

In [4]:
df_heroes_powers.head()

NameError: name 'df_heroes_powers' is not defined

## Querying the dataset

To access a particular column of the dataframe we can use the access the elements like a python dictionary. The returned object is of the type panda.Series.

In [5]:
print(df_heroes_info['name'])
print(type(df_heroes_info['name']))

NameError: name 'df_heroes_info' is not defined

To see all the available possible powers recorded we can use the method <code>columns()</code> on the <code>df_heroes_powers</code> dataframe.  

In [6]:
list_powers = [power for power in df_heroes_powers.columns][1:]
for powers in list_powers: 
    print(powers)

NameError: name 'df_heroes_powers' is not defined

To count the number of elements in each column we can use the function <code>value_counts</code>. Suppose we want to see the number of heroes that fight for the good side. We can query the dataset as follows:

In [7]:
df_heroes_info['Alignment'].value_counts()

NameError: name 'df_heroes_info' is not defined

Next we want to find the 5 most common powers and all unique powers in dataset. One way to do it is as follows:

In [8]:
num_powers = df_heroes_powers.sum(axis=0)[1:] #+ np.zeros(len(list_powers)-1)

NameError: name 'df_heroes_powers' is not defined

The 5 most common powers

In [9]:
common_pow = num_powers.sort_values(ascending=False)[0:5].index
for power in common_pow:
    print(power)

NameError: name 'num_powers' is not defined

All the unique power

In [10]:
unique_pow = num_powers.sort_values(ascending=True)[num_powers.sort_values(ascending=True)==1].index
for power in unique_pow:
    print(power)

NameError: name 'num_powers' is not defined

We can also do more complex query such as look at the powers that each publisher uses.

In [11]:
pub_power_data = pd.merge(df_heroes_info[['name','Publisher']],df_heroes_powers,left_on='name',right_on='hero_names',left_index=True)

NameError: name 'df_heroes_info' is not defined

In [12]:
grouped = pub_power_data.groupby('Publisher')

NameError: name 'pub_power_data' is not defined

In [13]:
grouped.sum().sum(axis=1)

NameError: name 'grouped' is not defined

## Visualizing Data

Visualizing data is fundamental in understanding more about the dataset. There are plenty of Python libraries for visualization. We will be using the matplotlib library for all out plots. If we want to look at the heroes created by each publisher, we can visualize the data using a bar graph.

In [14]:
heroes_publisher = pd.value_counts(df_heroes_info['Publisher'])
heroes_publisher.plot(kind='bar')
plt.show()

NameError: name 'df_heroes_info' is not defined

We can also have different formatting styles for our plots. The different types of formats available are:

In [15]:
styles = plt.style.available
for style in styles:
    print(style)

bmh
classic
dark_background
fast
fivethirtyeight
ggplot
grayscale
seaborn-bright
seaborn-colorblind
seaborn-dark-palette
seaborn-dark
seaborn-darkgrid
seaborn-deep
seaborn-muted
seaborn-notebook
seaborn-paper
seaborn-pastel
seaborn-poster
seaborn-talk
seaborn-ticks
seaborn-white
seaborn-whitegrid
seaborn
Solarize_Light2
tableau-colorblind10
_classic_test


To learn more about the different styles available and creating your own style visit <a href="https://matplotlib.org/users/style_sheets.html">here</a>. Let's use the seaborn-talk style for now

In [16]:
plt.style.use('seaborn-talk')

In [17]:
heroes_publisher.plot.bar()
plt.show()

NameError: name 'heroes_publisher' is not defined

Let us look at the height distribution of our heroes using a histogram.

In [18]:
plt.figure(figsize=(12,9))
df_heroes_info['Height'].plot.hist(20)
plt.xlabel('Height')
plt.ylabel('Frequency')
plt.show()

NameError: name 'df_heroes_info' is not defined

<Figure size 864x648 with 0 Axes>

Did you notice something odd with this plot? Height cannot be negative! What's happened here is that for all the characters for whom the data is missing their height was initialized as -99. This brings us to our next topic which is missing data.

## Missing Data

Impossible values such as -99 for height or NaN are often used in datasets to fill unknown indexes. The easiest way to resolve this is to remove these indexes.Replotting the histogram without the outliers we have

In [19]:
df_heroes_info[df_heroes_info['Height']!=-99]['Height'].plot.hist(20)
plt.xlabel('Height')

NameError: name 'df_heroes_info' is not defined