<a href="https://colab.research.google.com/github/emilylauyw/SheLovesData-Data_Analysis_With_Python/blob/master/Data_Analysis_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PLEASE DO NOT MODIFY THIS NOTEBOOK**

See the instructions below on how to make your own copy

# 0) How to use Colab

### 0.1. Make a Copy

- You will not be able to edit this notebook directly, instead you will need to edit your own copy.
- Click: **File** -> **Save a copy in Drive**
- Follow the prompts to open "Copy of Data Analysis with Python.ipynb"
- Save a bookmark of your notebook so you can find it later (these are the exercises you will use during the workshop).

![copy to drive](https://i.imgur.com/gADwz6L.png)

### 0.2. Cells

- The grey blocks of code are called cells. 
- You can edit cells by clicking on them and typing
- You can run a cell by hovering over the cell and clicking the play button, or using the cmd/ctrl + Enter keyboard shortcut
- When you run a cell, the code in the cell can change the state of your notebook session OR  data in that cell or anywhere else in your notebook
- Cells can depend on things that are defined in other cells, so the order that you run the cells in is important.
- When you run a cell, it will print the result of the last line in the cell.
- As you go through these exercises, try to guess what the output of each cell is going to be before running it.
- The first time you run a cell, it might take a few seconds to start working, that's normal

![alt text](https://i.imgur.com/H4xTmyp.png)

### 0.3. Cheatsheet

Make sure to bookmark this [pandas cheatsheet](https://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3). Don’t worry, it’s not compulsory study. However, it will be a handy tool to have by your side when we’re doing the exercises. It’s also a great resource for learning more after the workshop. 

# Breakout Session 1 - Basics


Let's import the `pandas` python module with alias `pd`

In [1]:
import pandas as pd

## 1) Reading Pokemon Data

Let's load Pokemon data from the url and check that it's loaded into a dataframe, which is a table of data that supports a lot of manipulation and query functions. You can copy the URL into another browser tab to see the raw file contents. 

Entering `df_pokemon` on the last line displays a preview of the dataframe.

The name `df` is short for - data frame

In [2]:
df_pokemon = pd.read_csv('https://github.com/veekun/pokedex/raw/master/pokedex/data/csv/pokemon.csv') 
df_pokemon

Unnamed: 0,id,identifier,species_id,height,weight,base_experience,order,is_default
0,1,bulbasaur,1,7,69,64,1,1
1,2,ivysaur,2,10,130,142,2,1
2,3,venusaur,3,20,1000,236,3,1
3,4,charmander,4,6,85,62,5,1
4,5,charmeleon,5,11,190,142,6,1
...,...,...,...,...,...,...,...,...
959,10153,araquanid-totem,752,31,2175,159,885,0
960,10154,togedemaru-totem,777,6,130,152,926,0
961,10155,necrozma-dusk,800,38,4600,306,954,0
962,10156,necrozma-dawn,800,42,3500,306,955,0


**Preview of the Pokemon data**

* **id** - unique id for the Pokemon
* **identifier** - name of the Pokemon
* **species_id** - id type of the Pokemon
* **height** - height of the Pokemon
* **weight** - weight of the Pokemon
* **base_experience** - experience required for the Pokemon to level up

Notice that apart from the above columns, there's a separate serial number column, with values from `0` to `963`. This is because `pandas` provides each row of the dataframe with a unique number / id by default.

When you first get a data set, you would want to know some general information about it, to get a gist of things. Let's check the general information of our pokemon data set

In [None]:
df_pokemon.info()

What does the above `info()` tell us about the data?<br>
* `RangeIndex` tells us that there are `964` rows in the data table
* We learn that there are `8` columns in this data numbered `0` to `7`
* The data type for every column is integer (`int64`), except the `identifier` column.
* `Null` means empty or nothing. `Non-Null Count` is telling us that, for each column, all the rows have a value in them, i.e. no empty cells.


Even though the dataframe has so many rows (964), `pandas` is able to show it to us if you see the above output. It does this by showing a sample of the rows - some rows from the top and some rows from the bottom and then showing `...` in the middle to denote that some rows exist in between

## 2) Filtering

### Find the information available for Pokemon `zangoose`.

Let's use the filtering concept that we just learned to filter out the details of the pokemon named `zangoose`.

Let's write down our condition first

In [None]:
df_pokemon["identifier"] == "zangoose"

As you can see, it shows lots of `False` values as result of the condition. But we can't see all the boolean values as there are a lot.

Now let's apply the condition in the dataframe

In [None]:
df_zangoose = df_pokemon[df_pokemon["identifier"] == "zangoose"]
df_zangoose

Awesome! We have got the row that has informtaion about the `zangoose` pokemon.

Now, what's the data type of this result? You can find the data type of any data using `type` function like this

In [None]:
print("df_zangoose is of type:", type(df_zangoose))

As you can see, the result is also a `DataFrame`

## 3) Arranging

Currently we can see the order of the columns by just seeing the dataframe `df_pokemon`

In [None]:
df_pokemon

Let's rearrange the order of columns by bringing `height`, and `weight` columns forward and pushing `species_id` column to the end

In [None]:
df_pokemon[['id', 'identifier', 'height', 'weight',
            'base_experience', 'order', 'is_default', 'species_id']]

Some things to play around with the above code:
* Try the above with different order of columns
* Try to give a wrong or non existent column name
* Try repeating the column names

Now, let's see what happens when we use a single square bracket `[]` instead of double square brackets `[[]]`. Go ahead and run the below piece of code

In [None]:
df_pokemon['id', 'identifier', 'height', 'weight',
            'base_experience', 'order', 'is_default', 'species_id']

As we discussed in the slides, it gives a big fat error 😅 You can fix it by using the double square brackets `[[]]`

### How many Pokemon are there?<br>
We know from the above 'Preview of the Pokemon data' that, `id` gives us the unique identifier of the Pokemon. We want to ensure that we do not have duplicate rows for any Pokemon in this data and then count how many Pokemon we have.

To do so, we do the following:

In [None]:
number_of_rows = len(df_pokemon)
# find count of unique 'id' in the dataframe.
unique_ids = df_pokemon['id'].unique()
unique_ids_count = len(unique_ids)
duplicate_ids_count = number_of_rows - unique_ids_count
print("There are ", duplicate_ids_count, "duplicate 'id' in this dataframe.")
print("Number of Pokemon in this dataframe is", unique_ids_count)

### Find the Pokemon at index 335.

Note that index has a special meaning in a dataframe. If you look at the preview above, you will see there is an extra column with no name to the left of id. This wasn’t in the raw data. This is the automatically created index column, that starts from 0 and counts by 1 each row. Some other dataframe operations will directly or indirectly change the index, as we will see below.

In [None]:
print("This returns a type:", type(df_pokemon.loc[335]))
df_pokemon.loc[335]

Notice that the output **type**s of the two find queries are different. 

This is because of the way in which we filter `df_pokemon`.


If you're done with this section and have time - head on to the [bonus section](#scrollTo=2zlUWLg0-6YG) 

#  Breakout Session 2 - Selecting Data with Pandas

## 3) Selecting Specific Columns

How do we extract data from only the `identifier` column?

In [None]:
print(type(df_pokemon["identifier"]))
df_pokemon["identifier"]

Notice how we get a `Series` type of data and how it looks in the output.

In [None]:
print(type(df_pokemon[["identifier"]]))
df_pokemon[["identifier"]]

Notice how we get a `DataFrame` type of data and how it's output looks different from the `Series` output. With `DataFrame` output, you see a nice formatted table output, which you can hover over, unlike `Series` output

How do we extract data from both the `identifier` and `base_experience` columns?

In [None]:
df_pokemon[["identifier", "base_experience"]]

Notice that using 

```
df_pokemon[["identifier", "base_experience"]]
```
allows you to get data for multiple columns, which are organised as a DataFrame, so, intuitively even when you put a single column name like 

```
df_pokemon[["identifier"]]
```
Pandas creates it as a DataFrame, instead of Series.




## 4) Sorting by column values
Show which pokemons have the top 10 highest base experience?

In [None]:
df_pokemon.sort_values(by =['base_experience'], ascending=False).head(10)

## 5) Creating additional columns: Extracting data from 

*   List item
*   List item

other columns
Create a column called *BMI* (Body Mass Index) and calcaulte the *BMI* of each pokemon. 

`BMI = weight/height^2`

Save the result as df_pokemon_with_bmi.

In [None]:
df_pokemon_with_bmi = df_pokemon.copy()
df_pokemon_with_bmi['bmi'] = df_pokemon_with_bmi['weight']/df_pokemon_with_bmi['height']**2
df_pokemon_with_bmi

## Exercise 1

Find the pokemon with species ID 505. 

## 6) Calculating poke statistics

### What is the data type of the `weight` column?

In [None]:
df_pokemon['weight'].dtype

### On average, how much do Pokemon weigh?


In [None]:
df_pokemon['weight'].mean()

### We can also use `describe()`
The `describe()` attribute gives descriptive statictics of a column including the `mean` - mean of the values, `std` - standard deviation of the observations, and `min` - minimum of the values
 

In [None]:
df_pokemon['weight'].describe()

Use of `describe()` automatically calculates some predefined metrics on the numeric columns.

## Exercise 2


Try out what happens when you run `describe()` on a column with non-numeric values, like `identifier`

If you're done with this section give the [bonus section](#scrollTo=Natqojl1kZcX) a shot!

#Breakout Session 3 - Processing data

## 7) Grouping  and Aggregation



### Group Pokemon by their species id

In [None]:
df_pokemon.groupby(["species_id"])["identifier"].count()

### Counting in Groups

We need to apply an aggregation operation e.g. `count` to the Grouping in order for it be useful.

In this case, the resultant data is a `Series` and it is wrapped in a `DataFrame` and therefore it displays a formatted table.

In [None]:
df_pokemon_grouped_height_weight = pd.DataFrame(
    df_pokemon.groupby(["height", "weight"])["identifier"].count()
)

# lets look at the first ten 10 rows
df_pokemon_grouped_height_weight.head(10)

This DataFrame is a little different to the ones we've seen thus far.

The `identifier` column no longer contains the value of the identifier but contains the 'count'.

Also it has two indices, the columns we grouped by:

In [None]:
df_pokemon_grouped_height_weight.index.names

In [None]:
df_pokemon_grouped_species = pd.DataFrame(
    df_pokemon.groupby(["species_id"])["weight"].mean()
)

# lets look at the first ten 10 rows
df_pokemon_grouped_species.head(10)

In [None]:
df_pokemon_grouped_species = pd.DataFrame(
    df_pokemon.groupby(["species_id"])["height"].mean()
)

# lets look at the first ten 10 rows
df_pokemon_grouped_species.head(10)

## 8) Selecting Records with a MultiIndex

In [None]:
df_pokemon_grouped_height_weight.loc[(1,9999)]

Surely this must be an exceptional Pokemon in terms of weight?

In [None]:
df_pokemon["weight"].describe()

Seems like it is the heaviest (or one of the heaviest) Pokemon, so which Pokemon is it?

## 9) Select a Record by Query

That 1 pokemon with a height of 1 and a weight of 9999 is making us curious, which Pokemon is it?

We could look it up using multiple indices or we could write a query.

In [None]:
df_pokemon.query("height == 1 and weight == 9999")

So which Pokemon happens to be...
<img src="http://i0.kym-cdn.com/entries/icons/original/000/000/056/itsover1000.jpg"/>

image credit: http://knowyourmeme.com/memes/its-over-9000

In [None]:
df_pokemon.query("weight >= 9000")

## 10) Renaming columns

The counted identifiers don't have a descriptive column name, so lets rename it.

### What columns do we have in our grouped DataFrame?

In [None]:
df_pokemon_grouped_height_weight.columns

Let's rename that column by passing a list with the new column name.
If we had multiple columns we'd need to pass all of the column names in.

In [None]:
df_pokemon_grouped_height_weight.columns = ["pokemon_count"]
df_pokemon_grouped_height_weight.head(10)

## Exercise - 3 

Count the number of Pokemon in each species.

## Exercise - 4 

Find the 5 tallest pokemon

#Breakout Session 4 - Visualisation

As the volume of the data increases, it becomes difficult for us to simply look at the column and row values to find patterns and make inferences.

We use Plots to help us present this information in a visual format and make it more readable.

## 11) Basic Plotting

Let's find the `height` of pokemons whose `weight` is more than `8000` for our analysis and plot it in a graph

In [None]:
heavyPokemon = df_pokemon[df_pokemon['weight'] > 8000]
plot = heavyPokemon.plot.scatter(x="identifier", y = "height", grid=True, figsize=(10, 10))
plot = plot.set_xticklabels(heavyPokemon.identifier, rotation=80)

Notice the one Pokemon with very low height and still in the `weight > 8000` group. Do you remember it from your previous analysis?

In [None]:
tallPokemon = df_pokemon[df_pokemon['height'] > 60]
plot = tallPokemon["weight"].plot(kind="bar", x = "identifier", figsize=(10, 10))
plot = plot.set_xticklabels(tallPokemon.identifier, rotation=80)


Head on to the [bonus section](#scrollTo=g4X50HcL03eX) for more such plots

# Mini Project 1
Plot the mean weight of each species.

Plot the mean weights using a histogram. You can find the docs [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) for the same

# BONUS SECTIONS

## Bonus/Optional Reading 1

### Indices and Uniqueness

The code in this bonus section modifies the data in the data frame. Let's create a copy of the data before we do anything, to avoid affecting / changing the data frame we use in the break out sessions

In [None]:
df_pokemon_copy = df_pokemon.copy()

The first column of the dataset is an automatically generated unique identifier for each row. It  is called the `index`

In [None]:
df_pokemon_copy.index

Our dataset includes an `id` column. If it is unique, we can use it as the unique index, rather than the automatically generated index.

To check if the `id` column is unique we can use the `is_unique` attribute. 

In [None]:
df_pokemon_copy["id"].is_unique

We also have an `identifier` column for the name of the Pokemon. 

Are the names (identifiers) unique?

In [None]:
df_pokemon_copy["identifier"].is_unique

Because there are two unique indices (`id` and `identifier`), we can use either of these columns as the unique index for our dataset.

In [None]:
df_pokemon_copy = df_pokemon_copy.set_index("id")
df_pokemon_copy.head()

Head back to [continue on the main content](#scrollTo=PPuM0fOZdBKo)

## Bonus/Optional Reading 2
### Partial Text Matching


 Get the values in a column as string and then apply string contains operation to determine if a specific value in the series matches "totem".

Also lets make a seperate DataFrame that has only Pokemons with 'totem' in their name.

In [None]:
# find all the totem
totem_index = df_pokemon["identifier"].astype(str).str.contains("totem")
df_totem = df_pokemon[totem_index]
df_totem

#### Exercise - calculate the mean base experience of totem Pokemon

In [None]:
# mean base experience of Totem

#### Exercise - find all `totem` pokemon

In [None]:
# find index of totem
# select the rows and save into a new data frame df_totem
# show the first 10 rows of df_totem

Head back to [continue on the main content](#scrollTo=DhFIw5MfdeBy)

## Bonus/Optional Reading 3
### Visualization

#### Box Plot

Box plot is used to plot values of a single variable / column. It helps understand the spread of values in a column. For example, let's draw the box plot for the `height` column.

In [None]:
df_pokemon.boxplot('height')

In [None]:
# Multibar charts
tallPokemon = df_pokemon[df_pokemon['height'] > 60]
tallPokemon.plot(kind="bar", x = "identifier",y=["height","base_experience"], figsize=(10, 10))


Among other things, the box plot pictorially shows you the `median` value and outliers.

When you're done, head back to the [main content](#scrollTo=qoVoNjC9CU8h).