<a href="https://colab.research.google.com/github/brendanpshea/programming_problem_solving/blob/main/Programming_05_DataAnalytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas and Pokemon: Basic Data Analytics With Python
### Brendan Shea, PhD

**Data Analytics** refers to the process of examining, cleaning, transforming, and modeling data with the aim of discovering useful information, informing conclusions, and supporting decision-making. It involves using statistical, computational, and machine learning techniques to analyze and interpret complex data sets, enabling individuals and organizations to make more informed decisions.

Data Analytics is a broad field that spans many domains, encompassing everything from business intelligence and big data analytics to specialized fields like web analytics and social media analytics. At its core, data analytics is about uncovering patterns and extracting meaningful insights from raw data. This process can range from simple descriptive statistics, which describe and summarize data, to more complex predictive and prescriptive analytics, which forecast future trends and prescribe actions.

To put this into perspective, imagine you're playing a video game where you need to build a team of characters, each with their own set of skills and attributes. Data analytics in this context would involve examining the data on each character -- like their health points, attack strength, defense abilities, etc. -- to determine the most balanced and effective team composition. You would analyze past performance data, predict future outcomes in different scenarios, and make decisions on which characters to choose to maximize your chances of success.

In the real world, data analytics is used in a myriad of ways, from businesses analyzing customer data to improve their products and services, to healthcare providers using patient data to make better diagnostic and treatment decisions. In the educational sector, data analytics can help in understanding student performance and improving teaching methods.

### How Does Python Fit into Data Analytics?
Python  serves as an accessible and practical entry point into the world of data manipulation and analysis. Its simple syntax and readability make it an ideal language for those who are just starting to explore data analytics.

When you're new to data analytics, the primary goal is to learn how to manage and interpret data effectively. Python facilitates this learning process with its straightforward and intuitive coding style. Unlike more complex programming languages  (such as MATLAB, R, etc.)  that may have steep learning curves, Python's syntax is clear and concise, making it easier for beginners to grasp key concepts without getting overwhelmed.

One of the first steps in data analytics is learning how to handle and process data. Python offers a rich ecosystem of libraries specifically designed for these tasks. For instance, the Pandas library, which we'll be focusing on here, is a fundamental tool in Python for data manipulation. It allows you to easily read, write, and modify data in various formats like CSV, Excel, or databases.

Another aspect is data visualization, which is crucial in making sense of the data you are analyzing. Python libraries like Matplotlib and Seaborn enable you to create visualizations -- such as graphs and charts -- with just a few lines of code. This transforms complex numerical data into a series of easy-to-understand images, making it simpler to identify trends and patterns.

## Libraries for Data Analytics
Some important Python libraries for data analytics include the following.

| Python Library | Description |
| --- | --- |
| Pandas | A foundational library for data manipulation and analysis. Pandas provide DataFrames (similar to Excel spreadsheets) that make working with structured data intuitive and efficient. It's excellent for data cleaning, transformation, and analysis. |
| NumPy | Stands for Numerical Python, NumPy is crucial for numerical computations. It offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. |
| Matplotlib | A plotting library that provides a MATLAB-like interface for creating a wide range of static, animated, and interactive visualizations. Matplotlib is very customizable and widely used for creating graphs and charts. |
| Seaborn | Built on top of Matplotlib, Seaborn is a statistical data visualization library. It provides a high-level interface for drawing attractive and informative statistical graphics. |
| SciPy | Used for scientific and technical computing, SciPy builds on NumPy and provides a large number of higher-level functions for optimization, regression, interpolation, etc. |
| Scikit-learn | A simple and efficient tool for data mining and data analysis. It's built on NumPy, SciPy, and Matplotlib, and it's best known for its capabilities in machine learning, including classification, regression, clustering, and dimensionality reduction. |
| Statsmodels | Focused on statistical models, hypothesis testing, and data exploration. It's a great tool for statistical analysis and offers extensive options for model formulation. |
| TensorFlow | An open-source library developed by Google primarily for deep learning applications. TensorFlow offers flexible tools for building and training neural networks to detect and decipher patterns and correlations, similar to human learning and reasoning. |
| Keras | An open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library and simplifies many aspects of creating and compiling deep learning models. |
| Plotly | A graphing library that makes interactive, publication-quality graphs online. It offers a range of graphical representations like scatter plots, line charts, bar charts, and more, with interactive features. |

## What is Pandas?

**Pandas** is a library in Python that makes data analysis and manipulation straightforward and accessible. It is a tool that turns complex tasks into manageable ones, simplifying the way we work with data. In the world of Python programming, especially in data analytics, Pandas is akin to a Swiss Army knife -- versatile and essential.

At its core, Pandas is designed for working with tabular or structured data. Tabular data is similar to what you would see in a spreadsheet -- data that's organized into rows and columns. For example, imagine a table containing information about different Pokémon: each row represents a Pokémon, and each column details attributes like type, hit points, and attack strength.

One of the primary components of Pandas is the DataFrame. A **DataFrame** is a way to store and manipulate data in a table with rows and columns, much like a sheet in an Excel workbook. You can think of a DataFrame as a powerful tool that allows you to do a lot of different things with your data -- sort it, filter it, calculate statistics from it, and even clean it (like removing or fixing incorrect data). Each column of a DataFrame is a Pandas **Series** (basically, a list of objects of the same data type).

Another essential feature of Pandas is its ability to handle missing data. In real life, data can be messy. Sometimes, information is missing or incomplete. Pandas provide a straightforward way to deal with these gaps, either by filling them in with specific values or by removing the parts of the data that are incomplete.

In addition to these features, Pandas also make it easy to read data from different sources. Whether your data is in a CSV file, an Excel spreadsheet, or a database, Pandas can read it and turn it into a DataFrame. Once in a DataFrame, you can start analyzing and visualizing your data.

## Example: Loading Pokemon Data Into Pandas
In this example, we will walk through the process of loading a dataset from a CSV file -- specifically, a file containing Pokémon statistics -- into a Pandas DataFrame. A DataFrame is a central data structure in Pandas and can be thought of as a table with rows and columns, similar to an Excel spreadsheet. By loading data into a DataFrame, we make it possible to perform a variety of data analysis tasks efficiently.

### Step 1: Import Pandas Library
To use Pandas, you first need to ensure it's installed in your Python environment. If it's not already installed, you can do so using pip, Python's package manager, with the command `pip install pandas`. Once Pandas is installed, you start your Python script or notebook by importing it:

```python
import pandas as pd
```

Here, `pd` is a conventional alias used for Pandas. It's a shorthand that will save you some typing and keep your code clean.

### Step 2: Read the CSV File

With Pandas imported, the next step is to read the CSV file containing the Pokémon data. CSV, which stands for Comma-Separated Values, is a popular format for storing tabular data. Pandas has a built-in function, `read_csv()`, that makes reading CSV files straightforward. This function takes the file path or URL of the CSV file and converts it into a DataFrame. Here's how you do it:

```python
url = "https://github.com/brendanpshea/programming_problem_solving/raw/main/data/pokemon.csv"
pokemon_df = pd.read_csv(url)
```

In this code, `url` holds the link to the CSV file, and `pokemon_df` is the variable name we've chosen for our DataFrame. The `read_csv()` function fetches the data from the URL and parses it into a DataFrame.

#### Step 3: Verify the Data

Once the file is loaded into a DataFrame, it's good practice to verify the data to ensure everything looks as expected. This can be done by displaying the first few rows of the DataFrame. You can use the `head()` method for this:

```python
pokemon_df.head()
```

The `head()` method shows the first five rows of your DataFrame by default. This quick check helps confirm that your data is loaded correctly and gives you a glimpse of its structure and the type of data it contains.

All together, this looks like this:

In [None]:
import pandas as pd

url = "https://github.com/brendanpshea/programming_problem_solving/raw/main/data/pokemon.csv"
pokemon_df = pd.read_csv(url)

pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


Let's break down what we learn from this DataFrame:

1.  Each column in the DataFrame represents a different attribute or feature of the Pokémon.
  -   `#`: This appears to be an identifier or index for the Pokémon.
  -   `Name`: The name of the Pokémon.
  -   `Type 1` and `Type 2`: These columns represent the primary and secondary types of the Pokémon, indicating their elemental properties, like Grass, Poison, or Fire.
  -   `Total`: Likely a total score or sum of all the combat-related stats.
  -   `HP`: The health points of the Pokémon.
  -   `Attack` and `Defense`: These columns show the offensive and defensive strength of the Pokémon.
  -   `Sp. Atk` and `Sp. Def`: Special attack and special defense values.
  -   `Speed`: How fast the Pokémon can move in battles.
  -   `Generation`: Indicates the generation of the Pokémon series to which the Pokémon belongs.
  -   `Legendary`: A boolean (True or False) indicating whether the Pokémon is legendary.
2.  Each row in the DataFrame represents a different Pokémon, with its attributes listed across the columns.
    -   The leftmost column, which is not named, is the index of the DataFrame. It provides a unique number to each row (starting from 0).
    -   The first five rows show data for Bulbasaur, Ivysaur, Venusaur, Mega Venusaur, and Charmander.
    -   We can see an evolution pattern where, for example, Bulbasaur evolves into Ivysaur and then into Venusaur. This is indicated by the increasing values in their `Total`, `HP`, `Attack`, and other stats.
3.  The DataFrame contains a mix of data types: integers (like HP, Attack), strings (like Name, Type 1), and booleans (Legendary).
4. The presence of `NaN` shows that the dataset may have missing or undefined values, which is a common scenario in real-world data.

This DataFrame is a typical example of structured data that Pandas handles efficiently. By examining it, we can understand the attributes of each Pokémon, compare their abilities, and analyze patterns like evolution and type distribution. This information forms the basis for further analysis, such as statistical calculations, data visualization, and advanced data manipulation.

## Exploring the Data Frame: First Steps
When exploring new DataFrame, such as `pokemon_df`, there are several initial steps you should take to familiarize yourself with the dataset. These steps are essential to understand the basic structure and content of your data.

### `head()`s and `tail()`s

First, we can use the`head()` method in Pandas (show above). For example, `pokemon_df.head()` would display the first five rows of the `pokemon_df` DataFrame. This method gives you a quick glimpse into the types of data columns you have, the format of the data, and a sense of the values it contains. If you want to see more than five rows, you can specify a number as an argument, like `pokemon_df.head(10)` to view the first ten rows.

You can also take a look at the end of your DataFrame, which can be achieved using the `tail()` method. Just like `head()`, `tail()` defaults to the last five rows but can be adjusted to show more. This is particularly useful to ensure that the data has been loaded correctly and to understand if the format remains consistent throughout the dataset.

In [None]:
pokemon_df.tail()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
799,721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


### What `shape` are you in?
The `.shape` attribute (you can tell it an attribute as opposed to a method because it doesn't have parentheses) tells us how many columns and rows are in our data frame.

In [None]:
pokemon_df.shape

(800, 13)

It appears we have 800 rows (Pokemon) and 13 columns (properties of Pokemon).

### Please Give Me More `info()`
To understand the types of data in each column, the `info()` method is useful. It gives a concise summary of the DataFrame, including the number of non-null entries in each column, and the data type of each column. This is particularly helpful to identify columns with missing values and to plan data cleaning strategies.

In [None]:
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB



The output of `df.info()` for the Pokémon dataset provides valuable information about the DataFrame's structure and the data types of its columns:

-  **DataFrame Structure.** The dataset is a Pandas DataFrame. It contains 800 entries, indexed from 0 to 799, which implies that there are 800 Pokémon in this dataset.

-  **Data Columns and Counts.** The DataFrame consists of 13 columns. Each column's name is listed along with the count of non-null (non-missing) entries. For most columns, there are 800 non-null entries, indicating complete data for those fields. However, the 'Type 2' column stands out with only 414 non-null entries, suggesting that many Pokémon do not have a secondary type and this field is missing for a significant portion of the dataset.

-  **Data Types (Dtype).** This data frame contains three different data types.

    -   **int64.** Several columns are of integer data type (`int64`), including '#', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', and 'Generation'. These columns represent numerical data.
    -   **object.** The 'Name', 'Type 1', and 'Type 2' columns are of object type, typically used for strings in Pandas. This indicates these columns contain text data.
    -   **bool.** The 'Legendary' column is of boolean data type (`bool`), meaning it contains True/False values.
-  **Memory Usage.** The DataFrame occupies approximately 75.9 KB in memory. This information is useful for understanding the memory efficiency of the DataFrame, particularly relevant in scenarios involving large datasets or limited system resources.

### `describe()` Some Statistical Properties
Another vital step is to get a summary of your dataset, which can be done using the `describe()` method. This method provides a statistical summary of all numerical columns in the DataFrame - count, mean, standard deviation, minimum and maximum values, and the percentiles. In the context of pokemon_df, this would give insights into the statistical distribution of Pokémon stats like HP, Attack, and Defense.

In [None]:
pokemon_df.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


The output of `df.describe()` provides a statistical summary of the Pokémon dataset, focusing on various numeric attributes. Here's a concise breakdown of what we learn from this summary:

- **Count.** Each attribute (like HP, Attack, Defense, etc.) has 800 entries, indicating there are 800 Pokémon in this dataset.

-  **Mean.** This represents the average value for each attribute. For instance, the average HP (Hit Points) of Pokémon in this dataset is approximately 69.26, and the average Attack value is around 79.

- **Standard Deviation (std).** This shows how much variation or dispersion exists from the mean. A higher standard deviation indicates more spread out values. For example, the standard deviation of Total is around 119, suggesting a wide variation in the total stats of Pokémon.

-  **Minimum (min).** The smallest value in each attribute. Notably, the minimum for attributes like Attack, Defense, Sp. Atk, Sp. Def, and Speed are very low (close to 1 in some cases), suggesting the presence of some very weak Pokémon in those respects.

- **Percentiles (25%, 50%, 75%).** These values indicate the distribution of the data. The 25th percentile (also the first quartile) shows that 25% of Pokémon have a Total stat of 330 or less. The 50th percentile (**median**) shows the middle value, and the 75th percentile shows that 75% of Pokémon have a Total stat of 515 or less.

- **Maximum (max).** The highest value in each attribute. For example, the highest Total stat is 780, and the highest HP is 255. This indicates the presence of some extremely powerful Pokémon in those aspects.

Overall, this summary gives a comprehensive view of the central tendency, dispersion, and range of the numeric attributes in the Pokémon dataset, which is essential for understanding the overall statistical characteristics of these creatures.

## Questions: Getting to Know Data Frames
1.  What is data analytics and why is it important in today's world? Discuss with an example related to something you are interested in.
2. Examine the 'HP', 'Attack', and 'Defense' statistics of Pokémon. What do these descriptive statistics tell us about the overall strength and weaknesses of Pokémon?
3. How do the mean and median values in the 'Speed' or 'Total' statistics help us understand the general performance characteristics of Pokémon?
4. Discuss the significance of different data types (like int64, object, bool) seen in the Pokémon dataset. Why is it important to recognize these types?
5. In the Pokémon dataset, the 'Type 2' column has several null entries. What does the presence of these nulls signify?

### My Answers
1.

2.

3.

4.

5.

## How Can I Filter and Select Data?
Filtering and selecting data in Pandas is a fundamental aspect of data analysis, allowing you to extract specific subsets of data based on certain criteria. This is crucial in situations where you are interested in analyzing a particular segment of the data that meets specific conditions. For instance, in our Pokémon dataset (`pokemon_df`), you might want to analyze only the Legendary Pokémon or those belonging to a particular type.

**Filtering** in Pandas is done by specifying a condition that the rows must meet. This condition is usually a comparison operation. For example, to filter out Pokémon with a speed greater than 100, you would compare the 'Speed' column of each row to the value 100. The result is a series of boolean values (True or False) for each row, indicating whether each row meets the condition. This series of booleans can then be used to index the DataFrame, returning only the rows where the condition is True.

The general syntax for filtering based on a single condition is:

```python
filtered_df = df[df['column_name'] > value]
```

Suppose you want to filter `pokemon_df` to include only Legendary Pokémon. The 'Legendary' column contains boolean values, so the condition is straightforward:

In [None]:
legendary_df = pokemon_df[pokemon_df['Legendary'] == True]
legendary_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
156,144,Articuno,Ice,Flying,580,90,85,100,95,125,85,1,True
157,145,Zapdos,Electric,Flying,580,90,90,85,125,90,100,1,True
158,146,Moltres,Fire,Flying,580,90,100,90,125,85,90,1,True
162,150,Mewtwo,Psychic,,680,106,110,90,154,90,130,1,True
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,780,106,190,100,154,100,130,1,True


Selecting data, on the other hand, refers to choosing a subset of columns or rows based on their labels or positions. In Pandas, you can do this using the `loc` and `iloc` methods.

**Selecting Columns.** To select specific columns by name, you do the following:

```python
selected_columns = df[['column1', 'column2']]
```

To select rows (or rows and columns), we can use `loc` and `iloc`.
-  `loc` is used for label-based indexing. You specify the name of the rows and columns you want to select.
-  `iloc` is used for position-based indexing. You specify the integer index of the rows and columns.

```python
# Using loc for label-based indexing
subset_df = df.loc[rows, 'column_name']

# Using iloc for position-based indexing
subset_df = df.iloc[row_indices, column_indices]
```

**Example.** If you want to select the 'Name' and 'Type 1' columns for the first 10 Pokémon in `pokemon_df`, you would use:

In [None]:
# Using .loc to select row and column
selected_pokemon = pokemon_df.loc[:9, ['Name', 'Type 1']]
selected_pokemon

Unnamed: 0,Name,Type 1
0,Bulbasaur,Grass
1,Ivysaur,Grass
2,Venusaur,Grass
3,VenusaurMega Venusaur,Grass
4,Charmander,Fire
5,Charmeleon,Fire
6,Charizard,Fire
7,CharizardMega Charizard X,Fire
8,CharizardMega Charizard Y,Fire
9,Squirtle,Water


Or, using `iloc` (remembering that Python uses 0-based indexing).

In [None]:
# Using iloc to select rows and colums
selected_pokemon = pokemon_df.iloc[:10, [1, 2]]
selected_pokemon


Unnamed: 0,Name,Type 1
0,Bulbasaur,Grass
1,Ivysaur,Grass
2,Venusaur,Grass
3,VenusaurMega Venusaur,Grass
4,Charmander,Fire
5,Charmeleon,Fire
6,Charizard,Fire
7,CharizardMega Charizard X,Fire
8,CharizardMega Charizard Y,Fire
9,Squirtle,Water


By mastering filtering and selecting data in Pandas, you gain powerful tools to slice and dice your data, making it possible to focus on specific areas of interest, which is essential for effective data analysis.

## Table: Filtering Data With Pandas
Pandas has *a lot* of power to filter and select data, of which we are just scratching the surface. Here are a few examples:

| Pandas Syntax | Description |
| --- | --- |
| `pokemon_df[pokemon_df['Type 1'] == 'Fire']` | Selects all Pokémon where the primary type (`Type 1`) is Fire. |
| `pokemon_df[pokemon_df['HP'] > 100]` | Filters Pokémon with health points (HP) greater than 100. |
| `pokemon_df[(pokemon_df['Attack'] > 100) & (pokemon_df['Defense'] > 100)]` | Selects Pokémon with both Attack and Defense stats over 100. |
| `pokemon_df.loc[pokemon_df['Legendary'] == True, ['Name', 'Total']]` | Retrieves the names and total stats of all Legendary Pokémon. |
| `pokemon_df.iloc[:10, [1, 2, 3]]` | Selects the first 10 Pokémon and their columns 'Name', 'Type 1', and 'Type 2'. |
| `pokemon_df[pokemon_df['Speed'].idxmax()]` | Finds the Pokémon with the highest Speed stat. |
| `pokemon_df.sort_values(by='HP', ascending=False).head(10)` | Lists the top 10 Pokémon with the highest HP. |
| `pokemon_df[(pokemon_df['Type 1'] == 'Water')` | Selects all Pokemon whose 'Type 1' is 'Water.' |
| `pokemon_df[pokemon_df['Name'].str.contains('Mega')]` | Selects all Pokémon with 'Mega' in their name. |
| `pokemon_df.groupby('Type 1').mean().sort_values(by='Attack', ascending=False)` | Groups Pokémon by their primary type and sorts the types by their average Attack stat in descending order. |

The table of examples includes several new Pandas methods that are commonly used for data manipulation and analysis. Let's break down these methods:

-  `idxmax()` returns the index of the first occurrence of the maximum value in a given column.
-  `sort_values(by=, ascending=)` sorts a DataFrame by the values of one or more columns. It has parameters:
   -   `by`: Specifies the column(s) to sort by.
   -   `ascending`: A boolean value (True or False) that defines whether the sorting should be in ascending order. By default, it's set to True.
-  `str.contains()`is used to test if a string pattern or regex is contained within each string of a Series or Index. It returns a Series of boolean values.
-  `groupby()` is used to group a DataFrame by one or more columns and perform aggregate operations on these groups. For example, `pokemon_df.groupby('Type 1').mean()` groups the Pokémon by their 'Type 1' and then calculates the mean (average) for each numeric column within each group. It is then combined with `sort_values()` to sort these groups based on their average Attack stat.
-  Logical Operators (`&`, `|`) are used for combining boolean masks (conditions) in Pandas.

## Exercises: Filtering and Selecting Data

In [None]:
# 1. Filter pokemon_df to show only Pokémon with a Speed less than 50. Show the head.

In [None]:
# 2. Select only the 'Name' column from pokemon_df. Show the head.

In [None]:
# 3. Select the 'Name', 'Attack', and 'Defense' columns from pokemon_df.
# Show the tail.

In [None]:
# 4. Filter pokemon_df for Pokémon that have a Defense of over 100 and HP less than 50.
# SHow the first 3.

In [None]:
# 5. Use loc to select the rows from 100 to 107 and the columns 'Name' and 'Type 1'.

In [None]:
# 6. Sort pokemon_df by the 'Speed' column in descending order and select the top 15 Pokémon.

In [None]:
# 7. Select all Pokémon from pokemon_df whose names start with the letter 'G'.
# Now, sort by Total, and display the head

In [None]:
# 8. (OPTIONAL) Group pokemon_df by 'Generation', then select only the groups where the average Attack is above 70.
# Display the result.

## How Can I Create or Manipulate Columns in Pandas?
Creating or manipulating columns in Pandas is a vital aspect of data preprocessing and feature engineering. This process involves adding new columns or modifying existing ones in a DataFrame based on certain criteria or computations. For example, in the Pokémon dataset, you might want to create a new column that categorizes Pokémon based on their total stats, or you might want to modify an existing column to adjust the values in some way.

### Creating New Columns

You can create new columns in a Pandas DataFrame simply by assigning values to a new column name in the DataFrame. The values can be static, derived from existing columns, or based on a function.

Suppose you want to categorize Pokémon into 'High', 'Medium', and 'Low' tiers based on their 'Total' stats:

In [None]:
# Define a function to categorize based on Total stats
def categorize_total(total):
    if total >= 600:
        return 'High'
    elif total >= 300:
        return 'Medium'
    else:
        return 'Low'

# Create a new column 'Category' using the function
pokemon_df['Category'] = pokemon_df['Total'].apply(categorize_total)

# Show the data frame
pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Category
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,Medium
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,Medium
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,Medium
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,High
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,Medium


Here's what happens:
1. We create a function `categorize_total` that takes a numerical value (the 'Total' stat of a Pokémon) and returns a category ('High', 'Medium', or 'Low') based on the value. It uses `if-elif-else` statements to determine the category.
2. We use the Pandas   `.apply()` method, which applies this function along an axis (either row or column) of the DataFrame. Here, it's used to apply the `categorize_total` function to each value in the 'Total' column.
3. We use `pokemon_df['Category'] = ...` to create a new column in `pokemon_df` named 'Category'. Each row in this column gets the value returned by `categorize_total` for the corresponding row's 'Total' stat.

### Modifying Existing Columns
You can also modify existing columns. This is commonly done for tasks like data **normalization**, converting data types, or applying a transformation to the data.

For example, if you want to normalize the 'Total' stats to a scale of 0 to 1:

In [None]:
# I am going to make a copy of the data frame
# So we don't change our original
pokemon_df_normal = pokemon_df

# Normalize the 'Total' column
numerator = (pokemon_df['Total'] - pokemon_df['Total'].min())
denominator = (pokemon_df['Total'].max() - pokemon_df['Total'].min())
pokemon_df_normal['Total'] =  numerator / denominator

pokemon_df_normal.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Category
0,1,Bulbasaur,Grass,Poison,0.23,45,49,49,65,65,45,1,False,Medium
1,2,Ivysaur,Grass,Poison,0.375,60,62,63,80,80,60,1,False,Medium
2,3,Venusaur,Grass,Poison,0.575,80,82,83,100,100,80,1,False,Medium
3,3,VenusaurMega Venusaur,Grass,Poison,0.741667,80,100,123,122,120,80,1,False,High
4,4,Charmander,Fire,,0.215,39,52,43,60,50,65,1,False,Medium


Here's what we are doing here:

-   `numerator = (pokemon_df['Total'] - pokemon_df['Total'].min())`: This line finds the difference between each individual Pokémon's "Total" stat and the lowest "Total" of all Pokemon. It will have a different value for each row.
-   `denominator = (pokemon_df['Total'].max() - pokemon_df['Total'].min())`: This line finds the overall range of "Total" stats in the book, determining how wide the difference is between the strongest and weakest Pokémon. This will be a constant.
-   `pokemon_df_normal['Total'] = numerator / denominator`: This line divides the differences we found earlier by the overall range. It's will adjust the "Total" stats to fit on a special ruler that goes from 0 to 1.

This sort of normalization is very common in data science, and is often used when we want to compare different columns (for example, attack, defense, hp, etc.) that have different numerical ranges.

### Conditional Column Creation
You can create columns based on conditions applied to existing data. This is particularly useful for flagging data, creating binary categories, or segmenting data into groups based on specific criteria.

For example, to create a binary flag for Pokémon with a speed greater than 100:

In [None]:
# Create a new column 'High_Speed_Flag'
pokemon_df['High_Speed_Flag'] = pokemon_df['Speed'] > 100
pokemon_df.head()


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Category,High_Speed_Flag
0,1,Bulbasaur,Grass,Poison,0.23,45,49,49,65,65,45,1,False,Medium,False
1,2,Ivysaur,Grass,Poison,0.375,60,62,63,80,80,60,1,False,Medium,False
2,3,Venusaur,Grass,Poison,0.575,80,82,83,100,100,80,1,False,Medium,False
3,3,VenusaurMega Venusaur,Grass,Poison,0.741667,80,100,123,122,120,80,1,False,High,False
4,4,Charmander,Fire,,0.215,39,52,43,60,50,65,1,False,Medium,False


## Basic Methods of Data Analytics
In this section, we'll explore how we can use Pandas to answer some basic questions that arise in data analysis. Some of the techniques here are relatively advanced, so don't worry if you don't get all the "details" yet. The goal is to get a broad sense of "what you can do" with Pandas.

## Comparative Analysis with Pandas
**Sample Question: How do the average attack and defense statistics differ between Fire-type and Water-type Pokémon? Does one type generally exhibit stronger offensive or defensive capabilities?**

This question is an example of a **comparative analysis**, a common type of inquiry in data analysis where two or more groups are compared based on certain metrics. In the context of the Pokémon dataset, the groups are Pokémon types (Fire and Water), and the metrics are attack and defense statistics.

In real-world scenarios, this type of question is ubiquitous. For example, in the business sector, companies often compare the performance of different product lines (e.g., comparing average sales of two different types of products). In healthcare research, different treatment methods may be compared based on effectiveness or side effects. The underlying principle is to identify and understand differences or similarities between distinct groups within a larger dataset.

To answer this question using Pandas, we will follow a series of steps:

1.  *Filter the DataFrame.* We need to create subsets of the DataFrame for each Pokémon type we're interested in - Fire and Water.
2.  *Calculate Averages.* For each subset, we calculate the mean values of the 'Attack' and 'Defense' columns.
3.  *Compare Results.* The final step involves comparing these average values to draw conclusions about the offensive and defensive capabilities of Fire-type versus Water-type Pokémon.

In [None]:
# Example: Comparative Analytics

# Filter for Fire-type Pokémon
fire_pokemon = pokemon_df[pokemon_df['Type 1'] == 'Fire']

# Filter for Water-type Pokémon
water_pokemon = pokemon_df[pokemon_df['Type 1'] == 'Water']

# Calculate average attack and defense for Fire-type Pokémon
avg_attack_fire = fire_pokemon['Attack'].mean()
avg_defense_fire = fire_pokemon['Defense'].mean()

# Calculate average attack and defense for Water-type Pokémon
avg_attack_water = water_pokemon['Attack'].mean()
avg_defense_water = water_pokemon['Defense'].mean()

# Displaying the results
print(f"Average Attack for Fire-type Pokémon: {avg_attack_fire}")
print(f"Average Defense for Fire-type Pokémon: {avg_attack_fire}")
print(f"Average Attack for Water-type Pokémon: {avg_attack_water}")
print(f"Average Defense for Water-type Pokémon: {avg_defense_water}")


Average Attack for Fire-type Pokémon: 84.76923076923077
Average Defense for Fire-type Pokémon: 84.76923076923077
Average Attack for Water-type Pokémon: 74.15178571428571
Average Defense for Water-type Pokémon: 72.94642857142857


These results suggest that Fire Pokemon have somewhat higher attack and defense than do Water-type Pokemon. In a real-world scenario, we might following this up with a more detailed analysis (for example, we might include medians, standard deviations, other statistics, etc.).

### Correlation Analysis Using Pandas
**Sample Question: Is there a significant correlation between a Pokémon's Speed and its Attack statistic? Does a higher speed typically imply a higher attack value?**

This question is an exploration of correlation analysis between two variables, which is a fundamental aspect of data analysis. In this case, the variables are the Speed and Attack statistics of Pokémon. Correlation analysis helps in understanding whether, and how strongly, two variables are related to each other.

In real-world applications, correlation analysis is widely used across various fields. For instance, in finance, analysts might examine the correlation between different stock prices or between a stock price and market indices. In healthcare, researchers might investigate the correlation between lifestyle factors (like exercise) and health outcomes (like heart disease risk). Identifying correlations can provide insights into potential relationships or dependencies between different factors.

To assess the correlation between Speed and Attack for Pokémon using Pandas, we will:

1.  *Select the Relevant Columns.* We focus on the 'Speed' and 'Attack' columns from the dataset.
2.  *Calculate Correlation using `df.corr()`.* Use Pandas to calculate the correlation coefficient between these two variables.
3.  *Interpret the Result.* The correlation coefficient, ranging from -1 to 1, will indicate the strength and direction of the relationship. A value closer to 1 implies a strong positive correlation.

In [None]:
# Calculate the correlation coefficient
correlation = pokemon_df[['Speed', 'Attack']].corr()

# Display the correlation coefficient
print("Correlation coefficient between Speed and Attack:")
print(correlation)

Correlation coefficient between Speed and Attack:
          Speed   Attack
Speed   1.00000  0.38124
Attack  0.38124  1.00000


This is a correlation matrix that shows the relationship between two variables, Speed and Attack.

-   The diagonal elements (1.00000 for both Speed and Attack) represent the perfect positive correlation between a variable and itself (i.e., Speed perfectly correlates with Speed and Attack perfectly correlates with Attack).
-   The off-diagonal element (0.38124) represents the positive correlation between Speed and Attack. This value indicates a moderate positive relationship, meaning that as Speed increases, Attack tends to increase as well, but not perfectly.
-   The correlation coefficient is always between -1 and 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation.

In other words, there is a moderate positive correlation between Speed and Attack. This means that as Speed increases, Attack tends to increase as well, but not perfectly. It is important to note that correlation does not imply causation. Just because Speed and Attack are correlated does not mean that one causes the other (and, in fact, in this case, we know that both are caused by something else--the decisions of Pokemon game designers!).

In [None]:
# Grouping data by generation and calculating the mean of Total stats for each generation
generation_means = pokemon_df.groupby('Generation')['Total'].mean()

# Displaying the mean Total stats per generation
print("Average Total Stats per Pokémon Generation:")
print(generation_means)


Average Total Stats per Pokémon Generation:
Generation
1    426.813253
2    418.283019
3    436.225000
4    459.016529
5    434.987879
6    436.378049
Name: Total, dtype: float64


This code calculates the average total of each generation of pokemon in the pokemon_df DataFrame. Let's break it down:

-   `pokemon_df.groupby('Generation')` groups the DataFrame by the `Generation` column. This creates a dictionary-like object where each key is a generation and the value is a DataFrame containing all the pokemon of that generation.
-   `['Total']` selects the `Total` column from each group. This creates a new dictionary-like object where each key is a generation and the value is a Series containing the total stats for each pokemon of that generation.
-   `.mean()` calculates the average of each Series in the dictionary-like object. This creates a new Series containing the average total stats for each generation.
-  The result is stored in the `generation_means` variable.

This reveals there is relatively small changes between different generations. However, some generations (such as 2) are underpowered relative to other generations (such as 6).

### Association Analysis Using Pandas
**Sample Question: Is there a relationship between the rarity of Pokémon types and their average stat values? For instance, are rarer types like Dragon or Ghost statistically stronger?**

This question explores the concept of association analysis between two variables: Pokémon type rarity and average stat values. It seeks to determine if there is a link between how rare a Pokémon type is and its average statistics. This type of analysis is often used in fields like market research to investigate if there are associations between product availability (rarity) and their qualities (like price or features).

In a business context, for example, this could translate to analyzing whether luxury brands (rarer) typically offer higher quality or features than more commonly available brands. In environmental science, it might involve studying if rare animal species possess unique traits compared to more common species.

To investigate this relationship using Pandas, we'll:

1.  *Determine Rarity of Each Type.* Classify Pokémon types based on their frequency in the dataset.
2.  *Calculate Average Stats for Each Type.* Compute the average of relevant stats (like Total, HP, Attack, etc.) for each Pokémon type.
3.  *Analyze Relationship.* Compare the rarity of each type with its average stats to see if rarer types tend to have higher stats.


In [None]:
# Combine type columns, count occurrences, and rename for clarity
type_counts = pd.concat([pokemon_df['Type 1'], pokemon_df['Type 2']]).value_counts().rename('Frequency')

# Calculate mean for numeric columns only
type_avg_stats = pokemon_df.groupby('Type 1').mean(numeric_only=True)

# Merge with type counts, sort by rarity, and display analysis
print("Analysis of Stat Distribution and Rarity:")
print(type_avg_stats.merge(type_counts, left_index=True, right_index=True).sort_values(by='Frequency')[['Total', 'Frequency']])


Analysis of Stat Distribution and Rarity:
               Total  Frequency
Ice       433.458333         38
Fairy     413.176471         40
Ghost     439.562500         46
Steel     487.703704         49
Dragon    550.531250         50
Electric  443.409091         50
Dark      445.741935         51
Fighting  416.444444         53
Rock      453.750000         58
Poison    399.142857         62
Fire      458.076923         64
Ground    437.500000         67
Bug       378.927536         72
Psychic   475.947368         90
Grass     421.142857         95
Flying    485.000000        101
Normal    401.683673        102
Water     430.455357        126


This code block has a number of new concepts. Let's take a closer look at this.

1\. Counting Type Occurrences

-   `pd.concat([pokemon_df['Type 1'], pokemon_df['Type 2']])`: This line combines the "Type 1" and "Type 2" columns, creating a list of all types (including repeats).
-   `.value_counts()`: This counts how many times each type appears in the combined list, giving us an idea of their rarity.
-   `.rename('Frequency')`: This renames the count column to "Frequency" for easier understanding.
-   The result is stored in a table named `type_counts`.

2\. Calculating Average Stats for Each Type

-   `pokemon_df.groupby('Type 1')`: This groups the Pokémon data based on their primary type, creating smaller tables for each type.
-   `.mean(numeric_only=True)`: This calculates the average of each numerical column (like HP, Attack, etc.) within each type group.
-   The result is stored in a table named `type_avg_stats`, showing the average stats for each primary type.

3\. Combining and Sorting

-   `.merge(type_counts, left_index=True, right_index=True)`: This merges the `type_avg_stats` table (with average stats) and the `type_counts` table (with type frequencies), creating a comprehensive view.
-   `.sort_values(by='Frequency')`: This sorts the combined table based on the "Frequency" column, arranging types from rarest to most common.

In the end, there doesn't appear to be any strong relationship between rarity and total strength. Some rare types (Steel, Dragon) are quite strong, but others (Ice, Fairy) are quite weak. Conversly, flying types are common, but are quite strong.

## Code to Know: Data Manipulation and Analysis


| Pandas Code | Description |
| --- | --- |
| `df['New_Column'] = df['Existing_Column'].apply(custom_function)` | Creates a new column by applying a function to an existing column. Useful for categorizing or transforming data. |
| `df['Normalized_Column'] = (df['Column'] - df['Column'].min()) / (df['Column'].max() - df['Column'].min())` | Normalizes a column's values to a 0-1 scale, based on the column's minimum and maximum values. |
| `df['Flag_Column'] = df['Column'] > threshold` | Creates a binary flag column based on whether each row's value in a specified column exceeds a given threshold. |
| `grouped_average = df.groupby('Group_Column')['Analysis_Column'].mean()` | Calculates the average of a column within each group defined by another column. Ideal for comparative analysis across categories. |
| `correlation_matrix = df[['Column1', 'Column2']].corr()` | Generates a correlation matrix between two specified columns, revealing the strength and direction of their relationship. |
| `average_by_category = df.groupby('Category_Column').mean()` | Computes the average values for all numerical columns within each category of a specified column, useful for category-wise analysis. |
| `merged_df = df1.merge(df2, on='Common_Column')` | Merges two DataFrames based on a common column, combining data for more comprehensive analysis. |
| `pivot_table = pd.pivot_table(df, index=['Column1', 'Column2'], values='Analysis_Column', aggfunc='mean')` | Creates a pivot table to explore the relationship between two categorical columns and a numerical column. |
| `df['Cumulative_Sum'] = df['Column'].cumsum()` | Adds a column with the cumulative sum of another column, useful for trend analysis over sequential data. |
| `df['Percentage'] = df['Partial_Column'] / df['Total_Column'] * 100` | Calculates the percentage of one column in relation to another, allowing for proportionate comparisons. |
| `filtered_df = df[df['Column'].isin(value_list)]` | Filters the DataFrame to include only rows where the column's values are in a specified list, useful for targeted data analysis. |


## Discussion Questions
1. Imagine you have a dataset of student grades. How would you apply the technique of categorizing based on numeric values (like 'High', 'Medium', 'Low') to understand student performance? What criteria would you use to define these categories?
2. Discuss how changing the boundaries of these categories (e.g., what score range constitutes 'High') might affect your understanding of student performance. Why is it important to set these boundaries thoughtfully?
3. Consider a situation where you are comparing test scores from two different classes, but the tests had different maximum possible scores. Why would normalizing these scores (adjusting them to a common scale) be important in comparing the performance of the two classes?
4. In a school setting, how might creating a binary flag (like marking students as 'Pass' or 'Fail') be useful for quickly identifying students who need additional help?
5. Think about a situation where high temperatures might correlate with increased ice cream sales. What does it mean if these two variables are correlated? Does one cause the other?

### My Answers
1.

2.

3.

4.

5.

## Exercises: Data Manipulation and Analysis

In [None]:
# Exercise 1: In 'pokemon_df', create a new column 'Sample_New_Column' and fill it with the value 'Test' for all rows.
# Hint: Assign a string to a new column name, like pokemon_df['New_Column'] = 'Value'.

In [None]:
# Exercise 2: In 'pokemon_df', create a 'NormalizedTotal' column to a scale of 0 to 100.
# Hint: Normalize 'Total' as shown earlier, and multiply by by 100.

In [None]:
# Exercise 3: Create a new column 'High_Total' in 'pokemon_df' that contains
# True if 'Total' is above 500 and False otherwise.
# Hint: Use a simple conditional statement.

In [None]:
# Exercise 4: Calculate the correlation coefficient between the 'Attack' and 'Defense' columns in 'pokemon_df'.
# Hint: Use df.corr().

In [None]:
# Exercise 5: Create a new DataFrame from 'pokemon_df' that
# contains only Pokémon of 'Fire' type and their 'Total' is above 70.
# Hint: Use a combination of conditional statements like pokemon_df[condition1 & condition2].

## Review With Quizlet

In [None]:
%%html
<iframe src="https://quizlet.com/870693320/learn/embed?i=psvlh&x=1jj1" height="500" width="100%" style="border:0"></iframe>

## Glossary: Important Concepts in Data Analytics

| Term | Definition |
| --- | --- |
| Data Analytics | The science of analyzing raw data to make conclusions about that information. It involves applying an algorithmic or mechanical process to derive insights and encompasses a variety of techniques with many different goals. |
| Pandas | A software library written for the Python programming language for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. |
| Data Frame | A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns) in Pandas. It is commonly used for storing data tables and is similar to a SQL table or a spreadsheet. |
| Series | A one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. It is a basic data structure in Pandas. |
| Data Cleaning | The process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combined with data analysis and data processing, it enhances the quality of data. |
| Data Munging | The process of converting or mapping data from one raw form into another format to make it more appropriate and valuable for a variety of downstream purposes such as analytics. |
| Data Visualization | The graphical representation of information and data. By using visual elements like charts, graphs, and maps, it provides an accessible way to see and understand trends, outliers, and patterns in data. |
| Data Integration | The process of combining data from different sources into a single, unified view. It involves the technical and business processes used to combine different data sources and present a unified view of the data. |
| Data Mining | The practice of examining large pre-existing databases in order to generate new information. It involves methods at the intersection of machine learning, statistics, and database systems. |
| Data Wrangling | The process of cleaning, structuring, and enriching raw data into a desired format for better decision-making in less time. It is a fundamental step in the data preparation process. |
| Index | The 'key' of a Pandas DataFrame or Series, used for fast access, alignment, and SQL-like joins. It can be thought of as an immutable array or an ordered set. |
| Merge | A function in Pandas that helps to combine different DataFrames and Series in a manner similar to SQL-style joins. It can be used to merge two datasets using a specified join method like inner, outer, left, or right. |
| GroupBy | A feature in Pandas that splits the data into groups based on some criteria, applies a function to each group independently, and combines the results. It is often used for grouping data and performing operations on each group separately. |
| Data Types | Categories of data found in data analysis, like integers, floats, or strings, that define the operations possible on the data and its storage method. In Pandas, it includes types like int64, float64, and object. |
