<a href="https://colab.research.google.com/github/brendanpshea/programming_problem_solving/blob/main/Programming_05_DataAnalytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas and Pokemon: Basic Data Analytics With Python
### Brendan Shea, PhD

**Data Analytics** refers to the process of examining, cleaning, transforming, and modeling data with the aim of discovering useful information, informing conclusions, and supporting decision-making. It involves using statistical, computational, and machine learning techniques to analyze and interpret complex data sets, enabling individuals and organizations to make more informed decisions.

Data Analytics is a broad field that spans many domains, encompassing everything from business intelligence and big data analytics to specialized fields like web analytics and social media analytics. At its core, data analytics is about uncovering patterns and extracting meaningful insights from raw data. This process can range from simple descriptive statistics, which describe and summarize data, to more complex predictive and prescriptive analytics, which forecast future trends and prescribe actions.

To put this into perspective, imagine you're playing a video game where you need to build a team of characters, each with their own set of skills and attributes. Data analytics in this context would involve examining the data on each character -- like their health points, attack strength, defense abilities, etc. -- to determine the most balanced and effective team composition. You would analyze past performance data, predict future outcomes in different scenarios, and make decisions on which characters to choose to maximize your chances of success.

In the real world, data analytics is used in a myriad of ways, from businesses analyzing customer data to improve their products and services, to healthcare providers using patient data to make better diagnostic and treatment decisions. In the educational sector, data analytics can help in understanding student performance and improving teaching methods.

### How Does Python Fit into Data Analytics?
Python  serves as an accessible and practical entry point into the world of data manipulation and analysis. Its simple syntax and readability make it an ideal language for those who are just starting to explore data analytics.

When you're new to data analytics, the primary goal is to learn how to manage and interpret data effectively. Python facilitates this learning process with its straightforward and intuitive coding style. Unlike more complex programming languages  (such as MATLAB, R, etc.)  that may have steep learning curves, Python's syntax is clear and concise, making it easier for beginners to grasp key concepts without getting overwhelmed.

One of the first steps in data analytics is learning how to handle and process data. Python offers a rich ecosystem of libraries specifically designed for these tasks. For instance, the Pandas library, which we'll be focusing on here, is a fundamental tool in Python for data manipulation. It allows you to easily read, write, and modify data in various formats like CSV, Excel, or databases.

Another aspect is data visualization, which is crucial in making sense of the data you are analyzing. Python libraries like Matplotlib and Seaborn enable you to create visualizations -- such as graphs and charts -- with just a few lines of code. This transforms complex numerical data into a series of easy-to-understand images, making it simpler to identify trends and patterns.

## Libraries for Data Analytics
Some important Python libraries for data analytics include the following.

| Python Library | Description |
| --- | --- |
| Pandas | A foundational library for data manipulation and analysis. Pandas provide DataFrames (similar to Excel spreadsheets) that make working with structured data intuitive and efficient. It's excellent for data cleaning, transformation, and analysis. |
| NumPy | Stands for Numerical Python, NumPy is crucial for numerical computations. It offers support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. |
| Matplotlib | A plotting library that provides a MATLAB-like interface for creating a wide range of static, animated, and interactive visualizations. Matplotlib is very customizable and widely used for creating graphs and charts. |
| Seaborn | Built on top of Matplotlib, Seaborn is a statistical data visualization library. It provides a high-level interface for drawing attractive and informative statistical graphics. |
| SciPy | Used for scientific and technical computing, SciPy builds on NumPy and provides a large number of higher-level functions for optimization, regression, interpolation, etc. |
| Scikit-learn | A simple and efficient tool for data mining and data analysis. It's built on NumPy, SciPy, and Matplotlib, and it's best known for its capabilities in machine learning, including classification, regression, clustering, and dimensionality reduction. |
| Statsmodels | Focused on statistical models, hypothesis testing, and data exploration. It's a great tool for statistical analysis and offers extensive options for model formulation. |
| TensorFlow | An open-source library developed by Google primarily for deep learning applications. TensorFlow offers flexible tools for building and training neural networks to detect and decipher patterns and correlations, similar to human learning and reasoning. |
| Keras | An open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library and simplifies many aspects of creating and compiling deep learning models. |
| Plotly | A graphing library that makes interactive, publication-quality graphs online. It offers a range of graphical representations like scatter plots, line charts, bar charts, and more, with interactive features. |

## What is Pandas?

**Pandas** is a library in Python that makes data analysis and manipulation straightforward and accessible. It is a tool that turns complex tasks into manageable ones, simplifying the way we work with data. In the world of Python programming, especially in data analytics, Pandas is akin to a Swiss Army knife -- versatile and essential.

At its core, Pandas is designed for working with tabular or structured data. Tabular data is similar to what you would see in a spreadsheet -- data that's organized into rows and columns. For example, imagine a table containing information about different Pokémon: each row represents a Pokémon, and each column details attributes like type, hit points, and attack strength.

One of the primary components of Pandas is the DataFrame. A **DataFrame** is a way to store and manipulate data in a table with rows and columns, much like a sheet in an Excel workbook. You can think of a DataFrame as a powerful tool that allows you to do a lot of different things with your data -- sort it, filter it, calculate statistics from it, and even clean it (like removing or fixing incorrect data).

Another essential feature of Pandas is its ability to handle missing data. In real life, data can be messy. Sometimes, information is missing or incomplete. Pandas provide a straightforward way to deal with these gaps, either by filling them in with specific values or by removing the parts of the data that are incomplete.

In addition to these features, Pandas also make it easy to read data from different sources. Whether your data is in a CSV file, an Excel spreadsheet, or a database, Pandas can read it and turn it into a DataFrame. Once in a DataFrame, you can start analyzing and visualizing your data.

## Example: Loading Pokemon Data Into Pandas
In this example, we will walk through the process of loading a dataset from a CSV file -- specifically, a file containing Pokémon statistics -- into a Pandas DataFrame. A DataFrame is a central data structure in Pandas and can be thought of as a table with rows and columns, similar to an Excel spreadsheet. By loading data into a DataFrame, we make it possible to perform a variety of data analysis tasks efficiently.

### Step 1: Import Pandas Library
To use Pandas, you first need to ensure it's installed in your Python environment. If it's not already installed, you can do so using pip, Python's package manager, with the command `pip install pandas`. Once Pandas is installed, you start your Python script or notebook by importing it:

```python
import pandas as pd
```

Here, `pd` is a conventional alias used for Pandas. It's a shorthand that will save you some typing and keep your code clean.

### Step 2: Read the CSV File

With Pandas imported, the next step is to read the CSV file containing the Pokémon data. CSV, which stands for Comma-Separated Values, is a popular format for storing tabular data. Pandas has a built-in function, `read_csv()`, that makes reading CSV files straightforward. This function takes the file path or URL of the CSV file and converts it into a DataFrame. Here's how you do it:

```python
url = "https://github.com/brendanpshea/programming_problem_solving/raw/main/data/pokemon.csv"
pokemon_df = pd.read_csv(url)
```

In this code, `url` holds the link to the CSV file, and `pokemon_df` is the variable name we've chosen for our DataFrame. The `read_csv()` function fetches the data from the URL and parses it into a DataFrame.

#### Step 3: Verify the Data

Once the file is loaded into a DataFrame, it's good practice to verify the data to ensure everything looks as expected. This can be done by displaying the first few rows of the DataFrame. You can use the `head()` method for this:

```python
pokemon_df.head()
```

The `head()` method shows the first five rows of your DataFrame by default. This quick check helps confirm that your data is loaded correctly and gives you a glimpse of its structure and the type of data it contains.

All together, this looks like this:

In [1]:
import pandas as pd

url = "https://github.com/brendanpshea/programming_problem_solving/raw/main/data/pokemon.csv"
pokemon_df = pd.read_csv(url)

pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


Let's break down what we learn from this DataFrame:

1.  Each column in the DataFrame represents a different attribute or feature of the Pokémon.
  -   `#`: This appears to be an identifier or index for the Pokémon.
  -   `Name`: The name of the Pokémon.
  -   `Type 1` and `Type 2`: These columns represent the primary and secondary types of the Pokémon, indicating their elemental properties, like Grass, Poison, or Fire.
  -   `Total`: Likely a total score or sum of all the combat-related stats.
  -   `HP`: The health points of the Pokémon.
  -   `Attack` and `Defense`: These columns show the offensive and defensive strength of the Pokémon.
  -   `Sp. Atk` and `Sp. Def`: Special attack and special defense values.
  -   `Speed`: How fast the Pokémon can move in battles.
  -   `Generation`: Indicates the generation of the Pokémon series to which the Pokémon belongs.
  -   `Legendary`: A boolean (True or False) indicating whether the Pokémon is legendary.
2.  Each row in the DataFrame represents a different Pokémon, with its attributes listed across the columns.
    -   The leftmost column, which is not named, is the index of the DataFrame. It provides a unique number to each row (starting from 0).
    -   The first five rows show data for Bulbasaur, Ivysaur, Venusaur, Mega Venusaur, and Charmander.
    -   We can see an evolution pattern where, for example, Bulbasaur evolves into Ivysaur and then into Venusaur. This is indicated by the increasing values in their `Total`, `HP`, `Attack`, and other stats.
3.  The DataFrame contains a mix of data types: integers (like HP, Attack), strings (like Name, Type 1), and booleans (Legendary).
4. The presence of `NaN` shows that the dataset may have missing or undefined values, which is a common scenario in real-world data.

This DataFrame is a typical example of structured data that Pandas handles efficiently. By examining it, we can understand the attributes of each Pokémon, compare their abilities, and analyze patterns like evolution and type distribution. This information forms the basis for further analysis, such as statistical calculations, data visualization, and advanced data manipulation.