<a href="https://colab.research.google.com/github/chonginbilly/Moringa_DS/blob/main/Data_Analysis_in_Pandas_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in this course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

---

# Introduction to Data Analysis in Pandas

Welcome to the fascinating world of data analysis with Pandas! In this lesson, we'll explore one of the most powerful and widely used Python libraries for data manipulation – Pandas. As we explore its capabilities, you'll discover how Pandas facilitates efficient handling, cleaning, and analysis of structured data. This skill set is essential for extracting valuable insights and patterns from your datasets.

## Objectives

By the end of this lesson, you will be able to:

- Understand the role of Pandas in data analysis and manipulation.

## Import libraries

In [None]:
import csv
# for operating system functionalities
import os
# for numerical operations
import numpy as np
# for tabular data analysis
import pandas as pd

## Exploring Python Libraries: Our Gateway to Versatility

Python, a language celebrated for its simplicity and versatility, comes alive through its extensive ecosystem of libraries. While base Python provides fundamental functionality for various tasks, the real magic happens when we tap into the vast collection of open-source libraries crafted by the vibrant Python community.

In data science, Python shines as a preferred language due to its rich ecosystem. The thriving scientific community has bestowed upon us a treasure trove of specialized packages tailored for advanced data manipulation, analysis, and visualization. These libraries serve as invaluable tools, empowering users to tackle complex tasks with efficiency and ease.

Whether you're opening a JSON file, exploring machine learning algorithms, or creating stunning visualizations, Python libraries offer a streamlined path. These libraries, developed and maintained by dedicated volunteers and professionals, epitomize the collaborative spirit of the Python community. As we embark on our data analysis journey, we'll harness the power of these libraries to elevate our Python experience, unlocking new realms of possibilities in the world of data science. Get ready to discover the artistry of Python libraries – your gateway to unparalleled versatility and efficiency.


## Pandas: Your Data's Best Friend

![pandas](https://drive.google.com/uc?export=view&id=1a53XdfP_lQiNO5BxiAC76CHCJJ59y3iw)

Pandas, an essential Python library, is your trusted companion in the world of data manipulation and analysis. Imagine it as the superhero cape for your datasets, enabling you to effortlessly handle, analyze, and visualize information. With Pandas, tasks that seem daunting become a breeze.

This library introduces two fundamental data structures: **Series** and **DataFrame**. A Series is like a **single-column table**, while a DataFrame is your **go-to multi-dimensional table**, akin to a spreadsheet. Armed with these structures, you can filter, sort, and transform your data with simplicity and finesse.

Get ready to dive into the Pandas world where data manipulation becomes a joy. It's a must-have in your toolkit as you venture into the captivating realm of data science. For detailed guidance, explore the official [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/). Let's unleash the power of Pandas and elevate our data game!


### Why Pandas?

Pandas stands as a cornerstone in the field of data science, providing a user-friendly toolkit for Python that streamlines the manipulation of structured data. Its versatility extends from data cleaning to exploration and analysis, making it an indispensable tool for data scientists, analysts, and researchers.

Throughout this lesson, you'll witness firsthand how Pandas simplifies tasks such as cleaning and preprocessing data, enabling you to focus on the core aspects of your analysis. Whether you're dealing with spreadsheets, CSV files, databases, or any other structured data source, Pandas empowers you to navigate and extract meaningful information seamlessly.

Get ready to enhance your data manipulation skills and unlock the potential of your datasets with Pandas! Let's embark on this exciting journey into the world of data analysis.

## From `csv.DictReader` to DataFrames

Moving from `csv.DictReader` to Pandas, the spotlight is on the remarkable DataFrame feature within the Pandas library. The DataFrame, akin to a tabular representation of data, brings a new level of flexibility by integrating an index. This means you can effortlessly select and manipulate data using rows or columns, making your data-handling tasks smoother.

Now, let's make a swift comparison between using Pandas and the built-in `csv` module. While the base Python syntax might require more intricate maneuvers, Pandas steps in with a syntax that is not only simpler but also more efficient. The DataFrame shines as your tool of choice for navigating and transforming tabular data, offering a seamless experience in the world of data science.

We'll utilize the renowned [Iris dataset](https://archive.ics.uci.edu/dataset/53/iris) for our exploration into Pandas. The initial five entries in this dataset are as follows:



||||||
|:------:|:------:|:------:|:------:|:------:|
|sepal_length(cm)|	sepal_width(cm)|	petal_length(cm)|	petal_width(cm)|	class|
|5.1	|3.5	|1.4	|0.2	|Iris-setosa|
|4.9	|3.0	|1.4	|0.2	|Iris-setosa|
|4.7	|3.2	|1.3	|0.2	|Iris-setosa|
|4.6	|3.1	|1.5	|0.2	|Iris-setosa|
|5.0	|3.6	|1.4	|0.2	|Iris-setosa|

### Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Path to the where the data has been stored in **your** google drive.

In [None]:
# Define the path to your data
data_path = '/content/gdrive/MyDrive/Product/Naivas Big Data /Data/Iris'

# Change the current working directory to the specified path
os.chdir(data_path)

# List files in the directory
files = os.listdir()
print("Files in the directory:", files)

Files in the directory: ['Index', 'iris.names', 'iris.data']


In [None]:
# current working directory
os.getcwd()

'/content/gdrive/MyDrive/Product/Naivas Big Data /Data/Iris'

To open and read the initial 5 lines using the `csv` module, the procedure is as follows:

In [None]:
# Using with open and csv.DictReader to read csv
with open("iris.data", "r") as csvfile:
    reader = csv.DictReader(csvfile)
    iris_data = list(reader)

# Print the first 5 rows of data
for index in range(5):
    print(iris_data[index])

{'sepal_length(cm)': '5.1', 'sepal_width(cm)': '3.5', 'petal_length(cm)': '1.4', 'petal_width(cm)': '0.2', 'class': 'Iris-setosa'}
{'sepal_length(cm)': '4.9', 'sepal_width(cm)': '3.0', 'petal_length(cm)': '1.4', 'petal_width(cm)': '0.2', 'class': 'Iris-setosa'}
{'sepal_length(cm)': '4.7', 'sepal_width(cm)': '3.2', 'petal_length(cm)': '1.3', 'petal_width(cm)': '0.2', 'class': 'Iris-setosa'}
{'sepal_length(cm)': '4.6', 'sepal_width(cm)': '3.1', 'petal_length(cm)': '1.5', 'petal_width(cm)': '0.2', 'class': 'Iris-setosa'}
{'sepal_length(cm)': '5.0', 'sepal_width(cm)': '3.6', 'petal_length(cm)': '1.4', 'petal_width(cm)': '0.2', 'class': 'Iris-setosa'}


We possess a list comprising dictionaries, each sharing identical keys. Suppose we aim to extract all data for the 3rd row (record). This can be accomplished straightforwardly by employing list indexing:

In [None]:
# 3rd row
iris_data[4]

{'sepal_length(cm)': '5.0',
 'sepal_width(cm)': '3.6',
 'petal_length(cm)': '1.4',
 'petal_width(cm)': '0.2',
 'class': 'Iris-setosa'}

Now, if our goal is to retrieve all data for the 5th column (i.e., the values associated with the `Class` keys), accomplishing this task is feasible, although it would necessitate the use of a loop or list comprehension. The process might resemble the following:

In [None]:
# Extracting all data for the Class using list comprehension
class_data = [row['class'] for row in iris_data]

# Print the data for the 3rd column
print(class_data)

['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor', 'I

## Using Pandas

Using pandas, the same task of selecting data becomes more straightforward. We can create a DataFrame directly from the CSV file, and then we can easily access specific columns or rows.

With pandas, accessing columns is as straightforward as accessing rows. For instance, if we convert the `iris_data` (a list of dictionaries) into a DataFrame and then view the first five rows:

In [None]:
# Converting the list of dictionaries into a DataFrame
iris_df = pd.DataFrame(iris_data)

# Displaying the first five rows of the DataFrame
iris_df.head()

Unnamed: 0,sepal_length(cm),sepal_width(cm),petal_length(cm),petal_width(cm),class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


After importing the pandas library and assigning the standard alias `pd`, we used the DataFrame constructor to convert our existing list of dictionaries into a DataFrame.

Now, with a more straightforward syntax, we can extract all the information from the `class` column:


In [None]:
# information about the class column
iris_df['class']

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

However, extracting information by row is also straightforward, similar to how it was done with the list of dictionaries:

In [None]:
# extracting information about a row
iris_df.iloc[3]

sepal_length(cm)            4.6
sepal_width(cm)             3.1
petal_length(cm)            1.5
petal_width(cm)             0.2
class               Iris-setosa
Name: 3, dtype: object

We can eliminate the use of the `csv` module and the `iris_data` variable by directly reading the data from the CSV file. Simply specify the file path:

In [None]:
df = pd.read_csv('iris.data')
df

Unnamed: 0,sepal_length(cm),sepal_width(cm),petal_length(cm),petal_width(cm),class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


## Features of Pandas

The code snippets above showcase two prominent features of the pandas library as highlighted on the pandas [about page](https://pandas.pydata.org/about/):

1. A fast and efficient **DataFrame** object designed for data manipulation with integrated indexing.
2. Tools for seamless **reading and writing of data** between in-memory data structures and various formats, including CSV and text files, Microsoft Excel, SQL databases, and the high-performance HDF5 format.

Additional key features include:

- Intelligent **data alignment** and integrated **handling of missing data**, allowing for automatic label-based alignment in computations and easy manipulation of messy data into an organized form.
- Flexible **reshaping** and pivoting of data sets.
- Intelligent label-based **slicing**, fancy **indexing**, and **subsetting** of large data sets.
- Aggregating or transforming data with a powerful **group-by** engine that facilitates split-apply-combine operations on data sets.
- High-performance **merging and joining** of data sets.
- **Time series** functionality, including date range generation, frequency conversion, moving window statistics, date shifting, and lagging. Users can even create domain-specific time offsets and join time series without losing data.
- Highly **optimized for performance**, with critical code paths written in Cython or C.

In this module, we will explore these features in more detail.


## Summary

In this introductory lesson, we explored the pandas library, a robust tool for data manipulation and analysis in Python. The focus was on understanding the core features of pandas, particularly its central data structure, the DataFrame. We witnessed the transition from using the CSV module and DictReader to pandas, illustrating the library's efficiency and user-friendly syntax. The lesson highlighted the ease of creating a DataFrame from a list of dictionaries and demonstrated the straightforward methods for accessing rows and columns in pandas. Using the iris dataset, we showcased the simplicity of reading and analyzing tabular data with pandas. Additionally, we discussed notable features of pandas, including integrated indexing, versatile data reading and writing tools, intelligent data alignment, and support for time series functionality. As we progress in the course, we will delve into more advanced pandas features, leveraging its capabilities for efficient data manipulation and analysis.