# Pandas  

Pandas is a popular open-source data analysis and manipulation library for Python. Pandas stands for “Python Data Analysis Library ”. The name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets. It provides easy-to-use and powerful tools for working with structured data such as tables and time series.

Some advantages of Pandas include:

* Data manipulation: Pandas provides powerful tools for filtering, transforming, and aggregating data.
* Flexibility: Pandas can handle a wide range of data formats, including text files, CSV, Excel, SQL databases, and JSON.
* Integration with other libraries: Pandas works seamlessly with other data science libraries in Python, such as NumPy and Matplotlib.
* Time series analysis: Pandas has built-in support for working with time series data.

Also:

* Highly optimized for performance, with critical code paths written in Cython or C.


In [None]:
import pandas as pd # import pandas library, and use pd as alias (this is a convention)

## Data Structures

Pandas data structures include Series and DataFrames. 
* A Series is a one-dimensional array-like object that can hold a variety of data types, such as integers, strings, and floats. 
* A DataFrame is a two-dimensional table-like data structure with rows and columns.

### Series
A Pandas Series is a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, Python objects, etc. Each element in a Series has a label called the index, which is used to access the values of the Series. A Series is similar to a NumPy array, but it has an index that makes it more flexible and convenient to use.

In [None]:
# create a Series from a list
s1 = pd.Series([3, 5, 1, 2, 7])
print(s1)

In [None]:
# create a Series from a dictionary
data = {'apples': 5, 'oranges': 2, 'bananas': 8, 'pears': 1}
s2 = pd.Series(data)
print(s2)


In [None]:
# create a Series with custom index labels
s3 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s3)


In [None]:
# access elements of a Series using index labels
print(s2['apples'])   # output: 5
print(s3[['a', 'c']]) # output: a    10, c    30


In [None]:
# perform arithmetic operations on Series
s4 = s1 + s1
print(s4)


In [None]:
# create a boolean mask from a Series
mask = s1 > 3
print(mask)


In [None]:
# filter a Series using a boolean mask
filtered_s1 = s1[mask]
print(filtered_s1)


In [None]:
# apply a function to each element of a Series
s5 = s1.apply(lambda x: x ** 2)
print(s5)

### DataFrame  
A Pandas DataFrame is a two-dimensional labeled data structure that is used to store and manipulate tabular data. It consists of rows and columns, where each column can have a different data type. A DataFrame can be thought of as a collection of Series that share the same index. The rows are labeled by the index, and the columns are labeled by their names. DataFrame is the most commonly used Pandas data structure and can be thought of as a spreadsheet or SQL table. It provides powerful data manipulation and analysis functionalities, such as data indexing, selection, merging, grouping, and pivoting.

In [None]:
# create a DataFrame from a dictionary
data = {'name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'age': [25, 30, 35, 40],
        'country': ['USA', 'Canada', 'Australia', 'UK']}

df = pd.DataFrame(data)
print(df)

In [None]:
# create a DataFrame from a CSV file
df = pd.read_csv('data.csv')
print(df)

In [None]:
# create a DataFrame from a NumPy array
import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
columns = ['A', 'B', 'C']

df = pd.DataFrame(data, columns=columns)
print(df)

In [None]:
# create a DataFrame from a list of dictionaries
data = [{'name': 'Alice', 'age': 25, 'country': 'USA'},
        {'name': 'Bob', 'age': 30, 'country': 'Canada'},
        {'name': 'Charlie', 'age': 35, 'country': 'Australia'},
        {'name': 'Dave', 'age': 40, 'country': 'UK'}]

df = pd.DataFrame(data)
print(df)

In [None]:
# load a JSON file into a DataFrame
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'

# Load the first sheet of the JSON file into a data frame​
df = pd.read_json(url, orient='columns')
print(df.head())

Data indexing and selection is a crucial aspect of data analysis. Pandas provides flexible ways to select subsets of data using methods such as loc and iloc.



Here are some common methods of the Pandas DataFrame:

* `head()`: Returns the first n rows of the DataFrame. The default value of n is 5.
* `tail()`: Returns the last n rows of the DataFrame. The default value of n is 5.
* `info()`: Prints a summary of the DataFrame, including the column names, data types, and non-null values.
* `describe()`: Generates descriptive statistics for numerical columns in the DataFrame, such as count, mean, and standard deviation.
* `shape`: Returns a tuple representing the dimensions of the DataFrame, i.e., (number of rows, number of columns).
* `columns`: Returns a list of the column names in the DataFrame.
* `index`: Returns a list of the row labels or index values in the DataFrame.
* `loc[]`: Allows you to access a group of rows and columns in the DataFrame using label-based indexing. For example, `df.loc[0:5, ['Name', 'Age']]` returns the rows from index 0 to 5, and the "Name" and "Age" columns.
* `iloc[]`: Allows you to access a group of rows and columns in the DataFrame using integer-based indexing. For example `df.iloc[0:5, [0, 1]]` returns the first 5 rows and the first 2 columns.
* `drop()`: Removes one or more rows or columns from the DataFrame. For example, df.drop('Age', axis=1) removes the "Age" column.
* `fillna()`: Fills missing or NaN (Not a Number) values in the DataFrame with a specified value or method, such as the mean or median.
* `groupby()`: Groups the DataFrame by one or more columns and applies a function, such as mean or sum, to each group.
* `pivot_table()`: Creates a pivot table from the DataFrame, allowing you to summarize and analyze the data in different* ways.
* `merge()`: Merges two or more DataFrames based on a common column or index.
* `sort_values()`: Sorts the DataFrame by one or more columns in ascending or descending order.
* `to_csv()`: Writes the DataFrame to a CSV file.
* `plot()`: Creates a basic plot of the DataFrame, such as a line chart or histogram.

These are just a few of the many methods available in Pandas DataFrame. Pandas offers a rich set of functions for data manipulation, aggregation, and visualization, making it a powerful tool for data analysis.

## Indexing, .loc[], and .iloc[]

Pandas provides different methods to select rows and columns of data, the most important of which are loc and iloc. Understanding the differences between these methods is crucial for data manipulation in pandas. 

## Indexing

An `index` in pandas is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.

In [None]:
data = {
    'age': [30, 2, 12, 4, 32, 33, 69],
    'color': ['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
    'food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
    'height': [165, 70, 120, 80, 180, 172, 150],
    'score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
    'state': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
    }

index = ['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']

df = pd.DataFrame(data, index=index)


## .loc[]

`loc` is label-based data selection method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc. Along with that, it also uses the label names in the index.

In [None]:
# select all rows for a specific column
print(df.loc[: , 'color'])

# Select all rows for multiple columns
print(df.loc[:, ['age', 'color']])

# Select multiple columns with specific rows
print(df.loc[['Jane', 'Nick'], ['age', 'color']])

# Select a range of rows for all columns
print(df.loc['Nick':'Dean'])

## .iloc[]

On the other hand, `iloc` is an integer index-based method which means that we have to pass integer index in the method to select specific row/column. This method does not include the last element of the range passed in it unlike loc.

In [None]:
# select all rows for a specific column
print(df.iloc[: , 1])

# Select all rows for multiple columns
print(df.iloc[:, [0, 1]])

# Select multiple columns with specific rows
print(df.iloc[[0, 1], [0, 1]])

# Select a range of rows for all columns
print(df.iloc[1:4])

### Differences between .loc and .iloc

- `loc` gets rows (or columns) with particular labels from the index.
- `iloc` gets rows (or columns) at particular positions in the index (so it only takes integers).  
  
So `loc` is label-based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range. Unlike `loc`, in `iloc`, we have to pass the integer index in the method to select specific row/column. This method does not include the last element of the range.

## Pandas Axis Manipulation 

### Axis 0 and Axis 1
In pandas, `axis=0` represents rows (running vertically downwards across rows) and `axis=1` represents columns (running horizontally from left to right across columns). 

Always remember, `axis=0` will act on all the **ROWS** in each **COLUMN**, `axis=1` will act on all the **COLUMNS** in each **ROW**. This becomes very important when you want to add rows, delete rows, or apply some calculation down a column or across a row.

## Reading and Writing Data
Pandas provides functions for reading and writing data in a variety of formats. You can read data from a CSV file and write it to an Excel file, for example:

In [None]:
# Read data from a CSV file
df = pd.read_csv('data.csv')

# Write data to an Excel file
df.to_excel('data.xlsx', index=False)

## Memory usage
Pandas provides functionality to explicitly check the memory usage of a DataFrame. 

In [None]:
df.info(memory_usage='deep')

Here, `deep` provides the most accurate report of the memory usage, accounting for the true memory usage of all components of the DataFrame.

However, the memory consumption of a DataFrame is not only determined by the number of items but also the data type of the items. For instance, the data type `int64` consumes more memory than `int8`.

In [None]:
# Perform operation to reduce memory usage
df['column_name'] = df['column_name'].astype('int8')

## Dealing with Large Datasets
You'll likely encounter datasets that are too large to fit into memory. In such cases, Pandas provides several techniques to handle this scenario:

- Chunking: Read data in chunks small enough to fit into memory.

In [None]:
chunksize = 10 ** 6  # Tune this value to best match your memory availability
for chunk in pd.read_csv("large_dataset.csv", chunksize=chunksize):
    process(chunk)  # define a function to process your data

- DTypes: Minimize memory usage by specifying or converting data types.

In [None]:
df = pd.read_csv('data.csv', dtype={'column1': 'int8', 'column2': 'float32'})

# Exploratory Data Analysis

Exploratory data analysis is an important step in understanding the data. 

In [None]:
# Get the summary statistics of the DataFrame
df.describe()

# Get the unique values in a column
df['colum_name'].unique()

## Descriptive Statistics with Pandas

### Mean, median, mode

In [None]:
# create a pandas series
data = pd.Series(np.random.randint(low=0, high=10, size=20))

# calculate mean, median, and mode
mean = data.mean()
median = data.median()
mode = data.mode()[0]

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)

### Variance and standard deviation

In [None]:
# calculate variance and standard deviation
variance = data.var()
std_deviation = data.std()

print("Variance:", variance)
print("Standard deviation:", std_deviation)

### Skewness and kurtosis

In [None]:
skewness = data.skew()
kurtosis = data.kurtosis()

print("Skewness:", skewness)
print("Kurtosis:", kurtosis)

### Correlation
Pairwise correlation of columns: Exclude NA/null values when computing correlation.

In [None]:
# create two pandas series
x = pd.Series(np.random.randint(low=0, high=10, size=20))
y = pd.Series(np.random.randint(low=0, high=10, size=20))

# calculate correlation
correlation = x.corr(y)

print("Correlation:", correlation)

### Counting values
Value counts: Compute a histogram of a 1D array (Series).

In [None]:
# create a pandas dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Ella', 'Frank'],
    'age': [20, 25, 30, 35, 40, 45],
    'gender': ['F', 'M', 'M', 'M', 'F', 'M']
})

# count values in a column
gender_counts = df['gender'].value_counts()

print(gender_counts)

Quantile-based discretization function: Divide data into equal-sized buckets.

In [None]:
# create a pandas series
data = pd.Series(np.random.randint(low=0, high=10, size=20))

print(pd.qcut(data, 4))  # Quartile cut

# Next Material

Topics we are covering on next material:

* **Data cleaning and preparation** are essential steps in data analysis. Pandas provides functions for handling missing data, removing duplicates, and converting data types.

* **Data aggregation and grouping** involve summarizing data based on one or more variables. Pandas provides a groupby function for grouping data by one or more columns and performing various operations on the resulting groups.

* **Merging and joining data** involve combining data from different sources into a single dataset. Pandas provides functions for merging and joining data based on common columns or indices.

* **Time series analysis** is a specialized area of data analysis that deals with data that is indexed by time. Pandas provides functions for working with time series data, such as resampling, rolling, and shifting.

* **Reshaping and pivoting data** involve transforming data from one format to another. Pandas provides functions for pivoting and reshaping data, such as melt, pivot, stack, and unstack.

> Content created by [**Carlos Cruz-Maldonado**](https://www.linkedin.com/in/carloscruzmaldonado/).  
> I am available to answer any questions or provide further assistance.   
> Feel free to reach out to me at any time.  