# Pandas  

Pandas is a popular open-source data analysis and manipulation library for Python. Pandas stands for “Python Data Analysis Library ”. The name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets. It provides easy-to-use and powerful tools for working with structured data such as tables and time series.

Some advantages of Pandas include:

* Data manipulation: Pandas provides powerful tools for filtering, transforming, and aggregating data.
* Flexibility: Pandas can handle a wide range of data formats, including text files, CSV, Excel, SQL databases, and JSON.
* Integration with other libraries: Pandas works seamlessly with other data science libraries in Python, such as NumPy and Matplotlib.
* Time series analysis: Pandas has built-in support for working with time series data.

Also:

* Highly optimized for performance, with critical code paths written in Cython or C.


In [1]:
import pandas as pd # import numpy library, and use np as alias (this is a convention)

## Data Structures

Pandas data structures include Series and DataFrames. 
* A Series is a one-dimensional array-like object that can hold a variety of data types, such as integers, strings, and floats. 
* A DataFrame is a two-dimensional table-like data structure with rows and columns.

### Series
A Pandas Series is a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, Python objects, etc. Each element in a Series has a label called the index, which is used to access the values of the Series. A Series is similar to a NumPy array, but it has an index that makes it more flexible and convenient to use.

In [2]:
# create a Series from a list
s1 = pd.Series([3, 5, 1, 2, 7])
print(s1)

0    3
1    5
2    1
3    2
4    7
dtype: int64


In [3]:
# create a Series from a dictionary
data = {'apples': 5, 'oranges': 2, 'bananas': 8, 'pears': 1}
s2 = pd.Series(data)
print(s2)


apples     5
oranges    2
bananas    8
pears      1
dtype: int64


In [4]:
# create a Series with custom index labels
s3 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s3)


a    10
b    20
c    30
d    40
dtype: int64


In [5]:
# access elements of a Series using index labels
print(s2['apples'])   # output: 5
print(s3[['a', 'c']]) # output: a    10, c    30


5
a    10
c    30
dtype: int64


In [10]:
# perform arithmetic operations on Series
s4 = s1 + s1
print(s4)


0     6
1    10
2     2
3     4
4    14
dtype: int64


In [7]:
# create a boolean mask from a Series
mask = s1 > 3
print(mask)


0    False
1     True
2    False
3    False
4     True
dtype: bool


In [8]:
# filter a Series using a boolean mask
filtered_s1 = s1[mask]
print(filtered_s1)


1    5
4    7
dtype: int64


In [9]:
# apply a function to each element of a Series
s5 = s1.apply(lambda x: x ** 2)
print(s5)

0     9
1    25
2     1
3     4
4    49
dtype: int64


### DataFrame  
A Pandas DataFrame is a two-dimensional labeled data structure that is used to store and manipulate tabular data. It consists of rows and columns, where each column can have a different data type. A DataFrame can be thought of as a collection of Series that share the same index. The rows are labeled by the index, and the columns are labeled by their names. DataFrame is the most commonly used Pandas data structure and can be thought of as a spreadsheet or SQL table. It provides powerful data manipulation and analysis functionalities, such as data indexing, selection, merging, grouping, and pivoting.

In [11]:
# create a DataFrame from a dictionary
data = {'name': ['Alice', 'Bob', 'Charlie', 'Dave'],
        'age': [25, 30, 35, 40],
        'country': ['USA', 'Canada', 'Australia', 'UK']}

df = pd.DataFrame(data)
print(df)

      name  age    country
0    Alice   25        USA
1      Bob   30     Canada
2  Charlie   35  Australia
3     Dave   40         UK


In [None]:
# create a DataFrame from a CSV file
df = pd.read_csv('data.csv')
print(df)

In [12]:
# create a DataFrame from a NumPy array
import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
columns = ['A', 'B', 'C']

df = pd.DataFrame(data, columns=columns)
print(df)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


In [None]:
# create a DataFrame from a list of dictionaries
data = [{'name': 'Alice', 'age': 25, 'country': 'USA'},
        {'name': 'Bob', 'age': 30, 'country': 'Canada'},
        {'name': 'Charlie', 'age': 35, 'country': 'Australia'},
        {'name': 'Dave', 'age': 40, 'country': 'UK'}]

df = pd.DataFrame(data)
print(df)

In [19]:
# load a JSON file into a DataFrame
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json'

# Load the first sheet of the JSON file into a data frame​
df = pd.read_json(url, orient='columns')
print(df.head())

   integer            datetime  category
0        5 2015-01-01 00:00:00         0
1        5 2015-01-01 00:00:01         0
2        9 2015-01-01 00:00:02         0
3        6 2015-01-01 00:00:03         0
4        6 2015-01-01 00:00:04         0


Data indexing and selection is a crucial aspect of data analysis. Pandas provides flexible ways to select subsets of data using methods such as loc and iloc.



Here are some common methods of the Pandas DataFrame:

* `head()`: Returns the first n rows of the DataFrame. The default value of n is 5.
* `tail()`: Returns the last n rows of the DataFrame. The default value of n is 5.
* `info()`: Prints a summary of the DataFrame, including the column names, data types, and non-null values.
* `describe()`: Generates descriptive statistics for numerical columns in the DataFrame, such as count, mean, and standard deviation.
* `shape`: Returns a tuple representing the dimensions of the DataFrame, i.e., (number of rows, number of columns).
* `columns`: Returns a list of the column names in the DataFrame.
* `index`: Returns a list of the row labels or index values in the DataFrame.
* `loc[]`: Allows you to access a group of rows and columns in the DataFrame using label-based indexing. For example, `df.loc[0:5, ['Name', 'Age']]` returns the rows from index 0 to 5, and the "Name" and "Age" columns.
* `iloc[]`: Allows you to access a group of rows and columns in the DataFrame using integer-based indexing. For example `df.iloc[0:5, [0, 1]]` returns the first 5 rows and the first 2 columns.
* `drop()`: Removes one or more rows or columns from the DataFrame. For example, df.drop('Age', axis=1) removes the "Age" column.
* `fillna()`: Fills missing or NaN (Not a Number) values in the DataFrame with a specified value or method, such as the mean or median.
* `groupby()`: Groups the DataFrame by one or more columns and applies a function, such as mean or sum, to each group.
* `pivot_table()`: Creates a pivot table from the DataFrame, allowing you to summarize and analyze the data in different* ways.
* `merge()`: Merges two or more DataFrames based on a common column or index.
* `sort_values()`: Sorts the DataFrame by one or more columns in ascending or descending order.
* `to_csv()`: Writes the DataFrame to a CSV file.
* `plot()`: Creates a basic plot of the DataFrame, such as a line chart or histogram.

These are just a few of the many methods available in Pandas DataFrame. Pandas offers a rich set of functions for data manipulation, aggregation, and visualization, making it a powerful tool for data analysis.

## Descriptive Statistics with Pandas

### Mean, median, mode

In [20]:
# create a pandas series
data = pd.Series(np.random.randint(low=0, high=10, size=20))

# calculate mean, median, and mode
mean = data.mean()
median = data.median()
mode = data.mode()[0]

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)

Mean: 4.3
Median: 4.5
Mode: 0


### Variance and standard deviation

In [22]:
# calculate variance and standard deviation
variance = data.var()
std_deviation = data.std()

print("Variance:", variance)
print("Standard deviation:", std_deviation)

Variance: 8.74736842105263
Standard deviation: 2.9575950400710087


### Skewness and kurtosis

In [21]:
skewness = data.skew()
kurtosis = data.kurtosis()

print("Skewness:", skewness)
print("Kurtosis:", kurtosis)

Skewness: 0.06754113507991329
Kurtosis: -0.8822886285890141


### Correlation

In [23]:
# create two pandas series
x = pd.Series(np.random.randint(low=0, high=10, size=20))
y = pd.Series(np.random.randint(low=0, high=10, size=20))

# calculate correlation
correlation = x.corr(y)

print("Correlation:", correlation)

Correlation: 0.007410480214306802


### Counting values

In [24]:
# create a pandas dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Ella', 'Frank'],
    'age': [20, 25, 30, 35, 40, 45],
    'gender': ['F', 'M', 'M', 'M', 'F', 'M']
})

# count values in a column
gender_counts = df['gender'].value_counts()

print(gender_counts)

M    4
F    2
Name: gender, dtype: int64


# Next session

Topics we are covering on next session:

* **Data cleaning and preparation** are essential steps in data analysis. Pandas provides functions for handling missing data, removing duplicates, and converting data types.

* **Data aggregation and grouping** involve summarizing data based on one or more variables. Pandas provides a groupby function for grouping data by one or more columns and performing various operations on the resulting groups.

* **Merging and joining data** involve combining data from different sources into a single dataset. Pandas provides functions for merging and joining data based on common columns or indices.

* **Time series analysis** is a specialized area of data analysis that deals with data that is indexed by time. Pandas provides functions for working with time series data, such as resampling, rolling, and shifting.

* **Reshaping and pivoting data** involve transforming data from one format to another. Pandas provides functions for pivoting and reshaping data, such as melt, pivot, stack, and unstack.

> Content created by **Carlos Cruz-Maldonado**.  
> Feel free to ping me at any time.